<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 2 Sprint Challenge*

# Natural Language Processing

**Part 1 - Working with Text Data**
Use Python string methods remove irregular whitespace from the following string:

In [1]:
whitespace_string = "\n\n  This is a    string   that has  \n a lot of  extra \n   whitespace.   "

print(whitespace_string)



  This is a    string   that has  
 a lot of  extra 
   whitespace.   


In [4]:
" ".join(whitespace_string.split())

'This is a string that has a lot of extra whitespace.'

### Use Regular Expressions to take the dates in the following .txt file and put them into a dataframe with columns for:

[RegEx dates.txt](https://raw.githubusercontent.com/ryanleeallred/datasets/master/dates.txt)

- Day
- Month
- Year


In [33]:
df.dtypes

Full Date    object
dtype: object

In [42]:
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/ryanleeallred/datasets/master/dates.txt', 
                 names=['Full Date'], sep='\n')
df['Full Date'] = pd.to_datetime(df['Full Date'])
df['Full Date'] = df['Full Date'].astype('str')

regex = '(?P<Year>[^-]+)-(?P<Month>[^-]+)-(?P<Day>[^-]+)'
df = pd.concat([df, df['Full Date'].str.extract(regex).astype(int)], axis=1)
df.head()


Unnamed: 0,Full Date,Year,Month,Day
0,2015-03-08,2015,3,8
1,2015-03-15,2015,3,15
2,2015-03-22,2015,3,22
3,2015-03-29,2015,3,29
4,2015-04-05,2015,4,5


In [36]:
df.dtypes

Full Date    object
dtype: object

# Part 2 - Bag of Words 

### Use the twitter sentiment analysis dataset found at this link for the remainder of the Sprint Challenge:

[Twitter Sentiment Analysis Dataset](https://raw.githubusercontent.com/ryanleeallred/datasets/master/twitter_sentiment_binary.csv)

 ### Clean and tokenize the documents ensuring the following properties of the text:

1) Text should be lowercase.

2) Stopwords should be removed.

3) Punctuation should be removed.

4) Tweets should be tokenized at the word level. 

(The above don't necessarily need to be completed in that specific order.)

### Output some cleaned tweets so that we can see that you made all of the above changes.


In [1]:
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/ryanleeallred/datasets/master/twitter_sentiment_binary.csv',
                 encoding='utf-8')
df.head()

Unnamed: 0,Sentiment,SentimentText
0,0,is so sad for my APL frie...
1,0,I missed the New Moon trail...
2,1,omg its already 7:30 :O
3,0,.. Omgaga. Im sooo im gunna CRy. I'...
4,0,i think mi bf is cheating on me!!! ...


In [2]:
df.shape

(99989, 2)

In [3]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
import string
from nltk.corpus import stopwords
from nltk import word_tokenize

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [0]:
def process_text(text):
  # Change to lowercase and remove punctuation
  text = "".join([char.lower() for char in text
                  if char not in string.punctuation])
  # Remove stopwords
  stop = stopwords.words('english')
  # Tokenize at the word-level
  return [i for i in word_tokenize(text) if i not in stop]

df['Tokens'] = df['SentimentText'].apply(process_text)

In [5]:
df.head()

Unnamed: 0,Sentiment,SentimentText,Tokens
0,0,is so sad for my APL frie...,"[sad, apl, friend]"
1,0,I missed the New Moon trail...,"[missed, new, moon, trailer]"
2,1,omg its already 7:30 :O,"[omg, already, 730]"
3,0,.. Omgaga. Im sooo im gunna CRy. I'...,"[omgaga, im, sooo, im, gunna, cry, ive, dentis..."
4,0,i think mi bf is cheating on me!!! ...,"[think, mi, bf, cheating, tt]"


In [6]:
pd.options.display.max_colwidth = 200
df['Tokens'].sample(10)

2723                                                                                                 [hate, today]
68530                                                                             [bryanlas, ah, well, thats, bad]
93096                     [confessing7girl, yeah, disney, rocks, lol, well, streaming, rubbish, slow, skips, alot]
25266                                                     [adamczar, yay, congrats, wedding, beautiful, much, fun]
48178                                                                               [altonturley, thanx, much, ff]
20226    [jaimemarie, oh, thank, jaime, jaime, im, genuinely, interested, jaime, youre, speech, awesome, jaime, x]
40924    [amyatq13, happy, saturday, hot, stuff, sorry, desk, area, messy, promised, better, cubiclemate, miss, u]
39220                                                                [amystark, wish, could, yoga, st, lukes, 530]
94143                                                                           

### How should TF-IDF scores be interpreted? How are they calculated?

**Interpretation**

The tf-idf vectorization of text gives each word in a document a number that is in proportion to its frequency in the document and inversely proportional to the number of documents in the data. This means that common words like “a” or “the” receive smaller tf-idf scores in comparison to words that are very specific to the document. The default matrix of tf-idf scores containts one row per document and as many columns as there are different words in the dataset.

**Calculation**

TF(w) = (Number of times term w appears in a document) / (Total number of terms in the document)

IDF(w) = log_e(Total number of documents / Number of documents with term w in it)

TF-IDF = TF * IDF

# Part 3 - Document Classification

1) Use Train_Test_Split to create train and test datasets.

2) Vectorize the tokenized documents using your choice of vectorization method. 

 - Stretch goal: Use both of the methods that we talked about in class.

3) Create a vocabulary using the X_train dataset and transform both your X_train and X_test data using that vocabulary.

4) Use your choice of binary classification algorithm to train and evaluate your model's accuracy. Report both train and test accuracies.

 - Stretch goal: Use an error metric other than accuracy and implement/evaluate multiple classifiers.
 - Stretch goal: Track your results in a DataFrmae and produce a visualization of the results



# 1

In [7]:
# Train test split

# Subset for memory purposes
df = df.sample(25000)

# First remove the list brackets and get rid of numbers...
df['Tokens'] = [','.join(map(str, l)) for l in df['Tokens']]
df['Tokens'] = df['Tokens'].str.replace('\d+', '')

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['Tokens'], 
                              df['Sentiment'], random_state=420, test_size=0.3)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((17500,), (7500,), (17500,), (7500,))

In [8]:
X_train.head()

16780    done,english,paper,today,tomorrow,science,spanish,wednesday,math,im,done,come,freshman,year
36914                                           ali,yes,french,exam,took,last,week,horrible,horrible
66618      bluefur,thinkreferrals,bikram,yoga,ultimate,workout,challenge,take,,class,together,decide
22923                                                                               aaaarae,oh,sucks
76133                         antondominique,yeah,meron,exam,dunno,halfday,pero,sa,monday,think,wala
Name: Tokens, dtype: object

# 2 & 3

In [9]:
# Create a vocabulary with train data and transform train data

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=2500, stop_words='english')

vectorizer.fit(X_train)

train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()

(17500, 2500)


Unnamed: 0,aalaap,aaron,aaroncarter,abby,abc,able,absolutely,abt,ac,accent,...,½ï,ð²ð,ðµ,ðµð,ð½ð,ð¾,ð¾ð,ð¾ñ,ñð,ññ
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [10]:
# Transform test data with vocab from original model that we trained on

test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(7500, 2500)


Unnamed: 0,aalaap,aaron,aaroncarter,abby,abc,able,absolutely,abt,ac,accent,...,½ï,ð²ð,ðµ,ðµð,ð½ð,ð¾,ð¾ð,ð¾ñ,ñð,ññ
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# 4

In [11]:
y_train.value_counts()

1    9844
0    7656
Name: Sentiment, dtype: int64

In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier

# XGB = XGBClassifier(max_depth=8,
#                     n_estimators=20).fit(X_train_vectorized, y_train)
logreg = LogisticRegression(solver='lbfgs', max_iter=500)\
                            .fit(X_train_vectorized, y_train)


train_predictions = logreg.predict(X_train_vectorized)
test_predictions = logreg.predict(X_test_vectorized)


print('Train accuracy:', accuracy_score(y_train, train_predictions))
print('Test accuracy', accuracy_score(y_test, test_predictions))

Train accuracy: 0.7887428571428572
Test accuracy 0.7206666666666667


# Part 4 -  Word2Vec

1) Fit a Word2Vec model on your cleaned/tokenized twitter dataset. 

2) Display the 10 words that are most similar to the word "twitter"

In [0]:
# Reset the tokens column back into list format. This is what Word2Vec wants and needs.

df['Tokens'] = df['SentimentText'].apply(process_text)

In [0]:
df['Tokens'] = df['SentimentText'].apply(process_text)

from gensim.models import Word2Vec

model = Word2Vec(df['Tokens'], min_count=5, size=100, seed=420)

In [121]:
model.wv.most_similar('twitter')

  if np.issubdtype(vec.dtype, np.int):


[('facebook', 0.824047327041626),
 ('link', 0.8227037191390991),
 ('dm', 0.8197536468505859),
 ('blog', 0.819339394569397),
 ('following', 0.8165600895881653),
 ('follow', 0.8063817024230957),
 ('site', 0.8020864725112915),
 ('page', 0.7977306842803955),
 ('cherylanncole', 0.7902941107749939),
 ('carsonjdaly', 0.7894009351730347)]