# Part 1 - Working with Text Data

### Use Python string methods remove irregular whitespace from the following string:

In [2]:
whitespace_string = "\n\n  This is a    string   that has  \n a lot of  extra \n   whitespace.   "

print(whitespace_string)



  This is a    string   that has  
 a lot of  extra 
   whitespace.   


In [4]:
' '.join(whitespace_string.strip().split()).replace('a lot of', 'keine')

'This is a string that has keine extra whitespace.'

### Use Regular Expressions to take the dates in the following .txt file and put them into a dataframe with columns for:

[RegEx dates.txt](https://github.com/ryanleeallred/datasets/blob/master/dates.txt)

- Day
- Month
- Year


In [120]:
# tools for the job
import pandas as pd
import re as rx
import requests

s = requests.get('https://raw.githubusercontent.com/ryanleeallred/datasets/master/dates.txt')

In [121]:
no_n = s.text.split('\n')
no_r = [s.replace('\r', '') for s in no_n]

In [122]:
def clean(raw):
    m = [rx.findall(r'[A-Z][a-z]+', i)[0] for i in raw]
    d = [rx.findall(r'[\d]{1,2}', i)[0] for i in raw]
    y = [rx.findall(r'[\d]{4}', i)[0] for i in raw]
    
    df_dates = pd.DataFrame({'Month' : m,'Day' : d,'Year' : y})
    
    return df_dates

clean(no_r)

Unnamed: 0,Month,Day,Year
0,March,8,2015
1,March,15,2015
2,March,22,2015
3,March,29,2015
4,April,5,2015
5,April,12,2015
6,April,19,2015
7,April,26,2015
8,May,3,2015
9,May,10,2015


# Part 2 - Bag of Words 

### Use the twitter sentiment analysis dataset found at this link for the remainder of the Sprint Challenge:

[Twitter Sentiment Analysis Dataset](https://raw.githubusercontent.com/ryanleeallred/datasets/master/twitter_sentiment_binary.csv)

 ### Clean and tokenize the documents ensuring the following properties of the text:

1) Text should be lowercase.

2) Stopwords should be removed.

3) Punctuation should be removed.

4) Tweets should be tokenized at the word level. 

(The above don't necessarily need to be completed in that specific order.)

### Output some cleaned tweets so that we can see that you made all of the above changes.


In [54]:
import nltk
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string

stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/lambda_school_loaner_32/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [106]:
raw = pd.read_csv('https://raw.githubusercontent.com/ryanleeallred/datasets/master/twitter_sentiment_binary.csv')
raw.head()

Unnamed: 0,Sentiment,SentimentText
0,0,is so sad for my APL frie...
1,0,I missed the New Moon trail...
2,1,omg its already 7:30 :O
3,0,.. Omgaga. Im sooo im gunna CRy. I'...
4,0,i think mi bf is cheating on me!!! ...


In [107]:
df = raw.copy()
df.columns = ['sen', 'sen_text']

In [63]:
df.head()

Unnamed: 0,sen,sen_text
0,0,is so sad for my APL frie...
1,0,I missed the New Moon trail...
2,1,omg its already 7:30 :O
3,0,.. Omgaga. Im sooo im gunna CRy. I'...
4,0,i think mi bf is cheating on me!!! ...


## Somewhat inelegant clean

In [108]:
#tokenize descriptions
df.sen_text = df.sen_text.apply(word_tokenize)

# convert to lowercase
df.sen_text = [[el.lower() for el in row] for row in df.sen_text]

# remove punctuation (to revisit)
table = str.maketrans('','', string.punctuation)
df.sen_text = [[x.translate(table) for x in row] for row in df.sen_text]

# filter for alpha characters
df.sen_text = [[el for el in row if el.isalpha()] for row in df.sen_text]

# remove stop words
df.sen_text = [[el for el in row if not el in stop_words] for row in df.sen_text]

In [73]:
for i in range(10):
    print(df.sen_text[(i*10)])

['sad', 'apl', 'friend']
['must', 'think', 'positive']
['sad', 'iran']
['hate', 'athlete', 'appears', 'tear', 'acl', 'live', 'television']
['amp', 'amp', 'fightiin', 'wiit', 'babes']
['baddest', 'day', 'eveer']
['hate', 'u', 'leysh']
['never', 'thought', 'become', 'second', 'choice']
['send', 'sunshine', 'northern', 'ireland', 'going', 'swimming', 'today', 'kezbat']
['jonas', 'day', 'almost']


### How should TF-IDF scores be interpreted? How are they calculated?

**TF-IDF** is all in the name: **_Term Frequency, Inverse Document Frequency_**.  These scores try and highlight how important certain terms are to **particular** documents of a corpus.  If a term occurs a lot accross documents (some words just occur more often!) it is penalized heavily while terms that occur less frequently accross a body of documents gain a slight boost.

# Part 3 - Document Classification

1) Use Train_Test_Split to create train and test datasets.

2) Vectorize the tokenized documents using your choice of vectorization method. 

 - Stretch goal: Use both of the methods that we talked about in class.

3) Create a vocabulary using the X_train dataset and transform both your X_train and X_test data using that vocabulary.

4) Use your choice of binary classification algorithm to train and evaluate your model's accuracy. Report both train and test accuracies.

 - Stretch goal: Use an error metric other than accuracy and implement/evaluate multiple classifiers.



In [75]:
# processing
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# modeling
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

# evaluation
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score

In [90]:
df.sen_text = [' '.join(i) for i in df.sen_text]

In [93]:
# data/target assignment
X = df.sen_text
y = df.sen

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [86]:
print(y_train)
print(X_train)

11111    0
5078     0
20029    1
70187    0
89492    0
51725    1
45990    1
53562    1
31009    1
76743    1
34358    1
21810    1
50973    1
8810     1
26928    0
81952    0
3601     0
57022    1
29240    0
26681    1
63014    1
23719    1
67997    1
16014    0
4302     0
87070    0
25963    1
11089    0
8267     0
94345    1
        ..
17881    0
78513    1
22326    0
49730    1
61626    0
52164    0
57398    1
84308    0
83861    1
9880     0
96140    1
30159    1
38721    0
85093    1
51021    1
18158    1
80416    1
75042    1
76424    0
8323     0
3091     0
60390    0
1676     0
70852    1
55027    1
97767    0
31351    1
98725    1
49039    1
72960    1
Name: sen, Length: 74991, dtype: int64
11111               [quot, heart, quot, like, stuck, hear]
5078     [already, contemplating, skipping, church, hea...
20029    [hayley, ace, well, sneaky, way, get, fit, fee...
70187                        [good, pussy, dat, den, neva]
89492    [nt, usually, really, good, making, sure, pal

## TFIDF + Logistic

In [96]:
params =  {'tfidfvectorizer__ngram_range' : [(1,1), (1,2), (1,3)],
           'tfidfvectorizer__max_features' : [50, 100, None],
          }

pipe = make_pipeline(TfidfVectorizer(),
                     LogisticRegression(solver='lbfgs',
                     max_iter=200))

log_grid_cv = GridSearchCV(pipe, params, cv=3, scoring='roc_auc')
log_grid_cv.fit(X_train, y_train);

GridSearchCV(cv=3, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('tfidfvectorizer', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_i...enalty='l2', random_state=None, solver='lbfgs', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'tfidfvectorizer__ngram_range': [(1, 1), (1, 2), (1, 3), (3, 5)], 'tfidfvectorizer__max_features': [25, 50, 100, None]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=0)

In [109]:
print ('Best Params', log_grid_cv.best_params_)

print(f'Train ROCAUC: {roc_auc_score(log_grid_cv.predict(X_train), y_train):.2f}')
print(f'Test ROCAUC: {roc_auc_score(log_grid_cv.predict(X_test), y_test):.2f}') 

Best Params {'tfidfvectorizer__max_features': None, 'tfidfvectorizer__ngram_range': (1, 2)}
Train ROCAUC: 0.89
Test ROCAUC: 0.76


## TFIDF + Naive Bayes

In [110]:
params =  {'tfidfvectorizer__ngram_range' : [(1,2)],
           'tfidfvectorizer__max_features' : [None],
          }

pipe = make_pipeline(TfidfVectorizer(),
                     MultinomialNB())

nb_grid_cv = GridSearchCV(pipe, params, cv=3, scoring='roc_auc')
nb_grid_cv.fit(X_train, y_train);

In [111]:
print(f'Train ROCAUC: {roc_auc_score(nb_grid_cv.predict(X_train), y_train):.2f}')
print(f'Test ROCAUC: {roc_auc_score(nb_grid_cv.predict(X_test), y_test):.2f}') 

Train ROCAUC: 0.96
Test ROCAUC: 0.76


# Part 4 -  Word2Vec

1) Fit a Word2Vec model on your cleaned/tokenized twitter dataset. 

2) Display the 10 words that are most similar to the word "twitter"

In [89]:
from gensim.models.word2vec import Word2Vec

In [112]:
model = Word2Vec(df.sen_text, window=3, size=500)

In [123]:
list(model.wv.vocab)

['sad',
 'friend',
 'missed',
 'new',
 'moon',
 'trailer',
 'omg',
 'already',
 'im',
 'sooo',
 'gunna',
 'cry',
 'dentist',
 'since',
 'get',
 'crown',
 'put',
 'think',
 'mi',
 'bf',
 'cheating',
 'tt',
 'worry',
 'much',
 'chillin',
 'sunny',
 'work',
 'tomorrow',
 'tv',
 'tonight',
 'handed',
 'uniform',
 'today',
 'miss',
 'hmmmm',
 'wonder',
 'number',
 'must',
 'positive',
 'thanks',
 'haters',
 'face',
 'day',
 'weekend',
 'sucked',
 'far',
 'jb',
 'isnt',
 'showing',
 'australia',
 'ok',
 'thats',
 'win',
 'lt',
 'way',
 'feel',
 'right',
 'man',
 'completely',
 'useless',
 'rt',
 'funny',
 'twitter',
 'http',
 'feeling',
 'strangely',
 'fine',
 'gon',
 'na',
 'go',
 'listen',
 'celebrate',
 'huge',
 'roll',
 'thunder',
 'scary',
 'cut',
 'beard',
 'growing',
 'well',
 'year',
 'start',
 'happy',
 'meantime',
 'iran',
 'one',
 'see',
 'cause',
 'else',
 'following',
 'pretty',
 'awesome',
 'level',
 'writing',
 'massive',
 'blog',
 'tweet',
 'myspace',
 'comp',
 'shut',
 'lost

In [114]:
model.wv.most_similar('twitter', topn=5)

[('facebook', 0.8317605257034302),
 ('sent', 0.8304129838943481),
 ('email', 0.8274568915367126),
 ('dm', 0.8265565633773804),
 ('link', 0.8208439350128174)]

In [115]:
#hmmmm
model.wv.most_similar('drunk', topn=5)

[('eventually', 0.9904677867889404),
 ('texts', 0.9903397560119629),
 ('boss', 0.9896416664123535),
 ('doctor', 0.9893132448196411),
 ('faint', 0.9890397191047668)]

In [124]:
model.wv.most_similar('swimming', topn=5)

[('somewhere', 0.9852827787399292),
 ('swim', 0.981758177280426),
 ('near', 0.9813928604125977),
 ('focused', 0.9805941581726074),
 ('nap', 0.9804528951644897)]