# Part 1 - Working with Text Data

### Use Python string methods remove irregular whitespace from the following string:

In [65]:
import re
import pandas as pd
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import string                      # for punctuations
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, \
                                            TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.metrics import roc_auc_score, accuracy_score
from gensim.models.word2vec import Word2Vec
from nltk import WordNetLemmatizer

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/samirgadkari/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [66]:
whitespace_string = "\n\n  This is a    string   that has  \n a lot of  extra \n   whitespace.   "

print(whitespace_string)



  This is a    string   that has  
 a lot of  extra 
   whitespace.   


In [67]:
print(' '.join(whitespace_string.split()))

This is a string that has a lot of extra whitespace.


### Use Regular Expressions to take the dates in the following .txt file and put them into a dataframe with columns for:

[RegEx dates.txt](https://github.com/ryanleeallred/datasets/blob/master/dates.txt)

- Day
- Month
- Year


In [68]:
!wget https://raw.githubusercontent.com/ryanleeallred/datasets/master/dates.txt

--2019-03-29 10:27:47--  https://raw.githubusercontent.com/ryanleeallred/datasets/master/dates.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.188.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.188.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 296 [text/plain]
Saving to: ‘dates.txt.3’


2019-03-29 10:27:48 (17.6 MB/s) - ‘dates.txt.3’ saved [296/296]



In [69]:
with open('dates.txt') as f:
    lines = f.readlines()
    
    regex = r'(\w+)\s(\d+),\s(\d+)'
    def split_date(x):
        res = re.findall(regex, x)
        return res[0]
    
    zipped_dates = list(map(split_date, lines))
    
df = pd.DataFrame(data=zipped_dates, columns=['Month', 'Day', 'Year'])
df

Unnamed: 0,Month,Day,Year
0,March,8,2015
1,March,15,2015
2,March,22,2015
3,March,29,2015
4,April,5,2015
5,April,12,2015
6,April,19,2015
7,April,26,2015
8,May,3,2015
9,May,10,2015


# Part 2 - Bag of Words 

### Use the twitter sentiment analysis dataset found at this link for the remainder of the Sprint Challenge:

[Twitter Sentiment Analysis Dataset](https://raw.githubusercontent.com/ryanleeallred/datasets/master/twitter_sentiment_binary.csv)

 ### Clean and tokenize the documents ensuring the following properties of the text:

1) Text should be lowercase.

2) Stopwords should be removed.

3) Punctuation should be removed.

4) Tweets should be tokenized at the word level. 

(The above don't necessarily need to be completed in that specific order.)

### Output some cleaned tweets so that we can see that you made all of the above changes.


In [70]:
twitter_df = pd.read_csv('https://raw.githubusercontent.com/ryanleeallred/datasets'
                         '/master/twitter_sentiment_binary.csv',
                         header=0)
twitter_df.head()

Unnamed: 0,Sentiment,SentimentText
0,0,is so sad for my APL frie...
1,0,I missed the New Moon trail...
2,1,omg its already 7:30 :O
3,0,.. Omgaga. Im sooo im gunna CRy. I'...
4,0,i think mi bf is cheating on me!!! ...


In [71]:
twitter_df.isnull().sum()

Sentiment        0
SentimentText    0
dtype: int64

In [74]:
def lowercase(xs):
    return [s.lower() for s in xs]

def remove_stopwords(xs):
    stop_words = stopwords.words('english')
    return [w for w in xs if w not in stop_words]

def remove_punctuations(xs):
    translations = str.maketrans('', '', string.punctuation)
    return [   [word.translate(translations) \
                for word in sentence.split()] \
           for sentence in xs]

def join_words(xs):
    return [' '.join(sentence) for sentence in xs]

def clean_text(df, columns, transformations):
    
    def transform(xs, transformations):
        if len(transformations) == 0:
            return xs
        
        return transform(transformations[0](xs), transformations[1:])
        
    df2 = df.copy()
    for c in columns:
        col = df[c].values
        
        df2[c] = transform(col, transformations)
    return df2

cleaned = clean_text(twitter_df, ['SentimentText'], \
                     [lowercase, remove_stopwords,  \
                      remove_punctuations, join_words])
cleaned.head()

Unnamed: 0,Sentiment,SentimentText
0,0,is so sad for my apl friend
1,0,i missed the new moon trailer
2,1,omg its already 730 o
3,0,omgaga im sooo im gunna cry ive been at this ...
4,0,i think mi bf is cheating on me tt


In [75]:
cleaned.Sentiment.value_counts()

1    56457
0    43532
Name: Sentiment, dtype: int64

### Since the value counts show the target is almost balanced, we can use ROC AUC or accuracy.

### How should TF-IDF scores be interpreted? How are they calculated?

TF-IDF stands for Token Frequency - Inverse Document Frequency.

Interpretation:
TF-IDF scores give more weight to words that are unique across documents,
and to words that appear frequently within a document.

They are calculated using:
  - TF: (how many times the word appeared in the document)/(total number of words in document)
  - IDF: log base 2((number of documents)/(number of documents in which word appeared))
  - TF-IDF score: TF * IDF

# Part 3 - Document Classification

1) Use Train_Test_Split to create train and test datasets.

2) Vectorize the tokenized documents using your choice of vectorization method. 

 - Stretch goal: Use both of the methods that we talked about in class.

3) Create a vocabulary using the X_train dataset and transform both your X_train and X_test data using that vocabulary.

4) Use your choice of binary classification algorithm to train and evaluate your model's accuracy. Report both train and test accuracies.

 - Stretch goal: Use an error metric other than accuracy and implement/evaluate multiple classifiers.



In [76]:
def split_data(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

    print('X_train, y_train:', X_train.shape, y_train.shape)
    print('X_test, y_test', X_test.shape, y_test.shape)
    
    return X_train, X_test, y_train, y_test

X = cleaned['SentimentText']
y = cleaned['Sentiment']
X_train, X_test, y_train, y_test = split_data(X, y)

X_train, y_train: (79991,) (79991,)
X_test, y_test (19998,) (19998,)


In [77]:
vectorizer = TfidfVectorizer(max_features=None, ngram_range=(1,1), stop_words='english')
vectorizer.fit(X_train)
print(vectorizer.vocabulary_)



In [78]:
def vect_transform(vectorizer, X):
    X_transformed = vectorizer.transform(X)
    X_transformed_df = pd.DataFrame(X_transformed.toarray(), 
                                    columns=vectorizer.get_feature_names())
    print('Transformed shape:', X_transformed_df.shape)
    return X_transformed_df

In [79]:
X_train_vectorized = vect_transform(vectorizer, X_train)
X_train_vectorized.head()

Transformed shape: (79991, 101730)


Unnamed: 0,00,000,0000abcd,0001t,000martha,001,0010x0010,005603,00711,007heather007,...,ø³ø¹ø,øµø,øµù,ø¹,ø¹ø,ø¹øª,ø¹ù,ùø,ùøª,ùù
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [80]:
X_test_vectorized  = vect_transform(vectorizer, X_test)
X_test_vectorized.head()

Transformed shape: (19998, 101730)


Unnamed: 0,00,000,0000abcd,0001t,000martha,001,0010x0010,005603,00711,007heather007,...,ø³ø¹ø,øµø,øµù,ø¹,ø¹ø,ø¹øª,ø¹ù,ùø,ùøª,ùù
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### To make it easy to run multiple tests, we will create a function that takes a list of transformations and a scoring function. Then we can call this function multiple times/in a loop. This repeats the Vectorizer step done above, but is more reusable.
### Maybe I should have removed the above code, but it is good to see multiple ways of processing the data.

In [81]:
def test_model(pipeline_list, scoring_function, X_train, X_test, y_train, y_test):
    pipeline = Pipeline(pipeline_list)
    pipeline.fit(X_train, y_train)
    
    if scoring_function == roc_auc_score:
        print('Scoring using roc_auc_score:')
        y_pred_train = pipeline.predict_proba(X_train)[:, 1] # It's best to use pipeline.classes_ to find the class to
                                                             # index mapping on the second index of the return value from
                                                             # the predict_proba function.
        y_pred_test  = pipeline.predict_proba(X_test)[:, 1] # It's best to use pipeline.classes_ to find the class to
                                                             # index mapping on the second index of the return value from
                                                             # the predict_proba function.
    elif scoring_function == accuracy_score:
        print('Scoring using accuracy:')
        y_pred_train = pipeline.predict(X_train)  # Cannot use predict_proba for accuracy
        y_pred_test  = pipeline.predict(X_test)
        
    print('  train score:', scoring_function(y_train, y_pred_train))    
    print('  test score:', scoring_function(y_test, y_pred_test))
    
    # print('vocabulary:', p.named_steps[pipeline_list[0][0]].vocabulary_)
    return pipeline   # pipeline returned in case you want to do something else with it

In [82]:
p = test_model([('CountVectorizer', CountVectorizer(max_features=None, ngram_range=(1, 1), stop_words='english')),
                ('LogisticRegression', LogisticRegression(solver='lbfgs', max_iter=200, random_state=1))],
               accuracy_score, X_train, X_test, y_train, y_test)



Scoring using accuracy:
  train score: 0.9062144491255266
  test score: 0.7498749874987499


In [83]:
p = test_model([('CountVectorizer', CountVectorizer(max_features=None, ngram_range=(1, 1), stop_words='english')),
                ('LogisticRegression', LogisticRegression(solver='lbfgs', max_iter=200, random_state=1))],
               roc_auc_score, X_train, X_test, y_train, y_test)



Scoring using roc_auc_score:
  train score: 0.9636457369587224
  test score: 0.8230347183967857


In [84]:
p = test_model([('TfidfVectorizer', CountVectorizer(max_features=None, ngram_range=(1, 1), stop_words='english')),
                ('LogisticRegression', LogisticRegression(solver='lbfgs', max_iter=200, random_state=1))],
               accuracy_score, X_train, X_test, y_train, y_test)



Scoring using accuracy:
  train score: 0.9062144491255266
  test score: 0.7498749874987499


In [85]:
p = test_model([('TfidfVectorizer', CountVectorizer(max_features=None, ngram_range=(1, 1), stop_words='english')),
                ('LogisticRegression', LogisticRegression(solver='lbfgs', max_iter=200, random_state=1))],
               roc_auc_score, X_train, X_test, y_train, y_test)



Scoring using roc_auc_score:
  train score: 0.9636457369587224
  test score: 0.8230347183967857


# Part 4 -  Word2Vec

1) Fit a Word2Vec model on your cleaned/tokenized twitter dataset. 

2) Display the 10 words that are most similar to the word "twitter"

In [86]:
lemmatizer = WordNetLemmatizer()
lemmatized = [[lemmatizer.lemmatize(tok) for tok in sentence.split()] \
              for sentence in cleaned.SentimentText]
lemmatized[:3]

[['is', 'so', 'sad', 'for', 'my', 'apl', 'friend'],
 ['i', 'missed', 'the', 'new', 'moon', 'trailer'],
 ['omg', 'it', 'already', '730', 'o']]

In [87]:
def run_w2v_model(data, min_count, size):
    model = Word2Vec(data, min_count=min_count, size=size)
    print('model:', model)
    print('vocab len:', len(model.wv.vocab), 'vocab:', list(model.wv.vocab))
    return model

In [88]:
model = run_w2v_model(lemmatized, 1, 5)

model: Word2Vec(vocab=116068, size=5, alpha=0.025)


In [89]:
model.wv.most_similar('twitter')

[('sulk', 0.9966536164283752),
 ('joogle', 0.9952749609947205),
 ('lvatt', 0.9950293302536011),
 ('amandaelyss', 0.9950042963027954),
 ('famï¿½es', 0.9950007796287537),
 ('ur', 0.9947282075881958),
 ('mmhmm', 0.9946691989898682),
 ('myparkingfinecom', 0.9945354461669922),
 ('squad', 0.9932874441146851),
 ('weet', 0.9929564595222473)]