# U4 L4 P5 - Build a NLP Model

For this challenge, you will need to choose a corpus of data from nltk or another source that includes categories you can predict and create an analysis pipeline that includes the following steps:

- Data cleaning / processing / language parsing
- Create features using two different NLP methods: For example, BoW vs tf-idf.
- Use the features to fit supervised learning models for each feature set to predict the category outcomes.
- Assess your models using cross-validation and determine whether one model performed better.
- Pick one of the models and try to increase accuracy by at least 5 percentage points.

Write up your report in a Jupyter notebook. Be sure to explicitly justify the choices you make throughout, and submit it below.

### Acknowledgements

We'll use a Kaggle data set consisting of 1.6 million tweets, labeled as [0 = negative, 2 = neutral, 4 = positive].

More information about the data set can be found [here](http://help.sentiment140.com/for-students/). The original research paper can be found [here](https://cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf).

Citation: Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.

Link to download the data set:
https://www.kaggle.com/kazanova/sentiment140/home

In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
pd.set_option('display.max_colwidth', 1000)
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import re
import nltk
from nltk.corpus import stopwords
import spacy
from collections import Counter

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import ensemble
from sklearn.linear_model import LogisticRegression

In [6]:
# Instantiate the SpaCy module
nlp = spacy.load('en')

In [2]:
# Read the data set file into a data frame
tweetsDf = pd.read_csv('training.1600000.processed.noemoticon.csv', encoding='latin1',
                       names=['target','id','date','flag','user','text'])

print(tweetsDf.shape)
tweetsDf.head(3)

### Column Metadata

- target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
- id: The id of the tweet
- date: the date of the tweet
- flag: The query (if there is no query, then this value is NO_QUERY)
- user: the user that tweeted
- text: the text of the tweet

## Bag-of-words and parts of speech model

In [None]:
# Extract a smaller sample to reduce processing time
tweetsDfhead = tweetsDf.sample(100000)
tweetsDfhead.reset_index(drop=True, inplace=True)

In [35]:
# Remove newlines and other extra whitespace by splitting and rejoining
tweetsDfhead['tokenized'] = tweetsDfhead.text.apply(lambda x: ' '.join(x.split()))

# Create SpaCy tokens from words
tweetsDfhead['tokenized'] = tweetsDfhead.tokenized.apply(lambda x: nlp(x))

# Create a list of tokens from all tweets
tokenlist = []
tweetsDfhead.tokenized.apply(lambda x: [tokenlist.append(i) for i in x])

# Convert token list to more efficient numpy array
token_array = np.asarray(tokenlist)

# Delete token list to free up memory
del tokenlist

In [111]:
# Instantiate a counter dictionary with the most common words
top_words = Counter([token.lemma_ for token in token_array
             if not token.is_punct
             and not token.is_stop]).most_common(50)

# Extract the most common words into a list
common_words = [item[0] for item in top_words]
common_words[:5]

['-PRON-', 'be', 'not', 'go', 'good']

In [None]:
# Create columns for every 
for i in common_words:
    tweetsDfhead[i] = 0

# Create wordcount features in the data frame
# Process each row, counting the occurrence of words in each tweet
for i, sentence in enumerate(tweetsDfhead.tokenized):

    # Convert the sentence to lemmas, then filter out punctuation, stop words, and uncommon words
    words = [token.lemma_
             for token in sentence
             if (not token.is_punct
                 and not token.is_stop
                 and token.lemma_ in common_words)]

    # Populate the row with word counts
    for word in words:
        tweetsDfhead.loc[i, word] += 1

    # This counter is just to make sure the kernel didn't hang
    if i % 100 == 0:
        print("Processing row {}".format(i))

In [144]:
# Add columns with empty values for parts of speech types
pos_types = ['PROPN', 'ADV', 'NOUN', 'ADJ', 'VERB', 'CCONJ', 'PRON', 'NUM',
        'X', 'INTJ', 'DET', 'ADP', 'PUNCT', 'PART', 'SYM', 'SPACE']

for i in pos_types:
    tweetsDfhead[i] = 0

In [None]:
# Create POS count features in the data frame
# Process each row, counting the occurrence of parts of speech in each tweet
for i, sentence in enumerate(tweetsDfhead.tokenized):

    # Convert the sentence to lemmas, then filter out punctuation, stop words, and uncommon words
    POSs = [token.pos_ for token in sentence]

    # Populate the row with word counts
    for POS in POSs:
        tweetsDfhead.loc[i, POS] += 1

    # This counter is just to make sure the kernel didn't hang
    if i % 100 == 0:
        print("Processing row {}".format(i))

In [211]:
tweetsDfhead.head(2)

Unnamed: 0,target,id,date,flag,user,text,tokenized,-PRON-,be,not,...,PRON,NUM,X,INTJ,DET,ADP,PUNCT,PART,SYM,SPACE
0,4,1880773791,Fri May 22 02:27:19 PDT 2009,NO_QUERY,BenQIndia,@Netra &quot;just twit&quot; is a great motto..thanks! see you around,"(@Netra, &, quot;just, twit&quot, ;, is, a, great, motto, .., thanks, !, see, you, around)",0,0,0,...,1,0,0,0,1,0,3,0,0,0
1,0,2030872846,Thu Jun 04 09:00:16 PDT 2009,NO_QUERY,brooke_hyatt,Dead frogs smell amazing,"(Dead, frogs, smell, amazing)",0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Testing the BOW + POS model

The Random Forest is the only model that is overfitting with the training data. Overall accuracy isn't great for any of the three models, but is above 62%.

In [169]:
# Split the data set to train and test samples
Y = tweetsDfhead.target
X = tweetsDfhead.iloc[:,10:]
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=0)

In [170]:
# Instantiate and train the Random Forest Classifier model
rfc = ensemble.RandomForestClassifier()
train = rfc.fit(X_train, y_train)

# Inspect the results
print('Training set score:', rfc.score(X_train, y_train))
print('\nCross validation test scores:', cross_val_score(train, X_test, y_test))

Training set score: 0.9649333333333333

Cross validation test scores: [0.62381881 0.62501875 0.62541254]


In [171]:
# Instantiate and train the Logistic Regression model
lr = LogisticRegression()
train = lr.fit(X_train, y_train)

# Inspect the results
print('Training set score:', lr.score(X_train, y_train))
print('\nCross validation test scores:', cross_val_score(train, X_test, y_test))

Training set score: 0.6636166666666666

Cross validation test scores: [0.66334183 0.66101695 0.66179118]


In [172]:
# Instantiate and train the Gradient Boosting Classifier model
clf = ensemble.GradientBoostingClassifier()
train = clf.fit(X_train, y_train)

# Inspect the results
print('Training set score:', clf.score(X_train, y_train))
print('\nCross validation test scores:', cross_val_score(train, X_test, y_test))

Training set score: 0.6649166666666667

Cross validation test scores: [0.65816709 0.65921704 0.66171617]


### Testing the BOW only model
While the Random Forest Classifier is no longer overfitting, the test performance on all models is slightly lower.

In [173]:
# Re-generate the train and test samples and run the model with BOW features only
Y = tweetsDfhead.target
X = tweetsDfhead.iloc[:,10:-16]
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=0)

In [174]:
# Instantiate and train the Random Forest Classifier model
rfc = ensemble.RandomForestClassifier()
train = rfc.fit(X_train, y_train)

# Inspect the results
print('Training set score:', rfc.score(X_train, y_train))
print('\nCross validation test scores:', cross_val_score(train, X_test, y_test))

Training set score: 0.7212833333333334

Cross validation test scores: [0.62906855 0.62081896 0.63133813]


In [175]:
# Instantiate and train the Logistic Regression model
lr = LogisticRegression()
train = lr.fit(X_train, y_train)

# Inspect the results
print('Training set score:', lr.score(X_train, y_train))
print('\nCross validation test scores:', cross_val_score(train, X_test, y_test))

Training set score: 0.64765

Cross validation test scores: [0.64369282 0.64354282 0.64393939]


In [176]:
# Instantiate and train the Gradient Boosting Classifier model
clf = ensemble.GradientBoostingClassifier()
train = clf.fit(X_train, y_train)

# Inspect the results
print('Training set score:', clf.score(X_train, y_train))
print('\nCross validation test scores:', cross_val_score(train, X_test, y_test))

Training set score: 0.6460833333333333

Cross validation test scores: [0.64039298 0.63971801 0.64528953]


## TF-IDF Model
The TF-IDF model is much more computationally intensive and has a lower performance with Gradient Boosting.

Also, both Random Forest and Logistic Regression are overfitting with the training dataset.

That said, TF-IDF was able to improve Logistic Regression by ~1 percentage point.

In [203]:
# Instantiate Sklearn's DF-IDF vectorizer
vectorizer = TfidfVectorizer(max_df=.005, # drop words that occur in more than half the paragraphs
                             min_df=1, # only use words that appear at least once
                             stop_words='english', 
                             lowercase=True, #convert everything to lower case (since Alice in Wonderland has the HABIT of CAPITALIZING WORDS for EMPHASIS)
                             use_idf=True,#we definitely want to use inverse document frequencies in our weighting
                             norm=u'l2', #Applies a correction factor so that longer paragraphs and shorter paragraphs get treated equally
                             smooth_idf=True #Adds 1 to all document frequencies, as if an extra document existed that used every word once.  Prevents divide-by-zero errors
                            )

#Applying the vectorizer
vectorized = vectorizer.fit_transform(tweetsDfhead.text)

In [205]:
# Splitting into training and test sets
Y = tweetsDfhead.target
X_train, X_test, y_train, y_test = train_test_split(vectorized, Y, test_size=0.4, random_state=0)

In [206]:
# Instantiate and train the Random Forest Classifier model
rfc = ensemble.RandomForestClassifier()
train = rfc.fit(X_train, y_train)

# Inspect the results
print('Training set score:', rfc.score(X_train, y_train))
print('\nCross validation test scores:', cross_val_score(train, X_test, y_test))

Training set score: 0.9703666666666667

Cross validation test scores: [0.62614369 0.62824359 0.6340384 ]


In [207]:
# Instantiate and train the Logistic Regression model
lr = LogisticRegression()
train = lr.fit(X_train, y_train)

# Inspect the results
print('Training set score:', lr.score(X_train, y_train))
print('\nCross validation test scores:', cross_val_score(train, X_test, y_test))

Training set score: 0.8296666666666667

Cross validation test scores: [0.67054147 0.66701665 0.66884188]


In [208]:
# Instantiate and train the Gradient Boosting Classifier model
clf = ensemble.GradientBoostingClassifier()
train = clf.fit(X_train, y_train)

# Inspect the results
print('Training set score:', clf.score(X_train, y_train))
print('\nCross validation test scores:', cross_val_score(train, X_test, y_test))

Training set score: 0.5930833333333333

Cross validation test scores: [0.58902055 0.58332083 0.58378338]


## Conclusion

While the top score was the TF-IDF model with Logistic Regression, its performance was only ~1% better than the "Bag of Words + Parts of Speech" model, but at a much higher computational cost.

Next steps:

- Tune the Bag of Words + Parts of Speech" model parameters and try to increase accuracy by at least 5 percentage points;
- Run the model with a larger sample (e.g., 500k tweets)