# Tweet Sentiment Classification
### Module 4 Project - Kai Graham

# Overview of Process (CRISP-DM)
I will be following the Cross-Industry Standard Process for Data Mining to build a classifier that will determine the sentiment of tweets.  The CRISP-DM process includes the following key steps:
1. Business Understanding
2. Data Understanding
3. Data Preparation
4. Modeling
5. Evaluation
6. Deployment

# 1. Business Understanding
* Who are the stakeholders?
* What Business Problems will this solve?
* What problems are inside scope of this problem?
* What problems are outside scope?
* What data sources are available to us?
* Timeline of project / deadlines?
* Do stakeholders from different parts of the company all agree?

My overall goal is to create a classifier that will successfully classify a tweet based on its sentiment into one of the following classes: positive sentiment, negative sentiment, or neutral sentiment.  Given data is limited to tweets related to Apple / Google products, the biggest stakeholders for this project are likely Google and Apple themselves.  Product managers / other managers within the company could use a tool like this to track public sentiment surrounding various product launches / software updates.  While out of the specific scope of this project, combining with time series metrics, the Company could track increases or changes to sentiment based on tweet classification. 

# 2. Data Understanding
* What data is available to us, where does it live?
* What is our target?
* What predictors are available?
* What is distribution of data?
* EDA to show most common words and other corpus statistics.  Will require some initial processing.

The main dataset used throughout this data science process will be coming from CrowdFlower via the following url: `https://data.world/crowdflower/brands-and-product-emotions`. 

The following summary of the dataset is provided on CrowdFlower:

*Contributors evaluated tweets about multiple brands and products. The crowd was asked if the tweet expressed positive, negative, or no emotion towards a brand and/or product. If some emotion was expressed they were also asked to say which brand or product was the target of that emotion.*

In [54]:
# import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import nltk
from nltk.corpus import stopwords
from nltk.collocations import *
import string
import re
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV 
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.svm import LinearSVC
from imblearn.over_sampling import SMOTE

In [2]:
# failure to specify 'latin1' encoding results in errors
# error_df = pd.read_csv('judge-1377884607_tweet_product_company.csv')
# error_df.head()

In [3]:
# load dataset
raw_df = pd.read_csv('judge-1377884607_tweet_product_company.csv', encoding='latin_1')

In [4]:
# print first rows of dataset
raw_df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [5]:
# show info of df
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column                                              Non-Null Count  Dtype 
---  ------                                              --------------  ----- 
 0   tweet_text                                          9092 non-null   object
 1   emotion_in_tweet_is_directed_at                     3291 non-null   object
 2   is_there_an_emotion_directed_at_a_brand_or_product  9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB


Looking at the above outputs, we can see that there are a total of 9,093 entries in our dataset, with a total of three columns.  Raw text from tweets is held in the `tweet_text` column; sentiment is held in the `is_there_an_emotion_directed_at_a_brand_or_product`; and the item of emotion direction is held in the `emotion_in_tweet_is_directed_at` column.  

From first glance, we can likely drop the `emotion_in_tweet_is_directed_at` column as we are more interested in whether sentiment in a given tweet is positive, neutral, or negative based on the text.  Main predictors we will use is processed features derived from the `tweet_text` column.

Our target variable, which can also be though of as our class labels are held in the `is_there_an_emotion_directed_at_a_brand_or_product` column.  

In [6]:
# display value counts
display(raw_df['emotion_in_tweet_is_directed_at'].value_counts())
display(raw_df['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts())

iPad                               946
Apple                              661
iPad or iPhone App                 470
Google                             430
iPhone                             297
Other Google product or service    293
Android App                         81
Android                             78
Other Apple product or service      35
Name: emotion_in_tweet_is_directed_at, dtype: int64

No emotion toward brand or product    5389
Positive emotion                      2978
Negative emotion                       570
I can't tell                           156
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

Unsurprising given the origin of our dataset, the products identfied are either Apple or Google products.  Looking at sentiment, the majority of entries seem to fall under a neutral sentiment ('No emotion toward brand or product'), with the next largest group being tagged as 'Positive emotion'.  There is some clear class imbalance present with only 570 entries belonging to the 'Negative emotion' class. 

In [7]:
# rename columns so easier to work with
df = raw_df.copy()
df.columns = ['text', 'product_brand', 'sentiment']
df.head()

Unnamed: 0,text,product_brand,sentiment
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [8]:
# explore potential missing values
df.isna().sum()

text                1
product_brand    5802
sentiment           0
dtype: int64

We see that there is only one missing value in the text column, 0 in the sentiment column, and a large number (5802) in the product_brand column.  Given we are planning to work the majority of the time with the text and sentiment columns, this will not likely pose a large issue. 

In [9]:
# display missing value in the text column
df.loc[df['text'].isna()]

Unnamed: 0,text,product_brand,sentiment
6,,,No emotion toward brand or product


In [10]:
# display missing values in the product_brand column
df.loc[df['product_brand'].isna()]

Unnamed: 0,text,product_brand,sentiment
5,@teachntech00 New iPad Apps For #SpeechTherapy...,,No emotion toward brand or product
6,,,No emotion toward brand or product
16,Holler Gram for iPad on the iTunes App Store -...,,No emotion toward brand or product
32,"Attn: All #SXSW frineds, @mention Register fo...",,No emotion toward brand or product
33,Anyone at #sxsw want to sell their old iPad?,,No emotion toward brand or product
...,...,...,...
9087,"@mention Yup, but I don't have a third app yet...",,No emotion toward brand or product
9089,"Wave, buzz... RT @mention We interrupt your re...",,No emotion toward brand or product
9090,"Google's Zeiger, a physician never reported po...",,No emotion toward brand or product
9091,Some Verizon iPhone customers complained their...,,No emotion toward brand or product


In [11]:
# display sentiment breakdowns of missing product_brand entries
df.loc[df['product_brand'].isna()]['sentiment'].value_counts()

No emotion toward brand or product    5298
Positive emotion                       306
I can't tell                           147
Negative emotion                        51
Name: sentiment, dtype: int64

We see that the majority of missing product_brand values are also labeled as no emotion twoard brand or product, which makes sense as a lot of the neutral-labeled tweets may not be directed at a specific brand or product, and therefore would be missing a product_brand tagging.  Additionally, this column will not be used in our process of tweet classification. 

Drop unnecessary columns and handle missing value for additional EDA.

In [12]:
# drop product_brand column
clean_df = df.drop(['product_brand'], axis=1)

# handle missing values
clean_df = clean_df.dropna(subset=['text'])
clean_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9092 entries, 0 to 9092
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   text       9092 non-null   object
 1   sentiment  9092 non-null   object
dtypes: object(2)
memory usage: 213.1+ KB


In [13]:
# further examine tweets labeled as "I can't tell"
for i in range(10):
    display(clean_df.loc[clean_df['sentiment'] == "I can't tell"].iloc[i][0])

'Thanks to @mention for publishing the news of @mention new medical Apps at the #sxswi conf. blog {link} #sxsw #sxswh'

'\x89ÛÏ@mention &quot;Apple has opened a pop-up store in Austin so the nerds in town for #SXSW can get their new iPads. {link} #wow'

'Just what America needs. RT @mention Google to Launch Major New Social Network Called Circles, Possibly Today {link} #sxsw'

'The queue at the Apple Store in Austin is FOUR blocks long. Crazy stuff! #sxsw'

"Hope it's better than wave RT @mention Buzz is: Google's previewing a social networking platform at #SXSW: {link}"

'SYD #SXSW crew your iPhone extra juice pods have been procured.'

'Why Barry Diller thinks iPad only content is nuts @mention #SXSW {link}'

'Gave into extreme temptation at #SXSW and bought an iPad 2... #impulse'

'Catch 22\x89Û_ I mean iPad 2 at #SXSW : {link}'

'Forgot my iPhone for #sxsw. Android only. Knife to a gun fight'

Looking at a sample of the tweets labeled as "I can't tell", there is no clear class label that each should belong to.  Given this, and the small number of tweets with this class distinction, they will be removed from the dataset. 

In [14]:
# separate dataset into tweets and class_labels for additional EDA
tweets = clean_df['text']
class_labels = clean_df['sentiment']

In [15]:
# tokenize tweets and print the total vocabulary size of our dataset
tokenized = list(map(nltk.word_tokenize, tweets.dropna())) 
raw_tweet_vocab = set()
for tweet in tokenized:
    raw_tweet_vocab.update(tweet)
print(len(raw_tweet_vocab))

13212


Looking at the text within the tweets, there is a total vocabulary size of just over 13,200.

In [16]:
# print average tweet size
mean_tweet_size = []
for tweet in tokenized:
    mean_tweet_size.append(len(tweet))
np.mean(mean_tweet_size)

24.414980202375716

The average tweet size within the dataset is just over 24 words. 

In [17]:
# display frequency distribution of raw dataset
tweets_concat = []
for tweet in tokenized:
    tweets_concat += tweet
    
# display the 15 most common words
unprocessed_freq_dist = nltk.FreqDist(tweets_concat)
unprocessed_freq_dist.most_common(25)

[('#', 15875),
 ('@', 7194),
 ('mention', 7123),
 ('.', 5601),
 ('SXSW', 4787),
 ('sxsw', 4523),
 ('link', 4311),
 ('}', 4298),
 ('{', 4296),
 ('the', 3928),
 (',', 3533),
 ('to', 3521),
 ('RT', 2947),
 ('at', 2859),
 (';', 2800),
 ('&', 2707),
 ('for', 2440),
 ('!', 2398),
 ('a', 2174),
 ('Google', 2136),
 ('iPad', 2129),
 (':', 2075),
 ('Apple', 1882),
 ('in', 1833),
 ('quot', 1696)]

From first glance, we can see a number of the top appear words / tokens are stopwords or punctuation.  For additional EDA processed, we will try removing stopwords to see if additional information can be extracted from the data.

In [18]:
def initial_tweet_process(tweet, stopwords_list):
    """
    Function to intially process a tweet to assist in EDA / data understanding. 
    Input: tweet of type string, stopwords_list of words to remove
    Returns: tokenized tweet, converted to lowercase, with all stopwords removed
    """
    # tokenize
    tokens = nltk.word_tokenize(tweet)
    
    # remove stopwords and lowercase
    stopwords_removed = [token.lower() for token in tokens if token.lower() not in stopwords_list]
    
    # return processed tweet
    return stopwords_removed

In [19]:
# set up initial stopwords list
stopwords_list = stopwords.words('english') + list(string.punctuation)
stopwords_list += ["''", '""', '...', '``']
stopwords_list += ['mention', 'sxsw', 'link', 'rt', 'quot', 'google', 'apple']

In [20]:
# separate dataset based on class label
neutral_tweets = clean_df.loc[clean_df['sentiment'] == 'No emotion toward brand or product']
positive_tweets = clean_df.loc[clean_df['sentiment'] == 'Positive emotion']
negative_tweets = clean_df.loc[clean_df['sentiment'] == 'Negative emotion']
ambig_tweets = clean_df.loc[clean_df['sentiment'] == "I can't tell"]
all_tweets = clean_df.copy()

In [21]:
# process the datasets
processed_neutral = neutral_tweets['text'].apply(lambda x: initial_tweet_process(x, stopwords_list))
processed_positive = positive_tweets['text'].apply(lambda x: initial_tweet_process(x, stopwords_list))
processed_negative = negative_tweets['text'].apply(lambda x: initial_tweet_process(x, stopwords_list))
processed_ambig = ambig_tweets['text'].apply(lambda x: initial_tweet_process(x, stopwords_list))
processed_all = all_tweets['text'].apply(lambda x: initial_tweet_process(x, stopwords_list))

In [22]:
def concat_tweets(tweets):
    """
    Function to concatenate a list of tweets into one piece of text.
    Input: tweets (list of tweets)
    Returns: concatenated tweet string
    """
    tweets_concat = []
    for tweet in tweets:
        tweets_concat += tweet
    return tweets_concat

In [23]:
# concatenate the datasets
concat_neutral = concat_tweets(list(processed_neutral))
concat_positive = concat_tweets(list(processed_positive))
concat_negative = concat_tweets(list(processed_negative))
concat_ambig = concat_tweets(list(processed_ambig))
concat_all = concat_tweets(list(processed_all))

# produce frequency distributions for datasets
freqdist_neutral = nltk.FreqDist(concat_neutral)
freqdist_positive = nltk.FreqDist(concat_positive)
freqdist_negative = nltk.FreqDist(concat_negative)
freqdist_ambig = nltk.FreqDist(concat_ambig)
freqdist_all = nltk.FreqDist(concat_all)

In [24]:
# display top neutral words
print('Top Neutral Words')
freqdist_neutral.most_common(15)

Top Neutral Words


[('ipad', 1212),
 ('store', 867),
 ('iphone', 815),
 ('new', 678),
 ("'s", 648),
 ('austin', 630),
 ('amp', 601),
 ('2', 550),
 ('circles', 490),
 ('social', 481),
 ('launch', 465),
 ('today', 441),
 ('app', 355),
 ('network', 355),
 ('android', 350)]

In [25]:
# display top positive words
print('Top Positive Words')
freqdist_positive.most_common(15)

Top Positive Words


[('ipad', 1003),
 ('store', 545),
 ('iphone', 523),
 ("'s", 493),
 ('2', 490),
 ('app', 396),
 ('new', 360),
 ('austin', 294),
 ('amp', 211),
 ('ipad2', 209),
 ('android', 198),
 ('launch', 160),
 ('get', 157),
 ("n't", 152),
 ('pop-up', 151)]

In [26]:
# display negative words
print('Top Negative Words')
freqdist_negative.most_common(15)

Top Negative Words


[('ipad', 188),
 ('iphone', 162),
 ("n't", 87),
 ("'s", 77),
 ('2', 64),
 ('app', 60),
 ('store', 46),
 ('new', 43),
 ('like', 39),
 ('circles', 34),
 ('social', 31),
 ('apps', 30),
 ('people', 29),
 ('design', 28),
 ('need', 25)]

In [27]:
print('Top Ambiguous Words')
freqdist_ambig.most_common(15)

Top Ambiguous Words


[('ipad', 43),
 ('iphone', 32),
 ('store', 22),
 ("'s", 19),
 ('2', 18),
 ('circles', 17),
 ('austin', 16),
 ("n't", 14),
 ('social', 12),
 ('like', 11),
 ('go', 10),
 ('new', 9),
 ('pop-up', 9),
 ('line', 9),
 ('today', 8)]

In [28]:
# display top all 
print('Top Words from All Tweets')
freqdist_all.most_common(15)

Top Words from All Tweets


[('ipad', 2446),
 ('iphone', 1532),
 ('store', 1480),
 ("'s", 1237),
 ('2', 1122),
 ('new', 1090),
 ('austin', 964),
 ('amp', 836),
 ('app', 817),
 ('circles', 658),
 ('launch', 653),
 ('social', 648),
 ('today', 580),
 ('android', 577),
 ("n't", 482)]

Comparing the top words, we see again that the following words appear frequently in all class labels, and are therefore not as helpful in classification. Update stopwords list and reprint frequency distributions. 

In [29]:
# create additional stopwords
additional_stopwords = ['ipad', 'iphone', 'android', 'store']

In [30]:
def print_normalized_word_freq(freq_dist, n=15):
    """
    Print a normalized frequency distribution from a given distribution. Returns top n results. 
    """
    total_word_count = sum(freq_dist.values())
    top = freq_dist.most_common(n)
    
    print('Word\t\t\tNormalized Frequency')
    for word in top:
        normalized_freq = word[1] / total_word_count
        print('{} \t\t\t {:.4}'.format(word[0], normalized_freq))
    
    return None

In [31]:
# neutral normalized frequency
print_normalized_word_freq(freqdist_neutral)

Word			Normalized Frequency
ipad 			 0.02441
store 			 0.01746
iphone 			 0.01641
new 			 0.01365
's 			 0.01305
austin 			 0.01269
amp 			 0.0121
2 			 0.01108
circles 			 0.009868
social 			 0.009687
launch 			 0.009365
today 			 0.008881
app 			 0.007149
network 			 0.007149
android 			 0.007049


In [32]:
# positive normalized frequency
print_normalized_word_freq(freqdist_positive)

Word			Normalized Frequency
ipad 			 0.03489
store 			 0.01896
iphone 			 0.01819
's 			 0.01715
2 			 0.01705
app 			 0.01378
new 			 0.01252
austin 			 0.01023
amp 			 0.00734
ipad2 			 0.007271
android 			 0.006888
launch 			 0.005566
get 			 0.005462
n't 			 0.005288
pop-up 			 0.005253


In [33]:
# negative normalized frequency
print_normalized_word_freq(freqdist_negative)

Word			Normalized Frequency
ipad 			 0.03262
iphone 			 0.02811
n't 			 0.01509
's 			 0.01336
2 			 0.0111
app 			 0.01041
store 			 0.007981
new 			 0.00746
like 			 0.006766
circles 			 0.005899
social 			 0.005378
apps 			 0.005205
people 			 0.005031
design 			 0.004858
need 			 0.004337


In [34]:
# ambig normalized frequency
print_normalized_word_freq(freqdist_ambig)

Word			Normalized Frequency
ipad 			 0.02937
iphone 			 0.02186
store 			 0.01503
's 			 0.01298
2 			 0.0123
circles 			 0.01161
austin 			 0.01093
n't 			 0.009563
social 			 0.008197
like 			 0.007514
go 			 0.006831
new 			 0.006148
pop-up 			 0.006148
line 			 0.006148
today 			 0.005464


In [35]:
def print_bigrams(tweets_concat, n=15):
    """
    Function takes concatenated tweets and prints most common bigrams
    """
    bigram_measures = nltk.collocations.BigramAssocMeasures()
    finder = BigramCollocationFinder.from_words(tweets_concat)
    tweet_scored = finder.score_ngrams(bigram_measures.raw_freq)
    display(tweet_scored[:n])
    return None

In [36]:
# neutral bigrams
print_bigrams(concat_neutral)

[(('ipad', '2'), 0.00896201715873847),
 (('social', 'network'), 0.0070084988117775),
 (('new', 'social'), 0.006424457244129375),
 (('called', 'circles'), 0.005175816651226487),
 (('network', 'called'), 0.00507511982921819),
 (('major', 'new'), 0.004531356990373383),
 (('launch', 'major'), 0.004370242075160108),
 (('pop-up', 'store'), 0.0038264792363153018),
 (('possibly', 'today'), 0.003745921778708664),
 (('circles', 'possibly'), 0.0037257824143070045),
 (('temporary', 'store'), 0.002960486567043944),
 (('store', 'austin'), 0.002678535465420711),
 (('iphone', 'app'), 0.0025979780078140735),
 (('downtown', 'austin'), 0.0023563056349941596),
 (('marissa', 'mayer'), 0.002215330084182543)]

In [37]:
# positive bigrams
print_bigrams(concat_positive)

[(('ipad', '2'), 0.014646025395720994),
 (('iphone', 'app'), 0.004731257610019134),
 (('pop-up', 'store'), 0.003965907114280745),
 (('social', 'network'), 0.002922247347364759),
 (('temporary', 'store'), 0.002783092711775961),
 (('new', 'social'), 0.0026787267350843625),
 (('downtown', 'austin'), 0.002504783440598365),
 (('store', 'downtown'), 0.002435206122803966),
 (('ipad', 'app'), 0.0024004174639067665),
 (('network', 'called'), 0.0019829535571403724),
 (('called', 'circles'), 0.0019481648982431728),
 (('marissa', 'mayer'), 0.0019481648982431728),
 (('new', 'ipad'), 0.0019481648982431728),
 (('launch', 'major'), 0.0018785875804487736),
 (('major', 'new'), 0.0018785875804487736)]

In [38]:
# negative bigrams
print_bigrams(concat_negative)

[(('ipad', '2'), 0.008674531575294934),
 (('iphone', 'app'), 0.003990284524635669),
 (('ipad', 'design'), 0.003296321998612075),
 (('design', 'headaches'), 0.0029493407356002777),
 (('new', 'social'), 0.002775850104094379),
 (('social', 'network'), 0.0024288688410825814),
 (('news', 'apps'), 0.002255378209576683),
 (('fascist', 'company'), 0.002081887578070784),
 (('ipad', 'news'), 0.002081887578070784),
 (('major', 'new'), 0.002081887578070784),
 (('ca', "n't"), 0.0019083969465648854),
 (('called', 'circles'), 0.0019083969465648854),
 (('network', 'called'), 0.0019083969465648854),
 (('company', 'america'), 0.0017349063150589867),
 (('iphone', 'battery'), 0.0017349063150589867)]

In [39]:
# ambiguous bigrams
print_bigrams(concat_ambig)

[(('ipad', '2'), 0.01092896174863388),
 (('social', 'network'), 0.0047814207650273225),
 (('called', 'circles'), 0.004098360655737705),
 (('network', 'called'), 0.004098360655737705),
 (('pop-up', 'store'), 0.004098360655737705),
 (('new', 'social'), 0.0034153005464480873),
 (('store', 'austin'), 0.0034153005464480873),
 (('circles', 'possibly'), 0.00273224043715847),
 (('iphone', 'battery'), 0.0020491803278688526),
 (('iphone', 'game'), 0.0020491803278688526),
 (('launch', 'new'), 0.0020491803278688526),
 (("'re", 'going'), 0.001366120218579235),
 (("'s", 'gon'), 0.001366120218579235),
 (('ai', 'profile'), 0.001366120218579235),
 (('android', 'party'), 0.001366120218579235)]

Similar to the frequency distributions, we see a number of the same results showing up commonly among the different class labels.  We will move on to show the PMI for each class.

In [40]:
def display_pmi(tweets_concat, freq_filter=10):
    """
    Function that takes concatenated tweets and a freq_filter number. Displays PMI scores. 
    """
    bigram_measures = nltk.collocations.BigramAssocMeasures()
    tweet_pmi_finder = BigramCollocationFinder.from_words(tweets_concat)
    tweet_pmi_finder.apply_freq_filter(freq_filter)
    tweet_pmi_scored = tweet_pmi_finder.score_ngrams(bigram_measures.pmi)
    display(tweet_pmi_scored)
    return None

In [41]:
# neutral pmi
display_pmi(concat_neutral)

[(('speak.', 'mark'), 12.277694226941772),
 (('lonely', 'planet'), 12.140190703191836),
 (('speech', 'therapy'), 11.899182603688041),
 (('augmented', 'reality'), 11.792267399771529),
 (('mark', 'belinsky'), 11.792267399771529),
 (('therapy', 'communication'), 11.761679079938107),
 (('communication', 'showcased'), 11.692731726220615),
 (('dwnld', 'groundlink'), 11.692731726220615),
 (('barry', 'diller'), 11.599622321829134),
 (('league', 'extraordinary'), 11.59962232182913),
 (('south', 'southwest'), 11.27769422694177),
 (('mike', 'tyson'), 11.277694226941769),
 (('exhibit', 'hall'), 11.207304899050374),
 (('interrupt', 'regularly'), 11.207304899050373),
 (('regularly', 'scheduled'), 11.207304899050373),
 (('afford', 'attend'), 11.178158553390855),
 (('living', 'social-type'), 11.140190703191838),
 (('red', 'cross'), 11.076060365772122),
 (('consider', 'saving'), 11.046840226857128),
 (('schools', 'marketing'), 10.938556842022185),
 (('showcased', 'conference'), 10.818262608304472),
 ((

In [42]:
# positive pmi
display_pmi(concat_positive)

[(('speak.', 'mark'), 11.226060909579708),
 (('belinsky', '911tweets'), 11.10053002749585),
 (('mark', 'belinsky'), 11.10053002749585),
 (('holler', 'gram'), 10.811023410300866),
 (('physical', 'worlds'), 10.797217610775835),
 (('gon', 'na'), 10.723560569050527),
 (('convention', 'center'), 10.530915491108132),
 (('choice', 'awards'), 10.489095315413502),
 (('includes', 'uberguide'), 10.351591791663568),
 (('connect', 'digital'), 10.096777892634744),
 (('song', 'info'), 9.904132814692346),
 (('uberguide', 'sponsored'), 9.878137606159402),
 (('core', 'action'), 9.778824587912718),
 (('looking', 'forward'), 9.553635567608215),
 (('brain', 'search'), 9.37806400302476),
 (('911tweets', 'panel'), 9.327940523598922),
 (('marketing', 'experts'), 9.293175105438246),
 (('schools', 'marketing'), 9.22606090957971),
 (('think', 'speak.'), 9.196313566185657),
 (('giving', 'away'), 9.158946713721171),
 (('shop', 'core'), 9.118311054184383),
 (('best', 'andoid'), 8.856827099913989),
 (('marissa', 'ma

In [43]:
# negative pmi
display_pmi(concat_negative)

[(('fascist', 'company'), 8.463107276780697),
 (('company', 'america'), 8.422465292283349),
 (('network', 'called'), 8.344955925062434),
 (('design', 'headaches'), 7.685499698117143),
 (('launch', 'major'), 7.547996174367208),
 (('social', 'network'), 7.439122636236958),
 (('called', 'circles'), 7.164383679420613),
 (('news', 'apps'), 7.116478741265009),
 (('major', 'new'), 6.844197444136203),
 (('new', 'social'), 6.112393555085774),
 (('ca', "n't"), 6.0499111243260195),
 (('ipad', '2'), 4.582121958271836),
 (('iphone', 'battery'), 4.568042116568966),
 (('ipad', 'design'), 4.3788383598830904),
 (('ipad', 'news'), 4.353303267775955),
 (('iphone', 'app'), 3.769675977738615)]

In [44]:
display_pmi(concat_ambig)

[(('ipad', '2'), 4.919510082139633)]

Looking at PMI scores, we can see some trends starting to emerge within our dataset.  Looking at positive tweets, we see combinations such as ('choice', 'awards'), ('league', 'extraordinary'); which contrast some of the combinations showing in the negative labeled tweets such as: ('fascist', 'company'), ('design', 'headaches'), etc. 

# 3. Data Preparation
Leverage information learned during data understanding phase to preprocess dataset and prepare data for modeling. 

In [45]:
# set seed for reproducibility
SEED = 1

In [46]:
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column                                              Non-Null Count  Dtype 
---  ------                                              --------------  ----- 
 0   tweet_text                                          9092 non-null   object
 1   emotion_in_tweet_is_directed_at                     3291 non-null   object
 2   is_there_an_emotion_directed_at_a_brand_or_product  9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB


In [47]:
# pull in copy of dataset
clean_df = raw_df.copy()

# relabel columns
clean_df.columns = ['text', 'product_brand', 'sentiment']

# drop product_brand column, handle missing values and duplicates
clean_df = clean_df.drop('product_brand', axis=1)
clean_df = clean_df.dropna()
clean_df = clean_df.drop_duplicates()

# remove ambiguous tweets
clean_df = clean_df.loc[clean_df['sentiment'] != "I can't tell"]

In [48]:
# separate dataset into text and class_labels
text = clean_df['text']
class_labels = clean_df['sentiment']

In [49]:
# split tweets and labels into train and test sets for validation purposes
X_train, X_test, y_train, y_test = train_test_split(text, class_labels, stratify=class_labels, random_state=SEED)

In [50]:
# update stopwords list per our data understanding findings
updated_stopwords = stopwords_list + additional_stopwords

In [51]:
def preprocess_tweet(tweet, stopwords_list):
    """
    Function to preprocess a tweet. 
    Takes: tweet, stopwords list
    Returns: processed tweet with stopwords removed and converted to lowercase
    """
    processed = re.sub("\'", '', tweet) # handle apostrophes
    processed = re.sub('\s+', ' ', processed) # handle excess white space
    tokens = nltk.word_tokenize(processed)
    stopwords_removed = [token.lower() for token in tokens if token.lower() not in stopwords_list]
    return ' '.join(stopwords_removed)

In [52]:
# preprocess train and test sets
X_train_preprocessed = X_train.apply(lambda x: preprocess_tweet(x, updated_stopwords))
X_test_preprocessed = X_test.apply(lambda x: preprocess_tweet(x, updated_stopwords))

Now that we have split our data into train and test sets, as well as, preprocessed both train and test sets, we are ready to vectorize our data.  We have chosen to use TF-IDF vectorization for its benefits in classification and finding words that are unique per class label. Try a number of vectorizers to see if there is one that performs better over the other (count vectorized vs. TF-IDF vectorized).

In [53]:
# create vectorizers with unigram and bigrams
count_vectorizer = CountVectorizer(ngram_range=(1,2), analyzer='word', min_df=5)
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,2), analyzer='word', min_df=5)

# fit to preprocessed data
X_train_count = count_vectorizer.fit_transform(X_train_preprocessed)
X_test_count = count_vectorizer.transform(X_test_preprocessed) 
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train_preprocessed) 
X_test_tfidf = tfidf_vectorizer.transform(X_test_preprocessed) 

### Word Embeddings

In [56]:
# tokenize datasets
tokenized_X_train = X_train_preprocessed.map(nltk.word_tokenize).values
tokenized_X_test = X_test_preprocessed.map(nltk.word_tokenize).values

# get total training vocabulary size
total_train_vocab = set(word for tweet in tokenized_X_train for word in tweet)
train_vocab_size = len(total_train_vocab)
print(f'There are {train_vocab_size} unique tokens in the processed training set.')

There are 8951 unique tokens in the processed training set.


In [57]:
def glove_vectors(vocab):
    """
    Returns appropriate vectors from GloVe file.
    Input: vocabulary set to use.
    """
    glove = {}
    with open('glove.6B.50d.txt', 'rb') as f:
        for line in f:
            parts = line.split()
            word = parts[0].decode('utf-8')
            if word in vocab:
                vector = np.array(parts[1:], dtype=np.float32)
                glove[word] = vector
    return glove

In [58]:
glove = glove_vectors(total_train_vocab)

In [59]:
class W2vVectorizer(object):
    
    def __init__(self, w2v):
        # Takes in a dictionary of words and vectors as input
        self.w2v = w2v
        if len(w2v) == 0:
            self.dimensions = 0
        else:
            self.dimensions = len(w2v[next(iter(glove))])
    
    # Note: Even though it doesn't do anything, it's required that this object implement a fit method or else
    # it can't be used in a scikit-learn pipeline  
    def fit(self, X, y):
        return self
            
    def transform(self, X):
        return np.array([
            np.mean([self.w2v[w] for w in words if w in self.w2v]
                   or [np.zeros(self.dimensions)], axis=0) for words in X])

In [60]:
# instantiate vectorizer objects with glove
w2v_vectorizer = W2vVectorizer(glove)

# transform training and testing data
X_train_w2v = w2v_vectorizer.transform(tokenized_X_train)
X_test_w2v = w2v_vectorizer.transform(tokenized_X_test)

Now that we have vectorized our datasets, we are ready to move on to the modeling stage.

# 4. Modeling

This is a classification task, tasked with classifying the sentiment of tweets based on the text within the tweet. Three primary models will be relied on for classification:
1. Random Forests
2. Linear SVM
3. Neural Networks

Overfitting will be addressed thru hyperparameter tuning, such as pruning trees used in random forests / XGBoost, in addition to other parameter tuning. 

This is a multi-class classification problem, with three available class labels (Neutral, Positive, or Negative). As a result, the performance metric we will focus on throughout this process will be accuracy. 

### Random Forest

In [61]:
# instantiate random forest classifiers, with balanced class_weight
rf_count = RandomForestClassifier(random_state=SEED, n_jobs=-1, class_weight='balanced')
rf_tfidf = RandomForestClassifier(random_state=SEED, n_jobs=-1, class_weight='balanced')
rf_w2v = RandomForestClassifier(random_state=SEED, n_jobs=-1, class_weight='balanced')

# fit to training sets
rf_count.fit(X_train_count, y_train)
rf_tfidf.fit(X_train_tfidf, y_train)
rf_w2v.fit(X_train_w2v, y_train)

RandomForestClassifier(class_weight='balanced', n_jobs=-1, random_state=1)

In [62]:
# Count Vectorized
count_train_score = rf_count.score(X_train_count, y_train)
count_test_score = rf_count.score(X_test_count, y_test)
print(f'Count Vectorized Train Score: {count_train_score}')
print(f'Count Vectorized Test Score: {count_test_score}')
print('--------')

# TF-IDF Vectorized
tfidf_train_score = rf_tfidf.score(X_train_tfidf, y_train)
tfidf_test_score = rf_tfidf.score(X_test_tfidf, y_test)
print(f'TF-IDF Vectorized Train Score: {tfidf_train_score}')
print(f'TF-IDF Vectorized Test Score: {tfidf_test_score}')
print('--------')

# W2V Vectorized
w2v_train_score = rf_w2v.score(X_train_w2v, y_train)
w2v_test_score = rf_w2v.score(X_test_w2v, y_test)
print(f'Word2Vec Vectorized Train Score: {w2v_train_score}')
print(f'Word2Vec Vectorized Test Score: {w2v_test_score}')

Count Vectorized Train Score: 0.9512341062079281
Count Vectorized Test Score: 0.6630776132794975
--------
TF-IDF Vectorized Train Score: 0.9512341062079281
TF-IDF Vectorized Test Score: 0.6585912965455362
--------
Word2Vec Vectorized Train Score: 0.9545250560957367
Word2Vec Vectorized Test Score: 0.6514131897711979


Reviewing baseline random model scores for our three vectorized datasets (count, tf-idf, and word2vec using glove), we can see that results are fairly consistent across our vectorization methods. Further, looking at our high training set accuracy score vs. test scores, shows we are likely overfitting slightly to the training data.  

Random forests are known for overfitting and we have not yet tuned any hyperparams.  Set up a grid search to optimize params. 

#### Random Forest - Count Vectorized

In [68]:
# set search parameters
rf_params_count = {
    'max_depth': [10, 25, 50, 75, 100, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5],
    'n_estimators': [25, 50, 75, 100]
}

# instantiate random forest classifier for randomized search
rf_classifier = RandomForestClassifier(random_state=SEED, class_weight='balanced', n_jobs=-1)

# instantiate random search
rf_count_rs = RandomizedSearchCV(estimator=rf_classifier,
                                 param_distributions=rf_params_count,
                                 return_train_score=True,
                                 scoring='accuracy',
                                 n_iter=100,
                                 verbose=1,
                                 cv=3,
                                 random_state=SEED)

In [69]:
# fit to count-vectorized data
rf_count_rs.fit(X_train_count, y_train)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


RandomizedSearchCV(cv=3,
                   estimator=RandomForestClassifier(class_weight='balanced',
                                                    n_jobs=-1, random_state=1),
                   n_iter=100,
                   param_distributions={'max_depth': [10, 25, 50, 75, 100,
                                                      None],
                                        'min_samples_leaf': [1, 2, 5],
                                        'min_samples_split': [2, 5, 10],
                                        'n_estimators': [25, 50, 75, 100]},
                   random_state=1, return_train_score=True, scoring='accuracy',
                   verbose=1)

In [70]:
# print count-vectorized randomized-search results
mean_train_score_count = np.mean(rf_count_rs.cv_results_['mean_train_score'])
mean_test_score_count = np.mean(rf_count_rs.cv_results_['mean_test_score'])
print(f'Random Search Train Accuracy (Count Vect.): {mean_train_score_count}')
print(f'Random Search Test Accuracy (Count Vect.): {mean_test_score_count}')

# display best params
rf_count_rs.best_params_

Random Search Train Accuracy (Count Vect.): 0.729998733324767
Random Search Test Accuracy (Count Vect.): 0.5947952362887449


{'n_estimators': 100,
 'min_samples_split': 2,
 'min_samples_leaf': 1,
 'max_depth': 100}

In [71]:
# run gridsearch with params around these values to see if results can be improved further
# need to further address overfitting
grid_search_params = {
    'min_samples_split': [4, 5],
    'min_samples_leaf': [3, 4],
    'max_depth': [25, 50, 75],
    'max_features': ['auto', 'sqrt'],
    'bootstrap': [True, False],
    'criterion': ['entropy', 'gini']
}

# instantiate classifier
rf_classifier = RandomForestClassifier(n_jobs=-1, random_state=SEED, class_weight='balanced', n_estimators=100)

# instantiate grid search
rf_gs_count = GridSearchCV(estimator=rf_classifier, 
                           param_grid=grid_search_params, 
                           cv=3, 
                           scoring='accuracy', 
                           return_train_score=True,
                           verbose=1)

In [72]:
# fit to count vectorized
rf_gs_count.fit(X_train_count, y_train)

Fitting 3 folds for each of 96 candidates, totalling 288 fits


GridSearchCV(cv=3,
             estimator=RandomForestClassifier(class_weight='balanced',
                                              n_jobs=-1, random_state=1),
             param_grid={'bootstrap': [True, False],
                         'criterion': ['entropy', 'gini'],
                         'max_depth': [25, 50, 75],
                         'max_features': ['auto', 'sqrt'],
                         'min_samples_leaf': [3, 4],
                         'min_samples_split': [4, 5]},
             return_train_score=True, scoring='accuracy', verbose=1)

In [73]:
# print count-vectorized results
mean_train_score_count = np.mean(rf_gs_count.cv_results_['mean_train_score'])
mean_test_score_count = np.mean(rf_gs_count.cv_results_['mean_test_score'])
print(f'Random Search Train Accuracy (Count Vect.): {mean_train_score_count}')
print(f'Random Search Test Accuracy (Count Vect.): {mean_test_score_count}')

# display best params
rf_gs_count.best_params_

Random Search Train Accuracy (Count Vect.): 0.659510896388931
Random Search Test Accuracy (Count Vect.): 0.5737062579643757


{'bootstrap': True,
 'criterion': 'gini',
 'max_depth': 75,
 'max_features': 'auto',
 'min_samples_leaf': 3,
 'min_samples_split': 4}

In [80]:
# run best count-vect model with these params
best_rf_count = RandomForestClassifier(class_weight='balanced', n_jobs=-1, random_state=SEED,
                                       bootstrap=True,
                                       criterion='gini',
                                       max_depth=75,
                                       max_features='auto',
                                       min_samples_leaf=3,
                                       min_samples_split=4)

# fit to count vect data
best_rf_count.fit(X_train_count, y_train)

# print testing score
print(f'Best Tuned Random Forest (Count Vectorized) Test Accuracy: {best_rf_count.score(X_test_count, y_test)}')

Best Tuned Random Forest (Count Vectorized) Test Accuracy: 0.62673844773441


Looking at our tuned params, we now see we are getting a test accuracy of ~59%, with a train accuracy of ~66%.  Overfitting appears to have been addressed through tuning as evidenced by the closeness of training and testing accuracy.  

Results are somewhat strong, especially when considering a "simple" model guessing on a balanced dataset would be expected to generate an accuracy score of ~33%. 

Run similar process with TF-IDF Vectorized data.

#### Random Forest - TF-IDF Vectorized

In [84]:
# set search parameters
rf_params_tfidf = {
    'max_depth': [10, 25, 50, 75, 100, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5],
    'n_estimators': [25, 50, 75, 100]
}

# instantiate random forest classifier for randomized search
rf_classifier = RandomForestClassifier(random_state=SEED, class_weight='balanced', n_jobs=-1)

# instantiate random search
rf_tfidf_rs = RandomizedSearchCV(estimator=rf_classifier,
                                 param_distributions=rf_params_tfidf,
                                 return_train_score=True,
                                 scoring='accuracy',
                                 n_iter=100,
                                 verbose=1,
                                 cv=3,
                                 random_state=SEED)

In [85]:
# fit to vectorized dataset
rf_tfidf_rs.fit(X_train_tfidf, y_train)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


RandomizedSearchCV(cv=3,
                   estimator=RandomForestClassifier(class_weight='balanced',
                                                    n_jobs=-1, random_state=1),
                   n_iter=100,
                   param_distributions={'max_depth': [10, 25, 50, 75, 100,
                                                      None],
                                        'min_samples_leaf': [1, 2, 5],
                                        'min_samples_split': [2, 5, 10],
                                        'n_estimators': [25, 50, 75, 100]},
                   random_state=1, return_train_score=True, scoring='accuracy',
                   verbose=1)

In [86]:
# print search results
mean_train_score_tfidf = np.mean(rf_tfidf_rs.cv_results_['mean_train_score']) 
mean_test_score_tfidf = np.mean(rf_tfidf_rs.cv_results_['mean_test_score'])
print(f'Grid Search Train Accuracy (TF-IDF): {mean_train_score_tfidf}')
print(f'Grid Search Test Accuracy (TF-IDF): {mean_test_score_tfidf}')

# display best params
rf_tfidf_rs.best_params_

Grid Search Train Accuracy (TF-IDF): 0.7468722628099855
Grid Search Test Accuracy (TF-IDF): 0.5943212311516302


{'n_estimators': 100,
 'min_samples_split': 2,
 'min_samples_leaf': 1,
 'max_depth': 100}

In [87]:
# similar to count vectorized, run refined grid search and try to further address overfitting
grid_search_params = {
    'min_samples_split': [4, 5],
    'min_samples_leaf': [3, 4],
    'max_depth': [25, 50, 75],
    'max_features': ['auto', 'sqrt'],
    'bootstrap': [True, False],
    'criterion': ['entropy', 'gini']
}

# instantiate classifier
rf_classifier = RandomForestClassifier(n_jobs=-1, random_state=SEED, class_weight='balanced', n_estimators=100)

# instantiate grid search
rf_gs_tfidf = GridSearchCV(estimator=rf_classifier, 
                           param_grid=grid_search_params, 
                           cv=3, 
                           scoring='accuracy', 
                           return_train_score=True,
                           verbose=1)

In [88]:
# fit to training data
rf_gs_tfidf.fit(X_train_tfidf, y_train)

Fitting 3 folds for each of 96 candidates, totalling 288 fits


GridSearchCV(cv=3,
             estimator=RandomForestClassifier(class_weight='balanced',
                                              n_jobs=-1, random_state=1),
             param_grid={'bootstrap': [True, False],
                         'criterion': ['entropy', 'gini'],
                         'max_depth': [25, 50, 75],
                         'max_features': ['auto', 'sqrt'],
                         'min_samples_leaf': [3, 4],
                         'min_samples_split': [4, 5]},
             return_train_score=True, scoring='accuracy', verbose=1)

In [89]:
# print search results
mean_train_score_tfidf = np.mean(rf_gs_tfidf.cv_results_['mean_train_score']) 
mean_test_score_tfidf = np.mean(rf_gs_tfidf.cv_results_['mean_test_score'])
print(f'Grid Search Train Accuracy (TF-IDF): {mean_train_score_tfidf}')
print(f'Grid Search Test Accuracy (TF-IDF): {mean_test_score_tfidf}')

# display best params
rf_gs_tfidf.best_params_

Grid Search Train Accuracy (TF-IDF): 0.6863403538168945
Grid Search Test Accuracy (TF-IDF): 0.5787108687891517


{'bootstrap': True,
 'criterion': 'gini',
 'max_depth': 75,
 'max_features': 'auto',
 'min_samples_leaf': 3,
 'min_samples_split': 4}

In [91]:
# run best tfidf model with identified params
best_rf_tfidf = RandomForestClassifier(random_state=SEED, n_jobs=-1, class_weight='balanced', n_estimators=100,
                                       min_samples_split=4, 
                                       min_samples_leaf=3,
                                       max_depth=75,
                                       max_features='auto',
                                       bootstrap=True,
                                       criterion='gini')

# fit to tfidf vectorized
best_rf_tfidf.fit(X_train_tfidf, y_train)

# print testing score
print(f'Best Tuned Random Forest (Count Vectorized) Test Accuracy: {best_rf_tfidf.score(X_test_tfidf, y_test)}')

Best Tuned Random Forest (Count Vectorized) Test Accuracy: 0.6258411843876177


Results of TF-IDF vectorized data is similar to that of count vectorized.  With testing scores ~59% and training scores close to 69%.  Results may be overfitting slightly more to the TF-IDF vectorized data. 

#### Random Forest - Word2Vec Vectorized

In [92]:
# set grid search parameters
rf_params_w2v = {
    'max_depth': [10, 25, 50, 75, 100, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5],
    'n_estimators': [25, 50, 75, 100]
}

# instantiate random forest classifier for grid search
rf_classifier = RandomForestClassifier(random_state=SEED, class_weight='balanced', n_jobs=-1)

# instantiate grid search
rf_w2v_rs = RandomizedSearchCV(estimator=rf_classifier,
                               param_distributions=rf_params_w2v,
                               return_train_score=True,
                               scoring='accuracy',
                               n_iter=100,
                               verbose=1,
                               cv=3,
                               random_state=SEED)

In [93]:
# train to dataset
rf_w2v_rs.fit(X_train_w2v, y_train)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


RandomizedSearchCV(cv=3,
                   estimator=RandomForestClassifier(class_weight='balanced',
                                                    n_jobs=-1, random_state=1),
                   n_iter=100,
                   param_distributions={'max_depth': [10, 25, 50, 75, 100,
                                                      None],
                                        'min_samples_leaf': [1, 2, 5],
                                        'min_samples_split': [2, 5, 10],
                                        'n_estimators': [25, 50, 75, 100]},
                   random_state=1, return_train_score=True, scoring='accuracy',
                   verbose=1)

In [94]:
# print word2vec vectorized grid search results
mean_train_score_w2v = np.mean(rf_w2v_rs.cv_results_['mean_train_score']) 
mean_test_score_w2v = np.mean(rf_w2v_rs.cv_results_['mean_test_score'])
print(f'Grid Search Train Accuracy (word2vec): {mean_train_score_w2v}')
print(f'Grid Search Test Accuracy (word2vec): {mean_test_score_w2v}')

# display best params
rf_w2v_rs.best_params_

Grid Search Train Accuracy (word2vec): 0.9582281511194074
Grid Search Test Accuracy (word2vec): 0.6426536295000428


{'n_estimators': 75,
 'min_samples_split': 5,
 'min_samples_leaf': 2,
 'max_depth': None}

In [109]:
# further refine with grid search and further address overfitting
rf_params_w2v = {
    'min_samples_split': [5, 6],
    'min_samples_leaf': [3, 4],
    'max_depth': [5, 6, 7],
    'max_features': ['auto', 'sqrt'],
    'bootstrap': [True, False],
    'criterion': ['entropy', 'gini']
}

# instantiate classifier
rf_classifier = RandomForestClassifier(n_jobs=-1, random_state=SEED, class_weight='balanced', n_estimators=75)

# instantiate grid search
rf_gs_w2v = GridSearchCV(estimator=rf_classifier, 
                           param_grid=rf_params_w2v, 
                           cv=3, 
                           scoring='accuracy', 
                           return_train_score=True,
                           verbose=1)

In [110]:
rf_gs_w2v.fit(X_train_w2v, y_train)

Fitting 3 folds for each of 96 candidates, totalling 288 fits


GridSearchCV(cv=3,
             estimator=RandomForestClassifier(class_weight='balanced',
                                              n_estimators=75, n_jobs=-1,
                                              random_state=1),
             param_grid={'bootstrap': [True, False],
                         'criterion': ['entropy', 'gini'],
                         'max_depth': [5, 6, 7],
                         'max_features': ['auto', 'sqrt'],
                         'min_samples_leaf': [3, 4],
                         'min_samples_split': [5, 6]},
             return_train_score=True, scoring='accuracy', verbose=1)

In [112]:
# print word2vec vectorized grid search results
mean_train_score_w2v = np.mean(rf_gs_w2v.cv_results_['mean_train_score']) 
mean_test_score_w2v = np.mean(rf_gs_w2v.cv_results_['mean_test_score'])
print(f'Grid Search Train Accuracy (word2vec): {mean_train_score_w2v}')
print(f'Grid Search Test Accuracy (word2vec): {mean_test_score_w2v}')

# display best params
rf_gs_w2v.best_params_

Grid Search Train Accuracy (word2vec): 0.6980460028180712
Grid Search Test Accuracy (word2vec): 0.5374097277540485


{'bootstrap': True,
 'criterion': 'gini',
 'max_depth': 7,
 'max_features': 'auto',
 'min_samples_leaf': 3,
 'min_samples_split': 5}

Using a different range for w2v vect data as overfitting appeared stronger. Now that overfitting has been addressed and a number of hyperparams have been set, we can see performance is still not great.  Move on to SVC to see if we can improve results.

### Linear SVM

In [118]:
# create linear SVC
svc_count = LinearSVC(random_state=SEED, class_weight='balanced', max_iter=5000)
svc_tfidf = LinearSVC(random_state=SEED, class_weight='balanced', max_iter=5000)
svc_w2v = LinearSVC(random_state=SEED, class_weight='balanced', max_iter=5000)

# fit to training sets
svc_count.fit(X_train_count, y_train)
svc_tfidf.fit(X_train_tfidf, y_train)
svc_w2v.fit(X_train_w2v, y_train)

LinearSVC(class_weight='balanced', max_iter=5000, random_state=1)

In [119]:
# Count Vectorized
count_train_score = svc_count.score(X_train_count, y_train)
count_test_score = svc_count.score(X_test_count, y_test)
print(f'Count Vectorized Train Score: {count_train_score}')
print(f'Count Vectorized Test Score: {count_test_score}')

# TF-IDF Vectorized
tfidf_train_score = svc_tfidf.score(X_train_tfidf, y_train)
tfidf_test_score = svc_tfidf.score(X_test_tfidf, y_test)
print(f'TF-IDF Vectorized Train Score: {tfidf_train_score}')
print(f'TF-IDF Vectorized Test Score: {tfidf_test_score}')

# Word2Vec Vectorized
w2v_train_score = svc_w2v.score(X_train_w2v, y_train)
w2v_test_score = svc_w2v.score(X_test_w2v, y_test)
print(f'Word2Vect Vectorized Train Score: {w2v_train_score}')
print(f'Word2Vect Vectorized Test Score: {w2v_test_score}')

Count Vectorized Train Score: 0.8764397905759163
Count Vectorized Test Score: 0.6464782413638402
TF-IDF Vectorized Train Score: 0.8480179506357517
TF-IDF Vectorized Test Score: 0.6401973979362943
Word2Vect Vectorized Train Score: 0.5992520568436799
Word2Vect Vectorized Test Score: 0.5895020188425303


Looking at baseline linearSVC results, we can see that we are not overfitting as badly to training data as we were with random forest model above. Baseline test scores are inline with each other when comparing Count and TF-IDF vectorized datasets. 

Try to addres overfitting and improve results with grid search

In [120]:
# set params for grid search
svc_params = {
    'C': [0.01, 0.1, 1, 10]
}

In [121]:
# grid search 
svc_classifier = LinearSVC(random_state=SEED, class_weight='balanced', max_iter=10000)
svc_grid_search = GridSearchCV(svc_classifier,
                               svc_params,
                               return_train_score=True,
                               scoring='accuracy')

# fit to count data
svc_grid_search.fit(X_train_count, y_train)



GridSearchCV(estimator=LinearSVC(class_weight='balanced', max_iter=10000,
                                 random_state=1),
             param_grid={'C': [0.01, 0.1, 1, 10]}, return_train_score=True,
             scoring='accuracy')

In [122]:
# print count-vectorized grid-search results
mean_train_score_count = np.mean(svc_grid_search.cv_results_['mean_train_score'])
mean_test_score_count = np.mean(svc_grid_search.cv_results_['mean_test_score'])
print(f'Grid Search Train Accuracy (Count Vect.): {mean_train_score_count}')
print(f'Grid Search Test Accuracy (Count Vect.): {mean_test_score_count}')

# display best params
svc_grid_search.best_params_

Grid Search Train Accuracy (Count Vect.): 0.8609573672400898
Grid Search Test Accuracy (Count Vect.): 0.6398279730740464


{'C': 0.01}

In [123]:
# failing to converge, try with tfidf vectorized data
svc_grid_search.fit(X_train_tfidf, y_train)

GridSearchCV(estimator=LinearSVC(class_weight='balanced', max_iter=10000,
                                 random_state=1),
             param_grid={'C': [0.01, 0.1, 1, 10]}, return_train_score=True,
             scoring='accuracy')

In [124]:
# print tfidf-vectorized grid search results
mean_train_score_tfidf = np.mean(svc_grid_search.cv_results_['mean_train_score']) 
mean_test_score_tfidf = np.mean(svc_grid_search.cv_results_['mean_test_score'])
print(f'Grid Search Train Accuracy (TF-IDF): {mean_train_score_tfidf}')
print(f'Grid Search Test Accuracy (TF-IDF): {mean_test_score_tfidf}')

# display best params
svc_grid_search.best_params_

Grid Search Train Accuracy (TF-IDF): 0.8109667165295438
Grid Search Test Accuracy (TF-IDF): 0.6428571428571428


{'C': 0.1}

With LinearSVC, our TF-IDF vectorized data appears to be performing slightly better, with slighty less overfitting to training data.  Additionally, test score results are marginally better than the "dummy" model that guess neutral every time. Moving forward, we will stick with TF-IDF Vectorization and try to remvoe some more overfitting that is occurring. 

In [125]:
svc_params = {
    'C': [0.001, 0.01]
}

# grid search 
svc_classifier = LinearSVC(random_state=SEED, class_weight='balanced', max_iter=10000)
svc_grid_search = GridSearchCV(svc_classifier,
                               svc_params,
                               return_train_score=True,
                               scoring='accuracy')

# fit to tfidf data
svc_grid_search.fit(X_train_tfidf, y_train)

GridSearchCV(estimator=LinearSVC(class_weight='balanced', max_iter=10000,
                                 random_state=1),
             param_grid={'C': [0.001, 0.01]}, return_train_score=True,
             scoring='accuracy')

In [126]:
# print tfidf-vectorized grid search results
mean_train_score_tfidf = np.mean(svc_grid_search.cv_results_['mean_train_score']) 
mean_test_score_tfidf = np.mean(svc_grid_search.cv_results_['mean_test_score'])
print(f'Grid Search Train Accuracy (TF-IDF): {mean_train_score_tfidf}')
print(f'Grid Search Test Accuracy (TF-IDF): {mean_test_score_tfidf}')

# display best params
svc_grid_search.best_params_

Grid Search Train Accuracy (TF-IDF): 0.6357329842931937
Grid Search Test Accuracy (TF-IDF): 0.6218399401645475


{'C': 0.01}

Running updated grid search on tf-idf vectorized data, we see some improved results. Looking at our training accuracy score of 63.6%, we are inline with our test accuracy score of 62%.  We are no longer likely overfitting to training data.  While results are not particularly strong, we are still outperforming the "dummy" model, which would score closer to 60%. 

In [137]:
# fit to w2v vectorized data
svc_grid_search.fit(X_train_w2v, y_train)

GridSearchCV(estimator=LinearSVC(class_weight='balanced', max_iter=10000,
                                 random_state=1),
             param_grid={'C': [0.001, 0.01]}, return_train_score=True,
             scoring='accuracy')

In [138]:
# print w2v-vectorized grid search results
mean_train_score_w2v = np.mean(svc_grid_search.cv_results_['mean_train_score']) 
mean_test_score_w2v = np.mean(svc_grid_search.cv_results_['mean_test_score'])
print(f'Grid Search Train Accuracy (Word2Vec): {mean_train_score_w2v}')
print(f'Grid Search Test Accuracy (Word2Vec): {mean_test_score_w2v}')

# display best params
svc_grid_search.best_params_

Grid Search Train Accuracy (Word2Vec): 0.6006731488406881
Grid Search Test Accuracy (Word2Vec): 0.5915482423335827


{'C': 0.001}

In [139]:
# run a best version of the SVC model
best_svc = LinearSVC(class_weight='balanced',
                     max_iter=10000,
                     random_state=SEED,
                     C=0.01)

# fit to data
best_svc.fit(X_train_tfidf, y_train)

LinearSVC(C=0.01, class_weight='balanced', max_iter=10000, random_state=1)

In [140]:
# generate scores
print(f'Best Identified SVC Train Accuracy: {best_svc.score(X_train_tfidf, y_train)}')
print(f'Best Identified SVC Test Accuracy: {best_svc.score(X_test_tfidf, y_test)}')

Best Identified SVC Train Accuracy: 0.6785340314136126
Best Identified SVC Test Accuracy: 0.6541049798115747


### Deep Learning with Word Embeddings

In [141]:
# import necessary libraries
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Input, Dense, LSTM, Embedding, Dropout, Activation, Bidirectional, GlobalMaxPool1D
from keras.models import Sequential
from keras import initializers, regularizers, constraints, optimizers, layers
from keras.preprocessing import text, sequence

Using TensorFlow backend.


In [142]:
# convert labels to one-hot encoded format
y_train_encoded = pd.get_dummies(y_train).values
y_test_encoded = pd.get_dummies(y_test).values

In [143]:
# create tokenizer and limit overall vocab size to 20000 most important words
tokenizer = text.Tokenizer(num_words=20000)

# fit on text
tokenizer.fit_on_texts(list(X_train_preprocessed))
list_tokenized = tokenizer.texts_to_sequences(X_train_preprocessed)
X_t = sequence.pad_sequences(list_tokenized, maxlen=100)

In [144]:
# set embedding size
embedding_size = 50

# construct neural network
model = Sequential()
model.add(Embedding(20000, embedding_size))
model.add(LSTM(25, return_sequences=True))
model.add(Dropout(0.5))
model.add(GlobalMaxPool1D())
model.add(Dropout(0.5)) # to help overfitting
model.add(Dense(50, activation='relu'))
model.add(Dropout(0.5)) # to help overfitting
model.add(Dense(3, activation='softmax')) # 3 potential class labels

In [145]:
# compile model with params
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [146]:
# model summary
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 50)          1000000   
_________________________________________________________________
lstm_1 (LSTM)                (None, None, 25)          7600      
_________________________________________________________________
dropout_1 (Dropout)          (None, None, 25)          0         
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 25)                0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 25)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 50)                1300      
_________________________________________________________________
dropout_3 (Dropout)          (None, 50)               

In [147]:
# fit model
model.fit(X_t, y_train_encoded, epochs=5, batch_size=50, validation_split=0.1)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 6016 samples, validate on 669 samples
Epoch 1/5

KeyboardInterrupt: 