# Tweet Sentiment Classification
### Module 4 Project - Kai Graham

# Overview of Process (CRISP-DM)
I will be following the Cross-Industry Standard Process for Data Mining to build a classifier that will determine the sentiment of tweets.  The CRISP-DM process includes the following key steps:
1. Business Understanding
2. Data Understanding
3. Data Preparation
4. Modeling
5. Evaluation
6. Deployment

# 1. Business Understanding
* Who are the stakeholders?
* What Business Problems will this solve?
* What problems are inside scope of this problem?
* What problems are outside scope?
* What data sources are available to us?
* Timeline of project / deadlines?
* Do stakeholders from different parts of the company all agree?

My overall goal is to create a classifier that will successfully classify a tweet based on its sentiment into one of the following classes: positive sentiment, negative sentiment, or neutral sentiment.  Given data is limited to tweets related to Apple / Google products, the biggest stakeholders for this project are likely Google and Apple themselves.  Product managers / other managers within the company could use a tool like this to track public sentiment surrounding various product launches / software updates.  While out of the specific scope of this project, combining with time series metrics, the Company could track increases or changes to sentiment based on tweet classification. 

# 2. Data Understanding
* What data is available to us, where does it live?
* What is our target?
* What predictors are available?
* What is distribution of data?
* EDA to show most common words and other corpus statistics.  Will require some initial processing.

The main dataset used throughout this data science process will be coming from CrowdFlower via the following url: `https://data.world/crowdflower/brands-and-product-emotions`. 

The following summary of the dataset is provided on CrowdFlower:

*Contributors evaluated tweets about multiple brands and products. The crowd was asked if the tweet expressed positive, negative, or no emotion towards a brand and/or product. If some emotion was expressed they were also asked to say which brand or product was the target of that emotion.*

In [54]:
# import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import nltk
from nltk.corpus import stopwords
from nltk.collocations import *
import string
import re
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.svm import LinearSVC
from imblearn.over_sampling import SMOTE

In [2]:
# failure to specify 'latin1' encoding results in errors
# error_df = pd.read_csv('judge-1377884607_tweet_product_company.csv')
# error_df.head()

In [3]:
# load dataset
raw_df = pd.read_csv('judge-1377884607_tweet_product_company.csv', encoding='latin_1')

In [4]:
# print first rows of dataset
raw_df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [5]:
# show info of df
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column                                              Non-Null Count  Dtype 
---  ------                                              --------------  ----- 
 0   tweet_text                                          9092 non-null   object
 1   emotion_in_tweet_is_directed_at                     3291 non-null   object
 2   is_there_an_emotion_directed_at_a_brand_or_product  9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB


Looking at the above outputs, we can see that there are a total of 9,093 entries in our dataset, with a total of three columns.  Raw text from tweets is held in the `tweet_text` column; sentiment is held in the `is_there_an_emotion_directed_at_a_brand_or_product`; and the item of emotion direction is held in the `emotion_in_tweet_is_directed_at` column.  

From first glance, we can likely drop the `emotion_in_tweet_is_directed_at` column as we are more interested in whether sentiment in a given tweet is positive, neutral, or negative based on the text.  Main predictors we will use is processed features derived from the `tweet_text` column.

Our target variable, which can also be though of as our class labels are held in the `is_there_an_emotion_directed_at_a_brand_or_product` column.  

In [6]:
# display value counts
display(raw_df['emotion_in_tweet_is_directed_at'].value_counts())
display(raw_df['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts())

iPad                               946
Apple                              661
iPad or iPhone App                 470
Google                             430
iPhone                             297
Other Google product or service    293
Android App                         81
Android                             78
Other Apple product or service      35
Name: emotion_in_tweet_is_directed_at, dtype: int64

No emotion toward brand or product    5389
Positive emotion                      2978
Negative emotion                       570
I can't tell                           156
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

Unsurprising given the origin of our dataset, the products identfied are either Apple or Google products.  Looking at sentiment, the majority of entries seem to fall under a neutral sentiment ('No emotion toward brand or product'), with the next largest group being tagged as 'Positive emotion'.  There is some clear class imbalance present with only 570 entries belonging to the 'Negative emotion' class. 

In [7]:
# rename columns so easier to work with
df = raw_df.copy()
df.columns = ['text', 'product_brand', 'sentiment']
df.head()

Unnamed: 0,text,product_brand,sentiment
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [8]:
# explore potential missing values
df.isna().sum()

text                1
product_brand    5802
sentiment           0
dtype: int64

We see that there is only one missing value in the text column, 0 in the sentiment column, and a large number (5802) in the product_brand column.  Given we are planning to work the majority of the time with the text and sentiment columns, this will not likely pose a large issue. 

In [9]:
# display missing value in the text column
df.loc[df['text'].isna()]

Unnamed: 0,text,product_brand,sentiment
6,,,No emotion toward brand or product


In [10]:
# display missing values in the product_brand column
df.loc[df['product_brand'].isna()]

Unnamed: 0,text,product_brand,sentiment
5,@teachntech00 New iPad Apps For #SpeechTherapy...,,No emotion toward brand or product
6,,,No emotion toward brand or product
16,Holler Gram for iPad on the iTunes App Store -...,,No emotion toward brand or product
32,"Attn: All #SXSW frineds, @mention Register fo...",,No emotion toward brand or product
33,Anyone at #sxsw want to sell their old iPad?,,No emotion toward brand or product
...,...,...,...
9087,"@mention Yup, but I don't have a third app yet...",,No emotion toward brand or product
9089,"Wave, buzz... RT @mention We interrupt your re...",,No emotion toward brand or product
9090,"Google's Zeiger, a physician never reported po...",,No emotion toward brand or product
9091,Some Verizon iPhone customers complained their...,,No emotion toward brand or product


In [11]:
# display sentiment breakdowns of missing product_brand entries
df.loc[df['product_brand'].isna()]['sentiment'].value_counts()

No emotion toward brand or product    5298
Positive emotion                       306
I can't tell                           147
Negative emotion                        51
Name: sentiment, dtype: int64

We see that the majority of missing product_brand values are also labeled as no emotion twoard brand or product, which makes sense as a lot of the neutral-labeled tweets may not be directed at a specific brand or product, and therefore would be missing a product_brand tagging.  Additionally, this column will not be used in our process of tweet classification. 

Drop unnecessary columns and handle missing value for additional EDA.

In [12]:
# drop product_brand column
clean_df = df.drop(['product_brand'], axis=1)

# handle missing values
clean_df = clean_df.dropna(subset=['text'])
clean_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9092 entries, 0 to 9092
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   text       9092 non-null   object
 1   sentiment  9092 non-null   object
dtypes: object(2)
memory usage: 213.1+ KB


In [13]:
# further examine tweets labeled as "I can't tell"
for i in range(10):
    display(clean_df.loc[clean_df['sentiment'] == "I can't tell"].iloc[i][0])

'Thanks to @mention for publishing the news of @mention new medical Apps at the #sxswi conf. blog {link} #sxsw #sxswh'

'\x89ÛÏ@mention &quot;Apple has opened a pop-up store in Austin so the nerds in town for #SXSW can get their new iPads. {link} #wow'

'Just what America needs. RT @mention Google to Launch Major New Social Network Called Circles, Possibly Today {link} #sxsw'

'The queue at the Apple Store in Austin is FOUR blocks long. Crazy stuff! #sxsw'

"Hope it's better than wave RT @mention Buzz is: Google's previewing a social networking platform at #SXSW: {link}"

'SYD #SXSW crew your iPhone extra juice pods have been procured.'

'Why Barry Diller thinks iPad only content is nuts @mention #SXSW {link}'

'Gave into extreme temptation at #SXSW and bought an iPad 2... #impulse'

'Catch 22\x89Û_ I mean iPad 2 at #SXSW : {link}'

'Forgot my iPhone for #sxsw. Android only. Knife to a gun fight'

Looking at a sample of the tweets labeled as "I can't tell", there is no clear class label that each should belong to.  Given this, and the small number of tweets with this class distinction, they will be removed from the dataset. 

In [14]:
# separate dataset into tweets and class_labels for additional EDA
tweets = clean_df['text']
class_labels = clean_df['sentiment']

In [15]:
# tokenize tweets and print the total vocabulary size of our dataset
tokenized = list(map(nltk.word_tokenize, tweets.dropna())) 
raw_tweet_vocab = set()
for tweet in tokenized:
    raw_tweet_vocab.update(tweet)
print(len(raw_tweet_vocab))

13212


Looking at the text within the tweets, there is a total vocabulary size of just over 13,200.

In [16]:
# print average tweet size
mean_tweet_size = []
for tweet in tokenized:
    mean_tweet_size.append(len(tweet))
np.mean(mean_tweet_size)

24.414980202375716

The average tweet size within the dataset is just over 24 words. 

In [17]:
# display frequency distribution of raw dataset
tweets_concat = []
for tweet in tokenized:
    tweets_concat += tweet
    
# display the 15 most common words
unprocessed_freq_dist = nltk.FreqDist(tweets_concat)
unprocessed_freq_dist.most_common(15)

[('#', 15875),
 ('@', 7194),
 ('mention', 7123),
 ('.', 5601),
 ('SXSW', 4787),
 ('sxsw', 4523),
 ('link', 4311),
 ('}', 4298),
 ('{', 4296),
 ('the', 3928),
 (',', 3533),
 ('to', 3521),
 ('RT', 2947),
 ('at', 2859),
 (';', 2800)]

From first glance, we can see a number of the top appear words / tokens are stopwords or punctuation.  For additional EDA processed, we will try removing stopwords to see if additional information can be extracted from the data.

In [18]:
def initial_tweet_process(tweet, stopwords_list):
    """
    Function to intially process a tweet to assist in EDA / data understanding. 
    Input: tweet of type string, stopwords_list of words to remove
    Returns: tokenized tweet, converted to lowercase, with all stopwords removed
    """
    # tokenize
    tokens = nltk.word_tokenize(tweet)
    
    # remove stopwords and lowercase
    stopwords_removed = [token.lower() for token in tokens if token.lower() not in stopwords_list]
    
    # return processed tweet
    return stopwords_removed

In [19]:
# set up initial stopwords list
stopwords_list = stopwords.words('english') + list(string.punctuation)
stopwords_list += ["''", '""', '...', '``']

In [20]:
# separate dataset based on class label
neutral_tweets = clean_df.loc[clean_df['sentiment'] == 'No emotion toward brand or product']
positive_tweets = clean_df.loc[clean_df['sentiment'] == 'Positive emotion']
negative_tweets = clean_df.loc[clean_df['sentiment'] == 'Negative emotion']
ambig_tweets = clean_df.loc[clean_df['sentiment'] == "I can't tell"]

In [21]:
# process the four datasets
processed_neutral = neutral_tweets['text'].apply(lambda x: initial_tweet_process(x, stopwords_list))
processed_positive = positive_tweets['text'].apply(lambda x: initial_tweet_process(x, stopwords_list))
processed_negative = negative_tweets['text'].apply(lambda x: initial_tweet_process(x, stopwords_list))
processed_ambig = ambig_tweets['text'].apply(lambda x: initial_tweet_process(x, stopwords_list))

In [22]:
def concat_tweets(tweets):
    """
    Function to concatenate a list of tweets into one piece of text.
    Input: tweets (list of tweets)
    Returns: concatenated tweet string
    """
    tweets_concat = []
    for tweet in tweets:
        tweets_concat += tweet
    return tweets_concat

In [23]:
# concatenate the four datasets
concat_neutral = concat_tweets(list(processed_neutral))
concat_positive = concat_tweets(list(processed_positive))
concat_negative = concat_tweets(list(processed_negative))
concat_ambig = concat_tweets(list(processed_ambig))

# produce frequency distributions for four datasets
freqdist_neutral = nltk.FreqDist(concat_neutral)
freqdist_positive = nltk.FreqDist(concat_positive)
freqdist_negative = nltk.FreqDist(concat_negative)
freqdist_ambig = nltk.FreqDist(concat_ambig)

In [24]:
# display top neutral words
print('Top Neutral Words')
freqdist_neutral.most_common(15)

Top Neutral Words


[('sxsw', 5671),
 ('mention', 4513),
 ('link', 2946),
 ('rt', 1853),
 ('google', 1683),
 ('apple', 1223),
 ('ipad', 1212),
 ('quot', 1018),
 ('store', 867),
 ('iphone', 815),
 ('new', 678),
 ("'s", 648),
 ('austin', 630),
 ('amp', 601),
 ('2', 550)]

In [25]:
# display top positive words
print('Top Positive Words')
freqdist_positive.most_common(15)

Top Positive Words


[('sxsw', 3105),
 ('mention', 2194),
 ('link', 1217),
 ('ipad', 1003),
 ('rt', 937),
 ('apple', 925),
 ('google', 716),
 ('store', 545),
 ('iphone', 523),
 ("'s", 493),
 ('2', 490),
 ('quot', 464),
 ('app', 396),
 ('new', 360),
 ('austin', 294)]

In [26]:
# display negative words
print('Top Negative Words')
freqdist_negative.most_common(15)

Top Negative Words


[('sxsw', 581),
 ('mention', 313),
 ('ipad', 188),
 ('quot', 175),
 ('iphone', 162),
 ('google', 145),
 ('rt', 138),
 ('apple', 120),
 ('link', 102),
 ("n't", 87),
 ("'s", 77),
 ('2', 64),
 ('app', 60),
 ('store', 46),
 ('new', 43)]

In [27]:
print('Top Ambiguous Words')
freqdist_ambig.most_common(15)

Top Ambiguous Words


[('sxsw', 159),
 ('mention', 104),
 ('google', 51),
 ('link', 48),
 ('ipad', 43),
 ('quot', 39),
 ('apple', 37),
 ('rt', 34),
 ('iphone', 32),
 ('store', 22),
 ("'s", 19),
 ('2', 18),
 ('circles', 17),
 ('austin', 16),
 ("n't", 14)]

Comparing the top words found in positive, neutral, negative, and ambiguous tweets shows that the same words are common across all classes.  Some of these words include `sxsw`, `mention`, `google`, `link`, `rt`, `quot`.  Adding some of these words to our stopwords list may help.

In [28]:
# create additional stopwords
additional_stopwords = ['sxsw', 'mention', 'link', 'rt', 'quot']

In [29]:
def print_normalized_word_freq(freq_dist, n=15):
    """
    Print a normalized frequency distribution from a given distribution. Returns top n results. 
    """
    total_word_count = sum(freq_dist.values())
    top = freq_dist.most_common(n)
    
    print('Word\t\t\tNormalized Frequency')
    for word in top:
        normalized_freq = word[1] / total_word_count
        print('{} \t\t\t {:.4}'.format(word[0], normalized_freq))
    
    return None

In [30]:
# neutral normalized frequency
print_normalized_word_freq(freqdist_neutral)

Word			Normalized Frequency
sxsw 			 0.08271
mention 			 0.06582
link 			 0.04297
rt 			 0.02703
google 			 0.02455
apple 			 0.01784
ipad 			 0.01768
quot 			 0.01485
store 			 0.01265
iphone 			 0.01189
new 			 0.009889
's 			 0.009451
austin 			 0.009189
amp 			 0.008766
2 			 0.008022


In [31]:
# positive normalized frequency
print_normalized_word_freq(freqdist_positive)

Word			Normalized Frequency
sxsw 			 0.08106
mention 			 0.05728
link 			 0.03177
ipad 			 0.02619
rt 			 0.02446
apple 			 0.02415
google 			 0.01869
store 			 0.01423
iphone 			 0.01365
's 			 0.01287
2 			 0.01279
quot 			 0.01211
app 			 0.01034
new 			 0.009399
austin 			 0.007676


In [32]:
# negative normalized frequency
print_normalized_word_freq(freqdist_negative)

Word			Normalized Frequency
sxsw 			 0.07918
mention 			 0.04265
ipad 			 0.02562
quot 			 0.02385
iphone 			 0.02208
google 			 0.01976
rt 			 0.01881
apple 			 0.01635
link 			 0.0139
n't 			 0.01186
's 			 0.01049
2 			 0.008722
app 			 0.008177
store 			 0.006269
new 			 0.00586


In [33]:
# ambig normalized frequency
print_normalized_word_freq(freqdist_ambig)

Word			Normalized Frequency
sxsw 			 0.08213
mention 			 0.05372
google 			 0.02634
link 			 0.02479
ipad 			 0.02221
quot 			 0.02014
apple 			 0.01911
rt 			 0.01756
iphone 			 0.01653
store 			 0.01136
's 			 0.009814
2 			 0.009298
circles 			 0.008781
austin 			 0.008264
n't 			 0.007231


In [34]:
def print_bigrams(tweets_concat, n=15):
    """
    Function takes concatenated tweets and prints most common bigrams
    """
    bigram_measures = nltk.collocations.BigramAssocMeasures()
    finder = BigramCollocationFinder.from_words(tweets_concat)
    tweet_scored = finder.score_ngrams(bigram_measures.raw_freq)
    display(tweet_scored[:n])
    return None

In [35]:
# neutral bigrams
print_bigrams(concat_neutral)

[(('rt', 'mention'), 0.026210236140079637),
 (('link', 'sxsw'), 0.008853429792447602),
 (('sxsw', 'link'), 0.008503376555184435),
 (('ipad', '2'), 0.006490570440921223),
 (('mention', 'mention'), 0.00631554382228964),
 (('sxsw', 'rt'), 0.005761292863289625),
 (('link', 'rt'), 0.005119528594973819),
 (('mention', 'google'), 0.005119528594973819),
 (('social', 'network'), 0.005075771940315923),
 (('apple', 'store'), 0.004871574218579076),
 (('mention', 'sxsw'), 0.004856988667026443),
 (('new', 'social'), 0.004652790945289596),
 (('sxsw', 'mention'), 0.004448593223552749),
 (('network', 'called'), 0.0036755589912632544),
 (('called', 'circles'), 0.003515117924184303)]

In [36]:
# positive bigrams
print_bigrams(concat_positive)

[(('rt', 'mention'), 0.023627392110278568),
 (('ipad', '2'), 0.01099130616400804),
 (('sxsw', 'link'), 0.008067253217763621),
 (('mention', 'sxsw'), 0.005691460198940031),
 (('apple', 'store'), 0.005587029736574159),
 (('sxsw', 'rt'), 0.005117092655927734),
 (('link', 'sxsw'), 0.004856016500013054),
 (('link', 'rt'), 0.0038378194919458006),
 (('mention', 'mention'), 0.0038117118763543326),
 (('sxsw', 'mention'), 0.0035767433360311203),
 (('iphone', 'app'), 0.0034984204892567162),
 (('sxsw', 'apple'), 0.003211236717750568),
 (('store', 'sxsw'), 0.0030545910242017597),
 (('mention', 'google'), 0.0027935148682870794),
 (('austin', 'sxsw'), 0.002636869174738271)]

In [37]:
# negative bigrams
print_bigrams(concat_negative)

[(('rt', 'mention'), 0.018669937312619244),
 (('sxsw', 'rt'), 0.007358953393295176),
 (('ipad', '2'), 0.00681384573453257),
 (('mention', 'sxsw'), 0.0046334150994821474),
 (('link', 'sxsw'), 0.003952030526028891),
 (('sxsw', 'mention'), 0.003679476696647588),
 (('apple', 'store'), 0.003270645952575634),
 (('sxsw', 'link'), 0.003270645952575634),
 (('iphone', 'app'), 0.002998092123194331),
 (('mention', 'google'), 0.002998092123194331),
 (('mention', 'mention'), 0.0028618152085036794),
 (('sxsw', 'iphone'), 0.002725538293813028),
 (('ipad', 'design'), 0.0025892613791223765),
 (('sxsw', 'ipad'), 0.0025892613791223765),
 (('sxsw', 'quot'), 0.0024529844644317253)]

In [38]:
# ambiguous bigrams
print_bigrams(concat_ambig)

[(('rt', 'mention'), 0.017045454545454544),
 (('sxsw', 'link'), 0.00878099173553719),
 (('ipad', '2'), 0.008264462809917356),
 (('apple', 'store'), 0.006714876033057851),
 (('mention', 'mention'), 0.006714876033057851),
 (('link', 'sxsw'), 0.005681818181818182),
 (('sxsw', 'mention'), 0.005681818181818182),
 (('sxsw', 'rt'), 0.005165289256198347),
 (('mention', 'sxsw'), 0.004132231404958678),
 (('google', 'circles'), 0.003615702479338843),
 (('mention', 'google'), 0.003615702479338843),
 (('social', 'network'), 0.003615702479338843),
 (('austin', 'sxsw'), 0.0030991735537190084),
 (('called', 'circles'), 0.0030991735537190084),
 (('network', 'called'), 0.0030991735537190084)]

Similar to the frequency distributions, we see a number of the same results showing up commonly among the different class labels.  We will move on to show the PMI for each class.

In [39]:
def display_pmi(tweets_concat, freq_filter=10):
    """
    Function that takes concatenated tweets and a freq_filter number. Displays PMI scores. 
    """
    bigram_measures = nltk.collocations.BigramAssocMeasures()
    tweet_pmi_finder = BigramCollocationFinder.from_words(tweets_concat)
    tweet_pmi_finder.apply_freq_filter(freq_filter)
    tweet_pmi_scored = tweet_pmi_finder.score_ngrams(bigram_measures.pmi)
    display(tweet_pmi_scored)
    return None

In [40]:
# neutral pmi
display_pmi(concat_neutral)

[(('lonely', 'planet'), 12.605668913018919),
 (('speech', 'therapy'), 12.364660813515124),
 (('augmented', 'reality'), 12.257745609598611),
 (('mark', 'belinsky'), 12.257745609598611),
 (('therapy', 'communication'), 12.22715728976519),
 (('communication', 'showcased'), 12.158209936047697),
 (('dwnld', 'groundlink'), 12.158209936047697),
 (('barry', 'diller'), 12.065100531656217),
 (('league', 'extraordinary'), 12.065100531656213),
 (('south', 'southwest'), 11.743172436768853),
 (('mike', 'tyson'), 11.743172436768852),
 (('exhibit', 'hall'), 11.672783108877457),
 (('interrupt', 'regularly'), 11.672783108877455),
 (('regularly', 'scheduled'), 11.672783108877455),
 (('afford', 'attend'), 11.643636763217938),
 (('living', 'social-type'), 11.60566891301892),
 (('red', 'cross'), 11.541538575599205),
 (('consider', 'saving'), 11.512318436684211),
 (('schools', 'marketing'), 11.404035051849268),
 (('marketing', 'experts'), 11.27850416976541),
 (('150', 'million'), 11.241978293740296),
 (('con

In [41]:
# positive pmi
display_pmi(concat_positive)

[(('belinsky', '911tweets'), 11.514676389302611),
 (('mark', 'belinsky'), 11.514676389302611),
 (('holler', 'gram'), 11.225169772107627),
 (('physical', 'worlds'), 11.211363972582596),
 (('gon', 'na'), 11.137706930857284),
 (('convention', 'center'), 10.94506185291489),
 (('choice', 'awards'), 10.903241677220263),
 (('includes', 'uberguide'), 10.765738153470329),
 (('connect', 'digital'), 10.510924254441502),
 (('song', 'info'), 10.318279176499107),
 (('looking', 'forward'), 9.967781929414976),
 (('brain', 'search'), 9.79221036483152),
 (('911tweets', 'panel'), 9.742086885405683),
 (('marketing', 'experts'), 9.707321467245007),
 (('schools', 'marketing'), 9.64020727138647),
 (('think', 'speak.'), 9.610459927992418),
 (('giving', 'away'), 9.573093075527932),
 (('shop', 'core'), 9.449995255799172),
 (('best', 'andoid'), 9.27097346172075),
 (('marissa', 'mayer'), 9.130149579679204),
 (('video', 'streaming'), 9.129245352109091),
 (('route', 'around'), 9.048581040384304),
 (('ever', 'heard'

In [42]:
# negative pmi
display_pmi(concat_negative)

[(('fascist', 'company'), 8.811423845854492),
 (('company', 'america'), 8.770781861357147),
 (('network', 'called'), 8.693272494136231),
 (('design', 'headaches'), 8.033816267190938),
 (('launch', 'major'), 7.896312743441005),
 (('social', 'network'), 7.787439205310752),
 (('called', 'circles'), 7.51270024849441),
 (('news', 'apps'), 7.464795310338804),
 (('major', 'new'), 7.192514013209998),
 (('new', 'social'), 6.460710124159569),
 (('ca', "n't"), 6.398227693399816),
 (('today', 'link'), 5.320748940722099),
 (('apple', 'store'), 4.9956811383041675),
 (('ipad', '2'), 4.930438527345631),
 (('iphone', 'battery'), 4.916358685642763),
 (('ipad', 'design'), 4.727154928956885),
 (('ipad', 'news'), 4.7016198368497495),
 (('google', 'launch'), 4.6612620992336105),
 (('rt', 'mention'), 4.540659968498282),
 (('google', 'circles'), 4.480689853591787),
 (('iphone', 'app'), 4.0538622093926975),
 (('sxsw', 'rt'), 2.3051398812293122),
 (('quot', 'apple'), 1.9425011004449946),
 (('link', 'sxsw'), 1.8

In [43]:
display_pmi(concat_ambig)

[(('ipad', '2'), 5.322673481130185),
 (('apple', 'store'), 4.95041797114944),
 (('rt', 'mention'), 4.175354797241615),
 (('sxsw', 'link'), 2.108480622519423),
 (('sxsw', 'rt'), 1.8404455356272624),
 (('link', 'sxsw'), 1.4804493999063801),
 (('mention', 'mention'), 1.2184235191335038),
 (('sxsw', 'mention'), 0.3649721824864436)]

Looking at PMI scores, we can see some trends starting to emerge within our dataset.  Looking at positive tweets, we see combinations such as ('choice', 'awards'), ('league', 'extraordinary'); which contrast some of the combinations showing in the negative labeled tweets such as: ('fascist', 'company'), ('design', 'headaches'), etc. 

# 3. Data Preparation
Leverage information learned during data understanding phase to preprocess dataset and prepare data for modeling. 

In [44]:
# set seed for reproducibility
SEED = 1

In [45]:
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column                                              Non-Null Count  Dtype 
---  ------                                              --------------  ----- 
 0   tweet_text                                          9092 non-null   object
 1   emotion_in_tweet_is_directed_at                     3291 non-null   object
 2   is_there_an_emotion_directed_at_a_brand_or_product  9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB


In [46]:
# pull in copy of dataset
clean_df = raw_df.copy()

# relabel columns
clean_df.columns = ['text', 'product_brand', 'sentiment']

# drop product_brand column, handle missing values and duplicates
clean_df = clean_df.drop('product_brand', axis=1)
clean_df = clean_df.dropna()
clean_df = clean_df.drop_duplicates()

# remove ambiguous tweets
clean_df = clean_df.loc[clean_df['sentiment'] != "I can't tell"]

In [47]:
# separate dataset into text and class_labels
text = clean_df['text']
class_labels = clean_df['sentiment']

In [48]:
# split tweets and labels into train and test sets for validation purposes
X_train, X_test, y_train, y_test = train_test_split(text, class_labels, stratify=class_labels, random_state=SEED)

In [49]:
# update stopwords list per our data understanding findings
updated_stopwords = stopwords_list + additional_stopwords

In [50]:
def preprocess_tweet(tweet, stopwords_list):
    """
    Function to preprocess a tweet. 
    Takes: tweet, stopwords list
    Returns: processed tweet with stopwords removed and converted to lowercase
    """
    processed = re.sub("\'", '', tweet) # handle apostrophes
    processed = re.sub('\s+', ' ', processed) # handle excess white space
    tokens = nltk.word_tokenize(processed)
    stopwords_removed = [token.lower() for token in tokens if token.lower() not in stopwords_list]
    return ' '.join(stopwords_removed)

In [51]:
# preprocess train and test sets
X_train_preprocessed = X_train.apply(lambda x: preprocess_tweet(x, updated_stopwords))
X_test_preprocessed = X_test.apply(lambda x: preprocess_tweet(x, updated_stopwords))

Now that we have split our data into train and test sets, as well as, preprocessed both train and test sets, we are ready to vectorize our data.  We have chosen to use TF-IDF vectorization for its benefits in classification and finding words that are unique per class label. Try a number of vectorizers to see if there is one that performs better over the other (count vectorized vs. TF-IDF vectorized).

In [62]:
# create vectorizers with unigram and bigrams
count_vectorizer = CountVectorizer(ngram_range=(1,2), analyzer='word', min_df=5)
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,2), analyzer='word', min_df=5)

# fit to data
X_train_count = count_vectorizer.fit_transform(X_train_preprocessed)
X_test_count = count_vectorizer.transform(X_test_preprocessed)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train_preprocessed)
X_test_tfidf = tfidf_vectorizer.transform(X_test_preprocessed)

Now that we have vectorized our datasets, we are ready to move on to the modeling stage.

# 4. Modeling

This is a classification task, tasked with classifying the sentiment of tweets based on the text within the tweet. Three primary models will be relied on for classification:
1. Random Forests
2. Linear SVM
3. Neural Networks (Word Embeddings / Sequential)

Overfitting will be addressed thru hyperparameter tuning, such as pruning trees used in random forests / XGBoost, in addition to other parameter tuning. 

This is a multi-class classification problem, with three available class labels (Neutral, Positive, or Negative). As a result, the performance metric we will focus on throughout this process will be accuracy. 

### Random Forest

In [63]:
# instantiate random forest classifiers, with balanced class_weight
rf_count = RandomForestClassifier(random_state=SEED, n_jobs=-1, class_weight='balanced')
rf_tfidf = RandomForestClassifier(random_state=SEED, n_jobs=-1, class_weight='balanced')

# fit to training sets
rf_count.fit(X_train_count, y_train)
rf_tfidf.fit(X_train_tfidf, y_train)

RandomForestClassifier(class_weight='balanced', n_jobs=-1, random_state=1)

In [66]:
# Count Vectorized
count_train_score = rf_count.score(X_train_count, y_train)
count_test_score = rf_count.score(X_test_count, y_test)
print(f'Count Vectorized Train Score: {count_train_score}')
print(f'Count Vectorized Test Score: {count_test_score}')

# TF-IDF Vectorized
tfidf_train_score = rf_tfidf.score(X_train_tfidf, y_train)
tfidf_test_score = rf_tfidf.score(X_test_tfidf, y_test)
print(f'TF-IDF Vectorized Train Score: {tfidf_train_score}')
print(f'TF-IDF Vectorized Test Score: {tfidf_test_score}')

Count Vectorized Train Score: 0.9546746447270007
Count Vectorized Test Score: 0.6523104531179902
TF-IDF Vectorized Train Score: 0.9546746447270007
TF-IDF Vectorized Test Score: 0.6742934051144011


Comparing the two methods of vectorization, we can see that the TF-IDF is producing slightly better results with a testing accuracy of ~67% compared to that in the count vectorized dataset of ~65%.  Both methods show testing accuracy that is slightly better than our "dummy" model that could produce ~60% accuracy from guessing neutral for all entries. 

Additionally, looking at the delta between training and testing scores, we can see that our model is likely overfitting to the training data. Move forward with tuning hyperparameters to try to perform improvement and address overfitting.

In [69]:
# set grid search parameters
random_forest_params = {
    'criterion': ['entropy', 'gini'],
    'max_depth': [5, 10, 15],
    'min_samples_split': [4, 5, 6],
    'min_samples_leaf': [3, 4, 5],
}

In [70]:
# instantiate classifier for grid search
rf_classifier = RandomForestClassifier(random_state=SEED, n_jobs=-1, class_weight='balanced')
rf_grid_search = GridSearchCV(rf_classifier, 
                              random_forest_params, 
                              return_train_score=True,
                              scoring='accuracy')

# fit to count-vectorized data
rf_grid_search.fit(X_train_count, y_train)

GridSearchCV(estimator=RandomForestClassifier(class_weight='balanced',
                                              n_jobs=-1, random_state=1),
             param_grid={'criterion': ['entropy', 'gini'],
                         'max_depth': [5, 10, 15],
                         'min_samples_leaf': [3, 4, 5],
                         'min_samples_split': [4, 5, 6]},
             return_train_score=True, scoring='accuracy')

In [75]:
# print count-vectorized grid-search results
mean_train_score_count = np.mean(rf_grid_search.cv_results_['mean_train_score'])
mean_test_score_count = np.mean(rf_grid_search.cv_results_['mean_test_score'])
print(f'Grid Search Train Accuracy (Count Vect.): {mean_train_score_count}')
print(f'Grid Search Test Accuracy (Count Vect.): {mean_test_score_count}')

# display best params
rf_grid_search.best_params_

Grid Search Train Accuracy (Count Vect.): 0.5784343056594365
Grid Search Test Accuracy (Count Vect.): 0.5346796310147095


{'criterion': 'entropy',
 'max_depth': 15,
 'min_samples_leaf': 3,
 'min_samples_split': 4}

Looking at our grid search results, we can see overfitting has largely been addressed from tuning hyperparameters, and from pruning our trees.  While overfitting is better, our test accuracy score is not performing as well, with average testing score of 53%, which is now performing less than our "dummy" model that guesses "neutral" for every entry, which scores around 60%. 

In [76]:
# fit to TF-IDF Vectorized dataset
rf_grid_search.fit(X_train_tfidf, y_train)

GridSearchCV(estimator=RandomForestClassifier(class_weight='balanced',
                                              n_jobs=-1, random_state=1),
             param_grid={'criterion': ['entropy', 'gini'],
                         'max_depth': [5, 10, 15],
                         'min_samples_leaf': [3, 4, 5],
                         'min_samples_split': [4, 5, 6]},
             return_train_score=True, scoring='accuracy')

In [79]:
# print tfidf-vectorized grid search results
mean_train_score_tfidf = np.mean(rf_grid_search.cv_results_['mean_train_score']) 
mean_test_score_tfidf = np.mean(rf_grid_search.cv_results_['mean_test_score'])
print(f'Grid Search Train Accuracy (TF-IDF): {mean_train_score_tfidf}')
print(f'Grid Search Test Accuracy (TF-IDF): {mean_test_score_tfidf}')

# display best params
rf_grid_search.best_params_

Grid Search Train Accuracy (TF-IDF): 0.5873431396991605
Grid Search Test Accuracy (TF-IDF): 0.532469043463808


{'criterion': 'gini',
 'max_depth': 15,
 'min_samples_leaf': 3,
 'min_samples_split': 4}

In [80]:
# fit model to best TF-IDF params
best_random_forest = RandomForestClassifier(class_weight='balanced',
                                            n_jobs=-1,
                                            random_state=SEED,
                                            criterion='gini',
                                            max_depth=15,
                                            min_samples_leaf=3,
                                            min_samples_split=4)

best_random_forest.fit(X_train_tfidf, y_train)

RandomForestClassifier(class_weight='balanced', max_depth=15,
                       min_samples_leaf=3, min_samples_split=4, n_jobs=-1,
                       random_state=1)

Comparing to the count vectorized data, we now see that results are in line.  Similarly, performance is slightly worse than random guessing from our "dummy" model.  Move forward with additional modeling techniques to try and find improved results.

### Linear SVM

In [88]:
# create linear SVC
svc_count = LinearSVC(random_state=SEED, class_weight='balanced', max_iter=5000)
svc_tfidf = LinearSVC(random_state=SEED, class_weight='balanced', max_iter=5000)

# fit to training sets
svc_count.fit(X_train_count, y_train)
svc_tfidf.fit(X_train_tfidf, y_train)

LinearSVC(class_weight='balanced', max_iter=5000, random_state=1)

In [89]:
# Count Vectorized
count_train_score = svc_count.score(X_train_count, y_train)
count_test_score = svc_count.score(X_test_count, y_test)
print(f'Count Vectorized Train Score: {count_train_score}')
print(f'Count Vectorized Test Score: {count_test_score}')

# TF-IDF Vectorized
tfidf_train_score = svc_tfidf.score(X_train_tfidf, y_train)
tfidf_test_score = svc_tfidf.score(X_test_tfidf, y_test)
print(f'TF-IDF Vectorized Train Score: {tfidf_train_score}')
print(f'TF-IDF Vectorized Test Score: {tfidf_test_score}')

Count Vectorized Train Score: 0.8954375467464473
Count Vectorized Test Score: 0.655002243158367
TF-IDF Vectorized Train Score: 0.8587883320867614
TF-IDF Vectorized Test Score: 0.6558995065051593


Looking at baseline linearSVC results, we can see that we are not overfitting as badly to training data as we were with random forest model above. Baseline test scores are inline with each other when comparing Count and TF-IDF vectorized datasets. 

Try to addres overfitting and improve results with grid search

In [94]:
# set params for grid search
svc_params = {
    'C': [0.01, 0.1, 1, 10]
}

In [95]:
# grid search 
svc_classifier = LinearSVC(random_state=SEED, class_weight='balanced', max_iter=10000)
svc_grid_search = GridSearchCV(svc_classifier,
                               svc_params,
                               return_train_score=True,
                               scoring='accuracy')

# fit to count data
svc_grid_search.fit(X_train_count, y_train)



GridSearchCV(estimator=LinearSVC(class_weight='balanced', max_iter=10000,
                                 random_state=1),
             param_grid={'C': [0.01, 0.1, 1, 10]}, return_train_score=True,
             scoring='accuracy')

In [96]:
# print count-vectorized grid-search results
mean_train_score_count = np.mean(svc_grid_search.cv_results_['mean_train_score'])
mean_test_score_count = np.mean(svc_grid_search.cv_results_['mean_test_score'])
print(f'Grid Search Train Accuracy (Count Vect.): {mean_train_score_count}')
print(f'Grid Search Test Accuracy (Count Vect.): {mean_test_score_count}')

# display best params
svc_grid_search.best_params_

Grid Search Train Accuracy (Count Vect.): 0.874850411368736
Grid Search Test Accuracy (Count Vect.): 0.6465968586387435


{'C': 0.01}

In [97]:
# failing to converge, try with tfidf vectorized data
svc_grid_search.fit(X_train_tfidf, y_train)

GridSearchCV(estimator=LinearSVC(class_weight='balanced', max_iter=10000,
                                 random_state=1),
             param_grid={'C': [0.01, 0.1, 1, 10]}, return_train_score=True,
             scoring='accuracy')

In [98]:
# print tfidf-vectorized grid search results
mean_train_score_tfidf = np.mean(svc_grid_search.cv_results_['mean_train_score']) 
mean_test_score_tfidf = np.mean(svc_grid_search.cv_results_['mean_test_score'])
print(f'Grid Search Train Accuracy (TF-IDF): {mean_train_score_tfidf}')
print(f'Grid Search Test Accuracy (TF-IDF): {mean_test_score_tfidf}')

# display best params
svc_grid_search.best_params_

Grid Search Train Accuracy (TF-IDF): 0.8177356020942408
Grid Search Test Accuracy (TF-IDF): 0.6457367240089753


{'C': 0.1}

With LinearSVC, our TF-IDF vectorized data appears to be performing slightly better, with slighty less overfitting to training data.  Additionally, test score results are marginally better than the "dummy" model that guess neutral every time. Moving forward, we will stick with TF-IDF Vectorization and try to remvoe some more overfitting that is occurring. 

In [103]:
svc_params = {
    'C': [0.001, 0.01]
}

# grid search 
svc_classifier = LinearSVC(random_state=SEED, class_weight='balanced', max_iter=10000)
svc_grid_search = GridSearchCV(svc_classifier,
                               svc_params,
                               return_train_score=True,
                               scoring='accuracy')

# fit to tfidf data
svc_grid_search.fit(X_train_tfidf, y_train)

GridSearchCV(estimator=LinearSVC(class_weight='balanced', max_iter=10000,
                                 random_state=1),
             param_grid={'C': [0.001, 0.01]}, return_train_score=True,
             scoring='accuracy')

In [104]:
# print tfidf-vectorized grid search results
mean_train_score_tfidf = np.mean(svc_grid_search.cv_results_['mean_train_score']) 
mean_test_score_tfidf = np.mean(svc_grid_search.cv_results_['mean_test_score'])
print(f'Grid Search Train Accuracy (TF-IDF): {mean_train_score_tfidf}')
print(f'Grid Search Test Accuracy (TF-IDF): {mean_test_score_tfidf}')

# display best params
svc_grid_search.best_params_

Grid Search Train Accuracy (TF-IDF): 0.6361817501869858
Grid Search Test Accuracy (TF-IDF): 0.6201944652206433


{'C': 0.01}

Running updated grid search on tf-idf vectorized data, we see some improved results. Looking at our training accuracy score of 63.6%, we are inline with our test accuracy score of 62%.  We are no longer likely overfitting to training data.  While results are not particularly strong, we are still outperforming the "dummy" model, which would score closer to 60%. 

In [105]:
# run a best version of the SVC model
best_svc = LinearSVC(class_weight='balanced',
                     max_iter=10000,
                     random_state=SEED,
                     C=0.01)

# fit to data
best_svc.fit(X_train_tfidf, y_train)

LinearSVC(C=0.01, class_weight='balanced', max_iter=10000, random_state=1)

In [106]:
# generate scores
print(f'Best Identified SVC Train Accuracy: {best_svc.score(X_train_tfidf, y_train)}')
print(f'Best Identified SVC Test Accuracy: {best_svc.score(X_test_tfidf, y_test)}')

Best Identified SVC Train Accuracy: 0.6797307404637247
Best Identified SVC Test Accuracy: 0.655002243158367


### Neural Networks