# Tweet Sentiment Classification
### Module 4 Project - Kai Graham

# CRISP-DM
I will be following the Cross-Industry Standard Process for Data Mining to build a classifier that will determine the sentiment of tweets.  The CRISP-DM process includes the following key steps:
1. Business Understanding
2. Data Understanding
3. Data Preparation
4. Modeling
5. Evaluation
6. Deployment

# 1. Business Understanding
* Who are the stakeholders?
* What Business Problems will this solve?
* What problems are inside scope of this problem?
* What problems are outside scope?
* What data sources are available to us?
* Timeline of project / deadlines?
* Do stakeholders from different parts of the company all agree?

My overall goal is to create a classifier that will successfully classify a tweet based on its sentiment into one of the following classes: positive sentiment, negative sentiment, or neutral sentiment.  Given data is limited to tweets related to Apple / Google products, the biggest stakeholders for this project are likely Google and Apple themselves.  Product managers / other managers within the company could use a tool like this to track public sentiment surrounding various product launches / software updates.  While out of the specific scope of this project, combining with time series metrics, the Company could track increases or changes to sentiment based on tweet classification. 

# 2. Data Understanding
* What data is available to us, where does it live?
* What is our target?
* What predictors are available?
* What is distribution of data?

The main dataset used throughout this data science process will be coming from CrowdFlower via the following url: `https://data.world/crowdflower/brands-and-product-emotions`. 

The following summary of the dataset is provided on CrowdFlower:

*Contributors evaluated tweets about multiple brands and products. The crowd was asked if the tweet expressed positive, negative, or no emotion towards a brand and/or product. If some emotion was expressed they were also asked to say which brand or product was the target of that emotion.*

In [1]:
# import necessary libraries to load dataset and perform initial data understanding
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# failure to specify 'latin1' encoding results in errors
# error_df = pd.read_csv('judge-1377884607_tweet_product_company.csv')
# error_df.head()

In [3]:
# load dataset
raw_df = pd.read_csv('judge-1377884607_tweet_product_company.csv', encoding='latin_1')

In [4]:
# print first rows of dataset
raw_df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [5]:
# show info of df
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column                                              Non-Null Count  Dtype 
---  ------                                              --------------  ----- 
 0   tweet_text                                          9092 non-null   object
 1   emotion_in_tweet_is_directed_at                     3291 non-null   object
 2   is_there_an_emotion_directed_at_a_brand_or_product  9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB


Looking at the above outputs, we can see that there are a total of 9,093 entries in our dataset, with a total of three columns.  Raw text from tweets is held in the `tweet_text` column; sentiment is held in the `is_there_an_emotion_directed_at_a_brand_or_product`; and the item of emotion direction is held in the `emotion_in_tweet_is_directed_at` column.  

From first glance, we can likely drop the `emotion_in_tweet_is_directed_at` column as we are more interested in whether sentiment in a given tweet is positive, neutral, or negative based on the text.  Main predictors we will use is processed features derived from the `tweet_text` column.

Our target variable, which can also be though of as our class labels are held in the `is_there_an_emotion_directed_at_a_brand_or_product` column.  

In [6]:
# display value counts
display(raw_df['emotion_in_tweet_is_directed_at'].value_counts())
display(raw_df['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts())

iPad                               946
Apple                              661
iPad or iPhone App                 470
Google                             430
iPhone                             297
Other Google product or service    293
Android App                         81
Android                             78
Other Apple product or service      35
Name: emotion_in_tweet_is_directed_at, dtype: int64

No emotion toward brand or product    5389
Positive emotion                      2978
Negative emotion                       570
I can't tell                           156
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

Unsurprising given the origin of our dataset, the products identfied are either Apple or Google products.  Looking at sentiment, the majority of entries seem to fall under a neutral sentiment ('No emotion toward brand or product'), with the next largest group being tagged as 'Positive emotion'.  There is some clear class imbalance present with only 570 entries belonging to the 'Negative emotion' class. 

As we move into the data preparation stage, class imbalance will need to be addressed, in addition to the number of missing values clearly present in the dataset. 

# 3. Data Preparation
This section will handle certain data cleaning processes, display additional summary information about the dataset, and prepare final datasets for modeling. 

In [7]:
# import additional necessary libraries
import nltk
from nltk.corpus import stopwords
from nltk.collocations import *
import string
import re
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest, f_classif
from imblearn.over_sampling import SMOTE

In [8]:
# set random seed for reproducibility
SEED = 23

In [9]:
# rename columns so easier to work with
df = raw_df.copy()
df.columns = ['text', 'product_brand', 'sentiment']
df.head()

Unnamed: 0,text,product_brand,sentiment
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [10]:
# check for missing values
df.isna().sum()

text                1
product_brand    5802
sentiment           0
dtype: int64

In [11]:
# display missing text entry
df.loc[df['text'].isna()]

Unnamed: 0,text,product_brand,sentiment
6,,,No emotion toward brand or product


In [12]:
# drop as text is missing
clean_df = df.dropna(subset=['text'])

In [13]:
# examine missing product/brand entries
missing_prod_brand = clean_df.loc[clean_df['product_brand'].isna()]

# display value counts
missing_prod_brand['sentiment'].value_counts()

No emotion toward brand or product    5297
Positive emotion                       306
I can't tell                           147
Negative emotion                        51
Name: sentiment, dtype: int64

In [14]:
# drop product/brand column as we are only focused on sentiment
clean_df = clean_df.drop(['product_brand'], axis=1)

In [15]:
# check for additional missing values now that column is dropped
clean_df.isna().any()

text         False
sentiment    False
dtype: bool

In [16]:
# check for duplicates
clean_df.duplicated().sum()

22

In [17]:
# remove duplicates
final_clean_df = clean_df.drop_duplicates()
final_clean_df.duplicated().any()

False

In [18]:
# examine mislabeled "I can't tell tweets"
for i in range(7):
    display(final_clean_df.loc[final_clean_df['sentiment'] == "I can't tell"].iloc[i][0])

'Thanks to @mention for publishing the news of @mention new medical Apps at the #sxswi conf. blog {link} #sxsw #sxswh'

'\x89ÛÏ@mention &quot;Apple has opened a pop-up store in Austin so the nerds in town for #SXSW can get their new iPads. {link} #wow'

'Just what America needs. RT @mention Google to Launch Major New Social Network Called Circles, Possibly Today {link} #sxsw'

'The queue at the Apple Store in Austin is FOUR blocks long. Crazy stuff! #sxsw'

"Hope it's better than wave RT @mention Buzz is: Google's previewing a social networking platform at #SXSW: {link}"

'SYD #SXSW crew your iPhone extra juice pods have been procured.'

'Why Barry Diller thinks iPad only content is nuts @mention #SXSW {link}'

Looking at a sample of the tweets labeled as "I can't tell", there is no clear class label that each should belong to.  Given this, and the small number of tweets with this class distinction, I will remove them from the dataset. 

In [19]:
# remove "I can't tell" tweets from dataset
final_clean_df = final_clean_df.loc[final_clean_df['sentiment'] != "I can't tell"]

In [20]:
# split final clean dataset into tweets and class_labels 
tweets = final_clean_df['text']
class_labels = final_clean_df['sentiment']

In [21]:
# tokenize tweets
tokenized = list(map(nltk.word_tokenize, tweets))

In [22]:
# vocabulary of dataset
raw_tweet_vocabulary = set()
for tweet in tokenized:
    raw_tweet_vocabulary.update(tweet)
len(raw_tweet_vocabulary)

13075

In [23]:
# display frequency distribution of unprocessed dataset
tweets_concat = []
for tweet in tokenized:
    tweets_concat += tweet

unprocessed_freq_dist = nltk.FreqDist(tweets_concat)
unprocessed_freq_dist.most_common(25)

[('#', 15593),
 ('@', 7075),
 ('mention', 7005),
 ('.', 5480),
 ('SXSW', 4696),
 ('sxsw', 4432),
 ('link', 4247),
 ('}', 4234),
 ('{', 4232),
 ('the', 3855),
 ('to', 3460),
 (',', 3459),
 ('RT', 2899),
 ('at', 2808),
 (';', 2748),
 ('&', 2657),
 ('for', 2399),
 ('!', 2370),
 ('a', 2128),
 ('iPad', 2088),
 ('Google', 2077),
 (':', 2030),
 ('Apple', 1850),
 ('in', 1804),
 ('quot', 1657)]

In [24]:
# pull in stopwords from english language and append punctuation
stopwords_list = stopwords.words('english') + list(string.punctuation)
stopwords_list += ["''", '""', '...', '``']

In [25]:
# helper function to remove stopwords and convert everything to lowercase
def initial_tweet_process(tweet):
    """
    [...]
    """
    tokens = nltk.word_tokenize(tweet)
    stopwords_removed = [token.lower() for token in tokens if token.lower() not in stopwords_list]
    return stopwords_removed

In [26]:
# process all tweets in dataset
processed_tweets = list(map(initial_tweet_process, tweets))

In [27]:
# split processed tweets by class label to see if there are any differences
neutral_tweets = final_clean_df.loc[final_clean_df['sentiment'] == 'No emotion toward brand or product']
positive_tweets = final_clean_df.loc[final_clean_df['sentiment'] == 'Positive emotion']
negative_tweets = final_clean_df.loc[final_clean_df['sentiment'] == 'Negative emotion']

# process three datasets
processed_neutral_tweets = list(map(initial_tweet_process, neutral_tweets['text']))
processed_positive_tweets = list(map(initial_tweet_process, positive_tweets['text']))
processed_negative_tweets = list(map(initial_tweet_process, negative_tweets['text']))

In [28]:
# helper function to concatenate tweets
def tweet_concat(tweets):
    tweets_concat = []
    for tweet in tweets:
        tweets_concat += tweet
    return tweets_concat

In [29]:
# apply to all datasets
all_concat = tweet_concat(processed_tweets)
neutral_concat = tweet_concat(processed_neutral_tweets)
positive_concat = tweet_concat(processed_positive_tweets)
negative_concat = tweet_concat(processed_negative_tweets)

In [30]:
# create frequency distributions
all_freqdist = nltk.FreqDist(all_concat)
neutral_freqdist = nltk.FreqDist(neutral_concat)
positive_freqdist = nltk.FreqDist(positive_concat)
negative_freqdist = nltk.FreqDist(negative_concat)

In [31]:
# print top 25 most common words for each 
top_n = 25

# all
print(f'Top {top_n} most common (All): ')
display(all_freqdist.most_common(top_n))

# neutral
print(f'Top {top_n} most common (Neutral):')
display(neutral_freqdist.most_common(top_n))

# positive
print(f'Top {top_n} most common (Positive):')
display(positive_freqdist.most_common(top_n))

# negative
print(f'Top {top_n} most common (Negative):')
display(negative_freqdist.most_common(top_n))

Top 25 most common (All): 


[('sxsw', 9334),
 ('mention', 7006),
 ('link', 4249),
 ('rt', 2914),
 ('google', 2531),
 ('ipad', 2401),
 ('apple', 2265),
 ('quot', 1657),
 ('iphone', 1498),
 ('store', 1457),
 ("'s", 1216),
 ('2', 1104),
 ('new', 1074),
 ('austin', 946),
 ('amp', 827),
 ('app', 810),
 ('launch', 640),
 ('circles', 634),
 ('social', 629),
 ('android', 570),
 ('today', 565),
 ("n't", 468),
 ('ipad2', 454),
 ('network', 451),
 ('pop-up', 411)]

Top 25 most common (Neutral):


[('sxsw', 5658),
 ('mention', 4501),
 ('link', 2933),
 ('rt', 1841),
 ('google', 1672),
 ('apple', 1223),
 ('ipad', 1212),
 ('quot', 1018),
 ('store', 867),
 ('iphone', 815),
 ('new', 671),
 ("'s", 647),
 ('austin', 630),
 ('amp', 597),
 ('2', 550),
 ('circles', 483),
 ('social', 474),
 ('launch', 458),
 ('today', 434),
 ('app', 355),
 ('android', 350),
 ('network', 348),
 ('via', 271),
 ('called', 270),
 ('free', 260)]

Top 25 most common (Positive):


[('sxsw', 3097),
 ('mention', 2192),
 ('link', 1214),
 ('ipad', 1001),
 ('rt', 935),
 ('apple', 922),
 ('google', 714),
 ('store', 544),
 ('iphone', 523),
 ("'s", 492),
 ('2', 490),
 ('quot', 464),
 ('app', 395),
 ('new', 360),
 ('austin', 292),
 ('amp', 208),
 ('ipad2', 208),
 ('android', 196),
 ('launch', 160),
 ('get', 157),
 ("n't", 152),
 ('pop-up', 151),
 ('one', 146),
 ('great', 137),
 ('party', 132)]

Top 25 most common (Negative):


[('sxsw', 579),
 ('mention', 313),
 ('ipad', 188),
 ('quot', 175),
 ('iphone', 160),
 ('google', 145),
 ('rt', 138),
 ('apple', 120),
 ('link', 102),
 ("n't", 87),
 ("'s", 77),
 ('2', 64),
 ('app', 60),
 ('store', 46),
 ('new', 43),
 ('like', 39),
 ('circles', 34),
 ('social', 31),
 ('apps', 30),
 ('people', 29),
 ('design', 28),
 ('need', 25),
 ('android', 24),
 ('austin', 24),
 ('get', 24)]

In [32]:
# vocab size of each
print(f'All Tweets Vocab: {len(all_freqdist)}')
print(f'Neutral Tweets Vocab: {len(neutral_freqdist)}')
print(f'Positive Tweets Vocab: {len(positive_freqdist)}')
print(f'Negative Tweets Vocab: {len(negative_freqdist)}')

All Tweets Vocab: 10419
Neutral Tweets Vocab: 7818
Positive Tweets Vocab: 5362
Negative Tweets Vocab: 2128


In [33]:
# function to print normalized word_freq
def print_normalized_word_freq(frequency_distribution, n=25):
    """
    Print a normalized frequency distribution from given distribution. Returns top n. 
    """
    total_word_count = sum(frequency_distribution.values())
    top = frequency_distribution.most_common(n)
    
    print('Word\t\t\tNormalized Frequency')
    for word in top:
        normalized_freq = word[1] / total_word_count
        print('{} \t\t\t {:.4}'.format(word[0], normalized_freq))
        
    return None

In [34]:
print('All:')
print_normalized_word_freq(all_freqdist)

print('\nNeutral:')
print_normalized_word_freq(negative_freqdist)

print('\n Positive')
print_normalized_word_freq(positive_freqdist)

print('\n Negative')
print_normalized_word_freq(negative_freqdist)

All:
Word			Normalized Frequency
sxsw 			 0.08194
mention 			 0.06151
link 			 0.0373
rt 			 0.02558
google 			 0.02222
ipad 			 0.02108
apple 			 0.01988
quot 			 0.01455
iphone 			 0.01315
store 			 0.01279
's 			 0.01068
2 			 0.009692
new 			 0.009429
austin 			 0.008305
amp 			 0.00726
app 			 0.007111
launch 			 0.005619
circles 			 0.005566
social 			 0.005522
android 			 0.005004
today 			 0.00496
n't 			 0.004109
ipad2 			 0.003986
network 			 0.003959
pop-up 			 0.003608

Neutral:
Word			Normalized Frequency
sxsw 			 0.07907
mention 			 0.04274
ipad 			 0.02567
quot 			 0.0239
iphone 			 0.02185
google 			 0.0198
rt 			 0.01884
apple 			 0.01639
link 			 0.01393
n't 			 0.01188
's 			 0.01051
2 			 0.00874
app 			 0.008193
store 			 0.006282
new 			 0.005872
like 			 0.005326
circles 			 0.004643
social 			 0.004233
apps 			 0.004097
people 			 0.00396
design 			 0.003824
need 			 0.003414
android 			 0.003277
austin 			 0.003277
get 			 0.003277

 Positive
Word			Normalized 

In [35]:
# function to display bigrams
def display_bigrams(tweets_concat, n=25):
    bigram_measures = nltk.collocations.BigramAssocMeasures()
    tweet_finder = BigramCollocationFinder.from_words(tweets_concat)
    tweet_scored = tweet_finder.score_ngrams(bigram_measures.raw_freq)
    display(tweet_scored[:n])
    return None

In [36]:
display_bigrams(all_concat)

[(('rt', 'mention'), 0.02479171970608117),
 (('ipad', '2'), 0.008041506816845026),
 (('sxsw', 'link'), 0.008032727879272051),
 (('link', 'sxsw'), 0.007005592183233985),
 (('sxsw', 'rt'), 0.005592183233985023),
 (('mention', 'mention'), 0.005276141481357926),
 (('mention', 'sxsw'), 0.005144457417763302),
 (('apple', 'store'), 0.005012773354168679),
 (('link', 'rt'), 0.0044333634743523335),
 (('sxsw', 'mention'), 0.004169995347163087),
 (('mention', 'google'), 0.004126100659298212),
 (('social', 'network'), 0.0038539535945359893),
 (('new', 'social'), 0.003555469717054842),
 (('mention', 'rt'), 0.0029672808996655223),
 (('via', 'mention'), 0.0028443757736438735),
 (('store', 'sxsw'), 0.002835596836070899),
 (('sxsw', 'apple'), 0.002809260023351974),
 (('network', 'called'), 0.0027478074603411494),
 (('google', 'launch'), 0.0027214706476222246),
 (('austin', 'sxsw'), 0.00267757595975735),
 (('called', 'circles'), 0.0026424602094654503),
 (('mention', 'apple'), 0.002607344459173551),
 (('i

In [37]:
display_bigrams(neutral_concat)

[(('rt', 'mention'), 0.02610450576931514),
 (('link', 'sxsw'), 0.008716126296085055),
 (('sxsw', 'link'), 0.008526009447344945),
 (('ipad', '2'), 0.006507845976103774),
 (('mention', 'mention'), 0.0063323535003436725),
 (('sxsw', 'rt'), 0.005645007970283274),
 (('link', 'rt'), 0.005133154915982977),
 (('mention', 'google'), 0.005045408678102927),
 (('social', 'network'), 0.004986911186182892),
 (('apple', 'store'), 0.004884540575322833),
 (('mention', 'sxsw'), 0.004869916202342825),
 (('new', 'social'), 0.004562804369762646),
 (('sxsw', 'mention'), 0.004460433758902587),
 (('network', 'called'), 0.003582971380102078),
 (('mention', 'rt'), 0.0034367276503019933),
 (('called', 'circles'), 0.003422103277321985),
 (('google', 'launch'), 0.003422103277321985),
 (('major', 'new'), 0.003158864563681832),
 (('via', 'mention'), 0.0031442401907018237),
 (('launch', 'major'), 0.0030711183258017812),
 (('austin', 'sxsw'), 0.0028956258500416796),
 (('store', 'sxsw'), 0.0028956258500416796),
 (('men

In [38]:
display_bigrams(positive_concat)

[(('rt', 'mention'), 0.02363441254220431),
 (('ipad', '2'), 0.011018923233962363),
 (('sxsw', 'link'), 0.008061350014395268),
 (('mention', 'sxsw'), 0.005705760724474573),
 (('apple', 'store'), 0.0055748946528123115),
 (('sxsw', 'rt'), 0.005077603580495721),
 (('link', 'sxsw'), 0.004815871437171199),
 (('link', 'rt'), 0.0038474625068704686),
 (('mention', 'mention'), 0.0038212892925380167),
 (('sxsw', 'mention'), 0.003585730363545947),
 (('iphone', 'app'), 0.0035072107205485906),
 (('sxsw', 'apple'), 0.003219305362891617),
 (('store', 'sxsw'), 0.003062266076896904),
 (('mention', 'google'), 0.002800533933572382),
 (('via', 'mention'), 0.002643494647577669),
 (('austin', 'sxsw'), 0.0026173214332452168),
 (('mention', 'apple'), 0.0025388017902478605),
 (('pop-up', 'store'), 0.0024341089329180518),
 (('mention', 'rt'), 0.0023817625042531474),
 (('sxsw', 'ipad'), 0.0022247232182584344),
 (('social', 'network'), 0.002198550003925982),
 (('google', 'maps'), 0.0020676839322637214),
 (('new', 

In [39]:
display_bigrams(negative_concat)

[(('rt', 'mention'), 0.01870817970777004),
 (('sxsw', 'rt'), 0.007374027038099139),
 (('ipad', '2'), 0.006827802813054759),
 (('mention', 'sxsw'), 0.004642905912877236),
 (('link', 'sxsw'), 0.00396012563157176),
 (('sxsw', 'mention'), 0.0036870135190495697),
 (('apple', 'store'), 0.0032773453502662844),
 (('sxsw', 'link'), 0.0032773453502662844),
 (('iphone', 'app'), 0.0030042332377440938),
 (('mention', 'google'), 0.0030042332377440938),
 (('mention', 'mention'), 0.0028676771814829987),
 (('ipad', 'design'), 0.0025945650689608085),
 (('sxsw', 'ipad'), 0.0025945650689608085),
 (('sxsw', 'iphone'), 0.0025945650689608085),
 (('sxsw', 'quot'), 0.0024580090126997134),
 (('design', 'headaches'), 0.002321452956438618),
 (('mention', 'quot'), 0.0021848969001775228),
 (('new', 'social'), 0.0021848969001775228),
 (('google', 'circles'), 0.0020483408439164277),
 (('iphone', 'sxsw'), 0.0020483408439164277),
 (('quot', 'sxsw'), 0.0020483408439164277),
 (('social', 'network'), 0.0019117847876553326

In [40]:
# function to display pmi
def display_pmi(tweets_concat, freq_filter=5):
    bigram_measures = nltk.collocations.BigramAssocMeasures()
    tweet_pmi_finder = BigramCollocationFinder.from_words(tweets_concat)
    tweet_pmi_finder.apply_freq_filter(freq_filter)
    tweet_pmi_scored = tweet_pmi_finder.score_ngrams(bigram_measures.pmi)
    display(tweet_pmi_scored)
    return None

In [41]:
display_pmi(all_concat)

[(('98', 'accuracy'), 14.212559713232345),
 (('jc', 'penney'), 14.212559713232345),
 (('knitted', 'staircase'), 14.212559713232345),
 (('naomi', 'campbell'), 14.212559713232345),
 (('parking', '5-10'), 14.212559713232345),
 (('pauly', 'celebs'), 14.212559713232345),
 (('alternate', 'routes'), 13.990167291895897),
 (('aron', 'pilhofer'), 13.990167291895897),
 (('charlie', 'sheen'), 13.990167291895897),
 (('follower', 'swarm'), 13.990167291895897),
 (('likeability', 'virgin'), 13.990167291895897),
 (('lynn', 'teo'), 13.990167291895897),
 (('sheen', 'goddesses'), 13.990167291895897),
 (('swarm', 'ensues'), 13.990167291895897),
 (('cameron', 'sinclair'), 13.797522213953503),
 (('elusive', "'power"), 13.797522213953503),
 (('zazzlsxsw', 'you\x89ûªll'), 13.797522213953503),
 (('staircase', 'attendance'), 13.7975222139535),
 (('launchrock', 'comp'), 13.62759721251119),
 (('participating', 'launchrock'), 13.62759721251119),
 (('poked', 'liked'), 13.62759721251119),
 (('\x89ûïcheck-in', 'offers

In [42]:
display_pmi(neutral_concat)

[(('holler', 'gram'), 13.739337609077296),
 (('98', 'accuracy'), 13.4763032032435),
 (('charlie', 'sheen'), 13.4763032032435),
 (('false', 'alarm'), 13.4763032032435),
 (('jc', 'penney'), 13.4763032032435),
 (('knitted', 'staircase'), 13.4763032032435),
 (('parking', '5-10'), 13.4763032032435),
 (('sheen', 'goddesses'), 13.4763032032435),
 (('swarm', 'ensues'), 13.4763032032435),
 (('barton', 'hollow'), 13.253910781907052),
 (('elusive', "'power"), 13.253910781907052),
 (('entered', 'automatically'), 13.253910781907052),
 (('follower', 'swarm'), 13.253910781907052),
 (('poked', 'liked'), 13.253910781907052),
 (("'80s-themed", 'costume'), 13.061265703964658),
 (('acoustic', 'solo'), 13.061265703964658),
 (('agnerd', 'confession'), 13.061265703964658),
 (('cameron', 'sinclair'), 13.061265703964658),
 (('tim', "o'reilly"), 13.061265703964658),
 (('staircase', 'attendance'), 13.061265703964654),
 (('invite', 'follower'), 13.031518360570605),
 (('5-10', 'min'), 12.891340702522346),
 (('char

In [43]:
display_pmi(positive_concat)

[(('alternate', 'routes'), 12.899621266905065),
 (('ice', 'cream'), 12.899621266905065),
 (('interrupt', 'regularly'), 12.899621266905065),
 (('league', 'extraordinary'), 12.636586861071269),
 (('lustre', 'pearl'), 12.636586861071269),
 (('exhibit', 'hall'), 12.414194439734821),
 (('regularly', 'scheduled'), 12.414194439734821),
 (('speech', 'therapy'), 12.414194439734821),
 (('haha', 'awesomely'), 11.899621266905061),
 (('south', 'southwest'), 11.899621266905061),
 (('march', '9-15'), 11.762117743155128),
 (('stream', '+others'), 11.762117743155128),
 (('awesomely', 'rad'), 11.762117743155127),
 (('maggie', 'mae'), 11.762117743155127),
 (('belinsky', '911tweets'), 11.511055978987411),
 (('mark', 'belinsky'), 11.511055978987411),
 (('150', 'million'), 11.48458376762622),
 (('macbook', 'pro'), 11.414194439734821),
 (('physical', 'worlds'), 11.395578761567474),
 (('64gig', 'wifi'), 11.221549361792427),
 (('holler', 'gram'), 11.221549361792427),
 (('fam', 'showing'), 11.134086520542084),


In [44]:
display_pmi(negative_concat)

[(('kara', 'swisher'), 9.83821908050231),
 (('barry', 'diller'), 9.516290985614948),
 (('marissa', 'mayer'), 9.516290985614948),
 (('among', 'digital'), 9.475649001117603),
 (('digital', 'delegates'), 9.475649001117603),
 (('japan', 'relief'), 9.378787461865013),
 (('fast', 'among'), 9.253256579781155),
 (('way', 'caring'), 9.13777936236122),
 (('fades', 'fast'), 9.101253486336105),
 (('heard', 'weekend'), 9.101253486336104),
 (('best', 'thing'), 8.931328484893793),
 (('caring', 'much'), 8.931328484893793),
 (('much', 'business'), 8.931328484893793),
 (('classiest', 'fascist'), 8.860939157002395),
 (('fascist', 'company'), 8.80847173710826),
 (('company', 'america'), 8.767829752610915),
 (('elegant', 'fascist'), 8.767829752610915),
 (('network', 'called'), 8.690320385389999),
 (('possibly', 'today'), 8.475649001117603),
 (('lost', 'way'), 8.459707457248584),
 (('money', 'japan'), 8.378787461865013),
 (("'ve", 'heard'), 8.346365984172635),
 (('thing', "'ve"), 8.083331578338843),
 (('des

In [45]:
# copy tweets and labels over again to prepare final preprocessed data for modeling stage
raw_tweets = tweets.copy()
raw_class_labels = class_labels.copy()

In [46]:
# split tweets and labels into training / test sets for validation purposes
X_train, X_test, y_train, y_test = train_test_split(raw_tweets, 
                                                    raw_class_labels, 
                                                    stratify=raw_class_labels,
                                                    random_state=SEED)

In [47]:
# update stopwords list to remove common words
stopwords_list += ['mention', 'link', 'rt', 'quot']

In [48]:
# functions to process tweets and remove stopwords
def process_tweet(tweet, stopwords_list):
    processed = re.sub("\'", '', tweet)
    processed = re.sub('\s+', ' ', processed)
    tokens = nltk.word_tokenize(processed)
    stopwords_removed = [token.lower() for token in tokens if token.lower() not in stopwords_list]
    return ' '.join(stopwords_removed)

In [49]:
# process training and test text
X_train_processed = X_train.apply(lambda x: process_tweet(x, stopwords_list))
X_test_processed = X_test.apply(lambda x: process_tweet(x, stopwords_list))

In [50]:
# function to vectorize tweets
def tweet_vectorizer(X_train, y_train, X_test):
    vectorizer = TfidfVectorizer(ngram_range=(1, 2),
                                 strip_accents='unicode',
                                 decode_error='replace',
                                 analyzer='word',
                                 min_df=2)
    
    # vectorize training and test sets
    vectorized_X_train = vectorizer.fit_transform(X_train)
    vectorized_X_test = vectorizer.transform(X_test)
    
    # select top k of vectorized features
    selector = SelectKBest(f_classif, k=min(20000, vectorized_X_train.shape[1]))
    selector.fit(vectorized_X_train, y_train)
    return vectorized_X_train, vectorized_X_test

In [51]:
# vectorize
X_train_vect, X_test_vect = tweet_vectorizer(X_train_processed, 
                                             y_train,
                                             X_test_processed)

In [52]:
# show class imbalance
display(y_train.value_counts())
print('--Normalized--')
display(y_train.value_counts(normalize=True))

No emotion toward brand or product    4031
Positive emotion                      2227
Negative emotion                       427
Name: sentiment, dtype: int64

--Normalized--


No emotion toward brand or product    0.602992
Positive emotion                      0.333134
Negative emotion                      0.063874
Name: sentiment, dtype: float64

In [53]:
# create processed and unprocessed synthetic training data
X_train_resampled, y_train_resampled = SMOTE().fit_resample(X_train_vect, y_train)

In [54]:
# print new class values
y_train_resampled.value_counts(normalize=True)

Negative emotion                      0.333333
Positive emotion                      0.333333
No emotion toward brand or product    0.333333
Name: sentiment, dtype: float64

# 4. Modeling
* Is this a classification task?
* What models will be used?
* How to deal with overfitting?
* Performance metrics?

This is a classification task, tasked with classifying the sentiment of tweets based on the text within the tweet. Three primary models will be relied on for classification:
1. Random Forest
2. XGBoost
3. Linear SVM

Overfitting will be addressed thru hyperparameter tuning, such as pruning trees used in random forests / XGBoost, in addition to other parameter tuning. 

This is a multi-class classification problem, with three available class labels (Neutral, Positive, or Negative). As a result, the performance metric we will focus on throughout this process will be accuracy. 

In [55]:
# create function to store model results
def store_model_results(X_train, X_test, y_train, y_test, model, model_name):
    
    # generate predictions for train and test set
    train_preds = model.predict(X_train)
    test_preds = model.predict(X_test)
    
    # generate accuracy scores
    train_accuracy = accuracy_score(y_train, train_preds)
    test_accuracy = accuracy_score(y_test, test_preds)
    
    # store results in a dataframe
    train_results = pd.DataFrame([[f'Train-{model_name}', train_accuracy]], columns=['Model', 'Accuracy'])
    test_results = pd.DataFrame([[f'Test-{model_name}', test_accuracy]], columns=['Model', 'Accuracy'])
    results = pd.concat([train_results, test_results], axis=0)
    
    return results

## Random Forest Modeling

In [56]:
# instantiate initial model
random_forest_clf = RandomForestClassifier(random_state=SEED)

# fit to resampled data
random_forest_clf.fit(X_train_resampled, y_train_resampled)

# store results
baseline_random_forest = store_model_results(X_train_resampled,
                                             X_test_vect,
                                             y_train_resampled,
                                             y_test,
                                             random_forest_clf,
                                             'baseline_rf')

In [57]:
# print results of baseline
baseline_random_forest

Unnamed: 0,Model,Accuracy
0,Train-baseline_rf,0.977177
0,Test-baseline_rf,0.678331


Looking at the baseline results, we can see that we are getting a testing accuracy score of ~67%.  With rebalanced datasets, a 'simple' model would be expected to get ~30% accuracy if it guessed one class the entire time.  Considering our imbalanced data set, a 'simple' model would be expected to get around ~60% accuracy by guessing neutral every time.  Our random forest model works better for non-neutral cases and overall accuracy is still above 60%.  

The delta between train and test results shows our model is likely overfitting to the training data.  To attempt to address this / further improve results, grid search will be used to tune hyperparameters and find optimal model parameters. 

In [58]:
# create parameter grid
random_forest_params = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [3, 5, None],
    'min_samples_split': [4, 5, 6],
    'min_samples_leaf': [2, 3, 4],
    'n_jobs': [-1]
}

In [89]:
# instantiate GridSearchCV object
random_forest_grid_search = GridSearchCV(random_forest_clf,
                                         random_forest_params,
                                         cv=3,
                                         return_train_score=True)

# fit to resampled data
random_forest_grid_search.fit(X_train_resampled, y_train_resampled)

GridSearchCV(cv=5, estimator=RandomForestClassifier(random_state=23),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [3, 5, None],
                         'min_samples_leaf': [2, 3, 4],
                         'min_samples_split': [4, 5, 6], 'n_jobs': [-1]},
             return_train_score=True)

In [94]:
# generate train and test scores 
mean_train_score = np.mean(random_forest_grid_search.cv_results_['mean_train_score'])
mean_test_score = random_forest_grid_search.score(X_test_vect, y_test)

# print scores and best parameters
print(f'Mean Training Score: {mean_train_score}')
print(f'Mean Testing Score: {mean_test_score}')
print("Best Parameter Combination Found During Grid Search:")
random_forest_grid_search.best_params_

Mean Training Score: 0.7343410946101199
Mean Testing Score: 0.6760879318079857
Best Parameter Combination Found During Grid Search:


{'criterion': 'entropy',
 'max_depth': None,
 'min_samples_leaf': 2,
 'min_samples_split': 6,
 'n_jobs': -1}

In [97]:
random_forest_grid_search.score(X_train_resampled, y_train_resampled)

0.9344248738939882

In [96]:
random_forest_grid_search.cv_results_['mean_train_score']

array([0.64148685, 0.64148685, 0.64140416, 0.63896471, 0.63896471,
       0.63896471, 0.63662863, 0.63662863, 0.63662863, 0.67024329,
       0.67003655, 0.66964374, 0.6637726 , 0.6637726 , 0.6637726 ,
       0.66009277, 0.66009277, 0.66009277, 0.93485931, 0.9334329 ,
       0.93277134, 0.89880555, 0.89880555, 0.89880555, 0.87199254,
       0.87199254, 0.87199254, 0.6391715 , 0.63910948, 0.63910948,
       0.63664936, 0.63664936, 0.63664936, 0.63547096, 0.63547096,
       0.63547096, 0.67129769, 0.67115298, 0.67115298, 0.66480627,
       0.66480627, 0.66480627, 0.66060961, 0.66060961, 0.66060961,
       0.93343288, 0.93246125, 0.93225448, 0.89535315, 0.89535315,
       0.89535315, 0.86682432, 0.86682432, 0.86682432])

Using the following parameters, we are now seeing test results and train results much more inline, showing we have likely addressed a good amount of the overfitting that was taking place.  Run a best_random_forest model and save results so we can compare to other models we end up using. 

In [91]:
# instantiate with above parameters
best_random_forest = RandomForestClassifier(criterion='entropy',
                                            max_depth=None,
                                            min_samples_leaf=2,
                                            min_samples_split=6,
                                            n_jobs=-1, 
                                            random_state=SEED)

# fit to resampled training data
best_random_forest.fit(X_train_resampled, y_train_resampled)

RandomForestClassifier(criterion='entropy', max_depth=20, min_samples_leaf=5,
                       min_samples_split=6, n_jobs=-1, random_state=12)

In [92]:
# store results
best_rf_results = store_model_results(X_train_resampled,
                                      X_test_vect,
                                      y_train_resampled,
                                      y_test,
                                      best_random_forest,
                                      'best_rf')

In [93]:
# display best random forest results
best_rf_results

Unnamed: 0,Model,Accuracy
0,Train-best_rf,0.742248
0,Test-best_rf,0.622252
