# Tweet Sentiment Classification (Module 4 Project - Kai Graham)

## Overview of Process - CRISP-DM
I will be following the Cross-Industry Standard Process for Data Mining (CRISP-DM), with the following iterative steps.
1. Business Understanding
2. Data Understanding
3. Data Preparation
4. Modeling
5. Evaluation
6. Deployment

## 1. Business Understanding
I will be building a classifier to sort tweets based on sentiment (positive vs. negative vs. neutral).

[...] Further information needed about stakeholders, etc. 

## 2. Data Understanding
The dataset used within this process comes from [...], obtained from [...]

This section will focus on importing and exploring the data available to us as we begin to think about modeling and text processing.

In [1]:
# import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# set random seed
np.random.seed(23)

In [3]:
# load dataset and begin exploring
df = pd.read_csv('judge-1377884607_tweet_product_company.csv', encoding='latin_1')
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [4]:
# more information about dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column                                              Non-Null Count  Dtype 
---  ------                                              --------------  ----- 
 0   tweet_text                                          9092 non-null   object
 1   emotion_in_tweet_is_directed_at                     3291 non-null   object
 2   is_there_an_emotion_directed_at_a_brand_or_product  9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB


Our dataset contains 9,093 total rows and three columns, one containing the tweet text, one containing a product / company sentiment is directed at, and the third is the sentiment.

In [5]:
# rename columns so they are easier to work with 
df.columns = ['text', 'product', 'sentiment']
df.head()

Unnamed: 0,text,product,sentiment
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [6]:
# check for missing values
df.isna().sum()

text            1
product      5802
sentiment       0
dtype: int64

While there only appears to be one missing value within the text column, there is a large number of missing values within the product column.  Start by handling the text column

In [7]:
# display missing entry
df.loc[df['text'].isna()]

Unnamed: 0,text,product,sentiment
6,,,No emotion toward brand or product


In [8]:
# drop as text is missing
clean_df = df.dropna(subset=['text'])
clean_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9092 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   text       9092 non-null   object
 1   product    3291 non-null   object
 2   sentiment  9092 non-null   object
dtypes: object(3)
memory usage: 284.1+ KB


In [9]:
# there appear to be quite a bit of missing product entries - examine further
missing_products = df.loc[df['product'].isna()]

In [10]:
# sentiment counts of missing product rows
missing_products['sentiment'].value_counts()

No emotion toward brand or product    5298
Positive emotion                       306
I can't tell                           147
Negative emotion                        51
Name: sentiment, dtype: int64

Our dataset is missing just over 5,800 brand/product distinctions.  The majority of these are labeled as "neutral sentiment" which makes logical sense, but there are also a handful with different labels.  As our classifier is focused on classifying the sentiment of tweets and not as concerned with the products / brands within the text, we will drop the product column. 

In [11]:
# drop product column
clean_df = clean_df.drop(['product'], axis=1)
clean_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9092 entries, 0 to 9092
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   text       9092 non-null   object
 1   sentiment  9092 non-null   object
dtypes: object(2)
memory usage: 213.1+ KB


In [12]:
# check for additional missing values
clean_df.isna().any()

text         False
sentiment    False
dtype: bool

In [13]:
# check for duplicated entries
clean_df.duplicated().sum()

22

In [14]:
# remove duplicated values
clean_df = clean_df.drop_duplicates()
clean_df.duplicated().any()

False

The dataset no longer contains unnecessary columns, has handled missing values, and has removed duplicate entries.  Next, further explore our dataset prior to data preprocessing and preparation for modeling. 

In [15]:
clean_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9070 entries, 0 to 9092
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   text       9070 non-null   object
 1   sentiment  9070 non-null   object
dtypes: object(2)
memory usage: 212.6+ KB


Our dataset now contains 9,070 labeled tweets.  Explore breakdown of sentiment labels and start to think about if any class imbalance will need to be handled. 

In [16]:
# value counts of sentiment column
clean_df['sentiment'].value_counts()

No emotion toward brand or product    5375
Positive emotion                      2970
Negative emotion                       569
I can't tell                           156
Name: sentiment, dtype: int64

In [17]:
# examing some of the tweets labeled as "I can't tell" -- print first 5
for i in list(range(7)):
    display(clean_df.loc[clean_df['sentiment'] == "I can't tell"].iloc[i][0])

'Thanks to @mention for publishing the news of @mention new medical Apps at the #sxswi conf. blog {link} #sxsw #sxswh'

'\x89ÛÏ@mention &quot;Apple has opened a pop-up store in Austin so the nerds in town for #SXSW can get their new iPads. {link} #wow'

'Just what America needs. RT @mention Google to Launch Major New Social Network Called Circles, Possibly Today {link} #sxsw'

'The queue at the Apple Store in Austin is FOUR blocks long. Crazy stuff! #sxsw'

"Hope it's better than wave RT @mention Buzz is: Google's previewing a social networking platform at #SXSW: {link}"

'SYD #SXSW crew your iPhone extra juice pods have been procured.'

'Why Barry Diller thinks iPad only content is nuts @mention #SXSW {link}'

After briefly reviewing some of the tweets that were labeled as "I can't tell", it seems there are some tweets that could fall into the netural category as well as some I see that could fall into the negative category (tweets regarding queue length as the stores).

For this reason, I am choosing to drop all entries labeled as "I can't tell". Given that there are only ~156 of these entries, a large portion of the dataset is not being removed, and I feel comfortable with the amount of data remaining. 

In [18]:
# remove all I can't tell from the dataset as we don't have proper labels for these
clean_df = clean_df.loc[clean_df['sentiment'] != "I can't tell"]
clean_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8914 entries, 0 to 9092
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   text       8914 non-null   object
 1   sentiment  8914 non-null   object
dtypes: object(2)
memory usage: 208.9+ KB


In [19]:
clean_df['sentiment'].value_counts()

No emotion toward brand or product    5375
Positive emotion                      2970
Negative emotion                       569
Name: sentiment, dtype: int64

While we have handled unlabeled data (by removing), looking at the remaining value counts, we can see that the majority of tweets in our dataset are labeled as "No emotion toward brand or product", or in otherwords, neutral. Additionally, there are 2,970 positively labeled tweets, and only 569 negatively rated tweets.  

Our dataset clearly has some class imbalanced.  This will be kept in mind and handled in the next section, data preparation. 

Prior to moving onto further data preparation, explore various corpus statistics, and vectorize our clean_df to get a sense of which words are most common, etc. 

In [20]:
# produce frequency distribution
data = clean_df['text']
labels = clean_df['sentiment']

In [41]:
# import necessary libraries
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize, FreqDist
import string
from nltk.collocations import *

In [22]:
# tokenize entire text
tokenized_tweets = list(map(nltk.word_tokenize, data))

In [23]:
# total vocabulary of our dataset
total_vocab = set()
for tweet in tokenized_tweets:
    total_vocab.update(tweet)
len(total_vocab)

13075

In [24]:
# examine first 20 characters of first tokenized tweet
tokenized_tweets[0][:20]

['.',
 '@',
 'wesley83',
 'I',
 'have',
 'a',
 '3G',
 'iPhone',
 '.',
 'After',
 '3',
 'hrs',
 'tweeting',
 'at',
 '#',
 'RISE_Austin',
 ',',
 'it',
 'was',
 'dead']

There are 13,075 total unique words within our dataset, prior to removing any stopwords or punctuation.  Just from the one tweet above, we can see our dataset will benefit from removing stopwords.

In [26]:
# display frequency distribution of unprocessed dataset
tweets_concat = []
for tweet in tokenized_tweets:
    tweets_concat += tweet
    
tweet_freqdist = FreqDist(tweets_concat)
tweet_freqdist.most_common(25)

[('#', 15593),
 ('@', 7075),
 ('mention', 7005),
 ('.', 5480),
 ('SXSW', 4696),
 ('sxsw', 4432),
 ('link', 4247),
 ('}', 4234),
 ('{', 4232),
 ('the', 3855),
 ('to', 3460),
 (',', 3459),
 ('RT', 2899),
 ('at', 2808),
 (';', 2748),
 ('&', 2657),
 ('for', 2399),
 ('!', 2370),
 ('a', 2128),
 ('iPad', 2088),
 ('Google', 2077),
 (':', 2030),
 ('Apple', 1850),
 ('in', 1804),
 ('quot', 1657)]

Looking at the top 50 most common words in our corpus, we can see that the majority of these are stopwords or punctuation. 

In [38]:
# pull in stop words from english language and append punctuation
stopwords_list = stopwords.words('english') + list(string.punctuation)
stopwords_list += ["''", '""', '...', '``']
words_stopped = [word for word in tweets_concat if word not in stopwords_list]

In [28]:
# remove stopwords, and process everything to lowercase
def process_tweet(tweet):
    tokens = nltk.word_tokenize(tweet)
    stopwords_removed = [token.lower() for token in tokens if token.lower() not in stopwords_list]
    return stopwords_removed

In [30]:
processed_data = list(map(process_tweet, data))

In [33]:
# display frequency distribution with stopwords removed and punctuation removed
tweets_concat = []
for tweet in processed_data:
    tweets_concat += tweet
    
tweet_freqdist = FreqDist(tweets_concat)
tweet_freqdist.most_common(25)

[('sxsw', 9334),
 ('mention', 7006),
 ('link', 4249),
 ('rt', 2914),
 ('google', 2531),
 ('ipad', 2401),
 ('apple', 2265),
 ('quot', 1657),
 ('iphone', 1498),
 ('store', 1457),
 ("'s", 1216),
 ('2', 1104),
 ('new', 1074),
 ('austin', 946),
 ('amp', 827),
 ('app', 810),
 ('launch', 640),
 ('circles', 634),
 ('social', 629),
 ('android', 570),
 ('today', 565),
 ("n't", 468),
 ('ipad2', 454),
 ('network', 451),
 ('pop-up', 411)]

In [34]:
# vocab size
len(tweet_freqdist)

10419

There are more than 10,400 unique words remaining in our vocabulary after removing stop words and punctuation.

In [36]:
# produce normalized word frequency
total_word_count = sum(tweet_freqdist.values())
top_25 = tweet_freqdist.most_common(25)
print('Word\t\t\tNormalized Frequency')
for word in top_25:
    normalized_frequency = word[1] / total_word_count
    print('{} \t\t\t {:.4}'.format(word[0], normalized_frequency))

Word			Normalized Frequency
sxsw 			 0.08194
mention 			 0.06151
link 			 0.0373
rt 			 0.02558
google 			 0.02222
ipad 			 0.02108
apple 			 0.01988
quot 			 0.01455
iphone 			 0.01315
store 			 0.01279
's 			 0.01068
2 			 0.009692
new 			 0.009429
austin 			 0.008305
amp 			 0.00726
app 			 0.007111
launch 			 0.005619
circles 			 0.005566
social 			 0.005522
android 			 0.005004
today 			 0.00496
n't 			 0.004109
ipad2 			 0.003986
network 			 0.003959
pop-up 			 0.003608


We can see that sxsw, mention, link, rt are at the top of our frequency - - These are likely related to tweet mechanics - - and can likely be removed or added to our stoplist

In [43]:
# create bigrams to see which combinations of words are most frequent
bigram_measures = nltk.collocations.BigramAssocMeasures()
tweet_finder = BigramCollocationFinder.from_words(words_stopped)
tweet_scored = tweet_finder.score_ngrams(bigram_measures.raw_freq)
tweet_scored[:25]

[(('rt', 'mention'), 0.02479171970608117),
 (('ipad', '2'), 0.008041506816845026),
 (('sxsw', 'link'), 0.008032727879272051),
 (('link', 'sxsw'), 0.007005592183233985),
 (('sxsw', 'rt'), 0.005592183233985023),
 (('mention', 'mention'), 0.005276141481357926),
 (('mention', 'sxsw'), 0.005144457417763302),
 (('apple', 'store'), 0.005012773354168679),
 (('link', 'rt'), 0.0044333634743523335),
 (('sxsw', 'mention'), 0.004169995347163087),
 (('mention', 'google'), 0.004126100659298212),
 (('social', 'network'), 0.0038539535945359893),
 (('new', 'social'), 0.003555469717054842),
 (('mention', 'rt'), 0.0029672808996655223),
 (('via', 'mention'), 0.0028443757736438735),
 (('store', 'sxsw'), 0.002835596836070899),
 (('sxsw', 'apple'), 0.002809260023351974),
 (('network', 'called'), 0.0027478074603411494),
 (('google', 'launch'), 0.0027214706476222246),
 (('austin', 'sxsw'), 0.00267757595975735),
 (('called', 'circles'), 0.0026424602094654503),
 (('mention', 'apple'), 0.002607344459173551),
 (('i

In [45]:
# evaluate mutual information scores
tweet_pmi_finder = BigramCollocationFinder.from_words(words_stopped)
tweet_pmi_finder.apply_freq_filter(5)
tweet_pmi_scored = tweet_pmi_finder.score_ngrams(bigram_measures.pmi)
tweet_pmi_scored

[(('98', 'accuracy'), 14.212559713232345),
 (('jc', 'penney'), 14.212559713232345),
 (('knitted', 'staircase'), 14.212559713232345),
 (('naomi', 'campbell'), 14.212559713232345),
 (('parking', '5-10'), 14.212559713232345),
 (('pauly', 'celebs'), 14.212559713232345),
 (('alternate', 'routes'), 13.990167291895897),
 (('aron', 'pilhofer'), 13.990167291895897),
 (('charlie', 'sheen'), 13.990167291895897),
 (('follower', 'swarm'), 13.990167291895897),
 (('likeability', 'virgin'), 13.990167291895897),
 (('lynn', 'teo'), 13.990167291895897),
 (('sheen', 'goddesses'), 13.990167291895897),
 (('swarm', 'ensues'), 13.990167291895897),
 (('cameron', 'sinclair'), 13.797522213953503),
 (('elusive', "'power"), 13.797522213953503),
 (('zazzlsxsw', 'you\x89ûªll'), 13.797522213953503),
 (('staircase', 'attendance'), 13.7975222139535),
 (('launchrock', 'comp'), 13.62759721251119),
 (('participating', 'launchrock'), 13.62759721251119),
 (('poked', 'liked'), 13.62759721251119),
 (('\x89ûïcheck-in', 'offers

Now that there is a solid understanding of our dataset and initial corpus statistics, we can move on to further data preparation. 

## 3. Data Preparation

In [49]:
# handle class imbalance

In [52]:
# experiment with bigrams and other feature engineering

In [53]:
# split dataset into train and test set
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(binary_clean_df, random_state=SEED)

NameError: name 'binary_clean_df' is not defined

In [85]:
# split into data and target
train_data = train_df['text']
train_target = train_df['emotion']

test_data = test_df['text']
test_target = test_df['emotion']

In [72]:
# pull in stop words from english language
stopwords_list = stopwords.words('english') + list(string.punctuation)
stopwords_list += ["''", '""', '...', '``']

In [73]:
# create function to process a single tweet
def process_tweet(tweet):
    """
    Input: tweet of type str
    Function tokenizes tweet using function from nltk
    Lowercase every token, remove any stopwords found in stopwords_list from the tokenized article, 
    and return the results
    """
    tokens = nltk.word_tokenize(tweet)
    stopwords_removed = [token.lower() for token in tokens if token.lower() not in stopwords_list]
    return stopwords_removed

In [74]:
# use map function to call process_tweet on our data
processed_data = list(map(process_tweet, train_data))

In [75]:
processed_data[0]

['google',
 'crisis',
 'response',
 'site',
 'w/',
 'good',
 'info',
 'japanese',
 'earthquake/tsunami',
 'link',
 'sxsw',
 'sxswi']

In [76]:
train_data.head()

7684    #Google Crisis Response has a site up w/ good ...
9063    @mention You should get the iPad 2  to save yo...
8457    It was either go to #SXSW or wait in line and ...
2040    Sweet... Apple listened to us!  A temp Apple S...
285     At #SXSW, Apple schools the marketing experts ...
Name: text, dtype: object

In [77]:
# looks like our tokenizing worked properly, as well as the removal of some stop words

In [78]:
# get total vocabulary size of our training set
total_vocab = set()
for tweet in processed_data:
    total_vocab.update(tweet)
len(total_vocab)

5376

In [79]:
# total number of unique words in our training set is 5374

In [81]:
# create frequency distribution to see which words appear the most
tweets_concat = []
for tweet in processed_data:
    tweets_concat += tweet
    
tweet_freqdist = FreqDist(tweets_concat)
tweet_freqdist.most_common(200)

[('sxsw', 2758),
 ('mention', 1878),
 ('link', 989),
 ('ipad', 891),
 ('rt', 810),
 ('apple', 769),
 ('google', 656),
 ('iphone', 514),
 ('quot', 479),
 ('store', 424),
 ('2', 418),
 ("'s", 412),
 ('app', 325),
 ('new', 295),
 ('austin', 241),
 ('android', 174),
 ("n't", 172),
 ('amp', 169),
 ('ipad2', 166),
 ('launch', 135),
 ('get', 134),
 ('pop-up', 121),
 ('one', 120),
 ('time', 116),
 ('social', 113),
 ('great', 112),
 ('circles', 111),
 ('party', 107),
 ('today', 101),
 ('line', 100),
 ('like', 100),
 ('free', 100),
 ('via', 97),
 ("'m", 97),
 ('cool', 96),
 ('apps', 89),
 ('people', 87),
 ('maps', 87),
 ('day', 87),
 ('go', 83),
 ('good', 79),
 ('sxswi', 79),
 ('got', 77),
 ('love', 75),
 ('mobile', 75),
 ('network', 72),
 ('awesome', 71),
 ('opening', 70),
 ('temporary', 68),
 ("'re", 67),
 ('w/', 66),
 ('see', 66),
 ('check', 65),
 ('downtown', 64),
 ('need', 64),
 ('\x89ûï', 59),
 ('thanks', 58),
 ('first', 58),
 ('best', 58),
 ('called', 57),
 ('going', 56),
 ('popup', 55),


Given this is a frequency distribution across both of our sentiments (positive and negative), it is likely that the words presented above are the least important, as they are shared among both classses.  Knowing this, we will try to focus on words that appear frequently in one class but not the other

In [82]:
# vectorize with TF-IDF

In [84]:
# import proper libraries
from sklearn.feature_extraction.text import TfidfVectorizer

In [86]:
# instantiate vectorizer
vectorizer = TfidfVectorizer()

# vectorize train and test data
tf_idf_data_train = vectorizer.fit_transform(train_data)
tf_idf_data_test = vectorizer.transform(test_data)

In [88]:
# look at shape of our vectorized data
tf_idf_data_train.shape

(2654, 5199)

Our vectorized data contains 2,654 tweets, with 5,199 unique words in the vocabulary.  The vast majority of these columns for any given tweet will be zero, since every article contains a small subset of the total vocabulary

In [89]:
# display number of non-zero columns in the vectors
non_zero_cols = tf_idf_data_train.nnz / float(tf_idf_data_train.shape[0])
print(f'Average Number of Non-Zero Elements in Vectorized Tweets: {non_zero_cols}')

percent_sparse = 1 - (non_zero_cols / float(tf_idf_data_train.shape[1]))
print(f'Percentage of columns containing 0: {percent_sparse}')

Average Number of Non-Zero Elements in Vectorized Tweets: 16.66164280331575
Percentage of columns containing 0: 0.9967952216189044


As we can see above the average tweet contains ~16 non-zero columns. 

## Modeling

In [90]:
# import necessary libraries
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB

In [93]:
# instantiate initial models
nb_classifier = MultinomialNB()
rf_classifier = RandomForestClassifier(n_estimators=100)

In [95]:
# fit naive bayes model
nb_classifier.fit(tf_idf_data_train, train_target)
nb_train_preds = nb_classifier.predict(tf_idf_data_train)
nb_test_preds = nb_classifier.predict(tf_idf_data_test)

In [96]:
# fit random forest classifier
rf_classifier.fit(tf_idf_data_train, train_target)
rf_train_preds = rf_classifier.predict(tf_idf_data_train)
rf_test_preds = rf_classifier.predict(tf_idf_data_test)

In [99]:
# print results
nb_train_score = accuracy_score(train_target, nb_train_preds)
nb_test_score = accuracy_score(test_target, nb_test_preds)
rf_train_score = accuracy_score(train_target, rf_train_preds)
rf_test_score = accuracy_score(test_target, rf_test_preds)

In [100]:
print("Multinomial Naive Bayes")
print("Training Accuracy: {:.4} \t\t Testing Accuracy: {:.4}".format(nb_train_score, nb_test_score))
print("")
print('-'*70)
print("")
print('Random Forest')
print("Training Accuracy: {:.4} \t\t Testing Accuracy: {:.4}".format(rf_train_score, rf_test_score))

Multinomial Naive Bayes
Training Accuracy: 0.8497 		 Testing Accuracy: 0.8418

----------------------------------------------------------------------

Random Forest
Training Accuracy: 1.0 		 Testing Accuracy: 0.8655


In [101]:
# think about further lemmatizing, n-grams, etc.

## 3. Data Preparation

## 4. Modeling
This is a classification task, aimed at classifying tweets based on their sentiment.  As a result, we will iterate through a number of potential models / hyperparameters to arrive at the optimal model for our task.

Following metrics will be generated to help evaluate models:
* Accuracy: total number of correct predictions out of total observations
* Recall: number of true positives out of actual total positives.
* Precision: number of true positives out of predicted positives.
* F1 Score: harmonic mean of precision and recall. 