# Tweet Sentiment Classification (Module 4 Project - Kai Graham)

## Overview of Process - CRISP-DM
I will be following the Cross-Industry Standard Process for Data Mining (CRISP-DM), with the following iterative steps.
1. Business Understanding
2. Data Understanding
3. Data Preparation
4. Modeling
5. Evaluation
6. Deployment

## 1. Business Understanding
I will be building a classifier to sort tweets based on sentiment (positive vs. negative vs. neutral).

[...] Further information needed about stakeholders, etc. 

## 2. Data Understanding
The dataset used within this process comes from [...], obtained from [...]

This section will focus on importing and exploring the data available to us as we begin to think about modeling and text processing.

In [35]:
# import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [36]:
# set random seed
SEED = 23

In [37]:
# need to think about validation and train-test splits

In [38]:
# load dataset and begin exploring
df = pd.read_csv('judge-1377884607_tweet_product_company.csv', encoding='latin_1')
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [39]:
# as we can see above, we have successfully loaded the dataset
# further information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column                                              Non-Null Count  Dtype 
---  ------                                              --------------  ----- 
 0   tweet_text                                          9092 non-null   object
 1   emotion_in_tweet_is_directed_at                     3291 non-null   object
 2   is_there_an_emotion_directed_at_a_brand_or_product  9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB


In [40]:
# rename columns so they are easier to work with 
df.columns = ['text', 'product', 'emotion']
df.head()

Unnamed: 0,text,product,emotion
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [41]:
# check for missing values
df.isna().sum()

text          1
product    5802
emotion       0
dtype: int64

In [42]:
# there appear to be quite a bit of missing product entries - examine further
missing_products = df.loc[df['product'].isna()]
missing_products.head()

Unnamed: 0,text,product,emotion
5,@teachntech00 New iPad Apps For #SpeechTherapy...,,No emotion toward brand or product
6,,,No emotion toward brand or product
16,Holler Gram for iPad on the iTunes App Store -...,,No emotion toward brand or product
32,"Attn: All #SXSW frineds, @mention Register fo...",,No emotion toward brand or product
33,Anyone at #sxsw want to sell their old iPad?,,No emotion toward brand or product


In [43]:
# see if there are any entries not listed as no emotion toward brand or product
missing_products['emotion'].unique()

array(['No emotion toward brand or product', 'Positive emotion',
       'Negative emotion', "I can't tell"], dtype=object)

In [44]:
missing_products['emotion'].value_counts()

No emotion toward brand or product    5298
Positive emotion                       306
I can't tell                           147
Negative emotion                        51
Name: emotion, dtype: int64

In [45]:
# examine the one missing text entry
df.loc[df['text'].isna()]

Unnamed: 0,text,product,emotion
6,,,No emotion toward brand or product


In [46]:
# looks like no text, we will drop this entry
clean_df = df.dropna(subset=['text'])
clean_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9092 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   text     9092 non-null   object
 1   product  3291 non-null   object
 2   emotion  9092 non-null   object
dtypes: object(3)
memory usage: 284.1+ KB


In [47]:
# remove all I can't tell from the dataset as we don't have proper labels for these
clean_df = clean_df.loc[clean_df['emotion'] != "I can't tell"]
clean_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8936 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   text     8936 non-null   object
 1   product  3282 non-null   object
 2   emotion  8936 non-null   object
dtypes: object(3)
memory usage: 279.2+ KB


In [48]:
# for the time being we will ignore the product column as we are only focused on 
# emotion of the texts - drop the product column 
clean_df = clean_df.drop(['product'], axis=1)
clean_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8936 entries, 0 to 9092
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   text     8936 non-null   object
 1   emotion  8936 non-null   object
dtypes: object(2)
memory usage: 209.4+ KB


In [49]:
clean_df.head()

Unnamed: 0,text,emotion
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Positive emotion


In [50]:
# check if any further missing values or duplicates
clean_df.isna().any()

text       False
emotion    False
dtype: bool

In [51]:
# check duplicates
clean_df.duplicated().sum()

22

In [52]:
# remove duplicates as there are only 22 in our dataset
clean_df = clean_df.drop_duplicates()

In [53]:
# check it worked
clean_df.duplicated().any()

False

In [54]:
# no more duplicates -- good to move on to the next stage

In [55]:
clean_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8914 entries, 0 to 9092
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   text     8914 non-null   object
 1   emotion  8914 non-null   object
dtypes: object(2)
memory usage: 208.9+ KB


In [56]:
# we have 8914 records remaining, see how may are listed as neutral
clean_df['emotion'].value_counts()

No emotion toward brand or product    5375
Positive emotion                      2970
Negative emotion                       569
Name: emotion, dtype: int64

In [57]:
# looks like the majority are no emotion toward brand or product, but to begin
# we will focus just on building a binary nlp model
# drop entries listed as no emotion
binary_clean_df = clean_df.loc[clean_df['emotion'] != 'No emotion toward brand or product']
binary_clean_df['emotion'].value_counts()

Positive emotion    2970
Negative emotion     569
Name: emotion, dtype: int64

In [69]:
# split dataset into train and test set
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(binary_clean_df, random_state=SEED)

In [85]:
# split into data and target
train_data = train_df['text']
train_target = train_df['emotion']

test_data = test_df['text']
test_target = test_df['emotion']

In [71]:
# import necessary libraries
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize, FreqDist
import string

In [72]:
# pull in stop words from english language
stopwords_list = stopwords.words('english') + list(string.punctuation)
stopwords_list += ["''", '""', '...', '``']

In [73]:
# create function to process a single tweet
def process_tweet(tweet):
    """
    Input: tweet of type str
    Function tokenizes tweet using function from nltk
    Lowercase every token, remove any stopwords found in stopwords_list from the tokenized article, 
    and return the results
    """
    tokens = nltk.word_tokenize(tweet)
    stopwords_removed = [token.lower() for token in tokens if token.lower() not in stopwords_list]
    return stopwords_removed

In [74]:
# use map function to call process_tweet on our data
processed_data = list(map(process_tweet, train_data))

In [75]:
processed_data[0]

['google',
 'crisis',
 'response',
 'site',
 'w/',
 'good',
 'info',
 'japanese',
 'earthquake/tsunami',
 'link',
 'sxsw',
 'sxswi']

In [76]:
train_data.head()

7684    #Google Crisis Response has a site up w/ good ...
9063    @mention You should get the iPad 2  to save yo...
8457    It was either go to #SXSW or wait in line and ...
2040    Sweet... Apple listened to us!  A temp Apple S...
285     At #SXSW, Apple schools the marketing experts ...
Name: text, dtype: object

In [77]:
# looks like our tokenizing worked properly, as well as the removal of some stop words

In [78]:
# get total vocabulary size of our training set
total_vocab = set()
for tweet in processed_data:
    total_vocab.update(tweet)
len(total_vocab)

5376

In [79]:
# total number of unique words in our training set is 5374

In [81]:
# create frequency distribution to see which words appear the most
tweets_concat = []
for tweet in processed_data:
    tweets_concat += tweet
    
tweet_freqdist = FreqDist(tweets_concat)
tweet_freqdist.most_common(200)

[('sxsw', 2758),
 ('mention', 1878),
 ('link', 989),
 ('ipad', 891),
 ('rt', 810),
 ('apple', 769),
 ('google', 656),
 ('iphone', 514),
 ('quot', 479),
 ('store', 424),
 ('2', 418),
 ("'s", 412),
 ('app', 325),
 ('new', 295),
 ('austin', 241),
 ('android', 174),
 ("n't", 172),
 ('amp', 169),
 ('ipad2', 166),
 ('launch', 135),
 ('get', 134),
 ('pop-up', 121),
 ('one', 120),
 ('time', 116),
 ('social', 113),
 ('great', 112),
 ('circles', 111),
 ('party', 107),
 ('today', 101),
 ('line', 100),
 ('like', 100),
 ('free', 100),
 ('via', 97),
 ("'m", 97),
 ('cool', 96),
 ('apps', 89),
 ('people', 87),
 ('maps', 87),
 ('day', 87),
 ('go', 83),
 ('good', 79),
 ('sxswi', 79),
 ('got', 77),
 ('love', 75),
 ('mobile', 75),
 ('network', 72),
 ('awesome', 71),
 ('opening', 70),
 ('temporary', 68),
 ("'re", 67),
 ('w/', 66),
 ('see', 66),
 ('check', 65),
 ('downtown', 64),
 ('need', 64),
 ('\x89ûï', 59),
 ('thanks', 58),
 ('first', 58),
 ('best', 58),
 ('called', 57),
 ('going', 56),
 ('popup', 55),


Given this is a frequency distribution across both of our sentiments (positive and negative), it is likely that the words presented above are the least important, as they are shared among both classses.  Knowing this, we will try to focus on words that appear frequently in one class but not the other

In [82]:
# vectorize with TF-IDF

In [84]:
# import proper libraries
from sklearn.feature_extraction.text import TfidfVectorizer

In [86]:
# instantiate vectorizer
vectorizer = TfidfVectorizer()

# vectorize train and test data
tf_idf_data_train = vectorizer.fit_transform(train_data)
tf_idf_data_test = vectorizer.transform(test_data)

In [88]:
# look at shape of our vectorized data
tf_idf_data_train.shape

(2654, 5199)

Our vectorized data contains 2,654 tweets, with 5,199 unique words in the vocabulary.  The vast majority of these columns for any given tweet will be zero, since every article contains a small subset of the total vocabulary

In [89]:
# display number of non-zero columns in the vectors
non_zero_cols = tf_idf_data_train.nnz / float(tf_idf_data_train.shape[0])
print(f'Average Number of Non-Zero Elements in Vectorized Tweets: {non_zero_cols}')

percent_sparse = 1 - (non_zero_cols / float(tf_idf_data_train.shape[1]))
print(f'Percentage of columns containing 0: {percent_sparse}')

Average Number of Non-Zero Elements in Vectorized Tweets: 16.66164280331575
Percentage of columns containing 0: 0.9967952216189044


As we can see above the average tweet contains ~16 non-zero columns. 

## Modeling

In [90]:
# import necessary libraries
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB

In [93]:
# instantiate initial models
nb_classifier = MultinomialNB()
rf_classifier = RandomForestClassifier(n_estimators=100)

In [95]:
# fit naive bayes model
nb_classifier.fit(tf_idf_data_train, train_target)
nb_train_preds = nb_classifier.predict(tf_idf_data_train)
nb_test_preds = nb_classifier.predict(tf_idf_data_test)

In [96]:
# fit random forest classifier
rf_classifier.fit(tf_idf_data_train, train_target)
rf_train_preds = rf_classifier.predict(tf_idf_data_train)
rf_test_preds = rf_classifier.predict(tf_idf_data_test)

In [99]:
# print results
nb_train_score = accuracy_score(train_target, nb_train_preds)
nb_test_score = accuracy_score(test_target, nb_test_preds)
rf_train_score = accuracy_score(train_target, rf_train_preds)
rf_test_score = accuracy_score(test_target, rf_test_preds)

In [100]:
print("Multinomial Naive Bayes")
print("Training Accuracy: {:.4} \t\t Testing Accuracy: {:.4}".format(nb_train_score, nb_test_score))
print("")
print('-'*70)
print("")
print('Random Forest')
print("Training Accuracy: {:.4} \t\t Testing Accuracy: {:.4}".format(rf_train_score, rf_test_score))

Multinomial Naive Bayes
Training Accuracy: 0.8497 		 Testing Accuracy: 0.8418

----------------------------------------------------------------------

Random Forest
Training Accuracy: 1.0 		 Testing Accuracy: 0.8655


In [101]:
# think about further lemmatizing, n-grams, etc.

## 3. Data Preparation

## 4. Modeling
This is a classification task, aimed at classifying tweets based on their sentiment.  As a result, we will iterate through a number of potential models / hyperparameters to arrive at the optimal model for our task.

Following metrics will be generated to help evaluate models:
* Accuracy: total number of correct predictions out of total observations
* Recall: number of true positives out of actual total positives.
* Precision: number of true positives out of predicted positives.
* F1 Score: harmonic mean of precision and recall. 