<a href="https://colab.research.google.com/github/anushkarao5/USAirlinesSentimentAnalysis/blob/main/USAirlinesSentimentAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Discerning Customer Sentiment Towards 6 U.S. Airlines Using Twitter Reviews

In this notebook, we will explore various techniques for deciphering the sentiment concealed within Twitter reviews. Harnessing the power of natural language processing (NLP) and machine learning, we will construct various classifiers and assess their accuracies in categorizing airline-specific reviews into positive, negative, and neutral classes. Our primary objective is to find a classifier that excels in accurately categorizing tweets across all sentiment classes.


This analysis can provide airlines with both a high-level understanding of customer sentiment and the potential to unlock actionable insights that will increase their competitiveness within the airline industry. By monitoring recurring themes in each sentiment class, airlines could gain valuable insights into which areas they excel and which areas they need to improve to elevate brand sentiment.

# Outline:
- Loading Data and Basic Exploratory Data Analysis
- Text preprocessing
- Text vectorization
  - Bag of Words Vectorization
  - TFIDF Vectorization
- Modeling using classical statistics models
- Switching to neural networks
  - Word2Vec word embeddings
  - GloVe word embeddings
- Neural Network Models
- Evaluating performance metrics for all classifiers

Click on the first icon in the sidebar to view the table of contents and jump around the notebook.


In [None]:
# importing libraries and packages
import nltk
nltk.download("popular")

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px

import seaborn as sns
from nltk.corpus import stopwords
import string
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
!pip install contractions
import contractions

from bs4 import BeautifulSoup
import re
import plotly.express as px
from plotly.offline import init_notebook_mode, iplot
from sklearn.feature_extraction.text import CountVectorizer

init_notebook_mode(connected=True)


plt.style.use('ggplot')

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cmudict.zip.
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gazetteers.zip.
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/genesis.zip.
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gutenberg.zip.
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/inaugural.zip.
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/movie_reviews.zip.
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/names.zip.
[nltk_data]    | Downloading package shakespeare to /root/nltk_data...
[nlt

Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Collecting textsearch>=0.0.21 (from contractions)
  Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Collecting anyascii (from textsearch>=0.0.21->contractions)
  Downloading anyascii-0.3.2-py3-none-any.whl (289 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.9/289.9 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyahocorasick (from textsearch>=0.0.21->contractions)
  Downloading pyahocorasick-2.0.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.8/110.8 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully installed anyascii-0.3.2 contractions-0.1.73 pyahocorasick-2.0.0 textsearch-0.0.24


## Loading Data and Basic EDA

In [None]:
from google.colab import drive
drive.mount('/content/gdrive/')

Mounted at /content/gdrive/


In [None]:
file_path="/content/gdrive/MyDrive/Data/Tweets.csv"

In [None]:
# loading data
tweets=pd.read_csv(file_path)

In [None]:
# viewing first five rows of the data frame
tweets.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [None]:
#14,640 tweets, 14 features, 1 target variable (airline_sentiment)
tweets.shape

(14640, 15)

In [None]:
# @title
init_notebook_mode(connected=True)
%matplotlib inline

value_counts = tweets.airline_sentiment.value_counts()

value_counts_df = value_counts.reset_index()
value_counts_df.columns = ['Sentiment', 'Count']

fig=px.bar(value_counts_df, x='Sentiment', y='Count',title='Sentiment Distribution',color_discrete_sequence=['red'])
fig.show(renderer="colab")


- Sentiment distribution is imbalanced
- Majority of tweets are negative. We will have to take this into account when building our classifiers.


In [None]:
# @title
%matplotlib inline
value_counts = tweets.airline.value_counts()

value_counts_df = value_counts.reset_index()
value_counts_df.columns = ['Airline', 'Count']
value_counts_df
fig2=px.bar(value_counts_df, x='Airline', y='Count',title='Airline Distribution',color_discrete_sequence=['blue'])
fig2.show(renderer="colab")

- Distribution of tweets across airlines is imbalanced

In [None]:
# @title
color_map = {'negative': 'red', 'neutral': 'yellow', 'positive': 'blue'}

fig3 = px.histogram(tweets, x='airline', color='airline_sentiment', title='Sentiment Distribution by Airline',
              labels={'airline_sentiment': 'Sentiment'}, barmode='group',color_discrete_map=color_map)

fig3.update_traces(marker=dict(opacity=0.7))

fig3.show(renderer='colab')



 For all airlines except Virgin America, the ratio of neutral, positive, and negative tweets is imbalanced. The majority of tweets for all classes are negative.

In [None]:
# Creating a new column converting negative, neutral, and positive tweets to -1,0, and 1 respectively
# This will help in later model building
tweets['Sentiment']=tweets.airline_sentiment.apply(lambda x: 1 if x=='positive' else 0 if x=='neutral'else -1 if x=='negative' else None)

In [None]:
tweets.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone,Sentiment
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada),0
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada),1
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada),0
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada),-1
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada),-1


## Text Preprocessing
- What is the purpose of text preprocessing?
- Feeding in cleaner data should yield better results in our models
- We can think of this as "normalizing" our text data
- Note that different models may perform better with/ without certain preprocessing steps
- For example, neural networks often perform better when words are not stemmed, as neural networks are able to learn complex patterns directly from raw text
- However, since we will begin with non-neural network machine learning models, we will thoroughly preprocess the text before inputting it into our model


### Building a preprocessing function for non NN models
- clean any html tags (e.g. break statements)
- remove handles (e.g. @catsarecool)
- remove websites urls
- remove alphanumerics and numbers
- strip punctuation
- lower case all words
- remove stop words
- stem words to their "root"




In [None]:
# removing handles
def remove_tags(text):
    regex=re.compile('(@[A-Za-z0-9]+)')
    text=re.sub(regex,'',text)
    return text
# removing html tags
def remove_html(text):
    regex=re.compile('<.*?>')
    cleantext=re.sub(regex,'',text)
    return text

In [None]:
# expanding contractions
import sys
def decontracted(text):
    return(contractions.fix(text))


In [None]:
# cleaning punctuation
def remove_punc(sentence):
    for i in sentence:
        if i in string.punctuation:
            sentence=sentence.replace(i,"")

    return (sentence)

In [None]:
# getting stopwords
# stopwords are words that show up commonly in the english language but add little semantic value to text
stop_words=set(stopwords.words("english"))
print(stop_words)


{'because', 'through', 'on', 'under', "haven't", 'up', 'whom', 'him', 'once', "weren't", 'myself', 'other', 'your', 'than', 'its', 'further', 'all', 'before', 'ma', 'any', "needn't", 'such', 'himself', 'it', 'don', 'above', 'he', 'nor', 'we', 'wouldn', 'at', 'haven', 'as', 'his', 'am', "you've", 'our', 'themselves', 'mustn', 'doesn', 'in', 'own', 'against', 'too', 'between', 'y', 'm', 'has', 'about', 'a', 'out', 'theirs', 'her', 'isn', 'did', 'will', 'be', 'if', 'why', 'here', 'these', 'then', 'mightn', 'll', 'ours', "isn't", "mightn't", 'shan', "didn't", 'do', 'but', 't', 've', 'with', "couldn't", "won't", 'this', 'when', "you're", 'what', 'just', 'didn', "shan't", "that'll", 'hadn', "hasn't", "should've", 'i', 'so', 'itself', "shouldn't", 'are', 'that', 'from', 'an', 'having', 'below', 'is', 'there', 'same', 'won', 'the', 'should', 'they', 'only', 'weren', 'again', 'both', "don't", 'doing', 's', 'me', 'of', 'more', 'ain', 'during', "you'll", 'now', 'and', 'into', "wouldn't", 'wasn', 

In [None]:
# removing stop words
from nltk.tokenize import word_tokenize
def remove_stops(sentence):
    filtered_sent=[]
    sent_tokens=word_tokenize(sentence)
    sent_tokens
    for word in sent_tokens:
        if word not in stop_words:
            filtered_sent.append(word)
    return(filtered_sent)

In [None]:
# stemming words
# stemming words reduces them down to their base form. This allows our model to group together variations of the same word.
from nltk.stem.snowball import SnowballStemmer
snow_stemmer=SnowballStemmer(language='english')

def stemmed_sent(text):
    stemmed_sent=[]
    for i in text:
        stemmed_sent.append(snow_stemmer.stem(i))
    x=' '.join(i for i in stemmed_sent)
    return(x)

In [None]:
# putting it all together
def preprocessor(text):
    # removes html tags; exmp <br>
    text=remove_html(text)
    # removes @ tags; exmp: @catsrcool
    text=remove_tags(text)
    # removes websites
    text=re.sub(r"http\S+","",text)
    # removes contractions
    text=decontracted(text)
    # removes any numbers and words mixed with numbers
    text=re.sub("\S*\d\S*","",text)
    # removes anything that is not a letter
    # removes any numbers (both stray and mixed) if mixed, will not remove the letters mixed with numbers, but removes #s
    # [^A-Za-z]+  any character that IS NOT a-z OR A-Z ^ inside bracket, negates statement, in a way, cleans punc
    text=re.sub('[^A-Za-z]+',' ',text)
    # removes extra spaces
    text=re.sub(' +',' ',text)
    # removes punctation
    text=remove_punc(text)
    # lower case everthing
    text=text.lower()
    # remove stop words
    text=remove_stops(text)
    # stem sentence
    text=stemmed_sent(text)

    return(text)

Let's see how our preprocessor works


In [None]:
# choosing a random tweet
import random as random
random.seed(42)
rand=random.randint(0, 14640)
exm_tweet=tweets.text[rand]
print('Unprocessed Tweet:',exm_tweet)

print('Processed Tweet:', preprocessor(exm_tweet))


Unprocessed Tweet: @USAirways AND my rebooked flt isn't until Monday??  AND I don't get a voucher for a hotel?!  Never again, US airways.
Processed Tweet: rebook flt monday get voucher hotel never us airway


Our preprocessor has successfully removed tags, punctuation, and stopwords. It has also lower cased and stemmed all words and expanded contractions.

In [None]:
# applying preprocessor function to all tweets and saving preprocessed tweets in new column in data frame
tweets['preprocessed_tweets']=tweets.text.apply(lambda x: preprocessor(x) )
tweets.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone,Sentiment,preprocessed_tweets
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada),0,said
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada),1,plus ad commerci experi tacki
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada),0,today must mean need take anoth trip
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada),-1,realli aggress blast obnoxi entertain guest fa...
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada),-1,realli big bad thing




## Text Vectorization


Many models cannot handle text data as it is. Therefore we must convert the words into vectors of numbers that the model can interpret. There are several ways we can do this.
For non NN models, we will use Bag of Words and TFIDF Vectorization. Later, for the NNs, we will discuss alternative word vectorization options.
- Click on the "17 cells hidden" to learn more about BOW and TFIDF.


### Bag of Words Vectorization
- Each unique word is represented as a feature, and each tweet is represented as a row
- We put a 1 if the word is present in the tweet, and a 0 if the word is not present (one-hot encoding for text)
- Let's take a simple corpus with three sentences that we would like to vectorize:

In [None]:
corpus= ['cats are cool',
        'dogs are cool',
        'animals are the coolest']

We can use the count vectorizer function from sklearn to produce a one-hot encoded data frame with the rows as sentences in the corpus and the columns as the unique words in the corpus.


In [None]:
# @title
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
# fit_transform creates dummy variables from unique words in the corpus
cv_exmp=cv.fit_transform(corpus)
cv_exmp.toarray()

# these are the unique vocab words in the corpus
sorted(cv.vocabulary_.keys())
new_index_values = ['Sent 1', 'Sent 2', 'Sent3']

cv.get_feature_names_out()
df=pd.DataFrame((cv_exmp).toarray(),columns=(cv.get_feature_names_out()))
df.index=new_index_values
df

Unnamed: 0,animals,are,cats,cool,coolest,dogs,the
Sent 1,0,1,1,1,0,0,0
Sent 2,0,1,0,1,0,1,0
Sent3,1,1,0,0,1,0,1


Limitations of BOW:
- Word order is not preserved
- We do not know semantic relationships between words
- We mark only whether the word appeared or not in the sentence, not the number of instances the word appeared

### TF-IDF Vectorization
- Term frequency inverse document frequency
- The basic idea: the more times a given term appears in a document (a particular sentence), the more important the word is to understanding the document
- At the same time, terms that appear in almost every document are likely not important in understanding a specific document
- TF-IDF factors in both of these concerns


#### Steps:


1) calculate term frequency for each word in document (sentence):
- idea behind term frequency: the more often a word appears in a document, the more important it is to understanding that document
- term frequency = (number of times a word appeared in a document)/ (number of words in the document)


2) calculate inverse document frequency:
- idea of IDF: words that appear in most documents likely do not provide much information in understanding a particular document
- Inverse document Frequency = log(number of documents (or sentences) in a corpus/ # of documents containing a particular word)

3) Multiply term frequency for each word in the sentence by its corresponding inverse document frequency. Do this for all sentences.


For our example, we will use the same corpus as before


In [None]:
corpus

['cats are cool', 'dogs are cool', 'animals are the coolest']

- Term Frequency: the number of times a specific term appears in a document/ the number of terms in the document


In [None]:
# @title
# Step 1-- calculate term frequency
# hide cell
unique_words=['are','coolest','cool','cats','dogs','the','animals']
TF= pd.DataFrame({
          'word':unique_words,
           'TF Sent 1': ['1/7','0','1/7','1/7','0','0','0'],
            'TF Sent 2':['1/7','0','1/7','0','1/7','0','0'],
            'TF Sent 3':['1/7','1/7','0','0','0','1/7','1/7']
        })

TF


Unnamed: 0,word,TF Sent 1,TF Sent 2,TF Sent 3
0,are,1/7,1/7,1/7
1,coolest,0,0,1/7
2,cool,1/7,1/7,0
3,cats,1/7,0,0
4,dogs,0,1/7,0
5,the,0,0,1/7
6,animals,0,0,1/7


Inverse Document Frequency represents the importance of the term in the whole corpus:
- To calculate we take log (# of documents/ # of documents the word occurred in)


In [None]:
# @title
# Step 2: calculate inverse document frequency
#Hide cell
TF_IDF=TF.copy()
TF_IDF["IDF"]=['log(3/3)','log(3/1)','log(3/2)','log(3/1)','log(3/1)','log(3/1)','log(3/1)']
TF_IDF


Unnamed: 0,word,TF Sent 1,TF Sent 2,TF Sent 3,IDF
0,are,1/7,1/7,1/7,log(3/3)
1,coolest,0,0,1/7,log(3/1)
2,cool,1/7,1/7,0,log(3/2)
3,cats,1/7,0,0,log(3/1)
4,dogs,0,1/7,0,log(3/1)
5,the,0,0,1/7,log(3/1)
6,animals,0,0,1/7,log(3/1)


For example, to calculate the IDF of the word "cats", we take the log of the number of total documents (3) divided by the number of documents containing the word cats (1).
- now we multiply the term frequency of each word in each of the sentences by the respective IDF

In [None]:
# @title
# step 3: Multiply TF matrix with IDF respectively
# hide cell
import math
TF_IDF['TFIDF1']=[1/7*(math.log(1,10)),0,1/7*(math.log((3/2),10)),1/7*(math.log((3/1),10)),0,0,0]
TF_IDF['TFIDF2']=[1/7*(math.log(1,10)),0,1/7*(math.log((3/2),10)),0,1/7*(math.log((3/1),10)),0,0]
TF_IDF['TFIDF3']=[1/7*(math.log(1,10)),1/7*(math.log(3,10)),0,0,0,1/7*(math.log(3,10)),1/7*(math.log(3,10))]
TF_IDF

Unnamed: 0,word,TF Sent 1,TF Sent 2,TF Sent 3,IDF,TFIDF1,TFIDF2,TFIDF3
0,are,1/7,1/7,1/7,log(3/3),0.0,0.0,0.0
1,coolest,0,0,1/7,log(3/1),0.0,0.0,0.06816
2,cool,1/7,1/7,0,log(3/2),0.025156,0.025156,0.0
3,cats,1/7,0,0,log(3/1),0.06816,0.0,0.0
4,dogs,0,1/7,0,log(3/1),0.0,0.06816,0.0
5,the,0,0,1/7,log(3/1),0.0,0.0,0.06816
6,animals,0,0,1/7,log(3/1),0.0,0.0,0.06816


Cleaning up to show only TFIDF scores

In [None]:
# @title
TF_IDF_clean=TF_IDF.copy()
columns=TF_IDF['word']
TF_IDF_clean=TF_IDF_clean[['TFIDF1','TFIDF2','TFIDF3']]
TF_IDF_clean

# hide cell

Unnamed: 0,TFIDF1,TFIDF2,TFIDF3
0,0.0,0.0,0.0
1,0.0,0.0,0.06816
2,0.025156,0.025156,0.0
3,0.06816,0.0,0.0
4,0.0,0.06816,0.0
5,0.0,0.0,0.06816
6,0.0,0.0,0.06816


In [None]:
# @title
TF_IDF_clean=TF_IDF_clean.T
TF_IDF_clean.columns=columns
TF_IDF_clean
# hide cell

word,are,coolest,cool,cats,dogs,the,animals
TFIDF1,0.0,0.0,0.025156,0.06816,0.0,0.0,0.0
TFIDF2,0.0,0.0,0.025156,0.0,0.06816,0.0,0.0
TFIDF3,0.0,0.06816,0.0,0.0,0.0,0.06816,0.06816


This is how we would represent each sentence in the corpus using TFIDF scores. As we can see, words that are important to a specific document have a higher score: example "cats" for document 1, "dogs" for document 2, and "coolest" for document 3. Words that appear in all documents like "are" have the lowest scores.


While this can be an improvement from the Bag of words vectorizer (TFIDF tells us more than simply whether a word is present), the TFIDF vectorizer still fails to suggest the relationships between words. Still, both Bag of Words and TFIDF work fairly well with many models.


## Splitting Data into training and testing and vectorizing data

Before we vectorize all of our tweets, we split our data into training and testing sets. We split our data before vectorizing to prevent our test data affecting the training process. We do not want any test data to influence any part of the training.


In [None]:
from sklearn.model_selection import train_test_split
X=tweets.preprocessed_tweets
y=tweets.Sentiment

# holding out 20 percent of data from the training process to evaluate the model's performance

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

In [None]:
# applying Bag of Words models discussed earlier to our tweets
cv = CountVectorizer()
# fit_transform learns vocab from X_train and then creates numeric vectors for each tweet using CV
X_train_cv=cv.fit_transform(X_train)
# transform takes learned vocab (fitted on X_train) and applies it to the test data
X_test_cv=cv.transform(X_test)


In [None]:
print(X_train_cv.shape)
print(X_test_cv.shape)
# there are 7322 unique words

(11712, 7322)
(2928, 7322)


In [None]:
# to see how many words there are
cv.get_feature_names_out()


array(['aa', 'aaaand', 'aaadvantag', ..., 'zrh', 'zuke', 'zurich'],
      dtype=object)

In [None]:
# let's see what BOW looks like in the training data
# each row represents a tweet. If the word was present in a tweet, we put a 1. If it was not present, we put a 0.
pd.DataFrame((X_train_cv).toarray(),columns=(cv.get_feature_names_out())).head()

Unnamed: 0,aa,aaaand,aaadvantag,aaalwaysl,aadavantag,aadelay,aadv,aadvantag,aafail,aal,...,zfv,zig,zip,zipper,zombi,zone,zoom,zrh,zuke,zurich
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# seeing what BOW vectors look like in test data
pd.DataFrame((X_test_cv).toarray(),columns=((sorted(cv.vocabulary_.keys())))).head()

Unnamed: 0,aa,aaaand,aaadvantag,aaalwaysl,aadavantag,aadelay,aadv,aadvantag,aafail,aal,...,zfv,zig,zip,zipper,zombi,zone,zoom,zrh,zuke,zurich
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

In [None]:
print(X_train_tfidf.shape)
print(X_test_tfidf.shape)
# 7322 unique words

(11712, 7322)
(2928, 7322)


TFIDF on training data

In [None]:
# @title
# to get the name of each word
words=vectorizer.get_feature_names_out()

tfidf_vec_train_df=pd.DataFrame(X_train_tfidf.toarray(),columns=words)
print(tfidf_vec_train_df.shape)
tfidf_vec_train_df.head()
# 11712 tweets in train data, 7322 unique words

(11712, 7322)


Unnamed: 0,aa,aaaand,aaadvantag,aaalwaysl,aadavantag,aadelay,aadv,aadvantag,aafail,aal,...,zfv,zig,zip,zipper,zombi,zone,zoom,zrh,zuke,zurich
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


TFIDF on test data

In [None]:
# @title
tfidf_vec_test_df=pd.DataFrame(X_test_tfidf.toarray(),columns=words)
print(tfidf_vec_test_df.shape)
tfidf_vec_test_df.head()
# 2928 tweets in test data, 7322 unique words

(2928, 7322)


Unnamed: 0,aa,aaaand,aaadvantag,aaalwaysl,aadavantag,aadelay,aadv,aadvantag,aafail,aal,...,zfv,zig,zip,zipper,zombi,zone,zoom,zrh,zuke,zurich
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Modeling with non NN Models

We will use five models to classify our tweets:

**Multinomial Logistic Regression**
<details>
  <summary> More info </summary>

- Multinomial logistic regression uses softmax coding, an extension of the logistic function, to calculate the probability that a tweet falls into each sentiment class:

<div style="display: flex; justify-content: center; align-items: center; height: 300px;">
  <img src="https://drive.google.com/uc?id=1GrdlRXTxgsZitMkPMrDIs8BIx7cQ2zzd" width="300">
</div>

- The softmax coding function, where:
  - K= total number of classes (3) in our case
  - x= input features of a particular tweet (the BOW or TFIDF vector)
  - Pr(Y=k|X=x): the probability that a tweet falls into a class k given the tweet’s input features
  - bk0,bk1,....: coefficients associated with the kth class
    - The coefficients for each class are found by maximizing the maximum likelihood function
- We calculate:
  - Pr(Y = positive | X = input features of a tweet)
  - Pr(Y = neutral | X = input features of a tweet )
  - Pr(Y = negative | X = input features of a tweet)
- The three probabilities should sum to 1 (probability distribution over the sentiment classes for a given tweet)
-  We assign the tweet to the class which has the highest probability
- Image source:
James, Gareth, et al. An Introduction to Statistical Learning: With Applications in R. Springer, 2021. pg 141, 4.13

</details>

 **Multinomial Naive Bayes**
<details>
  <summary> More info </summary>

- Multinomial naive bayes use Bayes theorem to calculate the probabilities that a tweet falls into each class. It then assigns the tweet to the class with the highest probability.

<div style="display: flex; justify-content: center; align-items: center; height: 300px;">
  <img src="https://drive.google.com/uc?id=1NMI2sjm--MFseMOGNMTMurxghqOIRLAY" width="300">
</div>

- From Bayes theorem, where:
  - Pr (Y=k | X=x): probability that the tweets falls into a certain class k given the input features
  - πk: prior probability of a tweet falling into class k
  - f_k(x): Pr(X | Y=k): The likelihood of observing feature vector x given class k
- **In MNB, we make the assumption that within a certain class, the p predictors are independent (the words within a certain class are independent)**
- This makes calculating f_k(x) much easier. Instead of computing the joint probability of all the words in a tweet given a class, we take the product of the individual probabilities of observing a word given the class:
  - f_k(x)= Pr ( word 1 | class k ) * Pr ( word 2 | class k ) *  Pr ( word 3 | class k )
    - We calculate this value separately for all classes
- We substitute these values into the Bayes theorem formula, calculating
  - Pr ( Y = negative | X= tweet)
  - Pr ( Y= neutral | X= tweet)
  - Pr ( Y= positive | X = tweet)
- We assign the tweet to the class with the highest probability

- Image source:
James, Gareth, et al. An Introduction to Statistical Learning: With Applications in R. Springer, 2021. pg 142, 4.15

</details>

 **Support Vector Classifier**
<details>
  <summary> More info </summary>

- Support Vector Classifier works by transforming vectorized tweets into points in a higher-dimensional space
- SVC aims to find a decision boundary in this space that separates the vectorized tweets into their respective sentiment classes
- The SVC decision boundary ( a hyperplane) maximumes the margin, or the distance between the decision boundary and the support vectors
- Support vectors are the points closest to the decision boundary that play an important role in deciding where the decision boundary lies
- When making new predictions, we transform the vectorized tweet into the high dimensional space and make the classification based on where the point lies relative to the hyperplane

</details>

 **Random Forest Classifier**
<details>
  <summary> More info </summary>

- The Random Forest Classifier works by combining the predictions of multiple decision trees to make a final prediction
- Each decision tree is built from bootstrapped samples (random sampling with replacement of the tweets from the original data set) using a random subset of features (words)

<div style="display: flex; justify-content: center; align-items: center; height: 300px;">
  <img src="https://drive.google.com/uc?id=1azTNX_LapR3NhFzeJv3pBdNg1YTsW_yr" width="300">
</div>


- Exmp: suppose this decision tree is created from one bootstrapped sample of our data and a handful of random features ("okay", "happy", "angry")
- Each tree makes a prediction on where the observation goes
    - Suppose the observation is “It was okay”
    - We simply fall down the decision tree for this bootstrapped sample and land on Neutral
- We do this for all the decision trees
- “Vote” on which class our tweet falls in by choosing the class that most trees voted on
- This example is an extremely simplified version using bag of words vectorization.
- In the real model, the tweets have already been vectorized, and there are many more splitting nodes. This example still provides decent intuition.

</details>


 **XGB Classifier**

<details>

  <summary> More info </summary>

- XGBoost classification is another type of ensemble, tree-based model that combines multiple decision trees to create a stronger classifier
- XGBoost makes an initial prediction for each tweet, and then builds decision trees iteratively, prioritizing correcting prior misclassification
- The idea is that each iterative tree should have a lower misclassification rate.
- XGBoost combines predictions from  all trees to make a final prediction on the sentiment of a tweet.



</details>






In [None]:
# importing models
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
import xgboost as xgb
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.model_selection import cross_val_score

### Addressing class imbalances
Before we begin modeling and choosing our metrics, we must address how imbalanced our data is.

In [None]:
fig.show(renderer="colab")


- Each of the five models has different parameters to adjust class weights.
- For all models except multinomial naive bayes (MNB), we can simply pass in class_weight=balanced. This automatically adjusts the weights assigned to different classes during training to account for the class imbalances. Classes that dominate the data will be assigned lesser weights, and classes that make up the minority will be assigned larger weights.
- Adjusting class weights modifies the loss function during training. More weight is given to misclassification of the minority classes, which encourages the model to improve performance in minority classes.



In [None]:
from sklearn.utils.class_weight import compute_class_weight
class_weights = compute_class_weight(class_weight = "balanced", classes= [-1,0,1], y= y_train)
class_weights

array([0.53560159, 1.54982136, 2.05042017])

- The negative class is given the lowest weight as it is highly represented in the data, and the positive class is given the greatest weight to account for the sparsity.
- For multinomial naive bayes, which does not have this parameter, we simply have to resort to the default class priors. Class priors represent the prior probability of a tweet falling into the negative, neutral, or positive class. These values are calculated from our training data. Since our negative class has many more observations than the others, the imbalanced class priors can impact our MNB model by having a bias to the majority class.
- Ideally, we can adjust class priors to account for class imbalances. However, there is no concrete way to adjust these class priors. Estimating class priors often leads to overfitting or distortion of the data. Therefore, we will simply allow the model to calculate the class priors from our data.


In [None]:
set(y_train)
class_counts = [len(y_train[y_train == class_label]) for class_label in set(y_train)]
class_counts
total_samples = len(y_train)
total_samples
class_priors = [class_count / total_samples for class_count in class_counts]

class_priors
# 0.2150785519125683 is associated with first class label in the training set (0)
# 0.16256830601092895 associated with the second class label encountered in the training set (1)
# 0.6223531420765027 associated with the third class label encountered in the training set (01)

[0.2150785519125683, 0.16256830601092895, 0.6223531420765027]

We don't need to pass in class priors to the MNB model because the model automatically calculates these values from the training data.



### Evaluation Metrics

**Accuracy**
- the amount of correct predictions divided by the number of total predictions. While accuracy is a popular evaluation metric, it is not the only metric that we should consider in a classification problem with highly imbalanced classes. Accuracy alone can be misleading in evaluating how well our classifier recognizes observations that fall into minority classes.
  - A simple example. Suppose we have a group of observations where 90% of the observations fall into the null class and 10% fall into the alternative class. If we have a classifier that predicts that every observation falls into the null class, that classifier then has 90% accuracy. However, this is still a poor classifier. It cannot recognize observations that fall into the minority class.
- Looking at precision, recall, and F1 scores can give us a better understanding of a classifier's performance.

**Precision**
- how often our model is correct when it makes a prediction that an observation fall into a certain class
- If a classifier has high precision, it means that the model is likely correct when it makes a prediction that an observation falls into a certain class.

**Recall**
- how good our model is at detecting the true positives of a class
- If a classifier has high recall, it is able to detect the true positive of a certain class well.

**This is perhaps our most important metric, as we are interested in seeing how well our model can detect the true positives of the minority classes**

**F1 score**
- the harmonic mean of precision and recall-- a way for us to combine both of the above metrics into one metric. A Higher f1 score indicates a better classifier.
- We will take both accuracy and f1 scores with a grain of salt; both these scores can be heavily impacted by the majority class, which we are not interested in.

#### Click "9 cells hidden" for an example of how these scores are calculated

To gain more intuition, let's look at the evaluation metrics for one specific model. This model uses count vectorization and multinomial logistic regression.


In [None]:
pipe_lr_cv=Pipeline([
            ('cv',CountVectorizer()),
            ('LR',LogisticRegression(multi_class='multinomial',class_weight='balanced',max_iter=2000,solver='lbfgs'))])

In [None]:
from sklearn.model_selection import cross_val_score

# fitting the model
pipe_lr_cv.fit(X_train,y_train)

# cross val scores
scores=cross_val_score(pipe_lr_cv,X_train,y_train,cv=5)
print(scores)
print(scores.mean())

# seeing how well our data performs on previously unseen data
y_pred=pipe_lr_cv.predict(X_test)
accuracy_score(y_test,y_pred)



[0.73452838 0.75202732 0.74380871 0.75192143 0.73441503]
0.7433401745774703


0.7653688524590164

We now look at precision, recall, and F1 scores. To gain more intuition as to how these metrics are calculated, we print a confusion matrix. Confusion matrices allow us to compare the predicted values for each class against the actual values of each class.


In [None]:
from sklearn.metrics import confusion_matrix
import numpy as np
class_names_actual=['Negative Actual','Neutral Actual','Postive Actual']
class_names_predicted=['Negative Pred','Neutral Pred','Postive Pred']
class_names=['Negative','Neutral', 'Postive']
cm=confusion_matrix(y_test,y_pred,labels=[-1,0,1])

In [None]:
confusion_matrix(y_test,y_pred)
cm_df=pd.DataFrame(cm,index=class_names_actual,columns=class_names_predicted)
cm_df


Unnamed: 0,Negative Pred,Neutral Pred,Postive Pred
Negative Actual,1489,298,102
Neutral Actual,110,401,69
Postive Actual,49,59,351


In [None]:
# We calculate the precision, recall and F1 score for each class.
# The precision is the number of true positives divided by the predicted positives.
# Precision= True positives/ Predicted positives
# For the negative class, the true positive value (the amount of negatives we predicted that were actually negatives) is 1488 and the predicted positives (the number of negatives we predicted regardless of the actual outcome) are 1488+111+47.
# Therefore the precision is:
precison= 1488/(1488+111+47)
print('Precision:',precison)
# Recall: True positives/ actual positives. How many of the actual positives we detected. For the negative class:
recall= 1488/(1488+301+100)
print('Recall:',recall)
# F1 score is a way to represent both precision and recall in one metric:
F1= (2*precison*recall)/(precison+recall)
print('F1 Score:', F1)
# This is how we calculate the metrics for each individual class

Precision: 0.9040097205346294
Recall: 0.787718369507676
F1 Score: 0.8418670438472418


Interpreting these scores.
- Precision: Our classifier was correct in its predictions that a tweet belongs to the negative class 90% of the time.
- Recall: Our classifier is able to identify 78% of the negative tweets.
- To quicken this process, we use classification reports, which calculate these scores for us.

In [None]:
# the classification report lets us look at the precision, recall, and f1- scores for every class.
from sklearn.metrics import classification_report
report=classification_report(y_test,y_pred)
print(report)


              precision    recall  f1-score   support

          -1       0.90      0.79      0.84      1889
           0       0.53      0.69      0.60       580
           1       0.67      0.76      0.72       459

    accuracy                           0.77      2928
   macro avg       0.70      0.75      0.72      2928
weighted avg       0.79      0.77      0.77      2928



### Evaluation Metrics using "default" parameters


We will begin evaluating our models. We use a simple pipeline for reproducibility. The first transformer is the count vectorizer or tfidf vectorizer, respectively, and the second is the classifier we are trying: log reg, multinomial naive bayes, random forest, support vector classifier, and xgboost.

In [None]:
pipe_lr_cv=Pipeline([
            ('cv',CountVectorizer()),
            ('LR',LogisticRegression(multi_class='multinomial',class_weight='balanced',max_iter=4000,solver='lbfgs'))])
pipe_lr_tfidf=Pipeline([('tfidf',(TfidfVectorizer())),
            ('LR',LogisticRegression(multi_class='multinomial',class_weight='balanced',max_iter=4000,solver='lbfgs'))])
pipe_nb_cv=Pipeline([('cv',CountVectorizer()),
                     ('MNB',MultinomialNB())])
pipe_nb_tfidf=Pipeline([('tfidf',TfidfVectorizer()),
                     ('MNB',MultinomialNB())])
pipe_rf_cv=Pipeline([('cv',CountVectorizer()),
            ('RF',RandomForestClassifier(random_state=42))])
pipe_rf_tfidf=Pipeline([('tfidf',TfidfVectorizer()),
            ('RF',RandomForestClassifier(random_state=42))])
pipe_svc_cv=Pipeline([('cv',CountVectorizer()),
            ('SVC',svm.SVC(kernel='rbf'))])
pipe_svc_tfidf=Pipeline([('tfidf',TfidfVectorizer()),
            ('SVC',svm.SVC(kernel='rbf'))])
pipe_xgb_cv=Pipeline([('cv',CountVectorizer()),
            ('XGB',xgb.XGBClassifier())])
pipe_xgb_tfidf=Pipeline([('tfidf',TfidfVectorizer()),
            ('XGB',xgb.XGBClassifier())])


Adding models to a list so we can iterate through them

In [None]:
models_default= [pipe_lr_cv,pipe_lr_tfidf,pipe_nb_cv,pipe_nb_tfidf,pipe_rf_cv,pipe_rf_tfidf,pipe_svc_cv,pipe_svc_tfidf,pipe_xgb_cv,pipe_xgb_tfidf]
model_names=['log_reg_cv','log_reg_tfidf','naive_bayes_cv','naive_bays_tfidf','random_forest_cv','random_forest_tfidf','support_vec_clas_cv','support_vec_class_tfidf','xgb_cv','xgb_tfidf']

In [None]:
# fitting the models
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, classification_report
class_names = [-1, 0, 1]

for i, model in enumerate(models_default):
    print(f"Model: {model_names[i]}")
    print("-----------------------------")

    if i == 8 or i == 9:
        le = LabelEncoder()
        y_train_xgb = le.fit_transform(y_train)
        y_test_xgb = le.transform(y_test)
        model.fit(X_train, y_train_xgb)
        y_pred = model.predict(X_test)
        mapping = {0: -1, 1: 0, 2: 1}
        y_pred = list(map(lambda x: mapping[x], model.predict(X_test)))

    else:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
# printing results
    accuracy = accuracy_score(y_test, y_pred)
    print("Accuracy:", accuracy)

    report = classification_report(y_test, y_pred)
    print("Classification Report:")
    print(report)
    print("\n")


Model: log_reg_cv
-----------------------------
Accuracy: 0.7653688524590164
Classification Report:
              precision    recall  f1-score   support

          -1       0.90      0.79      0.84      1889
           0       0.53      0.69      0.60       580
           1       0.67      0.76      0.72       459

    accuracy                           0.77      2928
   macro avg       0.70      0.75      0.72      2928
weighted avg       0.79      0.77      0.77      2928



Model: log_reg_tfidf
-----------------------------
Accuracy: 0.7599043715846995
Classification Report:
              precision    recall  f1-score   support

          -1       0.89      0.79      0.84      1889
           0       0.52      0.66      0.58       580
           1       0.67      0.75      0.71       459

    accuracy                           0.76      2928
   macro avg       0.69      0.74      0.71      2928
weighted avg       0.79      0.76      0.77      2928



Model: naive_bayes_cv
---------

We create a data frame to easily compare these values

In [None]:
# @title
# creating a data frame so that we can easily view these values
default_report=[]
for i, model in enumerate(models_default):
        if i == 8 or i == 9:
            y_pred = model.predict(X_test)
            mapping = {0: -1, 1: 0, 2: 1}
            y_pred = list(map(lambda x: mapping[x], model.predict(X_test)))
        else:
            y_pred = model.predict(X_test)

        report=classification_report(y_test,y_pred,zero_division=1,output_dict=True)

        # Concatoning classification reports for easier comparison. Not important to understand code.
        report_df = pd.DataFrame(report).transpose()
        report_df.reset_index(inplace=True)
        report_df = report_df.rename(columns={'index': 'labels'})
        model_name = model_names[i]
        report_df['Model Name'] = model_name

        pivot_df=report_df.pivot(index='Model Name',columns='labels')
        pivot_df.columns = [f'{col[0]} ({col[1]})' if col[1] else col[0] for col in pivot_df.columns]
        columns=list(pivot_df.columns)

        pivot_df.columns=columns
        columns_to_drop=['precision (accuracy)','recall (accuracy)','recall (accuracy)','f1-score (accuracy)','support (macro avg)','support (weighted avg)']
        final_df=pivot_df.drop(columns=columns_to_drop)
        final_df.rename(columns={'support (accuracy)': 'Accuracy'}, inplace=True)
        final_df = final_df[['Accuracy'] + [col for col in final_df.columns if col != 'Accuracy']]
        default_report.append(final_df)


In [None]:
# @title
default_reports_df=pd.concat(default_report)
default_reports_df.reset_index(drop=False, inplace=True)


In [None]:
# @title
# melt the dataframe for Plotly
melted_df = default_reports_df.melt(id_vars=['Model Name'],
                                     value_vars=['recall (-1)', 'recall (0)', 'recall (1)'],
                                     var_name='Class', value_name='Recall')

# create a grouped bar chart
fig = px.bar(melted_df, x='Model Name', y='Recall', color='Class', barmode='group',
             title='Recall for Different Classes by Model',
             labels={'Model Name': 'Model', 'Recall': 'Recall', 'Class': 'Sentiment Class'},
             color_discrete_sequence=['red', 'yellow', 'blue'])

# show the plot
fig.show(renderer='colab')

As we can see, most of our models have high recall in the negative class. This is to be expected because our data is dominated by negative tweets. We are more interested in seeing the recall in the positive and neutral classes. To create a single score that allows us to gauge our classifier, we average the recall in the neutral and positive classes. We will do the same for precision, although that is not our main concern.


In [None]:
# average recall in minority classes
default_reports_df['average_recall_pos_nue'] = (default_reports_df['recall (0)'] + default_reports_df['recall (1)']) / 2
# average precision in minority classes
default_reports_df['average_precision_pos_nue'] = (default_reports_df['precision (0)'] + default_reports_df['precision (1)']) / 2


In [None]:
# @title
sorted_df = default_reports_df.sort_values(by='average_recall_pos_nue', ascending=False)

fig = px.bar(sorted_df, x='Model Name', y='average_recall_pos_nue', title='Average Recall for Different Models',
             labels={'Model Name': 'Model', 'average_recall': 'Average Recall'},
             color_discrete_sequence=['green'])

fig.show(renderer='colab')


The average recall in the minority classes leaves much to be desired. Most of our data falls around 50% average recall in the minority class, with an exception of logistic regression, which performs significantly better, and naive bayes used tfidf, which performs significantly worse. Let's see if we can improve this.

### Improving recall in minority classes
- Our goal is to improve recall (the amount of true positives we detect) in the neutral and positive classes, as that seems to be the lowest scores.
- Let's see if we can improve our recall by testing out different combinations of hyperparameters. We create parameter grids for all five models. We want to find the hyperparameters that maximize recall in the neutral and positive class.
- We start by creating parameter grids for each of our six models.
- As a base metric to evaluate our models, I will be using macro average recall, which is the simple average of the recall scores across all classes.

In [None]:
# creating scorer
from sklearn.metrics import make_scorer, recall_score
macro_recall_scorer = make_scorer(recall_score, average='macro')


In [None]:
# creating param grids
lr_param_grid = [{'LR__penalty': ['l2'],
                   'LR__C': [0.01, 0.1, 1, 10],
                   'LR__solver': ['newton-cg'],
                   'LR__max_iter': [100, 1000,10000]
                 }]


nb_param_grid= [{
    'MNB__alpha': [0.1, 0.5, 1.0, 2.0],
}]



rf_param_grid = [{
    'RF__n_estimators': [50, 100,150],
    'RF__max_depth': [None, 10,50],
    'RF__min_samples_split': [2, 5,10],
    'RF__min_samples_leaf': [1, 2,4]
}]


svc_param_grid = [{'SVC__kernel': ['linear', 'rbf'],
                    'SVC__C': [1, 2, 3]}]


xgb_param_grid = [{'XGB__learning_rate': [.1,.2],
                    'XGB__max_depth': [1, 2,5,10],
                    'XGB__min_child_weight': [1,2],
                    'XGB__subsample': [1.0, 0.1],
                    'XGB__n_estimators': [50,100,150]}]

### Hyperparameter Tuning

**Multinomial Logistic Regression**
<details>
  <summary> More info </summary>

**Penalty**:
- Type of regularization (constraints on the coefficients)for the model, which is used to prevent overfitting and control model complexity
  - L2 regularization adds the squared magnitude of the coefficients associated with the features in the model as the penalty term.
	- Can shrink coefficients close to but not exactly to 0
- We do not use L1 regularization because it is not compatible with several solvers we have chosen


**C**:
- Controls how well our model is fit to the data
- A low value of c applies stronger regularization, meaning the model is kept simple at the risk of underfitting the data.
- A high value of c applies weaker regularization, allowing the model to be more complex at the risk of overfitting our data


**Solver**:
- Algorithm used for optimization
- We use newton-cg as it is compatible with multi class log reg


**Max_iter**:
- Maximum number of iterations the optimizer should run before in converges to a solution


</details>

 **Multinomial Naive Bayes**
<details>
  <summary> More info </summary>

**alpha**
- Smoothing parameter to address zero probabilities
- Zero probabilities are problematic because they suggest that an event is impossible (ex: the probability that a certain word occurred in a negative tweet may be 0 in the training data, but may actually occur in the testing data)
- We add a small alpha value to each word count to "smooth the probabilities" and prevent them from being zero
- Smaller alphas values indicate less smoothing, so probabilities are more affected by raw counts
- Larger alphas indicate more smoothing, so probabilities will be more uniform

</details>

 **Support Vector Classifier**
<details>
  <summary> More info </summary>

 **C**
- Regularization parameter.
- Smaller C results in less regularization, allowing a larger margin and preventing overfitting.
- Larger C results in more regularization, leading to a smaller margin and closer fit to training data, potentially overfitting.

**kernel**
- Used to transform data into a higher-dimensional space.
- Options include:
  - **Linear**
    - Represents a linear relationship between features and the target variable.
  - **RBF (Radial Basis Function)**
    - Used for capturing nonlinear patterns in data.




</details>

 **Random Forest Classifier**
<details>
  <summary> More info </summary>

**n_estimators**
- number of trees in the ensemble
- usually more trees can provide better performance but will increase computational time


**max_depth**
- maximum depth of a single decision tree
- smaller depths result in smaller trees and prevent overfitting
- larger depths increase complexity and may result in overfitting


**min_samples_split**
- minimum samples needed to split a node during tree construction
- if the # of samples in a node fall below the value of min_samples_split, then the node is not split again. It becomes a leaf node.
- larger values prevent splitting with smaller number of samples, which could prevent overfitting


**min_samples_leaf**
- minimum sample needed to be at a leaf node. If the values is less than the min_samples_leaf threshold, no leaf node is created
- greater values can prevent fitting to noise and outliers

</details>


 **XGB Classifier**

<details>

  <summary> More info </summary>
  
**Learning_rate**
-  Controls how quickly the model learns from training data.
- Small learning rates lead to slower learning but may fit training data well, potentially overfitting.
- Large learning rates result in faster learning but may underfit the data.

**Max_depth**
The number of levels from the root node to the leaf node for each model.

**Min_child_weight**
- Minimum sum of sample weights required for a parent node to be divided into child nodes.
- Controls the complexity of the model by altering child weights.

**Subsample**
-  Percent of training data used to build each tree.
- Values less than 1 may prevent overfitting by learning from different parts of the data.



</details>


We will now begin hyperparameter tuning to see if we can improve recall in the minority classes. To do this, we will use GridSearchCV. GridSearchCV checks all combinations of hyperparameters and saves the combination that maximizes the scoring function (average recall) for a particular model type.

<details>

  <summary> More on GridSearchCV </summary>
  
- Let us consider how GridSearchCV with three folds will work using the RF  model type. Based on the parameter grid, we will check 81 unique RF models  because there are 3 possible values for each hyperparameter and there are 3 total parameters (3 * 3 * 3 * 3=81).
- For each model, we use cross validation with three folds to maximize the average recall.
- We split the training data into three folds. We hold out the first fold as the validation set and build our random forest model on the remaining two folds. Then we evaluate the model's performance on the validation set using the minority recall scorer.
- We repeat this process two more times, so that each of the folds has been used as the validation set. We average the three minority recall scores from each of the validation sets, and store the averaged value as the score for that particular RF model.
- Since we have 81 possible RF models, we repeat this process a total of 81 times, so that we have 81 scores.
- The model with the largest recall is stored at the best estimator for the RF model.



</details>



In [None]:
# grid search objects
# estimator= pipeline
# param grid = param grid based on model
# scoring = average recall
# refit = 'recall' ensures the the best estimator is trained on the full training data set
# cv = 3: 3 folds

lr_cv_grid_search = GridSearchCV(estimator=pipe_lr_cv,
        param_grid=lr_param_grid,
        scoring={'recall': macro_recall_scorer},
        refit='recall',
        cv=3)
lr_tfidf_grid_search = GridSearchCV(estimator=pipe_lr_tfidf,
        param_grid=lr_param_grid,
        scoring={'recall': macro_recall_scorer},
        refit='recall',
        cv=3)
nb_cv_grid_search = GridSearchCV(estimator=pipe_nb_cv,
        param_grid=nb_param_grid,
        scoring={'recall': macro_recall_scorer},
        refit='recall',
        cv=3)
nb_tfidf_grid_search = GridSearchCV(estimator=pipe_nb_tfidf,
        param_grid=nb_param_grid,
       scoring={'recall': macro_recall_scorer},
        refit='recall',
        cv=3)
rf_cv_grid_search = GridSearchCV(estimator=pipe_rf_cv,
        param_grid=rf_param_grid,
        scoring={'recall': macro_recall_scorer},
        refit='recall',
        cv=3)
rf_tfidf_grid_search = GridSearchCV(estimator=pipe_rf_tfidf,
        param_grid=rf_param_grid,
        scoring={'recall': macro_recall_scorer},
        refit='recall',
        cv=3)
svc_cv_grid_search = GridSearchCV(estimator=pipe_svc_cv,
        param_grid=svc_param_grid,
        scoring={'recall': macro_recall_scorer},
        refit='recall',
        cv=3)
svc_tfidf_grid_search = GridSearchCV(estimator=pipe_svc_tfidf,
        param_grid=svc_param_grid,
        scoring={'recall': macro_recall_scorer},
        refit='recall',
        cv=3)
xgb_cv_grid_search = GridSearchCV(estimator=pipe_xgb_cv,
        param_grid=xgb_param_grid,
        scoring={'recall': macro_recall_scorer},
        refit='recall',
        cv=3)
xgb_tfidf_grid_search = GridSearchCV(estimator=pipe_xgb_tfidf,
        param_grid=xgb_param_grid,
        scoring={'recall': macro_recall_scorer},
        refit='recall',
        cv=3)

We fit each of the models using X_train and y_train. Then we save the best estimator from each of the 10 models in a list called best_estimators.

In [None]:
grids=[lr_cv_grid_search,lr_tfidf_grid_search,nb_cv_grid_search,nb_tfidf_grid_search,rf_cv_grid_search,rf_tfidf_grid_search,svc_cv_grid_search,svc_tfidf_grid_search,xgb_cv_grid_search,xgb_tfidf_grid_search]

In [None]:
# fitting the models and storing best estimator in a list
best_estimators = []

for grid_search in grids:
    if grid_search == xgb_cv_grid_search or grid_search == xgb_tfidf_grid_search:
        le = LabelEncoder()
        y_train_xgb = le.fit_transform(y_train)
        y_test_xgb = le.transform(y_test)
        grid_search.fit(X_train, y_train_xgb)
    else:
        grid_search.fit(X_train, y_train)
    best_estimators.append(grid_search.best_estimator_)


In [None]:
best_estimators

[Pipeline(steps=[('cv', CountVectorizer()),
                 ('LR',
                  LogisticRegression(C=0.1, class_weight='balanced',
                                     multi_class='multinomial',
                                     solver='newton-cg'))]),
 Pipeline(steps=[('tfidf', TfidfVectorizer()),
                 ('LR',
                  LogisticRegression(C=1, class_weight='balanced',
                                     multi_class='multinomial',
                                     solver='newton-cg'))]),
 Pipeline(steps=[('cv', CountVectorizer()), ('MNB', MultinomialNB(alpha=0.1))]),
 Pipeline(steps=[('tfidf', TfidfVectorizer()),
                 ('MNB', MultinomialNB(alpha=0.1))]),
 Pipeline(steps=[('cv', CountVectorizer()),
                 ('RF',
                  RandomForestClassifier(n_estimators=150, random_state=42))]),
 Pipeline(steps=[('tfidf', TfidfVectorizer()),
                 ('RF', RandomForestClassifier(random_state=42))]),
 Pipeline(steps=[('cv', CountV

In [None]:

grid_dict = {0: 'Logistic Regression CV', 1: 'Logistic Regression TFIDF',
             2: 'Multinomial Naive Bayes CV', 3: 'Multinomial Naive Bayes TFIDF',
             4: 'Random Forest CV',5:'Random Forest TFIDF',6:'SVC CV',7:'SVC TFIDF',
            8:'XGB CV',9:'XGB TFIDF'}

In [None]:
# simply concatonating the clasification reports of our best models for easier comparison

reports=[]


for i,estimator in enumerate (best_estimators):
    if i==8 or i==9:
        y_pred_modified = estimator.predict(X_test)
        y_pred= le.inverse_transform(y_pred_modified)
    else:
        y_pred=estimator.predict(X_test)
#     print('Classification report for', grid_dict[i] )
#     print(classification_report(y_test,y_pred,zero_division=1))
    report=classification_report(y_test,y_pred,zero_division=1,output_dict=True)

    # Concatoning classification reports for easier comparison.
    report_df = pd.DataFrame(report).transpose()
    report_df.reset_index(inplace=True)
    report_df = report_df.rename(columns={'index': 'labels'})
    model_name = list(grid_dict.values())[i]
    report_df['Model Name'] = model_name

    pivot_df=report_df.pivot(index='Model Name',columns='labels')
    pivot_df.columns = [f'{col[0]} ({col[1]})' if col[1] else col[0] for col in pivot_df.columns]
    columns=list(pivot_df.columns)

    pivot_df.columns=columns
    columns_to_drop=['precision (accuracy)','recall (accuracy)','recall (accuracy)','f1-score (accuracy)','support (macro avg)','support (weighted avg)']
    final_df=pivot_df.drop(columns=columns_to_drop)
    final_df.rename(columns={'support (accuracy)': 'Accuracy'}, inplace=True)
    final_df = final_df[['Accuracy'] + [col for col in final_df.columns if col != 'Accuracy']]
    reports.append(final_df)


In [None]:
combined_reports_df=pd.concat(reports)
combined_reports_df.reset_index(drop=False, inplace=True)


In [None]:
combined_reports_df

Unnamed: 0,Model Name,Accuracy,precision (-1),precision (0),precision (1),precision (macro avg),precision (weighted avg),recall (-1),recall (0),recall (1),recall (macro avg),recall (weighted avg),f1-score (-1),f1-score (0),f1-score (1),f1-score (macro avg),f1-score (weighted avg),support (-1),support (0),support (1)
0,Logistic Regression CV,0.753757,0.916347,0.509615,0.656604,0.694189,0.795061,0.759661,0.731034,0.75817,0.749622,0.753757,0.83068,0.600567,0.703741,0.711663,0.765198,1889.0,580.0,459.0
1,Logistic Regression TFIDF,0.759563,0.894674,0.517473,0.670565,0.694237,0.784823,0.791424,0.663793,0.749455,0.734891,0.759563,0.839888,0.581571,0.707819,0.709759,0.768015,1889.0,580.0,459.0
2,Multinomial Naive Bayes CV,0.773224,0.827215,0.58351,0.723301,0.711342,0.76265,0.894653,0.475862,0.649237,0.673251,0.773224,0.859613,0.524217,0.684271,0.689367,0.765688,1889.0,580.0,459.0
3,Multinomial Naive Bayes TFIDF,0.757855,0.763836,0.666667,0.806818,0.745774,0.751326,0.95712,0.341379,0.464052,0.587517,0.757855,0.849624,0.451539,0.589212,0.630125,0.729946,1889.0,580.0,459.0
4,Random Forest CV,0.767418,0.814726,0.582796,0.735065,0.710862,0.756295,0.896241,0.467241,0.616558,0.660014,0.767418,0.853542,0.51866,0.670616,0.680939,0.75853,1889.0,580.0,459.0
5,Random Forest TFIDF,0.775273,0.804979,0.634328,0.753501,0.730936,0.763106,0.924299,0.439655,0.586057,0.650003,0.775273,0.860522,0.519348,0.659314,0.679728,0.761398,1889.0,580.0,459.0
6,SVC CV,0.794057,0.841066,0.62768,0.768638,0.745795,0.787443,0.902065,0.555172,0.651416,0.702884,0.794057,0.870498,0.589204,0.705189,0.72163,0.788863,1889.0,580.0,459.0
7,SVC TFIDF,0.786885,0.832113,0.613588,0.759804,0.735168,0.777491,0.902594,0.498276,0.675381,0.692084,0.786885,0.865922,0.549952,0.71511,0.710328,0.779691,1889.0,580.0,459.0
8,XGB CV,0.790642,0.826775,0.635762,0.777494,0.746677,0.781212,0.912123,0.496552,0.662309,0.690328,0.790642,0.867355,0.557599,0.715294,0.713416,0.782159,1889.0,580.0,459.0
9,XGB TFIDF,0.787568,0.833986,0.617886,0.758883,0.736919,0.779406,0.901535,0.524138,0.651416,0.692363,0.787568,0.866446,0.567164,0.701055,0.711555,0.781235,1889.0,580.0,459.0


That's a lot of data! While all the data contains valuable information, let's focus on the data that is most relevant to evaluating our models' performance.
- Recall: We are most interested in seeing how well our model did in detecting instances from the minority classes. That is, how often our model was able to detect neutral and positive tweets. These observations are represented in the recall(0) and recall(1) column. This is our most important metric.
- We may also want to look at precision-- how often our predictions were correct for a certain class.
- We are willing to have a lower precision, as getting a tweet classification wrong isn't as important as detecting tweets from the minority class.
- Let us first look at the precision vs. recall in all three classes separately

In [None]:
# @title
# just creating scatter plots
from bokeh.models import ColumnDataSource, HoverTool
from bokeh.plotting import output_notebook, figure, show
from bokeh.layouts import gridplot



def create_scatter_plot(x_column, y_column, x_label, y_label, color,title):
    fig = figure(
        title=title,
        width=300,
        height=300
    )
    scatter = fig.scatter(
        x=x_column,
        y=y_column,
        size=10,
        source=source,
        color=color
    )

    hover = HoverTool()
    hover.tooltips = [
        ('Model Name', '@{Model Name}'),
        ('Recall', f'@{{{x_column}}}'),
        ('Precision', f'@{{{y_column}}}')
    ]
    fig.add_tools(hover)

    fig.title.text_font_size = '12.5pt'
    fig.xaxis.axis_label = f'Recall ({x_label} Tweets)'
    fig.yaxis.axis_label = f'Precision ({y_label} Tweets)'
    fig.xaxis.axis_label_text_font_size = '11pt'
    fig.yaxis.axis_label_text_font_size = '11pt'

    return fig

source = ColumnDataSource(data=combined_reports_df)
output_notebook()

fig = create_scatter_plot('recall (-1)', 'precision (-1)', 'Negative', 'Negative', 'red','Precison vs. Recall (Negative)')
fig2 = create_scatter_plot('recall (0)', 'precision (0)', 'Neutral', 'Neutral', 'green','Precision vs. Recall (Nuetral)')
fig3 = create_scatter_plot('recall (1)', 'precision (1)', 'Positive', 'Neutral', 'blue','Precison vs. Recall (Positive)')

grid = gridplot([[fig, fig2, fig3]])
show(grid, notebook_handle=True)


- Interestingly, the model that has highest recall in the negative class, MNB with TFIDF, has lowest recall in the neutral and positive classes.
- The models with the lowest recall in the negative classes (Logistic regression) have the highest recall in the neutral and positive classes. However, logistic regression also has the lowest precision of all the models in the neutral and negative classes, meaning that when it makes a predicition that a tweet belongs into the negative and neutral class, it is often wrong. This is a tradeoff we will have to make. Since we value recall in the minority classes, we will have to except lower precision in these classes.
- We will dive into potential reasons why our models perfomred the way they did soon.

In [None]:
# @title
# melt the DataFrame for Plotly
melted_df = combined_reports_df.melt(id_vars=['Model Name'],
                                     value_vars=['recall (-1)', 'recall (0)', 'recall (1)'],
                                     var_name='Class', value_name='Recall')

# create a grouped bar chart
fig = px.bar(melted_df, x='Model Name', y='Recall', color='Class', barmode='group',
             title='Recall for Different Classes by Model after hyperparamater tuning',
             labels={'Model Name': 'Model', 'Recall': 'Recall', 'Class': 'Sentiment Class'},
             color_discrete_sequence=['red', 'yellow', 'blue'])

# show the plot
fig.show(renderer='colab')

In [None]:
combined_reports_df['average_recall_pos_nue'] = (combined_reports_df['recall (0)'] + combined_reports_df['recall (1)']) / 2


In [None]:
# @title
sorted_df = combined_reports_df.sort_values(by='average_recall_pos_nue', ascending=False)

fig = px.bar(sorted_df, x='Model Name', y='average_recall_pos_nue', title='Average Minority Recall for Different Models After Tuning',
             labels={'Model Name': 'Model', 'average_recall': 'Average Recall'},
             color_discrete_sequence=['green'])

# Show the plot
fig.show(renderer='colab')


In [None]:
# @title
# comparing average recall fro default and tuned models
grid_dict = {
    0: 'Logistic Regression CV', 1: 'Logistic Regression TFIDF',
    2: 'Multinomial Naive Bayes CV', 3: 'Multinomial Naive Bayes TFIDF',
    4: 'Random Forest CV', 5: 'Random Forest TFIDF',
    6: 'SVC CV', 7: 'SVC TFIDF', 8: 'XGB CV', 9: 'XGB TFIDF'
}

# create a DataFrame with the specified columns
data = []
for i, model_name in grid_dict.items():
    default_recall = default_reports_df.loc[i, 'average_recall_pos_nue']
    final_recall = combined_reports_df.loc[i, 'average_recall_pos_nue']
    data.append([model_name, default_recall, final_recall])

columns = ['Model Name', 'Default Report Avg Recall', 'Tuned Report Avg Recall']
combined_recall_df = pd.DataFrame(data, columns=columns)

(combined_recall_df)

Unnamed: 0,Model Name,Default Report Avg Recall,Tuned Report Avg Recall
0,Logistic Regression CV,0.728043,0.744602
1,Logistic Regression TFIDF,0.706624,0.706624
2,Multinomial Naive Bayes CV,0.502378,0.56255
3,Multinomial Naive Bayes TFIDF,0.209708,0.402716
4,Random Forest CV,0.545802,0.5419
5,Random Forest TFIDF,0.512856,0.512856
6,SVC CV,0.561547,0.603294
7,SVC TFIDF,0.508961,0.586829
8,XGB CV,0.565089,0.579431
9,XGB TFIDF,0.472119,0.587777


In [None]:
# @title
combined_recall_df = combined_recall_df.sort_values(by='Tuned Report Avg Recall', ascending=False)

melted_df = combined_recall_df.melt(id_vars=['Model Name'],
                                    value_vars=['Default Report Avg Recall', 'Tuned Report Avg Recall'],
                                    var_name='Report', value_name='Avg Recall')

fig = px.bar(melted_df, x='Model Name', y='Avg Recall', color='Report', barmode='group',
             title='Comparison of Average Recall for Positive/Neutral Sentiments after Tuning',
             labels={'Model Name': 'Model', 'Avg Recall': 'Average Recall', 'Report': 'Report'},
             color_discrete_sequence=['red', 'green'])

fig.update_xaxes(tickangle=-45)

fig.show(renderer='colab')


As we can see, for each model except Random Forest with cv, hyperparameter tuning resulted in increased or similar average recall. Our most noticeable increase in average recall in the minority class was MNB with TFIDF, which doubled!

### Looking at Macro Precison  Vs. Macro Recall and Weighted Precison vs. Weighted Recall

In [None]:
# @title


source = ColumnDataSource(data=combined_reports_df)
output_notebook()


fig4 = create_scatter_plot('recall (macro avg)', 'precision (macro avg)', 'Positive', 'Neutral', 'darkviolet','Macro Precison vs. Macro Recall')
fig4.xaxis.axis_label = 'Macro Average Recall'
fig4.yaxis.axis_label = 'Macro Average Precision'



fig5 = create_scatter_plot('recall (weighted avg)', 'precision (weighted avg)', 'Positive', 'Neutral', 'deeppink','Weighted Precison vs. Weighted Recall')
fig5.xaxis.axis_label = 'Weighted Average Recall'
fig5.yaxis.axis_label = 'Weighted Average Precision'

grid = gridplot([[fig4,fig5]])
show(grid, notebook_handle=True)

-  While it is interesting to look at weighted precision vs. weighted recall, our main interest is macro precision vs. macro recall because we want to give equal importance to the precision and recall of each class.
- Logistic regression using both TFIDF and CV performs best in terms of recall but worst in terms of precision. This is a classic example of the precision-recall tradeoff. Our LR model has a higher recall, meaning it makes predictions for the positive and neutral classes more liberally. This may capture many true positives of the minority class at the expense of incorrectly predicting that tweets fall into these classes.
- Support Vector Classifier and XGBoost are the next best, and multinomial naive Bayes and random forest perform the worst, mostly due to their low recall in the neutral and positive classes (unable to detect true positives in the minority classes).


**Interesting results**

Why did countvectorizer outperform TFIDF for most models?

- **Information Loss in TF-IDF**
  - Countvectorizer captures the presence or absence of all words, even if they are unimportant in understanding the meaning of the tweet
  - However, TFIDF downweights frequent words, which means that words that occur often, even if they are important in understanding the sentiment of a tweet, are not given as much weight.

- **Multi-class Sensitivity**
  - CV is able to directly capture the presence of distinctive words like “happy”, “angry”, and  “okay” using 0 and 1s. The algorithm then decides on the sentiment class based on the presence or absence of a certain word
  - When there are minority classes, the relevant words in the minority classes are always captured without the risk of being down weighted like in TFIDF.
  - This can help the algorithm distinguish between the three sentiments

- **Handling Noise**
  - TF-IDF is more affected by noise in the data than CV. TF-IDF may give a unique but  irrelevant word more weight, while CV will just focus on its presence


**Understanding Model Performance**

We will evaluate the models based on their average recall in the minority classes, choosing the version of each model (CV or TFIDF) that performed best after hyperparameter tuning.

In descending order:

**Logistic Regression (CV) (74%)**

<details>
  <summary>Click to Expand</summary>

Why did logistic regression perform the best (average recall of 74% in minority classes)?
- Soft Decision boundaries to identify minority classes:
  - LR uses soft decision boundaries to separate classes. What does this mean? Instead of making a strict decision of where a tweet belongs, LR computes a probability score for the tweet belonging in each of the classes. It then selected the class with the highest probability.
  - This allows the model to express uncertainty and acknowledge that some tweets have characteristics of multiple classes.
  - This can be advantageous because it reduces the bias towards the majority class. Instead of favoring the majority class due to class imbalance, the model will assign probabilities to all the classes, making it better equipped to handle complex relationships in the imbalanced data. This can lead to higher recall in minority classes.
- Potential Linear Separability:
  -  In some instances, sentiment classes can be separated using linear decision boundaries, which logistic regression automatically assumes. Given the high recall in the minority classes, the sentiment classes may be linearly separable.


</details>


**Support Vector Classifier (CV) (60%)**

<details>
  <summary>Click to Expand</summary>

- Based on how well logistic regression performed, we have strong reason to assume that the classes are linearly separable.
- However, the best estimator in SVC using countvectorizer uses an RBF kernel, which creates nonlinear decision boundaries from the points in the transformed space.
- This sort of decision boundary could perform poorly if the data is actually linearly separable.

</details>


**XGB (TFIDF) (59%)**
<details>
  <summary>Click to Expand</summary>

Why do we have poor recall in the minority classes?


- Limited minority samples available in the training set:
  - XGBoost builds its trees based on samples from the data set. While class_weights=’balanced’ can adjust the loss function, it cannot correct for the limited minority sets available for sampling. If there is not enough minority samples for the model to learn from, misclassifications are expected.


- Overcomplexity:
  - Many ensemble, tree based models like XGboost are known for their high flexibility and their ability to handle complex, non-linear relationships in the data. This is often advantageous but can lead to overfitting to noise and outliers in the data.
</details>



**MNB (CV) (56%)**
<details>
  <summary>Click to Expand</summary>

Why do we have poor recall in the minority classes?


- Independence assumption:
  - MNB makes the assumption that each feature (word) contributes to the class prediction independently of other features in the tweet. In other words, the presence of a certain word in a tweet is independent of other words in the tweet given the class label.  However this assumption often does not work well in sentiment analysis, where the order and context of words is critical. This can lead to an oversimplification of our data, which may result in misclassification.

**It is important to note that while we adjusted the class weights for each of the other models, we did not do that for MNB because there was no option to do so. This could account for why MNB performed so poorly without hyperparameter tuning.**


</details>

**RF (CV) (54%)**
<details>
  <summary>Click to Expand</summary>

Why do we have poor recall in the minority classes?

- Limited minority samples available in the training set: Similar to XGBoost, random forests may struggle to build strong classifying trees when there are so limited samples from the minority classes.
  - As the random forest model is created using bootstrapped sample from the original dataset, the minority classes will be underrepresented, and the model may not have enough data to perform well.
- Overlapping classes:
  - When there are similar features in all classes, which is often the case for sentiment classification (similar words in all classes), random forest may struggle to form more nuanced distinctions that separate the classes. This can be especially challenging in tweets with sarcasm or dry humor:
  - Exmp: “I love when my flight takes off three hours late!”
- Making these subtle distinctions will be difficult

</details>

It is important to note that only Logistic Regression performed significantly better than the other models (15% increase from the next best model). All the other models performed similarly, with a maximum difference between the models being 6% (between SVC and RF). The reasoning I provided should be understood as explanations of model performance in the context of sentiment classification.

**Most importantly, we must understand that extensive hyperparameter tuning could completely change the order of model performance! This section provides only a small glimpse into the power of various classifiers in SA.**


# Neural Networks


In [None]:
# importing all necessary libraries and packages

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from sklearn.model_selection import train_test_split
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.layers import Dropout
from keras.preprocessing.text import one_hot, Tokenizer
from keras.utils import pad_sequences
from keras.models import Sequential
from keras.layers import Flatten, Bidirectional, GlobalMaxPooling1D, Embedding, Conv1D, LSTM
from sklearn.model_selection import train_test_split
import gensim
from gensim.models import Word2Vec
from gensim.models.doc2vec import TaggedDocument


- We have tried using non NN machine learning models to classify our tweets. Let us now see how neural networks can affect our performance.
- Previously, we used count vectorizer and TFIDF to vectorize our tweets. Both these techniques result in sparse vectors representing each tweet. Sparse vectors are vectors that are predominantly filled with 0s.
- This form of representation works fairly well with traditional machine learning models but is not always effective with neural networks, which are designed to handle dense inputs.
- For our neural networks, we want to represent these tweets with dense vectors, or vectors that have very little 0s.
- We will explore two dense vectorization techniques, Word2Vec and GloVe.


### Preparing Data
- Before we can jump into Word2Vec and GloVe embeddings, we must tokenize our tweets. Tokenization breaks down the tweets into word tokens.
  - Exmp: ["My dog rocks"]--> ["My","dog","rocks"]
- We do this to every tweet into our corpus.
- Each of the unique words in our corpus is then given a random integer representation.
- Suppose the integer representation for ["My","dog","rocks"] is [3,2,1].
- We tokenize all our tweets. These tokenized tweets are the input for our model.


Where do Word2Vec and GloVe embeddings come in?
- Word2Vec and GloVe embeddings are vectors given to individual words. These vectors are created through a separate process where the model learns to represent words through their contexts.
- Each word in our tokenized tweet ["My","dog","rocks"] or [3,2,1] will have its own vector representation.


Using Tokenization with Embeddings:
- The tokenized tweets are the inputs in our NN models. The embeddings serve as a look up dictionary. Each word in our tokenized tweet can be looked up in the embeddings dictionary. For example, ["My","dog","rocks"] will have three separate vectors based on the three words. The vectors will be the features in our NN models.




- Before tokenizing tweets, we altar the preprocessing function.
- Since neural networks can learn more complex relationships directly from data, we will remove the stemmings and keep the stopwords.


In [None]:

def preprocessor_nn(text):
    # removing html tags; exmp <br>
    text=remove_html(text)
    # removing @ tags; exmp: @catsrcool
    text=remove_tags(text)
    # removes websites
    text=re.sub(r"http\S+","",text)
    # removes contractions
    text=decontracted(text)
    #removes any numbers and words mixed with numbers
    text=re.sub("\S*\d\S*","",text)
    #removes anything that is not a letter
    # removes any numbers (both stray and mixed) if mixed, will not remove the letters mixed with numbers, but removes #s
    # [^A-Za-z]+  any character that IS NOT a-z OR A-Z ^ inside bracket, negates statement, in a way, cleans punc
    text=re.sub('[^A-Za-z]+',' ',text)
    #removing extra spaces
    text=re.sub(' +',' ',text)
    #cleans punctation
    text=remove_punc(text)
    #lower case everthing
    text=text.lower()
#     remove stop words
    text=remove_stops(text)
    #stem sentence
#     text=stemmed_sent(text)

    return(text)

In [None]:
# how our preprocessor works now
preprocessor_nn(tweets['text'][1])

['plus', 'added', 'commercials', 'experience', 'tacky']

In [None]:
# splitting the data
X=tweets.text.apply(lambda x: preprocessor_nn(x) )
y=tweets.Sentiment
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

Let's start tokenizing.
We use the keras tokenizer, which assigns each unique word in the vocabulary
an integer.
- 0 is reserved for padding (discussed later)
- 1 is reserved for out of index words (useful when vectorizing test data and dealing with words not in the training data)
- oov_token=1 specifies that all previously unseen words (words not in the training data) will be given the token 1


In [None]:
word_tokenizer = Tokenizer(oov_token=1)

# building the tokens on the unique words in X_train
word_tokenizer.fit_on_texts(X_train)

In [None]:
# looking at at the first five tokenized words
word_tokenizer.word_index
first_five_entries = dict(list(word_tokenizer.word_index
.items())[:5])

print(first_five_entries)


{1: 1, 'flight': 2, 'get': 3, 'thanks': 4, 'cancelled': 5}


In [None]:
# there are 9737 unique words in our corpus, including our OOV token
vocab_length=len(word_tokenizer.word_index)+1
vocab_length

9738

Why do we add 1 to the vocab length?
- Note that the indexing for our word_tokenizer begins at 1, instead of 0 (see first_five_entries). This means that in order to index up till the greatest token ID (9737), we need to add 1 to the vocab length.

In [None]:
# tokenizeing both x_train and x_test

X_train_tokenized=word_tokenizer.texts_to_sequences(X_train)
X_test_tokenized=word_tokenizer.texts_to_sequences(X_test)

# an example of what tokenization looks like
X_train_tokenized[0:5]

[[799, 11, 1436, 49, 226],
 [106, 330, 115, 230, 141, 148, 583, 91, 506, 1172, 108],
 [137, 57, 203, 103, 3542, 278, 159, 952, 953],
 [2, 19, 107, 132, 39, 90, 1039],
 [571, 244, 367, 30, 554, 417, 2108, 1173, 384, 268, 190]]

We want all the vectors to be of equal length. For this we use padding to make each tweet a vector of length 100. If the tweet is too short, 0s will be added after it until there are 100 tokens.


In [None]:
maxlen=100
X_train_tokenized=pad_sequences(X_train_tokenized,padding='post',maxlen=maxlen)
X_test_tokenized=pad_sequences(X_test_tokenized,padding='post',maxlen=maxlen)



In [None]:
# exmaple of padded tweet
X_train_tokenized[0]

array([ 799,   11, 1436,   49,  226,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0], dtype=int32)

### Handling Class imbalances
We mentioned earlier that the distribution of tweet sentiments is highly imbalanced. Previously, we used both class_weights='balanced' to account for the imbalances.
For our neural networks, we will use an oversampling technique called SMOTE. SMOTE works by creating instances of the minority classes that are similar to instances already existing in the minority class. It creates as many instances of the minority class needed to match the instances in the majority class. This is also useful as it creates "more data" for our models to use during learning.


In [None]:
# before SMOTE
len(X_train)
y_train.value_counts()

-1    7289
 0    2519
 1    1904
Name: Sentiment, dtype: int64

In [None]:
# using smote
from imblearn.over_sampling import SMOTE
# auto means that the model will increase the instances in the minority class to match the majority class
smote = SMOTE(sampling_strategy='auto', random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_tokenized, y_train)

In [None]:
# after smote
len(X_train_resampled)
y_train_resampled.value_counts()

-1    7289
 1    7289
 0    7289
Name: Sentiment, dtype: int64

Transforming Target Variable


- For our neural network models, we also have to one hot encode our target variable. This is important because it helps create categorical variables out of numerical variables, which prevents the machine from mistaking the numerical inputs (-1,0,1) as ordinal.
- One hot encoding is also necessary for the softmax activation function, which provides the probability that the observation falls in each class.




In [None]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False)
y_train_nn = encoder.fit_transform(y_train_resampled.values.reshape(-1,1))
y_test_nn = encoder.transform(y_test.values.reshape(-1,1))


In [None]:
print(y_train_nn.shape)
print(y_test_nn.shape)

(21867, 3)
(2928, 3)


### Word to Vec Text Vectorization
- We are finally going to train our word2vec model!
- Word2Vec uses a two-layer simple neural network to vectorize words
- Each word is represented as a vector, and a word vector's relative position to another word vector suggests its semantic meaning. For example, we would expect the word vector for "happy" to be close to the word vector for "joyful".
- This model generally has two approaches
- In the Continuous Bag-of-Words approach, the model learns by guessing target words from neighboring words (the dog likes ?)
  - target word: treat
- In the Skip-Gram approach, the model attempts to guess neighboring context words from the specified word
  - We treat each target word and context word as a new observation
  - For example in the sentence, "the dog likes treats", if the target word was dog, the model would try to predict context words "the" and "like". The model would learn based on these two pairs (target word: dog, context word: the), (target word= dog, context word:likes)
  - The Skip-gram model generally works better when we have a larger corpus
- From the gensim library, we will import the Word2Vec model, which we will train with our training data
- By default the Gensim's Word2Vec uses the Continuous Bag of Words approach


Word2Vec paramater

- sentences = X_train: passing in the corpus of words to train our model on
- vector_size=100: we want the vector representation for each word to be 100
- window=5: To capture semantic meaning, we look at the words five words before and five words after the target word. Based on these relationships, we build the word vectors for each unique word.
- min_count=1: minimum number of occurrences for a word to be included in vocabulary. Since we have a small dataset, we choose min_count=1 to capture as much data as possible




In [None]:

w2v_model=Word2Vec(sentences=X_train,vector_size=100,window=5,min_count=1)
# 9736 unique words
# 2 less words than our tokenizor because there is no out of index token, and we substract the one added to the vocab length for indexing
len((w2v_model.wv.index_to_key))

9736

In [None]:
# taking a look at the word to vec representation of the word "flights"
# as expected, there are 100 numbers representing the word
print(w2v_model.wv['flights'].shape)
w2v_model.wv['flights']

(100,)


array([-5.7033193e-01,  7.2683305e-01,  8.7783024e-02, -2.4009908e-02,
       -9.8888405e-02, -1.4692439e+00,  1.9702315e-01,  1.7681420e+00,
       -6.9984740e-01, -7.0598060e-01, -4.2849624e-01, -1.2909800e+00,
       -2.9127050e-02,  1.1972674e-01,  6.7899173e-01, -3.5623607e-01,
        3.4086233e-01, -6.3832629e-01, -1.7850080e-01, -1.4242325e+00,
        3.7956622e-01,  2.8166869e-01,  2.9410106e-01, -2.9112980e-01,
       -8.0852389e-02,  2.3966298e-01, -5.4365587e-01, -8.8053173e-01,
       -8.1619638e-01, -1.9107888e-02,  5.3257412e-01,  1.7733657e-01,
        2.5420931e-01, -7.4996758e-01, -1.9874513e-01,  9.6876532e-01,
        3.1601673e-04, -5.9464478e-01, -4.9898329e-01, -1.4261857e+00,
        1.8827841e-01, -6.7286456e-01, -1.4520207e-01, -3.8219652e-01,
        5.8622330e-01, -3.4148347e-01, -5.8888727e-01, -2.9746544e-01,
        3.1709614e-01,  5.2587962e-01,  5.0530773e-01, -4.9520808e-01,
       -3.9860912e-02,  2.6795059e-01, -4.9765012e-01,  2.3418115e-01,
      

- We create an embedding matrix that will serve as the initial weights in our neural network model
- We iterate through every unique word in the word index and add the word2vec representation of that word to the matrix


In [None]:
embedding_matrix_w2v=np.zeros((vocab_length,100))
for word, i in word_tokenizer.word_index.items():
    if word in w2v_model.wv:
        embedding_matrix_w2v[i]=w2v_model.wv[word]

In [None]:
embedding_matrix_w2v.shape
# we have the same amount of words as in our word tokenizer. Each word is represented by w2v representations.

(9738, 100)

### Building our nueral networks
- We will be building three neural networks:  a simple neural network, a convolutional neural network, and a long short term memory neural network. We will first define all the models, and then fit them all together.


#### SNN with W2V
- Let us begin with a simple, feed forward neural network.
- In a simple neural network, the connections between the neurons do not form cycles.
- The data flows in one direction, from the input layers, to the hidden layers, and then to the output layer, hence the name feedforward.
- We will use our W2V embedding matrix as our initial weights.

In [None]:
# SNN overview

snn_model_w2v = Sequential()
snn_model_w2v.add(Embedding(input_dim=vocab_length, output_dim=100, weights=[embedding_matrix_w2v], input_length=maxlen, trainable=True))
snn_model_w2v.add(Flatten())
snn_model_w2v.add(Dense(128, activation='relu'))
snn_model_w2v.add(Dropout(0.3))
snn_model_w2v.add(Dense(64, activation='relu'))
snn_model_w2v.add(Dense(3, activation='softmax'))

In [None]:
# SNN explanation
# Initializing a keras sequential model to build our model by adding layers
snn_model_w2v = Sequential()

# Embedding layer
# Converts integer tokenized tweets into vector representation (W2V) embeddings of length 100
# Input_dim = length of our vocabulary
# Output_dim: length of our word embeddings (each word is represented by 100-dimensional vector)
# Weights: as mentioned earlier, weights for this layer are initialized using W2V embedding matrix
# Trainable = true: allows the weights (word embeddings) to be updated during training
snn_model_w2v.add(Embedding(input_dim=vocab_length, output_dim=100, weights=[embedding_matrix_w2v], input_length=maxlen, trainable=True))

# Flatten layer
# Flattens 2 dimensional output of embedding layer to a one dimensional vector that can be passed into the next fully connected dense layer
snn_model_w2v.add(Flatten())

# Dense layer
# Adds a dense layer with 128 units using ReLU activation to introduce non-linearity
# Relu function replaces all negative values with zero while letting positive values go unchanged
# Each of the 128 units captures a specific pattern from the previous layer and produces 1 output
snn_model_w2v.add(Dense(128, activation='relu'))

# Dropout
# Dropout layer helps prevent overfitting by setting a fraction (0.3) of the input layers to 0
snn_model_w2v.add(Dropout(0.3))

# Dense layer
# Another dense layer. Each of the 64 units captures a specific feature or pattern form the previous layer and produces 1 output
snn_model_w2v.add(Dense(64, activation='relu'))

# Output layer
# Returns probabilities that the observations fall into each of the three classes
# 3 units, one for each class
# Softmax activation: computes probability over the three classes, should sum to 1
snn_model_w2v.add(Dense(3, activation='softmax'))

Model Architecture


In [None]:
snn_model_w2v.summary()


Model: "sequential_10"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_10 (Embedding)    (None, 100, 100)          973800    
                                                                 
 flatten_4 (Flatten)         (None, 10000)             0         
                                                                 
 dense_24 (Dense)            (None, 128)               1280128   
                                                                 
 dropout_10 (Dropout)        (None, 128)               0         
                                                                 
 dense_25 (Dense)            (None, 64)                8256      
                                                                 
 dense_26 (Dense)            (None, 3)                 195       
                                                                 
Total params: 2262379 (8.63 MB)
Trainable params: 226

Breaking down model architecture
- embedding_172 (Embedding): The output shape is 100 x 100 matrix.
  - Takes input sequences of length 100 (our tokenized and padded vectors) and transforms each word into a 100 dimensional vector using initial w2v weights.
- flatten_70 (Flatten): The output shape is a 10,000 unit vector.
  - Takes the matrix of 100x100 embedding vectors and flattens it to a one dimensional vector of length 10,000. This prepares data to be fed into dense layers.
- dense_245 (Dense): The output shape is a 128 unit vector.
  - This layer produces a vector of length 128 by capturing the most relevant patterns and relationships in the 10,000 value flattened layer.
- dropout_23 (Dropout): The output is a 128 unit vector.
  - The shape remains the same, but we are temporarily setting a few of the neurons to 0 to prevent overfitting
- dense_246 (Dense): The output is a 64 unit vector
  - Another dense layer which outputs a vector of length 64, capturing more important patterns and relationships in the data
- dense_247 (Dense): The output is a 3 unit vector
  - Outputs a vector of length three which provides the probability that each observation falls into a certain class

The "none" in the architecture represents the batch size, or how many data points are being processed at a time.

#### CNN with W2V

CNNS are used largely in image processing but can also be used in classification problems. CNNS for classification can involve a 1D convolutional layer, which performs convolutional operations on the data. Convolutional operations involve passing a filter (kernel) along a sequence of input data to extract local patterns and relationships.


In [None]:
# CNN overview

cnn_model_w2v = Sequential()
cnn_model_w2v.add(Embedding(input_dim=vocab_length, output_dim=100, weights=[embedding_matrix_w2v], input_length=maxlen, trainable=True))
cnn_model_w2v.add(Conv1D(128, 3, activation='relu', padding='same'))
cnn_model_w2v.add(GlobalMaxPooling1D())
cnn_model_w2v.add(Dropout(0.5))
cnn_model_w2v.add(Dense(64, activation='relu'))
cnn_model_w2v.add(Dense(3, activation='softmax'))


In [None]:
# Creating a Sequential model so we can build the model in layers
cnn_model_w2v = Sequential()


# Embedding layer
# Converts integer tokenized tweets into vector representation (W2V) embeddings of length 100
# Input_dim = length of our vocabulary
# Output_dim: length of our word embeddings (each word is represented by 100 words)
# Weights: as mentioned earlier, the word embeddings matrix we created using w2v
# Trainable = true: allows the weights (word embeddings) to be updated during training
cnn_model_w2v.add(Embedding(input_dim=vocab_length, output_dim=100, weights=[embedding_matrix_w2v], input_length=maxlen, trainable=True))

# First Conv1D layer
# Filters slide across the data to extract local patterns
# Filters=128: Each of the 128 filters is responsible for capturing specific patterns in the input data
# kernel_size=3: A kernel size of three indicates that we use a sliding window of size three to move along the input data, examining three word embeddings at a time.
# activation= relu: relu activation function to introduce non-linearity
# padding= 'same': ensures output sequence length for the layer is the same as the input sequence length
cnn_model_w2v.add(Conv1D(128, 3, activation='relu', padding='same'))

# Global Max Pooling layer
# A form of dimensionality reduction
# Extracts maximum value of each of the 128 filters, capturing most important info from each filter
cnn_model_w2v.add(GlobalMaxPooling1D())

# Dropout
# Sets half the input values to 0 to prevent overfitting
cnn_model_w2v.add(Dropout(0.5))

# Dense layer
# Each of the 64 units captures a specific pattern from the previous layer and produces 1 output
cnn_model_w2v.add(Dense(64, activation='relu'))

# Output layer
# Returns probabilities that the observations fall into each of the three classes
# 3 units, 3 classes
# Softmax activation: computes probability over the three classes, should sum to 1
cnn_model_w2v.add(Dense(3, activation='softmax'))


Understanding Conv1D and GlobalMaxPooling




To better understand the Conv1D layer and the max pooling layers, let us take a simple example.
- Suppose each tweet has a max of 4 words instead of 100. Suppose also that we have only three filters, instead of 81. Let us examine this example. We include padding, which are just 0 vectors, so that we have the same amount of windows (4) as the length of our input sequence (4).


<img src="https://drive.google.com/uc?id=1dv8uMiZhq_FFmRfhcrh-nL6yz9d7dald" alt="New Image" width="400" height="300">




- Each of the three filters will be applied to each of the four windows. The output of the conv layer will be 4x3 (4 rows represent four windows, and three columns represent the filter values). That provides a very simplified explanation of the Conv1D layer.
- Next we apply the max pooling layer. For each of the three filters, we choose the maximum value. In other words, we choose the maximum value from each of the columns. This is a form of dimensionality reduction.



In [None]:
cnn_model_w2v.summary()

Model: "sequential_12"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_12 (Embedding)    (None, 100, 100)          973800    
                                                                 
 conv1d_7 (Conv1D)           (None, 100, 128)          38528     
                                                                 
 global_max_pooling1d_7 (Gl  (None, 128)               0         
 obalMaxPooling1D)                                               
                                                                 
 dropout_12 (Dropout)        (None, 128)               0         
                                                                 
 dense_29 (Dense)            (None, 64)                8256      
                                                                 
 dense_30 (Dense)            (None, 3)                 195       
                                                     

Breaking down model architecture
- embedding_175 (Embedding): The output shape is (None, 100, 100)
  - Takes input sequences of length 100 (our tokenized and padded vectors) and transforms each word into a 100 dimensional vector using initial w2v weights.
- conv1d_67 (Conv1D): The output shape is (None,100,128)
  - The filters consider three consecutive word embeddings at a time (look at three words at a time). The three consecutive word embeddings represent 1 window. For every window, all 128 filters are applied, and the values of the 128 filters are stored as a row. For example, the filter values applied to the first window are stored in the first row.
  - Since there are 100 windows after padding, there will be 100 rows.
- global_max_pooling1d_64: The output shape is a vector of length 128.
  - From each of the 128 filters (each column of our 100 x 128 matrix), we chose the max value. This is a form of dimensionality reduction.
- dropout: The output shape is (None, 128).
  - We temporarily set half the neurons to 0 to prevent form overfitting.
- dense_252 (Dense): The output is a vector of length 64.
  - We process output from previous layers and learn important relationships, reducing size to 64 dimensional vectors.
- dense_253 (Dense): The output is a vector of length 3.
  - Outputs a vector of length three which provides the probability that each observation falls into a certain class



In [None]:
from keras.layers import Bidirectional, GlobalMaxPooling1D

#### LSTM with W2V

- The Long Short Term Memory network is a type of recurrent neural network designed to capture longer-term dependencies and relationships in sequential data.
- LSTMs utilize memory cells and gates to control the flow of information and to decide which information is relevant.
- Since they are good at capturing long term dependencies, LSTMs are optimal in natural language processing and time series problems.

In [None]:
# LSTM overview

lstm_model_w2v = Sequential()
lstm_model_w2v.add(Embedding(input_dim=vocab_length, output_dim=100, weights=[embedding_matrix_w2v], input_length=maxlen, trainable=True))
lstm_model_w2v.add(Bidirectional(LSTM(128, return_sequences=True)))
lstm_model_w2v.add(GlobalMaxPooling1D())
lstm_model_w2v.add(Dense(64, activation='relu'))
lstm_model_w2v.add(Dropout(0.5))
lstm_model_w2v.add(Dense(3, activation='softmax'))


In [None]:
# LSTM Model explanation
# Create a Sequential model so we can build the model in layers
lstm_model_w2v = Sequential()


# Embedding layer
# Converts integer tokenized tweets into vector representation (W2V embeddings) of length 100
# Input_dim = length of our vocabulary
# Output_dim: length of our word embeddings (each word is represented by 100 words)
# Weights: as mentioned earlier, the word embeddings we created using w2v
# Trainable = true: allows the weights (word embeddings) to be updated during training
lstm_model_w2v.add(Embedding(input_dim=vocab_length, output_dim=100, weights=[embedding_matrix_w2v], input_length=maxlen, trainable=True))

# LSTM layer
# 128 LSTM units look at all the words up till a particular time step in the forward direction
# Remaining 128 units look at all the words up till a particular time step in the opposite direction
# Each LSTM unit captures a particular relationship between words in a sequence
lstm_model_w2v.add(Bidirectional(LSTM(128, return_sequences=True)))


# Global pooling layer
# Extracts maximum value of each of the 256 LSTM units (max value of each column from the previous layer), capturing most important info from each unit
lstm_model_w2v.add(GlobalMaxPooling1D())


# Dense layer
# Each of the 64 units captures a specific pattern from the previous layer and produces 1 output
lstm_model_w2v.add(Dense(64, activation='relu'))


# Dropout
# Sets half the input values to 0 to prevent overfitting
lstm_model_w2v.add(Dropout(0.5))


# Output layer
# Returns probabilities that the observations fall into each of the three classes
# 3 units, 3 classes
# Softmax activation: computes probability over the three classes, should sum to 1
lstm_model_w2v.add(Dense(3, activation='softmax'))

LSTM Model Architecture

In [None]:
lstm_model_w2v.summary()


Model: "sequential_14"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_14 (Embedding)    (None, 100, 100)          973800    
                                                                 
 bidirectional_4 (Bidirecti  (None, 100, 256)          234496    
 onal)                                                           
                                                                 
 global_max_pooling1d_9 (Gl  (None, 256)               0         
 obalMaxPooling1D)                                               
                                                                 
 dense_33 (Dense)            (None, 64)                16448     
                                                                 
 dropout_14 (Dropout)        (None, 64)                0         
                                                                 
 dense_34 (Dense)            (None, 3)               



- embedding_175 (Embedding): The output shape is (None,100,100)
  - Takes input sequences of length 100 (our tokenized and padded vectors) and transforms each word in the tokenized tweet into a 100 dimensional vector using initial w2v weights.

- bidirectional_1: The output shape is (None, 100, 256)
	- Each word is considered as its own time step. Since there are 100 words, there are 100 timesteps, and 100 rows
	- Let us consider the fifth step. The fifth time step processes the word from the first to the fifth time step and captures unique relationships about the sequence using the first 128 units.
	- It also processes the information in the opposite direction: the fifth word to the first word, and captures interesting relationships in that direction
	- We can assume that the 100th timestep row will contain the most information because it processes the words form 1- 100 in the forward direction and then the backward direction

- globalMax pooling: The output shape is a 256 long vector.
  - We take the maximum value from each column (256 columns) and store that as a vector. This is a form of dimensionality reduction.

- dense_12 (Dense): The output is a 64 unit vector.
  - We process output from previous layers and learn important relationships, reducing size to 64 dimensional vectors.

- dropout_5 (Dropout): The output is a 64 unit vector.
  - We set half of the values from the previous layer to 0. This protects from overfitting.

- dense_253 (Dense): The output is a vector of length 3.
  - Outputs a vector of length three which provides the probability that each observation falls into a certain class




### GloVe word embeddings

- Another popular type of dense vector word embedding in the GloVe pre-trained word embeddings. The GloVe word embeddings have already been pretrained on a large corpus. Based on the user's choice, the vectors that represent each word can be 50, 100, or 200 numbers. We will use 100 numbers.
- Similar to Word2Vec, word vectors in the vector space that are close to each other have similar semantic meanings.
- GloVe differs from word2vec as it aims to capture the global context of a word rather than just the local context, which we specified earlier in the W2V model with the windows parameters (how many words to look at before and after the target word).
- We will now load our GloVe embeddings. The GloVe embeddings contain 40,000 words, each word represented in a 100 word vector of numbers already defined in the GloVe model
- We want to load these words into a dictionary with key= word, and value= 100 integer vector


In [None]:
#simply creating a dictionary with all the GloVe words
from numpy import asarray
glove_dictionary = dict()
glove_file = open('/content/gdrive/MyDrive/Data/a2_glove.6B.100d.txt', encoding="utf8")
glove_file

for line in glove_file:
    records=line.split()
    word=records[0]
    vector_dimensions=asarray(records[1:],dtype='float32')
    glove_dictionary[word]=vector_dimensions
glove_file.close()

In [None]:
# there are 40,000 words in the Glo_Ve file
len(glove_dictionary)

400000

In [None]:
len(glove_dictionary['the'])
# each word is represented by 100 numbers

100

In [None]:
# we will add the respective GloVe word embeddings for each word in our corpus to our embeddings matrix
from numpy import asarray
from numpy import zeros
embedding_matrix_glove=zeros((vocab_length,100))
for word, index in word_tokenizer.word_index.items():
    embedding_vector=glove_dictionary.get(word)
    if embedding_vector is not None:
        embedding_matrix_glove[index]=embedding_vector

In [None]:
# as we can see, all the words in our corpus are already in the matrix.
embedding_matrix_glove.shape

(9738, 100)

From here the code for the the three models will be identical, except that our intial weights will be the embeddings_matrix_glove, instead of the embedding_matrix_w2v.

#### SNN with  GloVe

In [None]:
snn_model_glove = Sequential()
snn_model_glove.add(Embedding(input_dim=vocab_length, output_dim=100, weights=[embedding_matrix_glove], input_length=maxlen, trainable=True))
snn_model_glove.add(Flatten())
snn_model_glove.add(Dense(128, activation='relu'))
snn_model_glove.add(Dropout(0.3))
snn_model_glove.add(Dense(64, activation='relu'))
snn_model_glove.add(Dense(3, activation='softmax'))  # Output layer

#### CNN with  GloVe

In [None]:
cnn_model_glove = Sequential()
cnn_model_glove.add(Embedding(input_dim=vocab_length, output_dim=100, weights=[embedding_matrix_glove], input_length=maxlen, trainable=True))
cnn_model_glove.add(Conv1D(128, 3, activation='relu', padding='same'))  # Experiment with different filter sizes
# cnn_model_glove.add(Conv1D(128, 3, activation='relu', padding='same'))
cnn_model_glove.add(GlobalMaxPooling1D())
cnn_model_glove.add(Dropout(0.5))
cnn_model_glove.add(Dense(64, activation='relu'))
cnn_model_glove.add(Dense(3, activation='softmax'))

#### LSTM with GloVe

In [None]:
lstm_model_glove = Sequential()
lstm_model_glove.add(Embedding(input_dim=vocab_length, output_dim=100, weights=[embedding_matrix_glove], input_length=maxlen, trainable=True))
lstm_model_glove.add(Bidirectional(LSTM(128, return_sequences=True)))  # Bidirectional LSTM
lstm_model_glove.add(GlobalMaxPooling1D())  # Global Max Pooling
lstm_model_glove.add(Dense(64, activation='relu'))  # Additional Dense layer
lstm_model_glove.add(Dropout(0.5))
lstm_model_glove.add(Dense(3, activation='softmax'))


### Fitting all our models

In [None]:
nn_models=[snn_model_w2v,cnn_model_w2v,lstm_model_w2v,snn_model_glove,cnn_model_glove,lstm_model_glove]

In [None]:
grid_dict_nn = {0: 'Simple NN W2V', 1: 'CNN NN W2v',
             2: 'LSTM NN W2V', 3: 'Simple NN GloVe',
             4: 'CNN NN GloVe',5:'LSTM NN GloVe'}

In [None]:
reports_nn=[]
for i,model in enumerate(nn_models):
    print(list(grid_dict_nn.values())[i])
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    model_history = model.fit(X_train_resampled, y_train_nn, batch_size=32, epochs=20, verbose=1, validation_split=0.2)
    # we never detect classes with 0
    y_prob=model.predict(X_test_tokenized)
    y_classes=y_prob.argmax(axis=-1)

    y_classes_transformed = y_classes - 1

    print(classification_report(y_test,y_classes_transformed,zero_division=0))
    report=classification_report(y_test,y_classes_transformed,zero_division=0,output_dict=True)

    #Concatoning classification reports for easier comparison. Not important to understand code.
    report_df = pd.DataFrame(report).transpose()
    report_df.reset_index(inplace=True)
    report_df = report_df.rename(columns={'index': 'labels'})
    model_name = list(grid_dict_nn.values())[i]
    report_df['Model Name'] = model_name

    pivot_df=report_df.pivot(index='Model Name',columns='labels')
    pivot_df.columns = [f'{col[0]} ({col[1]})' if col[1] else col[0] for col in pivot_df.columns]
    columns=list(pivot_df.columns)

    pivot_df.columns=columns
    columns_to_drop=['precision (accuracy)','recall (accuracy)','recall (accuracy)','f1-score (accuracy)','support (macro avg)','support (weighted avg)']
    final_df=pivot_df.drop(columns=columns_to_drop)
    final_df.rename(columns={'support (accuracy)': 'Accuracy'}, inplace=True)
    final_df = final_df[['Accuracy'] + [col for col in final_df.columns if col != 'Accuracy']]
    reports_nn.append(final_df)





Simple NN W2V
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
              precision    recall  f1-score   support

          -1       0.86      0.53      0.65      1889
           0       0.32      0.63      0.43       580
           1       0.47      0.65      0.55       459

    accuracy                           0.57      2928
   macro avg       0.55      0.60      0.54      2928
weighted avg       0.69      0.57      0.59      2928

CNN NN W2v
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
              precision    recall  f1-score   support

          -1       0.84      0.58      0.69      1889
           0       0.32

In [None]:
combined_reports_df_nn=pd.concat(reports_nn)
combined_reports_df_nn.reset_index(drop=False, inplace=True)


In [None]:
combined_reports_df_nn

Unnamed: 0,Model Name,Accuracy,precision (-1),precision (0),precision (1),precision (macro avg),precision (weighted avg),recall (-1),recall (0),recall (1),recall (macro avg),recall (weighted avg),f1-score (-1),f1-score (0),f1-score (1),f1-score (macro avg),f1-score (weighted avg),support (-1),support (0),support (1)
0,Simple NN W2V,0.567623,0.85567,0.324779,0.471609,0.550686,0.690301,0.527263,0.632759,0.651416,0.603813,0.567623,0.652473,0.42924,0.547118,0.542944,0.591738,1889.0,580.0,459.0
1,CNN NN W2v,0.585383,0.843653,0.316476,0.57561,0.57858,0.697207,0.577025,0.668966,0.514161,0.586717,0.585383,0.685319,0.429679,0.543153,0.552717,0.612394,1889.0,580.0,459.0
2,LSTM NN W2V,0.575137,0.840383,0.319073,0.493671,0.551042,0.682767,0.557438,0.617241,0.594771,0.589817,0.575137,0.670274,0.420682,0.539526,0.543494,0.600336,1889.0,580.0,459.0
3,Simple NN GloVe,0.606557,0.843703,0.336833,0.59743,0.592655,0.704692,0.588671,0.663793,0.607843,0.620102,0.606557,0.693483,0.446895,0.602592,0.58099,0.630389,1889.0,580.0,459.0
4,CNN NN GloVe,0.604167,0.874689,0.330196,0.65625,0.620378,0.732589,0.557967,0.725862,0.640523,0.641451,0.604167,0.681319,0.453908,0.648291,0.594506,0.631094,1889.0,580.0,459.0
5,LSTM NN GloVe,0.594604,0.872483,0.358646,0.48056,0.570563,0.70926,0.550556,0.675862,0.673203,0.633207,0.594604,0.675105,0.468619,0.560799,0.568174,0.616284,1889.0,580.0,459.0


In [None]:
# @title
melted_df = combined_reports_df_nn.melt(id_vars=['Model Name'],
                                     value_vars=['recall (-1)', 'recall (0)', 'recall (1)'],
                                     var_name='Class', value_name='Recall')

fig = px.bar(melted_df, x='Model Name', y='Recall', color='Class', barmode='group',
             title='Recall for Different Classes NN',
             labels={'Model Name': 'Model', 'Recall': 'Recall', 'Class': 'Sentiment Class'},
             color_discrete_sequence=['red', 'yellow', 'blue'])

fig.show(renderer='colab')

In [None]:
# @title
combined_reports_df_nn['average_recall_pos_nue'] = (combined_reports_df_nn['recall (0)'] + combined_reports_df_nn['recall (1)']) / 2

sorted_df = combined_reports_df_nn.sort_values(by='average_recall_pos_nue', ascending=False)

fig = px.bar(sorted_df, x='Model Name', y='average_recall_pos_nue', title='Average Recall for Minority Classes',
             labels={'Model Name': 'Model', 'average_recall': 'Average Recall'},
             color_discrete_sequence=['green'])

fig.show(renderer='colab')


In [None]:
# @title
source = ColumnDataSource(data=combined_reports_df_nn)
# output_notebook()

fig2 = create_scatter_plot('recall (-1)', 'precision (-1)', 'Negative', 'Negative', 'red','Precison vs. Recall (Negative)')
fig3 = create_scatter_plot('recall (0)', 'precision (0)', 'Neutral', 'Neutral', 'green','Precision vs. Recall (Nuetral)')
fig4 = create_scatter_plot('recall (1)', 'precision (1)', 'Positive', 'Neutral', 'blue','Precison vs. Recall (Positive)')

grid = gridplot([[fig2, fig3, fig4]])
show(grid, notebook_handle=True)

In [None]:
# @title

source = ColumnDataSource(data=combined_reports_df_nn)
output_notebook()


fig4 = create_scatter_plot('recall (macro avg)', 'precision (macro avg)', 'Positive', 'Neutral', 'darkviolet','Macro Precison vs. Macro Recall')
fig4.xaxis.axis_label = 'Macro Average Recall'
fig4.yaxis.axis_label = 'Macro Average Precision'



fig5 = create_scatter_plot('recall (weighted avg)', 'precision (weighted avg)', 'Positive', 'Neutral', 'deeppink','Weighted Precison vs. Weighted Recall')
fig5.xaxis.axis_label = 'Macro Average Recall'
fig5.yaxis.axis_label = 'Macro Average Precision'

grid = gridplot([[fig4,fig5]])
show(grid, notebook_handle=True)


### Evaluating Nueral Network Models

#### **GloVe vs Word2Vec**
For all models except SNN, GloVe embeddings resulted in better results. For our SNN, the Word2Vec embeddings had a 0.8% greater average recall in the minority class compared to GloVe. However, for both LSTM and CNN, the GloVe embeddings had an 8% greater average recall than the Word2Vec embeddings.

**Why is it that GloVe embeddings result in better performance?**

Our GloVe embeddings performed better than the Word2Vec embeddings. Why could this be?
- Limited training size:
  - We trained our Word2Vec model on a data set of limited size. It is likely that the model did not have enough data to accurately recognize patterns between words.
  - The GloVe embeddings come pre-trained on an enormous corpus, so the word embeddings are likely more representative of the words.
- Semantic Relationships in global context
  - GloVe embeddings are known to better be able to capture semantic relationships between words as the word co-occurrences are considered in the global realm.




#### **Ranking Model Performance**
We will rank our models using the embeddings that lead to the highest minority recall. It is important to note that our models performed very similarly. Let us understand why our models performed the way they did.

1) CNN (68%)

<details>
<summary>Click to expand</summary>

- Local Pattern Recognition: CNNS are well-suited for detecting local patterns and features in data because of their ability to look at data in windows. These local patterns and combinations of words or phrases could inform us about the specific sentiments.
- Not capturing noise: Since CNN'S focus primarily on local patterns in the data, they are less likely to capture irrelevant patterns than models like LSTM, which can remember these patterns for a while.

</details>

1) LSTM (67%)

<details>
<summary>Click to expand</summary>

- Memory: LSTM’s memory cells allow the model to retain information over longer sequences. LSTMS can remember important information from earlier in a tweet which can help with the understanding of the entire tweet. While this can make them prone to overfitting, it can often be advantageous in text analysis problems, where we want to retain bits of information throughout the text.
- Combining Information: Since LSTMs are able to retain memory, they can combine important information from earlier in a tweet with later information to make a more holistic decision.

</details>

3) SNN (64%)

<details>
<summary>Click to expand</summary>

- While SNNs are not always the first choice for sentiment analysis, it is possible that the quality of the word embeddings, in both Word2Vec and GloVe, allowed the model to learn the complex relationships between words and the sentiment classes.

</details>

It is also interesting to note that CNN with GloVe not only had the highest macro average recall but also the highest macro average precision. This is different than our best performer in the non NN model. Logistic regression had the highest macro average recall but the lowest macro average precision.





### Evaluating All Performance Metrics for all Models

In [None]:
# combining all model performances
all_combined_models=pd.concat([combined_reports_df, combined_reports_df_nn])

In [None]:
all_combined_models
all_combined_models= all_combined_models.reset_index(drop=True)


In [None]:
all_combined_models

Unnamed: 0,Model Name,Accuracy,precision (-1),precision (0),precision (1),precision (macro avg),precision (weighted avg),recall (-1),recall (0),recall (1),...,recall (weighted avg),f1-score (-1),f1-score (0),f1-score (1),f1-score (macro avg),f1-score (weighted avg),support (-1),support (0),support (1),average_recall_pos_nue
0,Logistic Regression CV,0.753757,0.916347,0.509615,0.656604,0.694189,0.795061,0.759661,0.731034,0.75817,...,0.753757,0.83068,0.600567,0.703741,0.711663,0.765198,1889.0,580.0,459.0,0.744602
1,Logistic Regression TFIDF,0.759563,0.894674,0.517473,0.670565,0.694237,0.784823,0.791424,0.663793,0.749455,...,0.759563,0.839888,0.581571,0.707819,0.709759,0.768015,1889.0,580.0,459.0,0.706624
2,Multinomial Naive Bayes CV,0.773224,0.827215,0.58351,0.723301,0.711342,0.76265,0.894653,0.475862,0.649237,...,0.773224,0.859613,0.524217,0.684271,0.689367,0.765688,1889.0,580.0,459.0,0.56255
3,Multinomial Naive Bayes TFIDF,0.757855,0.763836,0.666667,0.806818,0.745774,0.751326,0.95712,0.341379,0.464052,...,0.757855,0.849624,0.451539,0.589212,0.630125,0.729946,1889.0,580.0,459.0,0.402716
4,Random Forest CV,0.767418,0.814726,0.582796,0.735065,0.710862,0.756295,0.896241,0.467241,0.616558,...,0.767418,0.853542,0.51866,0.670616,0.680939,0.75853,1889.0,580.0,459.0,0.5419
5,Random Forest TFIDF,0.775273,0.804979,0.634328,0.753501,0.730936,0.763106,0.924299,0.439655,0.586057,...,0.775273,0.860522,0.519348,0.659314,0.679728,0.761398,1889.0,580.0,459.0,0.512856
6,SVC CV,0.794057,0.841066,0.62768,0.768638,0.745795,0.787443,0.902065,0.555172,0.651416,...,0.794057,0.870498,0.589204,0.705189,0.72163,0.788863,1889.0,580.0,459.0,0.603294
7,SVC TFIDF,0.786885,0.832113,0.613588,0.759804,0.735168,0.777491,0.902594,0.498276,0.675381,...,0.786885,0.865922,0.549952,0.71511,0.710328,0.779691,1889.0,580.0,459.0,0.586829
8,XGB CV,0.790642,0.826775,0.635762,0.777494,0.746677,0.781212,0.912123,0.496552,0.662309,...,0.790642,0.867355,0.557599,0.715294,0.713416,0.782159,1889.0,580.0,459.0,0.579431
9,XGB TFIDF,0.787568,0.833986,0.617886,0.758883,0.736919,0.779406,0.901535,0.524138,0.651416,...,0.787568,0.866446,0.567164,0.701055,0.711555,0.781235,1889.0,580.0,459.0,0.587777


In [None]:
# @title
sorted_df = all_combined_models.sort_values(by='average_recall_pos_nue', ascending=False)

fig = px.bar(sorted_df, x='Model Name', y='average_recall_pos_nue', title='Average Recall for Different Models After Tuning',
             labels={'Model Name': 'Model', 'average_recall': 'Average Recall'},
             color_discrete_sequence=['green'])

# Show the plot
fig.show(renderer='colab')


In [None]:
# @title
source = ColumnDataSource(data=all_combined_models)
output_notebook()

fig2 = create_scatter_plot('recall (-1)', 'precision (-1)', 'Negative', 'Negative', 'red','Precison vs. Recall (Negative)')
fig3 = create_scatter_plot('recall (0)', 'precision (0)', 'Neutral', 'Neutral', 'green','Precision vs. Recall (Nuetral)')
fig4 = create_scatter_plot('recall (1)', 'precision (1)', 'Positive', 'Neutral', 'blue','Precison vs. Recall (Positive)')

grid = gridplot([[fig2, fig3, fig4]])
show(grid, notebook_handle=True)

In [None]:
# @title
source = ColumnDataSource(data=all_combined_models)
output_notebook()


fig5 = create_scatter_plot('recall (macro avg)', 'precision (macro avg)', 'Positive', 'Neutral', 'darkviolet','Macro Precison vs. Macro Recall')
fig5.xaxis.axis_label = 'Macro Average Recall'
fig5.yaxis.axis_label = 'Macro Average Precision'



fig6 = create_scatter_plot('recall (weighted avg)', 'precision (weighted avg)', 'Positive', 'Neutral', 'deeppink','Weighted Precison vs. Weighted Recall')
fig6.xaxis.axis_label = 'Macro Average Recall'
fig6.yaxis.axis_label = 'Macro Average Precision'

grid = gridplot([[fig5,fig6]])
show(grid, notebook_handle=True)

Interesting observations on Macro Precsion vs. Recall

<details>
  <summary>Click to Expand</summary>


- While our NNs have a higher minority recall, they score quite low on the Macro Average Recall (MAR). Why is this?
  - The main factor dragging down the MAR score is the low recall in the negative class. Compared to the other models, this recall is much lower. Why is this?
  - To provide more balanced data to our NNs, we used SMOTE, which oversampled from the minority class. This created equal instances of all the classes.
  - In the non NN models, we did not use SMOTE. This meant that the data was dominated by the negative class, so it was more prone to making predictions that a tweet was negative. This could explain the relatively low recall in the NNs compared to the other models.
  
- Similar recall in all sentiment classes: NN provides similar recall in each of the classes. The non NN models generally had extremely high recall in the negative class (90%) and very low recall in the minority class (40-50%). NN on the other hand had more balanced recall scores (55% in the negative class, 72% in the neutral class, and 60% in the positive class). In the context of our problem, we prefer a model that has higher recall in the minority class.


- Why do NNs have lower precision?
  - We also note that NNs have lower macro average precision. Why is this?
  - While NNs have higher recall in the negative classes, they also have lower precision in these classes (precision-recall tradeoff). Our model may be more likely to classify an observation in the minority classes even if the observation does not belong there.

</details>




### Main takeaways

**Which text vectorization techniques are most effective?**

<details>
  <summary>Click to Expand</summary>

- CV outperformed TFIDF in almost all non NN models. This is likely because CV always captures the presence or absence of all words, while TFIDF may downweight important words if they appear more often.
- GloVe outperformed Word2Vec in almost all models. GloVe was pre-trained on a much larger corpus, so more complex relationships could have been learned. GloVe also looks at the global context of the word instead of the word in the context of a specific window (Word2Vec).

</details>


**Which models are best for sentiment classification?**

<details>
  <summary>Click to Expand</summary>

Model Performance Ranked (using the best vectorizer for each)


1) Logistic Regression (CV)




2) CNN (GloVe)




3) LSTM (GloVe)




4) SNN (GloVe or Word2Vec)




5) SVC (CV)




6) XGB (CV)




7) MNB (CV)




8) RF (CV)




- Logistics regression using both TF IDF and CV outperforms all models. Why could this be? As mentioned earlier, LR is a linear model that assumes a linear relationship between the input features and the output features. If our sentiment classes are linearly separable, which might be the case given how well LR performed, then LR’s simple linear decision boundary might prove most effective.
  - Moreover, the soft decision boundary, which allows the model to express how likely each of the classes are for a given tweet, allows the model to acknowledge that a tweet can have characteristics of several classes. This can reduce the bias of simply assigning many tweets to the majority class.
- NN are fierce competitors. This is no surprise. The rich word embeddings likely gave them a competitive edge. Moreover, NNs specialize in capturing sequential dependencies and local patterns, making them better at picking up the order and context of words.
- Weak Tree based classifiers:
  - XGBoost and Random Forest may have underperformed because there were not enough samples from the minority classes to build representative trees with.
- Multinomial Naive Bayes: MNB likely underperformed due to the assumption that the presence of a certain word is independent of other words in the tweet given the class label. This is likely not the case in SA, where word order and context is critical.


Overall, logistic regression and NNs are the best classifiers.

It is important to remember that intensive hyperparameter tuning may completely change the order the models performed in. However, this notebook is meant to give an idea of how many popular models approach sentiment classification.


</details>




**Why did neural networks outperform most non NN models?**

<details>
  <summary>Click to Expand</summary>

- Quality of embeddings: Making a direct comparison between NN models and non NN models can be difficult because we used different vectorization techniques for NN models and non NN models. It is possible that the quality of the word embedding we used in our NN models led to better results. GloVe and Word2Vec both look at the semantic relationship between words while CV and TFIDF do not.
- Handling imbalanced data: In our traditional models, we passed in class_weights='balanced' to handle the imbalanced data, which upgrades the loss function to penalize misclassifications of the minority classes.
- However, in our NNs, we used an oversampling technique called SMOTE, which created more instances from the minority class. We did this mostly because we wanted to create more data, as NN can perform poorly when they are trained on less data. Still, this could have affected our performance.
- NN and sequential learning: In text analysis, order and context of words matter. (ex: "Max loves maya" is not the same as "Maya loves Max"). NNs, especially LSTMs and CNNs, are designed to capture sequential dependencies and local patterns in text data, which helps them capture the order and context of words. This can make them better suited for text classification than non NN models, which do not look at this.


</details>


**Application**


<details>
  <summary>Click to Expand</summary>

Our best sentiment classifier has 75% macro average recall and 70% precision. This is not bad, especially given how imbalanced our sentiment classes were. This logistic regression SA classification model can be used when brands quickly want to gauge public sentiment towards their company. This data can then be filtered by class so that brands can draw insights from recurrent trends in each class.



</details>

