# Objective: Discern Customer Sentiment Towards 6 U.S. Airlines Using Tweets
Exploring different text vecotorization and modeling techniques for sentiment classification

### Outline: 
- Loading Data and Basic Exploratory Data Analysis 
- Text preprocessing 
- Text vectoriziation 
- Modeling using classical statistics models 
- Switching to nueral networks 
    - Word2Vec word embeddings
    - GloVe word embeddings
    - Nueral Network Models 
- Evalutating performance metrics for all classifiers

In [1]:
# importing libraries 
import nltk

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px

import seaborn as sns
from nltk.corpus import stopwords
import string
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
import contractions


from bs4 import BeautifulSoup
import re
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)


plt.style.use('ggplot')


## Loading Data and Basic EDA

In [2]:
tweets=pd.read_csv('Tweets.csv')

In [3]:
tweets.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [4]:
#14,640 tweets, 15 features
tweets.shape

(14640, 15)

In [5]:
init_notebook_mode(connected=True)
%matplotlib inline

value_counts = tweets.airline_sentiment.value_counts()

# Convert the value counts Series to a DataFrame
value_counts_df = value_counts.reset_index()
value_counts_df.columns = ['Sentiment', 'Count']

fig=px.bar(value_counts_df, x='Sentiment', y='Count',title='Sentiment Distribution',color_discrete_sequence=['red'])
fig.show()
init_notebook_mode(connected=True)



Sentiment class is unbalanced-- majority of tweets are negative. We will have to take this into account modeling

In [6]:
init_notebook_mode(connected=True)

%matplotlib inline
value_counts = tweets.airline.value_counts()

# Convert the value counts Series to a DataFrame
value_counts_df = value_counts.reset_index()
value_counts_df.columns = ['Airline', 'Count']
value_counts_df
fig2=px.bar(value_counts_df, x='Airline', y='Count',title='Airline Distribution',color_discrete_sequence=['blue'])
fig2.show()

In [7]:
import plotly.express as px
%matplotlib inline
init_notebook_mode(connected=True)

import plotly.express as px

# Assuming you have the 'tweets' DataFrame with appropriate columns
color_map = {'negative': 'red', 'neutral': 'yellow', 'positive': 'blue'}

fig3 = px.histogram(tweets, x='airline', color='airline_sentiment', title='Sentiment Distribution by Airline',
              labels={'airline_sentiment': 'Sentiment'}, barmode='group',color_discrete_map=color_map)  # Adjust the opacity here

fig3.update_traces(marker=dict(opacity=0.7))

fig3.show()


For all airlines except Virgin America, the ratio of nuetral, postive, and negative tweets is imbalanced. The majority of tweets for all classes are negative. 

In [8]:
# We create a new column converting negative, nuetral, and positive tweets to -1,0, and 1 respectively. This will help in later model building. 
tweets['Sentiment']=tweets.airline_sentiment.apply(lambda x: 1 if x=='positive' else 0 if x=='neutral'else -1 if x=='negative' else None)

In [9]:
tweets.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone,Sentiment
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada),0
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada),1
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada),0
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada),-1
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada),-1


## Text Preprocssing 
- What is the purpose of text preprocessing? 
- Feeding in cleaner data should yield better results in our models 
- We can think of this as "normalizing" our text data
- Note that different models may perform better with/ without certain preprocessing steps 
- For example, nueral networks often perform better when words are not stemmed, as nueral networks are able to learn complex patterns directly from raw text 
    - Stemming is reducing a word to it's root form (running--> run)
- However, since we will begin with clasical machine learning models, we will thoroughly preprocess the text before inputting it into our model 

### Building a preprocessing function for traditional models
- clean any html tags (e.g. break statments)
- remove handles (e.g. @catsarecool)
- remove websites urls
- remove alphanumericals and numbers 
- strip punctuation 
- lower case all words
- remove stop words 
- stem words to their "root" 

In [10]:
# Removing handles
def remove_tags(text):
    regex=re.compile('(@[A-Za-z0-9]+)')
    text=re.sub(regex,'',text)
    return text
# Removing handles
def remove_html(text):
    regex=re.compile('<.*?>')
    cleantext=re.sub(regex,'',text)
    return text

In [11]:
# Expanding contractions
import sys  
def decontracted(text):
    return(contractions.fix(text))


In [12]:
# cleaning punctuation
def remove_punc(sentence):
    for i in sentence: 
        if i in string.punctuation: 
            sentence=sentence.replace(i,"")
            
    return (sentence)

In [13]:
# getting stopwords
# Stopwords are words that show up commonly in the english language but add little semantic value to text
stop_words=set(stopwords.words("english"))
print(stop_words)

#removing stop words 
from nltk.tokenize import word_tokenize
def remove_stops(sentence): 
    filtered_sent=[]
    sent_tokens=word_tokenize(sentence)
    sent_tokens
    for word in sent_tokens: 
        if word not in stop_words:
            filtered_sent.append(word)
    return(filtered_sent)

{'both', 'when', 'below', 'between', "couldn't", 'will', 'up', 'not', 'having', 'm', 'was', "should've", "wasn't", 'itself', 'hadn', 'of', 'very', 'do', 'been', 'weren', 'few', 'more', 'same', 'by', 'out', 'is', 'himself', 'about', "hasn't", 'doesn', "doesn't", 'with', "isn't", 'there', 'ourselves', "she's", 'hers', 'theirs', 't', 'who', 'we', 'why', 'she', 'nor', 'needn', 'themselves', 'into', 've', 'being', "mustn't", 'at', 'mightn', 'those', 'haven', 'over', 'any', 'i', 'above', 'just', 'your', "weren't", 'wouldn', 'this', 'what', "shan't", 'me', 'on', "hadn't", 'during', 'd', "wouldn't", 'its', 'for', 'isn', 'am', 'as', "that'll", 'where', 'are', 'down', "haven't", 'if', 'my', 'her', 'shouldn', 'our', 'from', 'won', 'll', 'against', 'wasn', 'yourselves', 'or', 're', 'shan', "didn't", 'in', "you'll", 'which', 'couldn', 'through', 'them', 'didn', 'whom', 'herself', 'be', 'it', 'now', 'ain', 'you', 'aren', "shouldn't", 'that', "you've", "you'd", 'so', 'before', 'a', 'to', 'they', 'all

In [14]:
# stemming words
# stemming words reduces them down to their base form. This allows our model to group together variations of the same word. 
from nltk.stem.snowball import SnowballStemmer 
snow_stemmer=SnowballStemmer(language='english')

def stemmed_sent(text): 
    stemmed_sent=[]
    for i in text: 
        stemmed_sent.append(snow_stemmer.stem(i))
    x=' '.join(i for i in stemmed_sent)
    return(x)

In [15]:
#Putting it all together
def preprocessor(text):
    # removes html tags; exmp <br>
    text=remove_html(text)
    # removes @ tags; exmp: @catsrcool
    text=remove_tags(text)
    # removes websites 
    text=re.sub(r"http\S+","",text)
    # removes contractions
    text=decontracted(text)
    #removes any numbers and words mixed with numbers
    text=re.sub("\S*\d\S*","",text)
    #removes anything that is not a letter
    # removes any numbers (both stray and mixed) if mixed, will not remove the letters mixed with numbers, but removes #s
    # [^A-Za-z]+  any character that IS NOT a-z OR A-Z ^ inside bracket, negates statement, in a way, cleans punc 
    text=re.sub('[^A-Za-z]+',' ',text)
    #removes extra spaces 
    text=re.sub(' +',' ',text)
    #cleans punctation
    text=remove_punc(text)
    #lower case everthing
    text=text.lower()
    #romove stop words 
    text=remove_stops(text)
    #stem sentence
    text=stemmed_sent(text)
    
    return(text)

### Let's see how our preprocessor works 


In [16]:
# chosing a random tweet 
import random as random
random.seed(42)
rand=random.randint(0, 14640)
exm_tweet=tweets.text[rand]
print('Unprocessed Tweet:',exm_tweet)

print('Processed Tweet:', preprocessor(exm_tweet))


Unprocessed Tweet: @USAirways AND my rebooked flt isn't until Monday??  AND I don't get a voucher for a hotel?!  Never again, US airways.
Processed Tweet: rebook flt monday get voucher hotel never us airway


Our preprocessor has successfully removed tags, punctuation, and stopwords. It has also lower cased and stemmed all words and expanded contractions. 

In [17]:
# applying preprocessor function to all tweets and saving preprocessed tweets in new column in data frame 
tweets['preprocessed_tweets']=tweets.text.apply(lambda x: preprocessor(x) )
tweets.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone,Sentiment,preprocessed_tweets
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada),0,said
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada),1,plus ad commerci experi tacki
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada),0,today must mean need take anoth trip
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada),-1,realli aggress blast obnoxi entertain guest fa...
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada),-1,realli big bad thing


## Text Vectorization Intuition
Many models cannot handle text data as it is. Therefore we must convert the words into vectors of numbers that the model can interpret. There are several ways we can do this. 
For traditional machine learning models, we will use Bag of Words and TFIDF Vectorization. Later, for the nueral networks, we will discuss alternative word vectorization options. 

### Bag of Words Vectorization
- Each unique word is represented as a feature, and each tweet is represented as a row 
- We put a 1 if the word is present in the tweet, and a 0 if the word is not present (one-hot encoding for text)
- Let's take a simple corpus with three sentences that we would like to vectorize:

In [18]:
corpus= ['cats are cool',
        'dogs are cool',
        'animals are the coolest']

We can use the count vectorizer function from sklearn to produce a one hot encoded data frame with the rows as sentences in the corpus and the columns as the unique words in the corpus. 

In [19]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
# fit_transform creates dummy variables from unique words in the corpus
cv_exmp=cv.fit_transform(corpus)
cv_exmp.toarray()

# these are the unique vocab words in the corpus 
sorted(cv.vocabulary_.keys())
new_index_values = ['Sent 1', 'Sent 2', 'Sent3']

cv.get_feature_names_out()
df=pd.DataFrame((cv_exmp).toarray(),columns=(cv.get_feature_names_out()))
df.index=new_index_values
df

Unnamed: 0,animals,are,cats,cool,coolest,dogs,the
Sent 1,0,1,1,1,0,0,0
Sent 2,0,1,0,1,0,1,0
Sent3,1,1,0,0,1,0,1


##### Limitations of binary bag of words: 
- word order is not preserved 
- we do not know semantic relationships between words
- we mark only whether the word appeared or not in the sentence, not the number of instances the word appeared

### TF-IDF
- TF-IDF, term frequency inverse document frequency, is another text vecotorization tool 
-  The basic idea: the more times a given term appears in a document (a particular sentence), the more important the word is to understanding the document 
- At the same time, terms that appear in almost every document are likely not important to understanding a specific doucment 
- TF-IDF factors in both of these concens

#### Steps:

1) calculate term frequency for each word in document (sentence): 
- idea behind term frequency: the more often a word appears in a document, the more important it is to understanding that document 
- term frequency = (number of times a word appeared in a document)/ (number of words in the document)

2) calculate inverse document frequency: 
- idea of IDF: words that appear in most documents likely to not provide much information in understanding a particular document
- Inverse document Frequency = log(number of documents (sentences) in a corpus/ # of documents containing a particualr world)
   
3) Multiply term frequency for each words in the sentence by its corresponding inverse document frequency. Do this for all sentences.

We will use the same corpus as before

In [20]:
corpus

['cats are cool', 'dogs are cool', 'animals are the coolest']

In [21]:
# there is an issue in translation from this to regular 
# unique_words = list(set(' '.join(corpus).split()))
# unique_words
unique_words=['are','coolest','cool','cats','dogs','the','animals']


- Term Frequency: the number of times a specific term appears in a document/ the number of terms in the document 


In [22]:
# Step 1-- calculate term frequency
TF= pd.DataFrame({
          'word':unique_words,
           'TF Sent 1': ['1/7','0','1/7','1/7','0','0','0'],
            'TF Sent 2':['1/7','0','1/7','0','1/7','0','0'],
            'TF Sent 3':['1/7','1/7','0','0','0','1/7','1/7']
        })

TF


Unnamed: 0,word,TF Sent 1,TF Sent 2,TF Sent 3
0,are,1/7,1/7,1/7
1,coolest,0,0,1/7
2,cool,1/7,1/7,0
3,cats,1/7,0,0
4,dogs,0,1/7,0
5,the,0,0,1/7
6,animals,0,0,1/7


In [23]:
# Step 2: calculate inverse document frequency
TF_IDF=TF.copy()
TF_IDF["IDF"]=['log(3/3)','log(3/1)','log(3/2)','log(3/1)','log(3/1)','log(3/1)','log(3/1)']
TF_IDF

Unnamed: 0,word,TF Sent 1,TF Sent 2,TF Sent 3,IDF
0,are,1/7,1/7,1/7,log(3/3)
1,coolest,0,0,1/7,log(3/1)
2,cool,1/7,1/7,0,log(3/2)
3,cats,1/7,0,0,log(3/1)
4,dogs,0,1/7,0,log(3/1)
5,the,0,0,1/7,log(3/1)
6,animals,0,0,1/7,log(3/1)


Inverse Document Frequency represents the importance of the term in the whole corpus: 

For example, to calculate the IDF of the word "cats", we take the log of the number of total documents (3) divided by the number of documents containing the word cats (1). 

In [24]:
# step 3: Multiply TF matrix with IDF respectively

import math
TF_IDF['TFIDF1']=[1/7*(math.log(1,10)),0,1/7*(math.log((3/2),10)),1/7*(math.log((3/1),10)),0,0,0]
TF_IDF['TFIDF2']=[1/7*(math.log(1,10)),0,1/7*(math.log((3/2),10)),0,1/7*(math.log((3/1),10)),0,0]
TF_IDF['TFIDF3']=[1/7*(math.log(1,10)),1/7*(math.log(3,10)),0,0,0,1/7*(math.log(3,10)),1/7*(math.log(3,10))]
TF_IDF

Unnamed: 0,word,TF Sent 1,TF Sent 2,TF Sent 3,IDF,TFIDF1,TFIDF2,TFIDF3
0,are,1/7,1/7,1/7,log(3/3),0.0,0.0,0.0
1,coolest,0,0,1/7,log(3/1),0.0,0.0,0.06816
2,cool,1/7,1/7,0,log(3/2),0.025156,0.025156,0.0
3,cats,1/7,0,0,log(3/1),0.06816,0.0,0.0
4,dogs,0,1/7,0,log(3/1),0.0,0.06816,0.0
5,the,0,0,1/7,log(3/1),0.0,0.0,0.06816
6,animals,0,0,1/7,log(3/1),0.0,0.0,0.06816


In [25]:
TF_IDF_clean=TF_IDF.copy()
columns=TF_IDF['word']
TF_IDF_clean=TF_IDF_clean[['TFIDF1','TFIDF2','TFIDF3']]
TF_IDF_clean



Unnamed: 0,TFIDF1,TFIDF2,TFIDF3
0,0.0,0.0,0.0
1,0.0,0.0,0.06816
2,0.025156,0.025156,0.0
3,0.06816,0.0,0.0
4,0.0,0.06816,0.0
5,0.0,0.0,0.06816
6,0.0,0.0,0.06816


In [26]:
TF_IDF_clean=TF_IDF_clean.T
TF_IDF_clean.columns=columns
TF_IDF_clean


word,are,coolest,cool,cats,dogs,the,animals
TFIDF1,0.0,0.0,0.025156,0.06816,0.0,0.0,0.0
TFIDF2,0.0,0.0,0.025156,0.0,0.06816,0.0,0.0
TFIDF3,0.0,0.06816,0.0,0.0,0.0,0.06816,0.06816


This is how we would represent each sentence in the corpus using TFIDF scores. As we can see, words that are important to a specific document have a higher score: example "cats" for document 1, "dogs" for document 2, and "coolest" for document 3. Words that appear in all documents like "are" have the lowest scores. 

While this can be an improvment from the Bag of words vectorizor (TFIDF tells us more than simply whether a word is present), the TFIDF vectorizor still fails to suggest the relationships between words. Still, both Bag of Words and TFIDF work fairly well with many classic models. 


Before we vectorize all of our tweets, we split our data into training and testing. We split our data before vectorizing to prevent our test data effecting the training process. We do not want any test data to influence any part of the training. 


## Splitting Data into training and testing

In [27]:
from sklearn.model_selection import train_test_split
X=tweets.preprocessed_tweets
y=tweets.Sentiment
# holding out 20 percent of data from the training processs to evaluate the models' performance
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

In [28]:
print(len(X_train))
print(len(y_train))
print(len(X_test))
print(len(y_test))

11712
11712
2928
2928


In [29]:
# applying Bag of Words models discussed earlier to our tweets
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
# fit_transform learns vocab from X_train and then creates numeric vector for each tweet using CV
X_train_cv=cv.fit_transform(X_train)
# transform takes learned vocab (fitted on X_train) and applies it to the test data
X_test_cv=cv.transform(X_test)


In [30]:
print(X_train_cv.shape)
print(X_test_cv.shape)
# there are 7322 unique words

(11712, 7322)
(2928, 7322)


In [31]:
# to see how many words there are
cv.get_feature_names_out()


array(['aa', 'aaaand', 'aaadvantag', ..., 'zrh', 'zuke', 'zurich'],
      dtype=object)

In [32]:
#let's see what this looks like in the training data
# again, each row represents a tweet. If the word was present in a tweet, we put a 1. If it was not present, we put a 0. 
pd.DataFrame((X_train_cv).toarray(),columns=(cv.get_feature_names_out())).head()

Unnamed: 0,aa,aaaand,aaadvantag,aaalwaysl,aadavantag,aadelay,aadv,aadvantag,aafail,aal,...,zfv,zig,zip,zipper,zombi,zone,zoom,zrh,zuke,zurich
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [33]:
#seeing what the vectors for each word looks like train data
pd.DataFrame((X_test_cv).toarray(),columns=((sorted(cv.vocabulary_.keys())))).head()

Unnamed: 0,aa,aaaand,aaadvantag,aaalwaysl,aadavantag,aadelay,aadv,aadvantag,aafail,aal,...,zfv,zig,zip,zipper,zombi,zone,zoom,zrh,zuke,zurich
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### TFIDF

In [34]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

In [35]:
print(X_train_tfidf.shape)
print(X_test_tfidf.shape)
# 7388 unique words 

(11712, 7322)
(2928, 7322)


In [36]:
words=vectorizer.get_feature_names_out()

tfidf_vec_train_df=pd.DataFrame(X_train_tfidf.toarray(),columns=words)
print(tfidf_vec_train_df.shape)
tfidf_vec_train_df.head()
# 11712 tweets in train data, 7322 unique words

(11712, 7322)


Unnamed: 0,aa,aaaand,aaadvantag,aaalwaysl,aadavantag,aadelay,aadv,aadvantag,aafail,aal,...,zfv,zig,zip,zipper,zombi,zone,zoom,zrh,zuke,zurich
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [37]:
tfidf_vec_test_df=pd.DataFrame(X_test_tfidf.toarray(),columns=words)
print(tfidf_vec_test_df.shape)
tfidf_vec_test_df.head()
# 2928 tweets in test data, 7322 unique words

(2928, 7322)


Unnamed: 0,aa,aaaand,aaadvantag,aaalwaysl,aadavantag,aadelay,aadv,aadvantag,aafail,aal,...,zfv,zig,zip,zipper,zombi,zone,zoom,zrh,zuke,zurich
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Modeling with Traditional Machine Learning Models
- Managing class imbalances 
- Deciding on evalutation metrics 
- Evaluating models 

In [38]:
# importing models 
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
import xgboost as xgb
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.model_selection import cross_val_score

We will explore how well five models classify our tweets. 
-  Logistic Regression
-  Multinomial Naive Bayes 
-  Random Forest Classifier
-  Support Vector Classifer 
-  XGB Boost classifer 

WILL NEED TO PROVIDE BRIEF EXPLANATION/ INTUITION FOR ALL MODELS... WILL ADD HERE LATER

### Addressing class imbalances
Before we begin modeling and choosing our metrics, we must address how imbalanced our data is. 

In [39]:
init_notebook_mode(connected=True)

fig.show()

- Each of the five models has different paramaters to adjust class weights. 
- For all models except multinomial naive bayes (MNB), we can simple pass in class_weight=balanced. This automatically adjusts the weights assigned to different classes during training to account for the class imbalances. Classes that dominate the data will be assigned lesser weights, and classes that make up the minority will be assigned larger weights. 


In [40]:
from sklearn.utils.class_weight import compute_class_weight
class_weights = compute_class_weight(class_weight = "balanced", classes= [-1,0,1], y= y_train)
class_weights

array([0.53560159, 1.54982136, 2.05042017])

- The negative class is given the lowest weight as it is highly represented in the data, and the positive class is given the greatest weight in the model to account for the the sparsity. 
- For multinomial naive bayes, which does not have this parameter, we can instead adjust the class priors. Class priors represent the prior probability of a tweet falling into the negative, nuetral, or positive class. Since our negative class has many more observations that the others, the imbalanced class  priors can impact our MNB model. 
- By changing the class priors, we can change the balance of the classes in the training process, giving more weight to the minority classes. This is what we will do. 

In [41]:
class_counts = [len(y_train[y_train == class_label]) for class_label in set(y_train)]
class_counts
total_samples = len(y_train)
total_samples
class_priors = [class_count / total_samples for class_count in class_counts]

class_priors

[0.2150785519125683, 0.16256830601092895, 0.6223531420765027]

### Evaluation Metrics
- We will be using four evaluation metrics: accuracy, precision, recall, and F1 scores
- Accuracy: the amount of correct predictions divided by the number of total predictions. While accuracy is a popular evaluation metric, it is not the only metric that we should consider in a classification problem with highly imbalanced classes. Accuarcy alone can be misleading especially in evaluating how well our classifier recongizes observations that fall into minority classes. 
    - A simple example. Suppose we have a group of observations where 90% of the observation fall into the null class and only 10% fall into the alterantive class. If we have a classifer that predicts that every observation falls into the null class, that classifer then has 90% accuracy. However, this is still a poor classifier as it cannot recongize observations that fall into the minority class. 
- Looking at precison, recall, and F1 scores, can give us a better understanding of a classifier's performance.
- Precision: in the case of multi-class predicitions, precision is the number of true postives for a specific class divided by the predicted postives for a specific class 
    - If a classifer has high precison, it  means that the model is likely correct when it makes a prediction that an observation falls into a positive class. 
- Recall: the number of true postives divided by the number of actaul positives 
    - If a classifier has high recall, it is able to detect the true positive of a certain class well. This is perhaps our most important metric, as we are interested in seeing how well our model can detect the true positves in each class. 
- F1 score: the harmonic mean of precison and recall-- a way for us to combine both of the above metrics into one metric. A Higher f1 score indicates a better classifer.  

To gain more intution, let's look at the five evaluation metrics for one specific model. This model uses count vectorization and multinmoal logistic regression.

In [42]:
pipe_lr_cv=Pipeline([
            ('cv',CountVectorizer()),
            ('LR',LogisticRegression(multi_class='multinomial',class_weight='balanced',max_iter=2000,solver='lbfgs'))])

In [43]:
from sklearn.model_selection import cross_val_score

# fitting the model 
pipe_lr_cv.fit(X_train,y_train)

# cross val scores 
scores=cross_val_score(pipe_lr_cv,X_train,y_train,cv=5)
print(scores)
print(scores.mean())

# seeing how well our data performs on previously unseen data 
y_pred=pipe_lr_cv.predict(X_test)
accuracy_score(y_test,y_pred)



[0.73452838 0.75202732 0.74380871 0.75192143 0.73441503]
0.7433401745774703


0.7653688524590164

We now look at precision, recall, and F1 scores. To gain more intuiton as to how these metrics are calculated, we print a confusion matrix. Confusion matrices allow us to compare the predicted values for each class against the actual values of each class. 

In [44]:
from sklearn.metrics import confusion_matrix
import numpy as np
class_names_actual=['Negative Actaul','Nuetral Actual','Postive Actual']
class_names_predicted=['Negative Pred','Nuetral Pred','Postive Pred']
class_names=['Negative','Nuetral', 'Postive']
cm=confusion_matrix(y_test,y_pred,labels=[-1,0,1])

In [45]:
confusion_matrix(y_test,y_pred)
cm_df=pd.DataFrame(cm,index=class_names_actual,columns=class_names_predicted)
cm_df
# cm_df.style.set_caption("Predicted")


Unnamed: 0,Negative Pred,Nuetral Pred,Postive Pred
Negative Actaul,1489,298,102
Nuetral Actual,110,401,69
Postive Actual,49,59,351


In [46]:
# We calculate the precison, recall and F1 score for each class. 
# The precision is the number of true positives divided by the predicited postives. 
# Precision= True positives/ Predicted positives 
# for the negative class, the true positive value (the amount of negatives we predicted that were actaully negatives) is 1488 and the predicted postives (the number of negatives we predicted regardless of the actual outcome) are 1488+111+47. 
# Therefore the precison is 
precison= 1488/(1488+111+47)
print('Precision:',precison)
# Recall: True positves/ acutal posives. How many of the actaul positives we detected. For the negative class: 
recall= 1488/(1488+301+100)
print('Recall:',recall)
# F1 score is a way to represent both precison and recall in one metric: 
F1= (2*precison*recall)/(precison+recall)
print('F1 Score:', F1)
# this is how we calculate the metrics for each individual class

Precision: 0.9040097205346294
Recall: 0.787718369507676
F1 Score: 0.8418670438472418


In [47]:
# the classification report lets us look at the precision, recall, and f1- scores for every class.
from sklearn.metrics import classification_report
report=classification_report(y_test,y_pred)
print(report)


              precision    recall  f1-score   support

          -1       0.90      0.79      0.84      1889
           0       0.53      0.69      0.60       580
           1       0.67      0.76      0.72       459

    accuracy                           0.77      2928
   macro avg       0.70      0.75      0.72      2928
weighted avg       0.79      0.77      0.77      2928



We will begin evaluating all our models with these metrics.

We use a simple pipeline for reproducibility. The first transformer is the count vectorizer or tfidf vectorizer, respectively, and the second is the classifier we are trying: log reg, multinomial naive bayes, random forest, support vector classifier, and xgboost.

In [48]:
pipe_lr_cv=Pipeline([
            ('cv',CountVectorizer()),
            ('LR',LogisticRegression(multi_class='multinomial',class_weight='balanced',max_iter=4000,solver='lbfgs'))])
pipe_lr_tfidf=Pipeline([('tfidf',(TfidfVectorizer())),
            ('LR',LogisticRegression(multi_class='multinomial',class_weight='balanced',max_iter=4000,solver='lbfgs'))])
pipe_nb_cv=Pipeline([('cv',CountVectorizer()),
                     ('MNB',MultinomialNB())])
pipe_nb_tfidf=Pipeline([('tfidf',TfidfVectorizer()),
                     ('MNB',MultinomialNB())])
pipe_rf_cv=Pipeline([('cv',CountVectorizer()),
            ('RF',RandomForestClassifier(random_state=42))])
pipe_rf_tfidf=Pipeline([('tfidf',TfidfVectorizer()),
            ('RF',RandomForestClassifier(random_state=42))])
pipe_svc_cv=Pipeline([('cv',CountVectorizer()),
            ('SVC',svm.SVC(kernel='rbf'))])
pipe_svc_tfidf=Pipeline([('tfidf',TfidfVectorizer()),
            ('SVC',svm.SVC(kernel='rbf'))])
pipe_xgb_cv=Pipeline([('cv',CountVectorizer()),
            ('XGB',xgb.XGBClassifier())])
pipe_xgb_tfidf=Pipeline([('tfidf',TfidfVectorizer()),
            ('XGB',xgb.XGBClassifier())])


### Evaluation Metrics using "default" paramters


In [49]:
models_default= [pipe_lr_cv,pipe_lr_tfidf,pipe_nb_cv,pipe_nb_tfidf,pipe_rf_cv,pipe_rf_tfidf,pipe_svc_cv,pipe_svc_tfidf,pipe_xgb_cv,pipe_xgb_tfidf]
model_names=['log_reg_cv','log_reg_tfidf','naive_bayes_cv','naive_bays_tfidf','random_forest_cv','random_forest_tfidf','support_vec_clas_cv','support_vec_class_tfidf','xgb_cv','xgb_tfidf']

In [50]:
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, classification_report

class_names = [-1, 0, 1]

for i, model in enumerate(models_default):
    print(f"Model: {model_names[i]}")
    print("-----------------------------")
    
    if i == 8 or i == 9:
        le = LabelEncoder()
        y_train_xgb = le.fit_transform(y_train)
        y_test_xgb = le.transform(y_test)
        model.fit(X_train, y_train_xgb)
        y_pred = model.predict(X_test)
        mapping = {0: -1, 1: 0, 2: 1}
        y_pred = list(map(lambda x: mapping[x], model.predict(X_test)))

    else:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)    
        
    accuracy = accuracy_score(y_test, y_pred)
    print("Accuracy:", accuracy)
   
    report = classification_report(y_test, y_pred)
    print("Classification Report:")
    print(report)
    print("\n")


Model: log_reg_cv
-----------------------------
Accuracy: 0.7653688524590164
Classification Report:
              precision    recall  f1-score   support

          -1       0.90      0.79      0.84      1889
           0       0.53      0.69      0.60       580
           1       0.67      0.76      0.72       459

    accuracy                           0.77      2928
   macro avg       0.70      0.75      0.72      2928
weighted avg       0.79      0.77      0.77      2928



Model: log_reg_tfidf
-----------------------------
Accuracy: 0.7599043715846995
Classification Report:
              precision    recall  f1-score   support

          -1       0.89      0.79      0.84      1889
           0       0.52      0.66      0.58       580
           1       0.67      0.75      0.71       459

    accuracy                           0.76      2928
   macro avg       0.69      0.74      0.71      2928
weighted avg       0.79      0.76      0.77      2928



Model: naive_bayes_cv
---------

In [51]:
default_report=[]
for i, model in enumerate(models_default):
        if i == 8 or i == 9:
            y_pred = model.predict(X_test)
            mapping = {0: -1, 1: 0, 2: 1}
            y_pred = list(map(lambda x: mapping[x], model.predict(X_test)))
        else: 
            y_pred = model.predict(X_test)
        
        report=classification_report(y_test,y_pred,zero_division=1,output_dict=True)
    
        #Concatoning classification reports for easier comparison. Not important to understand code. 
        report_df = pd.DataFrame(report).transpose()
        report_df.reset_index(inplace=True)
        report_df = report_df.rename(columns={'index': 'labels'})
        model_name = model_names[i]
        report_df['Model Name'] = model_name

        pivot_df=report_df.pivot(index='Model Name',columns='labels')
        pivot_df.columns = [f'{col[0]} ({col[1]})' if col[1] else col[0] for col in pivot_df.columns]
        columns=list(pivot_df.columns)

        pivot_df.columns=columns
        columns_to_drop=['precision (accuracy)','recall (accuracy)','recall (accuracy)','f1-score (accuracy)','support (macro avg)','support (weighted avg)']
        final_df=pivot_df.drop(columns=columns_to_drop)
        final_df.rename(columns={'support (accuracy)': 'Accuracy'}, inplace=True)
        final_df = final_df[['Accuracy'] + [col for col in final_df.columns if col != 'Accuracy']]
        default_report.append(final_df)




In [52]:
default_reports_df=pd.concat(default_report)
default_reports_df.reset_index(drop=False, inplace=True)


In [53]:
default_reports_df

Unnamed: 0,Model Name,Accuracy,precision (-1),precision (0),precision (1),precision (macro avg),precision (weighted avg),recall (-1),recall (0),recall (1),recall (macro avg),recall (weighted avg),f1-score (-1),f1-score (0),f1-score (1),f1-score (macro avg),f1-score (weighted avg),support (-1),support (0),support (1)
0,log_reg_cv,0.765369,0.903519,0.529024,0.672414,0.701652,0.793108,0.788248,0.691379,0.764706,0.748111,0.765369,0.841956,0.599402,0.715596,0.718985,0.774101,1889.0,580.0,459.0
1,log_reg_tfidf,0.759904,0.894737,0.51817,0.670565,0.694491,0.785002,0.791953,0.663793,0.749455,0.735067,0.759904,0.840213,0.582011,0.707819,0.710014,0.768312,1889.0,580.0,459.0
2,naive_bayes_cv,0.781421,0.794814,0.690265,0.784091,0.75639,0.772424,0.941239,0.403448,0.601307,0.648665,0.781421,0.861852,0.509249,0.680641,0.683914,0.763599,1889.0,580.0,459.0
3,naive_bays_tfidf,0.712432,0.702703,0.735714,0.895161,0.777859,0.739412,0.991001,0.177586,0.24183,0.470139,0.712432,0.822315,0.286111,0.380789,0.496405,0.646885,1889.0,580.0,459.0
4,random_forest_cv,0.767077,0.815065,0.579618,0.738342,0.711008,0.756399,0.893594,0.47069,0.620915,0.661733,0.767077,0.852525,0.519505,0.674556,0.682196,0.758659,1889.0,580.0,459.0
5,random_forest_tfidf,0.775273,0.804979,0.634328,0.753501,0.730936,0.763106,0.924299,0.439655,0.586057,0.650003,0.775273,0.860522,0.519348,0.659314,0.679728,0.761398,1889.0,580.0,459.0
6,support_vec_clas_cv,0.788934,0.824536,0.634573,0.777174,0.745428,0.779482,0.917946,0.5,0.623094,0.680347,0.788934,0.868737,0.559306,0.691657,0.706567,0.779683,1889.0,580.0,459.0
7,support_vec_class_tfidf,0.789959,0.796986,0.717868,0.810198,0.775017,0.783385,0.951826,0.394828,0.623094,0.656583,0.789959,0.867551,0.509455,0.704433,0.693813,0.771046,1889.0,580.0,459.0
8,xgb_cv,0.786202,0.821343,0.655889,0.742424,0.739886,0.776198,0.912652,0.489655,0.640523,0.680943,0.786202,0.864594,0.560711,0.687719,0.704341,0.776671,1889.0,580.0,459.0
9,xgb_tfidf,0.763661,0.782609,0.646853,0.739691,0.723051,0.748989,0.933827,0.318966,0.625272,0.626022,0.763661,0.851557,0.427252,0.677686,0.652165,0.740251,1889.0,580.0,459.0


In [54]:

# Melt the DataFrame for Plotly
melted_df = default_reports_df.melt(id_vars=['Model Name'],
                                     value_vars=['recall (-1)', 'recall (0)', 'recall (1)'],
                                     var_name='Class', value_name='Recall')

# Create a grouped bar chart
fig = px.bar(melted_df, x='Model Name', y='Recall', color='Class', barmode='group',
             title='Recall for Different Classes by Model',
             labels={'Model Name': 'Model', 'Recall': 'Recall', 'Class': 'Sentiment Class'},
             color_discrete_sequence=['red', 'yellow', 'blue'])

# Show the plot
fig.show()

In [55]:
default_reports_df['average_recall_pos_nue'] = (default_reports_df['recall (0)'] + default_reports_df['recall (1)']) / 2
default_reports_df['average_precision_pos_nue'] = (default_reports_df['precision (0)'] + default_reports_df['precision (1)']) / 2


In [56]:


sorted_df = default_reports_df.sort_values(by='average_recall_pos_nue', ascending=False)

fig = px.bar(sorted_df, x='Model Name', y='average_recall_pos_nue', title='Average Recall for Different Models',
             labels={'Model Name': 'Model', 'average_recall': 'Average Recall'},
             color_discrete_sequence=['green'])

# Show the plot
fig.show()


The average recall in the minority classes leaves much ot be deisered. Most of our data falls around 50% average recall in the minority class, with an exception of logistic regression, which is significnatly higher, and naive bayes used tfidf, which is signifcantly lower. Let's see if we can improve this, 

### Improving Recall in Minority classes
- Our goal is to improve recall (the amount of true positives we detect) in the the nuetral and positive classes, as that seems to be the lowest socres. 
- Let's see if we can improve our recall by testing out different combinations of hyperparameters. We create paramter grids for all five models. We want to find the hyperparamters that maximize recall in the nuetral and postive class.
- We start by creating paramater grids for each of our six models.
- As a base metric to evaluate our models, I will be using macro average recall, which is the simple average of the recall scores.

In [57]:
from sklearn.metrics import make_scorer, recall_score
macro_recall_scorer = make_scorer(recall_score, average='macro')


In [58]:
lr_param_grid = [{'LR__penalty': ['l2'],
                   'LR__C': [0.01, 0.1, 1, 10],
                   'LR__solver': ['newton-cg'],
                   'LR__max_iter': [100, 1000,10000]  
                 }]


nb_param_grid= [{
    'MNB__alpha': [0.1, 0.5, 1.0, 2.0],  
}]



rf_param_grid = [{
    'RF__n_estimators': [50, 100,150],           
    'RF__max_depth': [None, 10,50],              
    'RF__min_samples_split': [2, 5,10],          
    'RF__min_samples_leaf': [1, 2,4]             
}]


svc_param_grid = [{'SVC__kernel': ['linear', 'rbf'],
                    'SVC__C': [1, 2, 3]}]


xgb_param_grid = [{'XGB__learning_rate': [.1,.2],
                    'XGB__max_depth': [1, 2,5,10],
                    'XGB__min_child_weight': [1,2],
                    'XGB__subsample': [1.0, 0.1],
                    'XGB__n_estimators': [50,100,150]}]


Grid search CV exhaustively calculates the macro recall for all the parameter sets we enter. For example for logistic regresssion, we have 1 * 3 * 3 = 9 possible models. For each model, cross validiation of three is used to evaluate the metrics. We hold out a third of our data set for validation, fit the model on the remaining two fold, and evaluate the recall using the validation set. We do this three times total, ensuring that each of the three fold gets to serve as the validation set, and then we average the three recall scores we got. That will be the score for that particular model. Grid search CV will save one model out of the nine that has the highest macro recall.

In [59]:
# we could probably clean this up with a for loop 
lr_cv_grid_search = GridSearchCV(estimator=pipe_lr_cv,
        param_grid=lr_param_grid,
        scoring={'recall': macro_recall_scorer},
        refit='recall',
        cv=3)
lr_tfidf_grid_search = GridSearchCV(estimator=pipe_lr_tfidf,
        param_grid=lr_param_grid,
        scoring={'recall': macro_recall_scorer},
        refit='recall',
        cv=3)
nb_cv_grid_search = GridSearchCV(estimator=pipe_nb_cv,
        param_grid=nb_param_grid,
        scoring={'recall': macro_recall_scorer},
        refit='recall',                        
        cv=3)
nb_tfidf_grid_search = GridSearchCV(estimator=pipe_nb_tfidf,
        param_grid=nb_param_grid,
       scoring={'recall': macro_recall_scorer},
        refit='recall',
        cv=3)
rf_cv_grid_search = GridSearchCV(estimator=pipe_rf_cv,
        param_grid=rf_param_grid,
        scoring={'recall': macro_recall_scorer},
        refit='recall',
        cv=3)
rf_tfidf_grid_search = GridSearchCV(estimator=pipe_rf_tfidf,
        param_grid=rf_param_grid,
        scoring={'recall': macro_recall_scorer},
        refit='recall',
        cv=3)
svc_cv_grid_search = GridSearchCV(estimator=pipe_svc_cv,
        param_grid=svc_param_grid,
        scoring={'recall': macro_recall_scorer},
        refit='recall',
        cv=3)
svc_tfidf_grid_search = GridSearchCV(estimator=pipe_svc_tfidf,
        param_grid=svc_param_grid,
        scoring={'recall': macro_recall_scorer},
        refit='recall',
        cv=3)
xgb_cv_grid_search = GridSearchCV(estimator=pipe_xgb_cv,
        param_grid=xgb_param_grid,
        scoring={'recall': macro_recall_scorer},
        refit='recall',
        cv=3)
xgb_tfidf_grid_search = GridSearchCV(estimator=pipe_xgb_tfidf,
        param_grid=xgb_param_grid,
        scoring={'recall': macro_recall_scorer},
        refit='recall',
        cv=3)

We fit each of the data sets using X_train and y_train. Then we save the best estimator from each of the nine models in a list called best_estimators.

In [60]:
grids=[lr_cv_grid_search,lr_tfidf_grid_search,nb_cv_grid_search,nb_tfidf_grid_search,rf_cv_grid_search,rf_tfidf_grid_search,svc_cv_grid_search,svc_tfidf_grid_search,xgb_cv_grid_search,xgb_tfidf_grid_search]

In [61]:
best_estimators = []
best_params=[]

for grid_search in grids:
    if grid_search == xgb_cv_grid_search or grid_search == xgb_tfidf_grid_search:
        le = LabelEncoder()
        y_train_xgb = le.fit_transform(y_train)
        y_test_xgb = le.transform(y_test)
        grid_search.fit(X_train, y_train_xgb)
    else:
        grid_search.fit(X_train, y_train)
    best_estimators.append(grid_search.best_estimator_)
    best_params.append(grid_search.best_params_)


KeyboardInterrupt: 

In [None]:
best_estimators

In [None]:
grid_dict = {0: 'Logistic Regression CV', 1: 'Logistic Regression TFIDF', 
             2: 'Multinomial Naive Bayes CV', 3: 'Multinomial Naive Bayes TFIDF', 
             4: 'Random Forest CV',5:'Random Forest TFIDF',6:'SVC CV',7:'SVC TFIDF',
            8:'XGB CV',9:'XGB TFIDF'}

In [None]:

reports=[]


for i,estimator in enumerate (best_estimators): 
    if i==8 or i==9:
        y_pred_modified = estimator.predict(X_test)
        y_pred= le.inverse_transform(y_pred_modified)  
    else: 
        y_pred=estimator.predict(X_test)
#     print('Classification report for', grid_dict[i] )
#     print(classification_report(y_test,y_pred,zero_division=1))
    report=classification_report(y_test,y_pred,zero_division=1,output_dict=True)
    
    #Concatoning classification reports for easier comparison. Not important to understand code. 
    report_df = pd.DataFrame(report).transpose()
    report_df.reset_index(inplace=True)
    report_df = report_df.rename(columns={'index': 'labels'})
    model_name = list(grid_dict.values())[i]
    report_df['Model Name'] = model_name

    pivot_df=report_df.pivot(index='Model Name',columns='labels')
    pivot_df.columns = [f'{col[0]} ({col[1]})' if col[1] else col[0] for col in pivot_df.columns]
    columns=list(pivot_df.columns)

    pivot_df.columns=columns
    columns_to_drop=['precision (accuracy)','recall (accuracy)','recall (accuracy)','f1-score (accuracy)','support (macro avg)','support (weighted avg)']
    final_df=pivot_df.drop(columns=columns_to_drop)
    final_df.rename(columns={'support (accuracy)': 'Accuracy'}, inplace=True)
    final_df = final_df[['Accuracy'] + [col for col in final_df.columns if col != 'Accuracy']]
    reports.append(final_df)


In [None]:
combined_reports_df=pd.concat(reports)
combined_reports_df.reset_index(drop=False, inplace=True)


In [None]:
combined_reports_df

That's a lot of data! While all the data contains valuable  information, let's focus on the data that is most relevant to evaluating our model's success. 
- first looking at improvements from default models 
- Recall: We are most interested in seeing how well our model did in detetecting instances from the minority classes. That is, how often our model was able to detect nuetral and positive tweets. These obverstations are represented in the recall(0) and recall(1) column. This is our most important metric. 
- We may also want to look at precision-- how often our predictions were correct for a certain class. 
- We are willing to have a lower precision, as getting a tweet classification wrong isn't as important as detecting tweets from the minority class. 
- Let us first look a the precison vs. recall in all three classes seperatly

In [None]:
# just creating scatter plots
from bokeh.models import ColumnDataSource, HoverTool
from bokeh.plotting import output_notebook, figure, show
from bokeh.layouts import gridplot



def create_scatter_plot(x_column, y_column, x_label, y_label, color,title):
    fig = figure(
        title=title,
        width=300,
        height=300
    )
    scatter = fig.scatter(
        x=x_column,
        y=y_column,
        size=10,
        source=source,
        color=color
    )
    
    hover = HoverTool()
    hover.tooltips = [
        ('Model Name', '@{Model Name}'),
        ('Recall', f'@{{{x_column}}}'),
        ('Precision', f'@{{{y_column}}}')
    ]
    fig.add_tools(hover)
    
    fig.title.text_font_size = '12.5pt'
    fig.xaxis.axis_label = f'Recall ({x_label} Tweets)'
    fig.yaxis.axis_label = f'Precision ({y_label} Tweets)'
    fig.xaxis.axis_label_text_font_size = '11pt'
    fig.yaxis.axis_label_text_font_size = '11pt'
    
    return fig

source = ColumnDataSource(data=combined_reports_df)
output_notebook()

fig = create_scatter_plot('recall (-1)', 'precision (-1)', 'Negative', 'Negative', 'red','Precison vs. Recall (Negative)')
fig2 = create_scatter_plot('recall (0)', 'precision (0)', 'Neutral', 'Neutral', 'green','Precision vs. Recall (Nuetral)')
fig3 = create_scatter_plot('recall (1)', 'precision (1)', 'Positive', 'Neutral', 'blue','Precison vs. Recall (Positive)')

grid = gridplot([[fig, fig2, fig3]])
show(grid, notebook_handle=True)

In [None]:
# Melt the DataFrame for Plotly
melted_df = combined_reports_df.melt(id_vars=['Model Name'],
                                     value_vars=['recall (-1)', 'recall (0)', 'recall (1)'],
                                     var_name='Class', value_name='Recall')

# Create a grouped bar chart
fig = px.bar(melted_df, x='Model Name', y='Recall', color='Class', barmode='group',
             title='Recall for Different Classes by Model after hyperparamater tuning',
             labels={'Model Name': 'Model', 'Recall': 'Recall', 'Class': 'Sentiment Class'},
             color_discrete_sequence=['red', 'yellow', 'blue'])

# Show the plot
fig.show()

In [None]:
combined_reports_df['average_recall_pos_nue'] = (combined_reports_df['recall (0)'] + combined_reports_df['recall (1)']) / 2


In [None]:

sorted_df = combined_reports_df.sort_values(by='average_recall_pos_nue', ascending=False)

fig = px.bar(sorted_df, x='Model Name', y='average_recall_pos_nue', title='Average Recall for Different Models',
             labels={'Model Name': 'Model', 'average_recall': 'Average Recall'},
             color_discrete_sequence=['green'])

# Show the plot
fig.show()


In [None]:
grid_dict = {
    0: 'Logistic Regression CV', 1: 'Logistic Regression TFIDF', 
    2: 'Multinomial Naive Bayes CV', 3: 'Multinomial Naive Bayes TFIDF', 
    4: 'Random Forest CV', 5: 'Random Forest TFIDF', 
    6: 'SVC CV', 7: 'SVC TFIDF', 8: 'XGB CV', 9: 'XGB TFIDF'
}

# Create a DataFrame with the specified columns
data = []
for i, model_name in grid_dict.items():
    default_recall = default_reports_df.loc[i, 'average_recall_pos_nue']
    final_recall = combined_reports_df.loc[i, 'average_recall_pos_nue']
    data.append([model_name, default_recall, final_recall])

columns = ['Model Name', 'Default Report Avg Recall', 'Tuned Report Avg Recall']
combined_recall_df = pd.DataFrame(data, columns=columns)

# Display the new DataFrame
(combined_recall_df)

In [None]:
combined_recall_df = combined_recall_df.sort_values(by='Tuned Report Avg Recall', ascending=False)

melted_df = combined_recall_df.melt(id_vars=['Model Name'],
                                    value_vars=['Default Report Avg Recall', 'Tuned Report Avg Recall'],
                                    var_name='Report', value_name='Avg Recall')

# Create a side-by-side bar graph using Plotly Express
fig = px.bar(melted_df, x='Model Name', y='Avg Recall', color='Report', barmode='group',
             title='Comparison of Average Recall for Positive/Neutral Sentiments after Tuning',
             labels={'Model Name': 'Model', 'Avg Recall': 'Average Recall', 'Report': 'Report'},
             color_discrete_sequence=['red', 'green'])

fig.update_xaxes(tickangle=-45)

fig.show()


As we can see, for each model except Random Forest with cv, hyperparamter tuning resulted in increased average recall. Our most noticable increase in average recall in the minority class was MNB with TFIDF, which doubled!

- All of these graphs provide useful information.
- However, given the volume of data in the Negative class, it is no suprise that our model performs fairly well in recognizing Negative tweets (lowest recall is still above 75%). The lowest recall in this class comes from the Logistic Regression model using count vectorization and logistic regression model using TFIDF. MNB using TFIDF and CV are the next lowest. 
- The Nuetral and Positive classes provide more intuition as to how our model performs. 
- Interestingly, logistic regression using count vectorization yields the highest recall in our class, followed by log reg using TFIDF, MNB using CV, and naive bayes using TFIDF. This is almost the oppositive of what we observed in the negative class. 
- We see similar results in the Positive class. LR CV, multinomial Naive bayes cv, LR using TFIDF, and MNB using TFIDF perform silimarly. 

### Looking at Macro Precison  Vs. Macro Recall and Weighted Precison vs. Weighted Recall

In [None]:


# source = ColumnDataSource(data=combined_reports_df)
# output_notebook()


# fig4 = create_scatter_plot('recall (macro avg)', 'precision (macro avg)', 'Positive', 'Neutral', 'darkviolet','Macro Precison vs. Macro Recall')
# fig4.xaxis.axis_label = 'Macro Average Recall'
# fig4.yaxis.axis_label = 'Macro Average Precision'



# fig5 = create_scatter_plot('recall (weighted avg)', 'precision (weighted avg)', 'Positive', 'Neutral', 'deeppink','Weighted Precison vs. Weighted Recall')
# fig5.xaxis.axis_label = 'Macro Average Recall'
# fig5.yaxis.axis_label = 'Macro Average Precision'

# grid = gridplot([[fig4,fig5]])
# show(grid, notebook_handle=True)

# Nueral Networks 

In [62]:
# importing all neccesary libraries from this section 

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from sklearn.model_selection import train_test_split
from keras.utils.np_utils import to_categorical
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.layers import Dropout
from keras.preprocessing.text import one_hot, Tokenizer
from keras.utils import pad_sequences
from keras.models import Sequential
from keras.layers.core import Activation, Dropout, Dense
from keras.layers import Flatten, Bidirectional, GlobalMaxPooling1D, Embedding, Conv1D, LSTM
from sklearn.model_selection import train_test_split


- We have tried using classical machine learning models to classify our tweets. Let us now see how nueral networks can affect our performance.
- Previously, we used count vectorizer and TFIDF to vectorize our tweets. Both these techiques result in sparse vectors representing each tweet. Sparse vectors are vectors that are predominanlty filled with 0s. 
- This form of representation works fairly well with traditional machine learning models but are not always as effective with nueral networks, whicha re designed to handle dense input.
- For our nueral networks, we want to find a way to represent these tweets with dense vectors, or vectors that have very little 0s. 
- We will explore two dense vecotrizations techniques,  Word2Vec and GloVe.
- We will first begin with a Word2Vec vectorization technique

### Word to Vec Text Vectorization
- Word2Vec uses a two-layer simple nueral network to vectorize words 
- Each word is represented as a vector, and a word vector's relative position to another word vector suggests its semantic meaning. For example, we would expect the word vector for "happy" to be close to the word vector for "joyful". 
- This model generally has two approaches 
- In the Continuous Bag-of-Words appraoch, the model learns by guessing target words from neighboring words (the dog likes ?) 
    - target word: treat 
- In the Skip-Gram approach, the model attempts to guess neighboring context words from the specified word
    - We treat each target word and context word as a new observation 
    - For example in the sentence, "the dog likes treats", if the target word was dog, the model would try to predict context words the and like. The model would learn based on these two pairs (target word: dog, context word: the), (target word= dog, context word:likes)
    - the Skip-gram model generally works better when we have a larger corpus 
- From the gensim library, we will import the Word2Vec model, which we will train with our training data
- By default the Gensim's Word2Vec uses the Continous Bag of Words approach

In [63]:
import gensim
from gensim.models import Word2Vec
from gensim.models.doc2vec import TaggedDocument


In [64]:
# As nueral networks can learn more complex relationships between data, we will adjust our preprocessing function to limit the amount of preprrocessing we do. 
# We will also remove stemming. 
def preprocessor_nn(text):
    # removing html tags; exmp <br>
    text=remove_html(text)
    # removing @ tags; exmp: @catsrcool
    text=remove_tags(text)
    # removes websites 
    text=re.sub(r"http\S+","",text)
    # removes contractions
    text=decontracted(text)
    #removes any numbers and words mixed with numbers
    text=re.sub("\S*\d\S*","",text)
    #removes anything that is not a letter
    # removes any numbers (both stray and mixed) if mixed, will not remove the letters mixed with numbers, but removes #s
    # [^A-Za-z]+  any character that IS NOT a-z OR A-Z ^ inside bracket, negates statement, in a way, cleans punc 
    text=re.sub('[^A-Za-z]+',' ',text)
    #removing extra spaces 
    text=re.sub(' +',' ',text)
    #cleans punctation
    text=remove_punc(text)
    #lower case everthing
    text=text.lower()
#     romove stop words 
    text=remove_stops(text)
    #stem sentence
#     text=stemmed_sent(text)
    
    return(text)

In [65]:
# we keep the preproccesor function the same, however we do not stem the tweets to give the nn a better chance to find relaitonships between words 
preprocessor_nn(tweets['text'][1])

['plus', 'added', 'commercials', 'experience', 'tacky']

In [66]:
# splitting the data 
X=tweets.text.apply(lambda x: preprocessor_nn(x) )
y=tweets.Sentiment
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

We use the keras tokenizer, which assigns each unique word to an integer. 
- 0 is reserved for padding (discussed later)
- 1 is reserved for out of index words (useful when vecotorizing test data and dealing with words not in the training data)
- oov_token=1 specifies that a words not in X_train will be given the token 1

In [67]:
word_tokenizer = Tokenizer(oov_token=1)
# building the tokens on teh unique words in X_train 
word_tokenizer.fit_on_texts(X_train)

In [68]:
word_tokenizer.word_index
first_five_entries = dict(list(word_tokenizer.word_index
.items())[:5])

# this is what the word_tokenizer dictionary is formatted as.
# Each unique word is from the training data is given a unique number. 
# We give the value 1 to any word that was not in the training data (any word not previously assigned an integer)
print(first_five_entries)


{1: 1, 'flight': 2, 'get': 3, 'thanks': 4, 'cancelled': 5}


In [69]:
vocab_length=len(word_tokenizer.word_index)+1
vocab_length

9738

Why do we add 1 to the vocab length? 
- Note that the indexing for our word_tokenizer begins at 1, instead of 0. This means that in order to index up till the greatest token ID (9737), we need to add 1 to the vocab length. 

In [70]:
# tokenizeing both x_train and y_train. Replacing words with their tokenized value.
X_train_tokenized=word_tokenizer.texts_to_sequences(X_train)
X_test_tokenized=word_tokenizer.texts_to_sequences(X_test)
(X_train_tokenized[0:5])

[[799, 11, 1436, 49, 226],
 [106, 330, 115, 230, 141, 148, 583, 91, 506, 1172, 108],
 [137, 57, 203, 103, 3542, 278, 159, 952, 953],
 [2, 19, 107, 132, 39, 90, 1039],
 [571, 244, 367, 30, 554, 417, 2108, 1173, 384, 268, 190]]

In order to feed these value into our training data, we need them to all be the same length.  For this we use padding to make each tweet vector of lenght 100. If the tweet is too short, 0s will be added after it till there are 100 tokens.


In [71]:
maxlen=100 
X_train_tokenized=pad_sequences(X_train_tokenized,padding='post',maxlen=maxlen)
X_test_tokenized=pad_sequences(X_test_tokenized,padding='post',maxlen=maxlen)



Now all our tweets have been tokenized!

### Handling Class imbalances 
We mentioned earlier that the distribution of tweet sentiments is highly imbalance. Previously, we used both class_weights='balanced' and adjusted class priors to account for the imbalances. 
For our nueral networks, we will use an oversampling tecnhique called SMOTE. SMOTE works by creating instances of the minority classes that are similar to instances already existing in the minority class. It creates as many instance of the minority class needed to match the instances in the majority class. This is also useful as it creates "more data" for our models to use for leanrning.

In [72]:
# before SMOTE 
len(X_train)
y_train.value_counts()

-1    7289
 0    2519
 1    1904
Name: Sentiment, dtype: int64

In [73]:
# Using smote
from imblearn.over_sampling import SMOTE
# auto means that the model will increase the instances in the minority class to match the majority class
smote = SMOTE(sampling_strategy='auto', random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_tokenized, y_train)

In [74]:
len(X_train_resampled)
y_train_resampled.value_counts()

-1    7289
 1    7289
 0    7289
Name: Sentiment, dtype: int64

- for our nueral network models, we also have to one hot encode our target variable. This is improtant because it helps create categorical variables out of numerical variables, which prevents the machine from mistaking the numerical inputs (-1,0,1) as oridinal. 
- one hot encoding is also necceary for the softmax activation function, which provides the probability that the observation falls in each class. 



In [75]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False)
y_train_nn = encoder.fit_transform(y_train_resampled.values.reshape(-1,1))
y_test_nn = encoder.transform(y_test.values.reshape(-1,1))


In [76]:
print(y_train_nn.shape)
print(y_test_nn.shape)

(21867, 3)
(2928, 3)


We are finally ready to train our word2vec!
- sentences = X_train: passing in the corpus of words to train our model on 
- vector_size=100: we want the vector represnetation for each word to be 100 
- window=5 To capture semenatic meaning, we look at the words five words before and five words after the target word. Based of these relationships, we build the word vectors for each  unique word. 
- min_count=1: minimim number of occurences for a word to be included in vocabulary. Sicnce we have a small datset, we choice min_count=1 to capture as much data as possibel  


In [77]:

w2v_model=Word2Vec(sentences=X_train,vector_size=100,window=5,min_count=1)
# # 9810 word in our word to vector model 
# # there are 9810 words here instead of the origianl lenght of 9810 becuase we have removed the padding token and the 00V token
len((w2v_model.wv.index_to_key))

9736

In [78]:
w2v_model.wv.index_to_key

['flight',
 'get',
 'thanks',
 'cancelled',
 'service',
 'help',
 'time',
 'customer',
 'would',
 'us',
 'amp',
 'hours',
 'hold',
 'flights',
 'plane',
 'thank',
 'please',
 'still',
 'one',
 'need',
 'back',
 'delayed',
 'bag',
 'gate',
 'call',
 'flightled',
 'got',
 'hour',
 'today',
 'like',
 'phone',
 'airline',
 'late',
 'guys',
 'fly',
 'know',
 'way',
 'waiting',
 'airport',
 'could',
 'going',
 'trying',
 'great',
 'day',
 'tomorrow',
 'change',
 'wait',
 'flying',
 'make',
 'people',
 'really',
 'go',
 'never',
 'weather',
 'check',
 'last',
 'good',
 'delay',
 'home',
 'even',
 'love',
 'minutes',
 'want',
 'united',
 'seat',
 'dm',
 'new',
 'agent',
 'another',
 'see',
 'told',
 'bags',
 'take',
 'luggage',
 'first',
 'w',
 'someone',
 'ticket',
 'number',
 'due',
 'let',
 'worst',
 'getting',
 'yes',
 'travel',
 'lost',
 'ever',
 'work',
 'baggage',
 'email',
 'next',
 'hrs',
 'much',
 'aa',
 'crew',
 'days',
 'made',
 'flighted',
 'seats',
 'response',
 'trip',
 'right',

In [79]:
w2v_model.wv['flight']

array([-3.7615287e-01,  8.5725266e-01,  2.5987071e-01,  7.3546015e-02,
       -3.0203399e-01, -1.3391430e+00,  3.3354762e-01,  1.7654251e+00,
       -5.8895457e-01, -7.4730760e-01, -4.4605944e-01, -1.1145122e+00,
        3.9260078e-02,  3.6485210e-01,  4.0418455e-01, -2.9382744e-01,
        3.7353748e-01, -5.9253871e-01, -3.2933635e-01, -1.4683536e+00,
        2.3295435e-01,  1.7544623e-01,  1.6776170e-01, -4.1336927e-01,
       -1.0141542e-01,  1.7239764e-01, -6.6001052e-01, -8.1728500e-01,
       -6.7839271e-01,  6.9302268e-02,  4.8842663e-01,  1.5539096e-01,
        3.3977520e-01, -7.2689217e-01, -2.8009936e-01,  9.4049758e-01,
        1.2920299e-01, -5.1844132e-01, -4.8186728e-01, -1.5740737e+00,
        2.5458280e-02, -5.0807947e-01, -2.1134545e-01, -5.3422922e-01,
        4.6598428e-01, -4.7604510e-01, -5.7212901e-01, -3.5838899e-01,
        2.9212499e-01,  5.7933432e-01,  3.9553839e-01, -4.3917257e-01,
       -3.1876230e-01,  2.1178989e-01, -3.7229744e-01,  2.0745037e-01,
      

In [80]:
# taking a look at the word to vec representation of the word "cats" 
# as expected, there are 100 numbers representing the word
# print(w2v_model.wv[0].shape)
# w2v_model.wv[0].shape

- we create an embedding matrix that will serve as the initial weights in our neural network model 
- we iterate through every unique word in word inted and add the word 2 vec representation of that word to the matrix 


In [81]:
embedding_matrix_w2v=np.zeros((vocab_length,100))
for word, i in word_tokenizer.word_index.items():
    if word in w2v_model.wv:
        embedding_matrix_w2v[i]=w2v_model.wv[word]

In [82]:
# vecotr representaiton for the word 'flight', the first true word in our data
embedding_matrix_w2v[2]

array([-3.76152873e-01,  8.57252657e-01,  2.59870708e-01,  7.35460147e-02,
       -3.02033991e-01, -1.33914304e+00,  3.33547622e-01,  1.76542509e+00,
       -5.88954568e-01, -7.47307599e-01, -4.46059436e-01, -1.11451221e+00,
        3.92600782e-02,  3.64852101e-01,  4.04184550e-01, -2.93827444e-01,
        3.73537481e-01, -5.92538714e-01, -3.29336345e-01, -1.46835363e+00,
        2.32954353e-01,  1.75446227e-01,  1.67761698e-01, -4.13369268e-01,
       -1.01415418e-01,  1.72397643e-01, -6.60010517e-01, -8.17285001e-01,
       -6.78392708e-01,  6.93022683e-02,  4.88426626e-01,  1.55390963e-01,
        3.39775205e-01, -7.26892173e-01, -2.80099362e-01,  9.40497577e-01,
        1.29202992e-01, -5.18441319e-01, -4.81867284e-01, -1.57407367e+00,
        2.54582800e-02, -5.08079469e-01, -2.11345449e-01, -5.34229219e-01,
        4.65984285e-01, -4.76045102e-01, -5.72129011e-01, -3.58388990e-01,
        2.92124987e-01,  5.79334319e-01,  3.95538390e-01, -4.39172566e-01,
       -3.18762302e-01,  

In [83]:
embedding_matrix_w2v.shape
# we have the same amount of words as in our word tokenizer. Each word represneted by w2v represetnation.

(9738, 100)

### Building our nueral networks 
- We will be building three nueral networks, a simple nueral network, a convuolutional nueral network, and an LSTM (long short term memory) recurrent nueral network. I will first provide the code and explantation for each of the networks, and then we will fit all the models. 

#### Simple Nueral Network using W2V as initial weights

In [84]:

snn_model_w2v = Sequential()
snn_model_w2v.add(Embedding(input_dim=vocab_length, output_dim=100, weights=[embedding_matrix_w2v], input_length=maxlen, trainable=True))
snn_model_w2v.add(Flatten())
snn_model_w2v.add(Dense(128, activation='relu'))
snn_model_w2v.add(Dropout(0.3))
snn_model_w2v.add(Dense(64, activation='relu'))
snn_model_w2v.add(Dense(3, activation='softmax'))  

In [85]:
# initating a sequential model so we can build the models by adding layers
snn_model_w2v = Sequential()
# first layer is the embedding layer
# - input_dim=  length of our vocabularly 
# - output_dim: length of our word embedding (each word is represented by 100 words)
# - weights: as mentioned earlier,  the word embeddings we created using w2v
# - trainable= true: allows the weights (word embeddings) to be updatated during trainign 
snn_model_w2v.add(Embedding(input_dim=vocab_length, output_dim=100, weights=[embedding_matrix_w2v], input_length=maxlen, trainable=True))
# flattenign 2 dimensional represtntios of word embeddings to one dimensional vector that can be passed into the next fully connected dense layer 
snn_model_w2v.add(Flatten())
# adding a dense layer with 128 units using ReLU activaiton to introuce non-linearity which helps the model learn complex relationships in the data
snn_model_w2v.add(Dense(128, activation='relu'))
# using a dropout of 0.3 to set a random selection 0f 0.3 nuerons to be = to 0. This prevents overfitting. 
snn_model_w2v.add(Dropout(0.3)) 
# adding another dense layer, this time with 64 units
snn_model_w2v.add(Dense(64, activation='relu'))
# adding an output layer of three units with softmax activation which returns probability that observatin falls in each of the three classes
snn_model_w2v.add(Dense(3, activation='softmax'))  

##### Model Architecture


In [86]:
snn_model_w2v.summary()


Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 100, 100)          973800    
                                                                 
 flatten_1 (Flatten)         (None, 10000)             0         
                                                                 
 dense_3 (Dense)             (None, 128)               1280128   
                                                                 
 dropout_1 (Dropout)         (None, 128)               0         
                                                                 
 dense_4 (Dense)             (None, 64)                8256      
                                                                 
 dense_5 (Dense)             (None, 3)                 195       
                                                                 
Total params: 2,262,379
Trainable params: 2,262,379
No

##### Breaking down model architecture
- defines the output shapes of each layer
- embedding_172 (Embedding): Takes input sequences of length 100 (our tokenized and padded vectors) and transforms each word into a 100 dimensional vector using initial w2v weights. 
- flatten_70 (Flatten): takes the matrix of 100x100 embedding vectors and flattens it to a one dimensional vector of lenght 10,000. This prepares data to be fed into dense layers. 
- dense_245 (Dense): this layer produces a vector of lenght 128 by capturing the most relevant patterns and relationships in the 10,000 value flattened layer.  
- dropout_23 (Dropout): the shape remains the same, but we are temporarily setting a few of the nuerons to 0 to prevent overfitting 
- dense_246 (Dense): Another dense layer which outputs a vector of lenght 64, capturing more imporntat patterns and relationships in the data 
- dense_247 (Dense): outputs a vector of length three which provides the probability that each observation falls into a certain class


#### CNN Using Word2Vec
- uses convultion layers to learn hierarchies of features from input data

In [87]:
cnn_model_w2v = Sequential()
cnn_model_w2v.add(Embedding(input_dim=vocab_length, output_dim=100, weights=[embedding_matrix_w2v], input_length=maxlen, trainable=True))
cnn_model_w2v.add(Conv1D(128, 3, activation='relu', padding='same'))
cnn_model_w2v.add(GlobalMaxPooling1D())
cnn_model_w2v.add(Dropout(0.5)) 
cnn_model_w2v.add(Dense(64, activation='relu'))
cnn_model_w2v.add(Dense(3, activation='softmax'))


In [88]:
# model explained
# initating a sequential model so we can build the model by adding layers
cnn_model_w2v = Sequential()
# first layer is the embedding layer
# - input_dim=  length of our vocabularly 
# - output_dim: length of our word embedding (each word is represented by 100 words)
# - weights: as mentioned earlier,  the word embeddings we created using w2v
# - trainable= true: allows the weights (word embeddings) to be updatated during trainign 
cnn_model_w2v.add(Embedding(input_dim=vocab_length, output_dim=100, weights=[embedding_matrix_w2v], input_length=maxlen, trainable=True))
# applies 1D convultions, which applies filters to small segments of the input sequnces
# 128 representes the number of filters used in the covolution layer. Each filter is responsible for learning a specif feature/pattern from the input
# 3 represente the number of consectuave words the convolution operation looks at a time
# activation= relu: activiation function to introduce non-linearit 
# padding= 'same': ensures output shape is same as input shape 
# adding 1d layer-- we would continue adding, but we will simply use 2 Conv layers
cnn_model_w2v.add(Conv1D(128, 3, activation='relu', padding='same'))
# extracts maximum value in each of the 128 filters, caputring most importnat feautres from each filter
cnn_model_w2v.add(GlobalMaxPooling1D())
# prevents overfitting by setting hafl the nuerons to 0 during traing 
cnn_model_w2v.add(Dropout(0.5))  
# adding a layer with 64 nuerons 
cnn_model_w2v.add(Dense(64, activation='relu'))
# final layer that produces probability score of each class 
cnn_model_w2v.add(Dense(3, activation='softmax'))


##### Model Architecture 

In [None]:
cnn_model_w2v.summary()

###### Breaking down model architecture
- embedding_175 (Embedding):  Takes input sequences of length 100 (our tokenized and padded vectors) and transforms each word in sequence into a 100 dimensional vector using initial w2v weights.
- conv1d_67 (Conv1D): layer processes sequences of lenght 100 from input data and used 128 filters to identify various features and create a feture map with those features
- conv1d_68 (Conv1D): same thing-- tries to idenfity more features
- global_max_pooling1d_64: extracts the max value from each feature map, which reduces the size to just a 128 output vector
- drouput: temproariliy sets half the nuerouns to 0 to prevent form overfitting 
- dense_252 (Dense): processes output from previous layers and learns impronat relationships, reducing size to 64 dimensional vector
 - nonlineaity allows the model to identity even more complex relationships 
- dense_253 (Dense): outputs a vector of length three which provides the probability scores that each observation falls into a certain class
 

In [None]:
from keras.layers import Bidirectional, GlobalMaxPooling1D

#### Long Short Term Memory using Word2Vec embeddings as initial weights
- type of RNN that speciailizes in capturing long-term dependancies and maintinating informaiton over extended sequences
- has different gates that allow the model to forget, update,and output information which in helpful is processing long sequeneces 

In [None]:
lstm_model_w2v = Sequential()
lstm_model_w2v.add(Embedding(input_dim=vocab_length, output_dim=100, weights=[embedding_matrix_w2v], input_length=maxlen, trainable=True))
lstm_model_w2v.add(Bidirectional(LSTM(128, return_sequences=True)))  # Bidirectional LSTM
lstm_model_w2v.add(GlobalMaxPooling1D())  # Global Max Pooling
lstm_model_w2v.add(Dense(64, activation='relu'))  # Additional Dense layer
lstm_model_w2v.add(Dropout(0.5))
lstm_model_w2v.add(Dense(3, activation='softmax'))


In [None]:
# model explained
# initating a sequential model so we can build the model by adding layers
lstm_model_w2v = Sequential()
# first layer is the embedding layer
# - input_dim=  length of our vocabularly 
# - output_dim: length of our word embedding (each word is represented by 100 words)
# - weights: as mentioned earlier,  the word embeddings we created using w2v
# - trainable= true: allows the weights (word embeddings) to be updatated during trainign 
lstm_model_w2v.add(Embedding(input_dim=vocab_length, output_dim=100, weights=[embedding_matrix_w2v], input_length=maxlen, trainable=True))
# uses an LSTM layer with 128 units (cells) 
# wraps the LSTM layer in a bidirectional wrapper so that it can process sequnces in both forward and backward direction
# we have 128 units, and return_sequences=Ture: retunrs the full sequnce of outputs 
lstm_model_w2v.add(Bidirectional(LSTM(128, return_sequences=True)))
# reduces the sequnces lenght to one number by selected the maximum value for each feature
lstm_model_w2v.add(GlobalMaxPooling1D())
# adds a dense layer with 64 units and ReLu activation 
lstm_model_w2v.add(Dense(64, activation='relu'))  
# selects half the layers and drops them out 
lstm_model_w2v.add(Dropout(0.5))
# # final layer that produces probability score of each class 
lstm_model_w2v.add(Dense(3, activation='softmax'))


In [None]:
##### Model Architecture 

In [None]:
lstm_model_w2v.summary()


- embedding_175 (Embedding):  Takes input sequences of length 100 (our tokenized and padded vectors) and transforms each word in sequence into a 100 dimensional vector using initial w2v weights.
- bidirectional (Bidirectional): processes input sequences of lenght 100 in the forward and backwards direction: 128 x 2=256
- two set of outputs: one from origianl sequence order and reverse sequnce ourder 
- GlobalMax pooling: for each of the 256 features, get the max value and put that into a vector
- a dense vector of 64 nuerons which takes the 256 features in the previous layer, captures meaningful relaitonships, and breaks them down to 64 nuerons 
- dropout: droppping out half of the neurons to prevent from overfitting 
- dense_253 (Dense): outputs a vector of length three which provides the probability scores that each observation falls into a certain class

### GloVe word embeddings 

- Another popular type of dense vector word embedding in the GloVe pretrained word embeddings. The GloVe word embeddings have already been pretrained on a large corpus. Based on the user's choice, the vectors that represent each word can be 50, 100, or 200 numbers. We will use 100 numbers. 
- Similar to Word2Vec, words vector's in the vector space that are close to each other have similar semantic meanings. 
- GloVe differs from word2vec as it aims to capture global context of a word rather than just the local context, which we specified earlier with the windows paramters (how many words to look at before and after the targe word).

#### Loading GloVe embeddings
- We will now load our GloVe embeddings. The GloVe embeddings contain 40,000 words, each word represented in a 100 word vector of numbers already defined in the GloVe model
- We want to load these words into a dictianary with key= word, and value= 100 integer vector 

In [None]:
#simply creating a dictionary with all the GloVe words 
from numpy import asarray
glove_dictionary = dict()
glove_file = open('a2_glove.6B.100d.txt', encoding="utf8")
glove_file

for line in glove_file: 
    records=line.split()
    word=records[0]
    vector_dimensions=asarray(records[1:],dtype='float32')
    glove_dictionary[word]=vector_dimensions
glove_file.close()

In [None]:
# there are 40,000 words in the Glo_Ve file
len(glove_dictionary)

In [None]:
len(glove_dictionary['the'])
# each word is represented by 100 numbers 

In [None]:
# we will add the respective GloVe word embeddings for each word in our corpus to out embeddings matrix
from numpy import asarray
from numpy import zeros
embedding_matrix_glove=zeros((vocab_length,100))
for word, index in word_tokenizer.word_index.items():
    embedding_vector=glove_dictionary.get(word)
    if embedding_vector is not None:
        embedding_matrix_glove[index]=embedding_vector

In [None]:
# as we can see all, the words in our corpus were in the matrix 
embedding_matrix_glove.shape

From here the code for the the three models will be identical, except that our intial weights will be the embeddings_matrix_glove, instead of the embedding_matrix_w2v. 

#### SNN with  GloVe

In [None]:
snn_model_glove = Sequential()
snn_model_glove.add(Embedding(input_dim=vocab_length, output_dim=100, weights=[embedding_matrix_glove], input_length=maxlen, trainable=True))
snn_model_glove.add(Flatten())
snn_model_glove.add(Dense(128, activation='relu'))
snn_model_glove.add(Dropout(0.3))
snn_model_glove.add(Dense(64, activation='relu'))
snn_model_glove.add(Dense(3, activation='softmax'))  # Output layer

### CNN with  GloVe

In [None]:
cnn_model_glove = Sequential()
cnn_model_glove.add(Embedding(input_dim=vocab_length, output_dim=100, weights=[embedding_matrix_glove], input_length=maxlen, trainable=True))
cnn_model_glove.add(Conv1D(128, 3, activation='relu', padding='same'))  # Experiment with different filter sizes
cnn_model_glove.add(GlobalMaxPooling1D())
cnn_model_glove.add(Dropout(0.5)) 
cnn_model_glove.add(Dense(64, activation='relu'))
cnn_model_glove.add(Dense(3, activation='softmax'))

### LSTM with SMOTE GloVE

In [None]:
lstm_model_glove = Sequential()
lstm_model_glove.add(Embedding(input_dim=vocab_length, output_dim=100, weights=[embedding_matrix_glove], input_length=maxlen, trainable=True))
lstm_model_glove.add(Bidirectional(LSTM(128, return_sequences=True)))  # Bidirectional LSTM
lstm_model_glove.add(GlobalMaxPooling1D())  # Global Max Pooling
lstm_model_glove.add(Dense(64, activation='relu'))  # Additional Dense layer
lstm_model_glove.add(Dropout(0.5))
lstm_model_glove.add(Dense(3, activation='softmax'))


### Fitting all our models

In [None]:
nn_models=[snn_model_w2v,cnn_model_w2v,lstm_model_w2v,snn_model_glove,cnn_model_glove,lstm_model_glove]

In [None]:
grid_dict_nn = {0: 'Simple NN W2V', 1: 'CNN NN W2v', 
             2: 'LSTM NN W2V', 3: 'Simple NN GloVe', 
             4: 'CNN NN GloVe',5:'LSTM NN GloVe'}

In [None]:
reports_nn=[]
for i,model in enumerate(nn_models):
    print(list(grid_dict_nn.values())[i])
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    model_history = model.fit(X_train_resampled, y_train_nn, batch_size=32, epochs=20, verbose=1, validation_split=0.2)
    # we never detect classes with 0
    y_prob=model.predict(X_test_tokenized)
    y_classes=y_prob.argmax(axis=-1)

    y_classes_transformed = y_classes - 1

    print(classification_report(y_test,y_classes_transformed,zero_division=0))
    report=classification_report(y_test,y_classes_transformed,zero_division=0,output_dict=True)
    
    #Concatoning classification reports for easier comparison. Not important to understand code. 
    report_df = pd.DataFrame(report).transpose()
    report_df.reset_index(inplace=True)
    report_df = report_df.rename(columns={'index': 'labels'})
    model_name = list(grid_dict_nn.values())[i]
    report_df['Model Name'] = model_name

    pivot_df=report_df.pivot(index='Model Name',columns='labels')
    pivot_df.columns = [f'{col[0]} ({col[1]})' if col[1] else col[0] for col in pivot_df.columns]
    columns=list(pivot_df.columns)

    pivot_df.columns=columns
    columns_to_drop=['precision (accuracy)','recall (accuracy)','recall (accuracy)','f1-score (accuracy)','support (macro avg)','support (weighted avg)']
    final_df=pivot_df.drop(columns=columns_to_drop)
    final_df.rename(columns={'support (accuracy)': 'Accuracy'}, inplace=True)
    final_df = final_df[['Accuracy'] + [col for col in final_df.columns if col != 'Accuracy']]
    reports_nn.append(final_df)
    


In [None]:
combined_reports_df_nn=pd.concat(reports_nn)
combined_reports_df_nn['average_recall_pos_nue'] = (combined_reports_df_nn['recall (0)'] + combined_reports_df_nn['recall (1)']) / 2

combined_reports_df_nn.reset_index(drop=False, inplace=True)


In [None]:
combined_reports_df_nn

In [None]:
source = ColumnDataSource(data=combined_reports_df_nn)
output_notebook()

fig = create_scatter_plot('recall (-1)', 'precision (-1)', 'Negative', 'Negative', 'red','Precison vs. Recall (Negative)')
fig2 = create_scatter_plot('recall (0)', 'precision (0)', 'Neutral', 'Neutral', 'green','Precision vs. Recall (Nuetral)')
fig3 = create_scatter_plot('recall (1)', 'precision (1)', 'Positive', 'Neutral', 'blue','Precison vs. Recall (Positive)')

grid = gridplot([[fig, fig2, fig3]])
show(grid, notebook_handle=True)

In [None]:

source = ColumnDataSource(data=combined_reports_df_nn)
output_notebook()


fig4 = create_scatter_plot('recall (macro avg)', 'precision (macro avg)', 'Positive', 'Neutral', 'darkviolet','Macro Precison vs. Macro Recall')
fig4.xaxis.axis_label = 'Macro Average Recall'
fig4.yaxis.axis_label = 'Macro Average Precision'



fig5 = create_scatter_plot('recall (weighted avg)', 'precision (weighted avg)', 'Positive', 'Neutral', 'deeppink','Weighted Precison vs. Weighted Recall')
fig5.xaxis.axis_label = 'Macro Average Recall'
fig5.yaxis.axis_label = 'Macro Average Precision'

grid = gridplot([[fig4,fig5]])
show(grid, notebook_handle=True)

### Why is it that GloVe embeddings result in better performance? 
Our GloVe embeddings performed better than the Word2Vec embeddings. Why could this be? 
- Limited training size: 
    - We trained our Word2Vec model on a data set of limited size. It is likely that the model did not have enough data to accuratlely recognize patterns between words. 
    - The GloVe embeddings come pretrained on an enormous corpus, so the word embeddings are likely more representatinve of the words. 
- Semantic Relationships 
     - GloVe embeddings are known to better be able to capture semantic relationships between words as the word co-occrences are considered in the global realm. 

In [None]:
# Overall conclusions
all_combined_models=pd.concat([combined_reports_df, combined_reports_df_nn])

In [None]:
all_combined_models
all_combined_models= all_combined_models.reset_index(drop=True)


In [None]:
all_combined_models

In [None]:
source = ColumnDataSource(data=all_combined_models)
output_notebook()

fig = create_scatter_plot('recall (-1)', 'precision (-1)', 'Negative', 'Negative', 'red','Precison vs. Recall (Negative)')
fig2 = create_scatter_plot('recall (0)', 'precision (0)', 'Neutral', 'Neutral', 'green','Precision vs. Recall (Nuetral)')
fig3 = create_scatter_plot('recall (1)', 'precision (1)', 'Positive', 'Neutral', 'blue','Precison vs. Recall (Positive)')

grid = gridplot([[fig, fig2, fig3]])
show(grid, notebook_handle=True)

In [None]:
# @title
sorted_df = all_combined_models.sort_values(by='average_recall_pos_nue', ascending=False)

fig4 = px.bar(sorted_df, x='Model Name', y='average_recall_pos_nue', title='Average Recall for Different Models After Tuning',
             labels={'Model Name': 'Model', 'average_recall': 'Average Recall'},
             color_discrete_sequence=['green'])

# Show the plot
fig4.show()


In [None]:
# source = ColumnDataSource(data=all_combined_models)
# output_notebook()


# fig5 = create_scatter_plot('recall (macro avg)', 'precision (macro avg)', 'Positive', 'Neutral', 'darkviolet','Macro Precison vs. Macro Recall')
# fig5.xaxis.axis_label = 'Macro Average Recall'
# fig5.yaxis.axis_label = 'Macro Average Precision'



# fig6 = create_scatter_plot('recall (weighted avg)', 'precision (weighted avg)', 'Positive', 'Neutral', 'deeppink','Weighted Precison vs. Weighted Recall')
# fig6.xaxis.axis_label = 'Macro Average Recall'
# fig6.yaxis.axis_label = 'Macro Average Precision'

# grid = gridplot([[fig5,fig6]])
# show(grid, notebook_handle=True)