## NLP - Lesson 3 - Sentiment Analysis

In Natural Language Processing, the idea is to learn patterns and form insights from textual data using a computer. But as computers cannot understand the text directly, we have to convert the text into numerical data which then can be used as an input to traditional and modern models. Machine learning algorithms can handle any dimension of textual data when converted to numerical data using techniques like word embeddings, for example word2vec. I am writing this notebook to teach myself, and possibly others, these techniques from scratch and also using famous python libraries.

### Libraries import

In [1]:
import pandas as pd
import numpy as np
import re

import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, precision_score, recall_score, f1_score, ConfusionMatrixDisplay, auc

from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline

In [3]:
from nltk.corpus import stopwords

In [4]:
pd.set_option('display.max_colwidth',None)

### Data - Twitter Disaster Classification

I have downloaded this data from Kaggle from the **Twitter Disaster Classification** competition.
Source - [Twitter Disaster Classification](https://www.kaggle.com/competitions/nlp-getting-started/overview)

In [5]:
data = pd.read_csv("./data/train.csv")
data.head(3)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected,1


In [6]:
data.shape

(7613, 5)

In [7]:
data.isna().sum()

id             0
keyword       61
location    2533
text           0
target         0
dtype: int64

In [8]:
data.nunique()

id          7613
keyword      221
location    3341
text        7503
target         2
dtype: int64

### Train-Test-Split

Let's split the data into train-test-split. We won't be performing EDA in this notebook as if we have already conducted some basic EDA in [00_vectorizer](https://github.com/Vishaldawar/nlp_learnings/blob/main/00_vectorizer.ipynb)

In [9]:
train, test = train_test_split(data, test_size=0.25, random_state=100)
train.shape, test.shape

((5709, 5), (1904, 5))

In [10]:
train['target'].mean(), test['target'].mean(), data['target'].mean()

(0.43370117358556665, 0.4175420168067227, 0.4296597924602653)

In [11]:
class modelling:
    def __init__(self, train_x, train_y, test_x, test_y, target):
        self.train_X = train_x
        self.train_y = train_y
        self.test_X = test_x
        self.test_y = test_x
        self.target = target

    def train_model(self, clf):
        self.clf = clf
        self.clf.fit(self.train_X, self.train_y)

    def get_metrics(self, X, y, dataset='given', ret_metrics=False):
        print("#"*125)
        self.proba = self.clf.predict_proba(X)[:,1]
        self.pred = self.clf.predict(X)
        auc = roc_auc_score(y, self.proba)
        prec = precision_score(y, self.pred)
        rec = recall_score(y, self.pred)
        f1 = f1_score(y, self.pred)
        print(f"AUC for {dataset} : ",auc)
        print(f"Precision for {dataset} : ",prec)
        print(f"Recall for {dataset} : ",rec)
        print(f"F1 Score for {dataset} : ",f1)
        print("#"*125)
        if ret_metrics:
            return auc, prec, rec, f1

    def plot_roc_auc_curve(self, y, pred):
        fpr, tpr, threshold = sklearn.metrics.roc_curve(y, pred)
        roc_auc = sklearn.metrics.auc(fpr, tpr)
        
        plt.title('Receiver Operating Characteristic')
        plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
        plt.legend(loc = 'lower right')
        plt.plot([0, 1], [0, 1],'r--')
        plt.xlim([0, 1])
        plt.ylim([0, 1])
        plt.ylabel('True Positive Rate')
        plt.xlabel('False Positive Rate')
        plt.show()

In [12]:
def clean_text(row,col):
    val = row[col]
    val = val.lower() ## removing case sensitivity
    val = re.compile(r'https?://\S+|www\.\S+').sub(r'',val) ## removing hyper-link information
    val = re.compile(r'[^a-zA-Z0-9]').sub(r' ',val).strip() ## keeping only alpha-numeric data and removing leading and trailing white spaces
    return val

def vector_clean_text(df):
    df['cleaned_text'] = df.apply(clean_text, args = ['text'], axis=1)
    return df

def tokenize(df):
    df['cleaned_text'] = df['cleaned_text'].map(lambda x : x.split(' '))
    return df

def filter_stopwords(row,col):
    word_list = row[col]
    filtered_words = [word for word in word_list if word not in stopwords.words('english')]
    filtered_words = [word for word in filtered_words if word.strip() != '']
    filtered_words = " ".join([word for word in filtered_words if len(word) > 3])
    return filtered_words

def vector_filter_stopwords(df):
    df['cleaned_text'] = df.apply(filter_stopwords, args=['cleaned_text'], axis=1)
    return df['cleaned_text']


clean_text_trans = FunctionTransformer(vector_clean_text)
tokenize_trans = FunctionTransformer(tokenize)
stopwords_trans = FunctionTransformer(vector_filter_stopwords)

In [13]:
sk_pipe = Pipeline([("clean", clean_text_trans), ("tokenize", tokenize_trans), ("stopwords", stopwords_trans)])
sk_pipe.fit(train)

In [14]:
train_data = train.copy()
train_data['cleaned_text'] = sk_pipe.transform(train_data)

In [15]:
test_data = test.copy()
test_data['cleaned_text'] = sk_pipe.transform(test_data)

In [16]:
train_data.head(10)

Unnamed: 0,id,keyword,location,text,target,cleaned_text
3398,4866,explode,,Learn How I Gained Access To The Secrets Of The Top Earners &amp; Used Them To Explode My Home Business Here: http://t.co/SGXP1U5OL1 Please #RT,0,learn gained access secrets earners used explode home business please
7330,10490,wildfire,Vail Valley,We should all have a fire safety plan. RT @Matt_Kroschel: MOCK WILDFIRE near #Vail as agencies prepare for the worst. http://t.co/SWwyLRk0fv,0,fire safety plan matt kroschel mock wildfire near vail agencies prepare worst
4903,6979,massacre,Cimerak - Pangandaran,Review: Dude Bro Party Massacre III http://t.co/f0WQlobOoy by Patrick BromleyThe title sa http://t.co/THpBDPdj35,0,review dude party massacre patrick bromleythe title
6376,9112,suicide%20bomb,,//./../.. Pic of 16yr old PKK suicide bomber who detonated bomb in Turkey Army trench released http://t.co/Sj57BoKsiB -/,1,16yr suicide bomber detonated bomb turkey army trench released
1244,1792,buildings%20on%20fire,"Fort Walton Beach, Fl",They are evacuating buildings in that area of State Road 20. We still don't have confirmation of what is on fire.,1,evacuating buildings area state road still confirmation fire
2071,2973,dead,,beforeitsnews : Hundreds feared dead after Libyan migrant boat capsizes during rescue Û_ http://t.co/MjoeeBDLXn) http://t.co/fvEn1ex0PS,1,beforeitsnews hundreds feared dead libyan migrant boat capsizes rescue
3603,5144,fatal,,11-Year-Old Boy Charged With Manslaughter of Toddler: Report: An 11-year-old boy has been charged with manslaughter over the fatal sh...,1,year charged manslaughter toddler report year charged manslaughter fatal
2947,4238,drowned,"San Francisco, CA",80 tons of cocaine worth 125 million dollars drowned in #Alameda .....now that's a American drought #coke,1,tons cocaine worth million dollars drowned alameda american drought coke
2180,3124,debris,"Hamilton, Ontario Canada",Malaysia seem more certain than France.\n\nPlane debris is from missing MH370 http://t.co/eXZnmxbINJ,1,malaysia seem certain france plane debris missing mh370
7390,10575,windstorm,,@blakeshelton DON'T be a FART ??in a WINDSTORM.FOLLOW ME ALREADY. JEEZ.,1,blakeshelton fart windstorm follow already jeez


This shows that the text looks much better. There are no hashtags as we can see in the examples. We have also removed case sensitivity, tokenized the text and then removed the stopwords like (to, and, of etc.). Now we can count occurrences of words to use count vectorizer and TF-IDF vectorizer and implement sentiment analysis.

### Modelling

In [17]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [18]:
%%time

## Creating the vocabulary

all_texts_train = train_data['cleaned_text'].values
all_texts_test = test_data['cleaned_text'].values
len(all_texts_train), len(all_texts_test)

CPU times: user 43 µs, sys: 2 µs, total: 45 µs
Wall time: 47 µs


(5709, 1904)

In [19]:
all_texts_train[:5]

array(['learn gained access secrets earners used explode home business please',
       'fire safety plan matt kroschel mock wildfire near vail agencies prepare worst',
       'review dude party massacre patrick bromleythe title',
       '16yr suicide bomber detonated bomb turkey army trench released',
       'evacuating buildings area state road still confirmation fire'],
      dtype=object)

In [20]:
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier

In [21]:
features = ['cleaned_text']
target = 'target'

In [22]:
tfidf = TfidfVectorizer()
train_x = tfidf.fit_transform(all_texts_train)
test_x = tfidf.transform(all_texts_test)
train_y = train_data[target]
test_y = test_data[target]

In [23]:
lr = LogisticRegression(n_jobs=-1, random_state=100)
lr_obj = modelling(train_x, train_y, test_x, test_y, target)

In [24]:
lr_obj.train_model(lr)
lr_obj.get_metrics(train_x, train_y, dataset='train',ret_metrics=False)
lr_obj.get_metrics(test_x, test_y, dataset='test',ret_metrics=False)

#############################################################################################################################
AUC for train :  0.9618397238294306
Precision for train :  0.953248031496063
Recall for train :  0.7823101777059773
F1 Score for train :  0.8593611357586513
#############################################################################################################################
#############################################################################################################################
AUC for test :  0.8671282984841009
Precision for test :  0.8346213292117465
Recall for test :  0.6792452830188679
F1 Score for test :  0.7489597780859917
#############################################################################################################################


In [25]:
xgb = XGBClassifier(n_jobs=-1, random_state=100)
xgb_obj = modelling(train_x, train_y, test_x, test_y, target)

In [26]:
xgb_obj.train_model(xgb)
xgb_obj.get_metrics(train_x, train_y, dataset='train',ret_metrics=False)
xgb_obj.get_metrics(test_x, test_y, dataset='test',ret_metrics=False)

#############################################################################################################################
AUC for train :  0.9286385927233644
Precision for train :  0.9330117899249732
Recall for train :  0.7031502423263328
F1 Score for train :  0.8019345923537541
#############################################################################################################################
#############################################################################################################################
AUC for test :  0.8311527751784996
Precision for test :  0.8126064735945485
Recall for test :  0.6
F1 Score for test :  0.6903039073806078
#############################################################################################################################


In [27]:
xgb2 = XGBClassifier(n_estimators=200, max_depth=8,n_jobs=-1, random_state=100)
xgb_obj2 = modelling(train_x, train_y, test_x, test_y, target)

In [28]:
xgb_obj2.train_model(xgb2)
xgb_obj2.get_metrics(train_x, train_y, dataset='train',ret_metrics=False)
xgb_obj2.get_metrics(test_x, test_y, dataset='test',ret_metrics=False)

#############################################################################################################################
AUC for train :  0.969862489362776
Precision for train :  0.9591836734693877
Recall for train :  0.8162358642972536
F1 Score for train :  0.8819550512764565
#############################################################################################################################
#############################################################################################################################
AUC for test :  0.8406037509002955
Precision for test :  0.8095975232198143
Recall for test :  0.6578616352201258
F1 Score for test :  0.7258848022206801
#############################################################################################################################


We see that the results are overfitting as we could see in an earlier notebook. Now let's apply word2vec using an existing library.

In [29]:
all_texts_train[:2]

array(['learn gained access secrets earners used explode home business please',
       'fire safety plan matt kroschel mock wildfire near vail agencies prepare worst'],
      dtype=object)

In [30]:
from gensim.models import word2vec, FastText

In [31]:
training_words = list(map(lambda x: x.split(), all_texts_train))
len(training_words)

5709

In [32]:
training_words[1]

['fire',
 'safety',
 'plan',
 'matt',
 'kroschel',
 'mock',
 'wildfire',
 'near',
 'vail',
 'agencies',
 'prepare',
 'worst']

In [33]:
%%time

vector_size = 100

word2vec_model = word2vec.Word2Vec(training_words,
                 vector_size=vector_size,
                 workers=8,
                 min_count=1)

print("Vocabulary Length:", len(word2vec_model.wv.key_to_index))

Vocabulary Length: 12637
CPU times: user 259 ms, sys: 6.37 ms, total: 265 ms
Wall time: 139 ms


In [34]:
print(word2vec_model)

Word2Vec<vocab=12637, vector_size=100, alpha=0.025>


In [35]:
word2vec_model.wv['learn'].shape

(100,)

In [36]:
similar_words = word2vec_model.wv.most_similar('learn', topn=5)
print(similar_words)

[('near', 0.44372454285621643), ('wahpeton', 0.4011099636554718), ('going', 0.3980615735054016), ('38pm', 0.3869110345840454), ('defend', 0.3681480586528778)]


In [37]:
train_data.head(3)

Unnamed: 0,id,keyword,location,text,target,cleaned_text
3398,4866,explode,,Learn How I Gained Access To The Secrets Of The Top Earners &amp; Used Them To Explode My Home Business Here: http://t.co/SGXP1U5OL1 Please #RT,0,learn gained access secrets earners used explode home business please
7330,10490,wildfire,Vail Valley,We should all have a fire safety plan. RT @Matt_Kroschel: MOCK WILDFIRE near #Vail as agencies prepare for the worst. http://t.co/SWwyLRk0fv,0,fire safety plan matt kroschel mock wildfire near vail agencies prepare worst
4903,6979,massacre,Cimerak - Pangandaran,Review: Dude Bro Party Massacre III http://t.co/f0WQlobOoy by Patrick BromleyThe title sa http://t.co/THpBDPdj35,0,review dude party massacre patrick bromleythe title


In [38]:
train_data['tokenized_text'] = train_data['cleaned_text'].map(lambda x : x.split())
test_data['tokenized_text'] = test_data['cleaned_text'].map(lambda x : x.split())

In [39]:
train_data = train_data.reset_index(drop=True)
test_data = test_data.reset_index(drop=True)

We will take the mean of embeddings of all the words present in a sentence.

In [40]:
%%time

embedding_matrix_train = np.zeros([train_data.shape[0],vector_size])
for idx, row in train_data.iterrows():
    # print(idx)
    vec = np.zeros([100,])
    c = 0
    c1 = 0
    for word in row['tokenized_text']:
        try:
            vec += word2vec_model.wv[word]
            c += 1
        except:
            c1 += 1
            print(f"No embedding for word : {word}, {c1}")
            continue
    vec = vec/max(c,1)
    embedding_matrix_train[idx] = vec

CPU times: user 391 ms, sys: 2.36 s, total: 2.75 s
Wall time: 217 ms


In [41]:
test_data.shape

(1904, 7)

In [42]:
%%time

embedding_matrix_test = np.zeros([test_data.shape[0],vector_size])
c1 = 0
for idx, row in test_data.iterrows():
    # print(idx)
    vec = np.zeros([100,])
    c = 0
    for word in row['tokenized_text']:
        try:
            vec += word2vec_model.wv[word]
            c += 1
        except:
            c1 += 1
            # print(f"No embedding for word : {word}, {c1}")
            continue
    vec = vec/max(c,1)
    embedding_matrix_test[idx] = vec

CPU times: user 39.5 ms, sys: 1.28 ms, total: 40.8 ms
Wall time: 40 ms


In [43]:
embedding_matrix_train.shape, embedding_matrix_test.shape

((5709, 100), (1904, 100))

In [44]:
xgb_word2vec_model = XGBClassifier(n_estimators=200, max_depth=8,n_jobs=-1, random_state=100)
xgb_word2vec_obj = modelling(embedding_matrix_train, train_y, embedding_matrix_test, test_y, target)

In [45]:
xgb_word2vec_obj.train_model(xgb_word2vec_model)
xgb_word2vec_obj.get_metrics(embedding_matrix_train, train_y, dataset='train',ret_metrics=False)
xgb_word2vec_obj.get_metrics(embedding_matrix_test, test_y, dataset='test',ret_metrics=False)

#############################################################################################################################
AUC for train :  0.9995690768713394
Precision for train :  0.9893964110929854
Recall for train :  0.9798061389337641
F1 Score for train :  0.984577922077922
#############################################################################################################################
#############################################################################################################################
AUC for test :  0.7818012714724013
Precision for test :  0.7128129602356407
Recall for test :  0.6088050314465409
F1 Score for test :  0.6567164179104478
#############################################################################################################################


## Conclusion

We can see that through word2vec embeddings, we have pushed the metrics to almost 99% of all metrics being tracked. However, the performance on test has reduced due to the word embeddings and size being limited in knowledge that was given to it. While creating embeddings for the test set, we were able to see that a lot of the words were not found in the vocabulary which is why there were no embeddings for them and hence we could not extract knowledge from them.