# **LIBRARIES IMPORT**

In [1]:
import sys
sys.path.append('D:\\Projects\\nlp-projects\\utils')
print(sys.path)

['C:\\Users\\legion\\AppData\\Local\\Programs\\Python\\Python312\\python312.zip', 'C:\\Users\\legion\\AppData\\Local\\Programs\\Python\\Python312\\DLLs', 'C:\\Users\\legion\\AppData\\Local\\Programs\\Python\\Python312\\Lib', 'C:\\Users\\legion\\AppData\\Local\\Programs\\Python\\Python312', 'd:\\Projects\\nlp-projects\\.venv', '', 'd:\\Projects\\nlp-projects\\.venv\\Lib\\site-packages', 'd:\\Projects\\nlp-projects\\.venv\\Lib\\site-packages\\win32', 'd:\\Projects\\nlp-projects\\.venv\\Lib\\site-packages\\win32\\lib', 'd:\\Projects\\nlp-projects\\.venv\\Lib\\site-packages\\Pythonwin', 'D:\\Projects\\nlp-projects\\utils']


In [2]:
from bs4 import BeautifulSoup
import pandas as pd
import requests
import json
import os
import re
from dotenv import load_dotenv
from pymongo.mongo_client import MongoClient
from pymongo.server_api import ServerApi
from urllib.parse import quote_plus
from nltk.tokenize import word_tokenize, sent_tokenize, RegexpTokenizer, TreebankWordTokenizer, wordpunct_tokenize
import nltk
import utils
import process_tweets
from nltk.corpus import stopwords
from nltk.stem.isri import ISRIStemmer
from nltk.stem import WordNetLemmatizer
from gensim.models import Word2Vec, FastText
from sklearn.preprocessing import LabelEncoder
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import  AdaBoostClassifier
from sklearn.linear_model import LogisticRegressionCV


# **DATA WRANGLING & PROCESSING PIPELINE**

Our dataset is a collection tweets with their respective sentiment labels. The dataset is divided into two files: `twitter_training.csv` and `twitter_validation.csv`. The first file is used to train the model and the second file is used to validate the model. The dataset is loaded into a pandas dataframe and the first few rows are displayed to get a sense of the data.

In [3]:
dataframe = pd.read_csv('data/twitter_training.csv')
dataframe.drop(['id','game'],axis=1,inplace=True)

dataframe

Unnamed: 0,label,text
0,Positive,im getting on borderlands and i will murder yo...
1,Positive,I am coming to the borders and I will kill you...
2,Positive,im getting on borderlands and i will kill you ...
3,Positive,im coming on borderlands and i will murder you...
4,Positive,im getting on borderlands 2 and i will murder ...
...,...,...
74677,Positive,Just realized that the Windows partition of my...
74678,Positive,Just realized that my Mac window partition is ...
74679,Positive,Just realized the windows partition of my Mac ...
74680,Positive,Just realized between the windows partition of...


The tweets require some extensive preprocessing before they can be used to train a model. The preprocessing steps includes removing user handles, words starting with a dollar sign, hyperlinks, hashtags, punctuations, words with 2 or fewer letters, HTML special entities, whitespace, stopwords, characters beyond the Basic Multilingual Plane (BMP) of Unicode, and converting the tweet to lowercase. The processed tweet is then returned.


In [4]:
dataframe['text'] = dataframe['text'].apply(process_tweets.processTweet)

dataframe

Unnamed: 0,label,text
0,Positive,getting borderlands murder
1,Positive,coming borders kill
2,Positive,getting borderlands kill
3,Positive,coming borderlands murder
4,Positive,getting borderlands murder
...,...,...
74677,Positive,just realized windows partition mac like yea...
74678,Positive,just realized mac window partition years behi...
74679,Positive,just realized windows partition mac years be...
74680,Positive,just realized windows partition mac like ye...


### **Tokenization**

In [5]:
tokenized_dataframe = dataframe.copy()

tokenized_dataframe['text'] = tokenized_dataframe['text'].apply(word_tokenize)

tokenized_dataframe

Unnamed: 0,label,text
0,Positive,"[getting, borderlands, murder]"
1,Positive,"[coming, borders, kill]"
2,Positive,"[getting, borderlands, kill]"
3,Positive,"[coming, borderlands, murder]"
4,Positive,"[getting, borderlands, murder]"
...,...,...
74677,Positive,"[just, realized, windows, partition, mac, like..."
74678,Positive,"[just, realized, mac, window, partition, years..."
74679,Positive,"[just, realized, windows, partition, mac, year..."
74680,Positive,"[just, realized, windows, partition, mac, like..."


### **Sanity Check**

In [6]:
# Number of unique words
unique_words = set(word for sentence in tokenized_dataframe['text'] for word in sentence)
print(len(unique_words))

# Number of non-empty sentences
non_empty_sentences = len([sentence for sentence in tokenized_dataframe['text'] if sentence])
print(non_empty_sentences)

34062
72993


In [25]:
# remove empty sentences
tokenized_dataframe = tokenized_dataframe[tokenized_dataframe['text'].apply(lambda x: len(x) > 0)]

tokenized_dataframe

Unnamed: 0,label,text
0,3,"[getting, borderlands, murder]"
1,3,"[coming, borders, kill]"
2,3,"[getting, borderlands, kill]"
3,3,"[coming, borderlands, murder]"
4,3,"[getting, borderlands, murder]"
...,...,...
74677,3,"[just, realized, windows, partition, mac, like..."
74678,3,"[just, realized, mac, window, partition, years..."
74679,3,"[just, realized, windows, partition, mac, year..."
74680,3,"[just, realized, windows, partition, mac, like..."


In [8]:
""" tokenized_dataframe['text'] = tokenized_dataframe['text'].apply(lambda x: utils.lemma(x))

tokenized_dataframe """

" tokenized_dataframe['text'] = tokenized_dataframe['text'].apply(lambda x: utils.lemma(x))\n\ntokenized_dataframe "

### **Label Encoding**

In [9]:
label_enc = LabelEncoder()

tokenized_dataframe['label'] = label_enc.fit_transform(tokenized_dataframe['label'])

tokenized_dataframe

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tokenized_dataframe['label'] = label_enc.fit_transform(tokenized_dataframe['label'])


Unnamed: 0,label,text
0,3,"[getting, borderlands, murder]"
1,3,"[coming, borders, kill]"
2,3,"[getting, borderlands, kill]"
3,3,"[coming, borderlands, murder]"
4,3,"[getting, borderlands, murder]"
...,...,...
74677,3,"[just, realized, windows, partition, mac, like..."
74678,3,"[just, realized, mac, window, partition, years..."
74679,3,"[just, realized, windows, partition, mac, year..."
74680,3,"[just, realized, windows, partition, mac, like..."


**Repeating the same steps for the validation dataset**

In [10]:
dataframe_validation = pd.read_csv('data/twitter_validation.csv')
dataframe_validation.drop(['id','game'],axis=1,inplace=True)

dataframe_validation

Unnamed: 0,label,text
0,Irrelevant,I mentioned on Facebook that I was struggling ...
1,Neutral,BBC News - Amazon boss Jeff Bezos rejects clai...
2,Negative,@Microsoft Why do I pay for WORD when it funct...
3,Negative,"CSGO matchmaking is so full of closet hacking,..."
4,Neutral,Now the President is slapping Americans in the...
...,...,...
995,Irrelevant,⭐️ Toronto is the arts and culture capital of ...
996,Irrelevant,tHIS IS ACTUALLY A GOOD MOVE TOT BRING MORE VI...
997,Positive,Today sucked so it’s time to drink wine n play...
998,Positive,Bought a fraction of Microsoft today. Small wins.


In [11]:
dataframe_validation['text'] = dataframe_validation['text'].apply(process_tweets.processTweet)

dataframe_validation

Unnamed: 0,label,text
0,Irrelevant,mentioned facebook struggling motivation r...
1,Neutral,bbc news amazon boss jeff bezos rejects claims...
2,Negative,why pay word functions poorly chromebook
3,Negative,csgo matchmaking full closet hacking truly aw...
4,Neutral,now president slapping americans face reall...
...,...,...
995,Irrelevant,toronto arts culture capital canada wonder...
996,Irrelevant,this actually good move tot bring more viewers...
997,Positive,today sucked time drink wine play borderlands...
998,Positive,bought fraction microsoft today small wins


In [12]:
tokenized_dataframe_validation = dataframe_validation.copy()

tokenized_dataframe_validation['text'] = tokenized_dataframe_validation['text'].apply(word_tokenize)

tokenized_dataframe_validation

Unnamed: 0,label,text
0,Irrelevant,"[mentioned, facebook, struggling, motivation, ..."
1,Neutral,"[bbc, news, amazon, boss, jeff, bezos, rejects..."
2,Negative,"[why, pay, word, functions, poorly, chromebook]"
3,Negative,"[csgo, matchmaking, full, closet, hacking, tru..."
4,Neutral,"[now, president, slapping, americans, face, re..."
...,...,...
995,Irrelevant,"[toronto, arts, culture, capital, canada, wond..."
996,Irrelevant,"[this, actually, good, move, tot, bring, more,..."
997,Positive,"[today, sucked, time, drink, wine, play, borde..."
998,Positive,"[bought, fraction, microsoft, today, small, wins]"


In [13]:
# Number of unique words
unique_words = set(word for sentence in tokenized_dataframe_validation['text'] for word in sentence)
print(len(unique_words))

# Number of non-empty sentences
non_empty_sentences = len([sentence for sentence in tokenized_dataframe_validation['text'] if sentence])
print(non_empty_sentences)

4085
999


In [14]:
# remove empty sentences
tokenized_dataframe_validation = tokenized_dataframe_validation[tokenized_dataframe_validation['text'].apply(lambda x: len(x) > 0)]

tokenized_dataframe_validation

Unnamed: 0,label,text
0,Irrelevant,"[mentioned, facebook, struggling, motivation, ..."
1,Neutral,"[bbc, news, amazon, boss, jeff, bezos, rejects..."
2,Negative,"[why, pay, word, functions, poorly, chromebook]"
3,Negative,"[csgo, matchmaking, full, closet, hacking, tru..."
4,Neutral,"[now, president, slapping, americans, face, re..."
...,...,...
995,Irrelevant,"[toronto, arts, culture, capital, canada, wond..."
996,Irrelevant,"[this, actually, good, move, tot, bring, more,..."
997,Positive,"[today, sucked, time, drink, wine, play, borde..."
998,Positive,"[bought, fraction, microsoft, today, small, wins]"


In [15]:
tokenized_dataframe_validation['label'] = label_enc.transform(tokenized_dataframe_validation['label'])

tokenized_dataframe_validation

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tokenized_dataframe_validation['label'] = label_enc.transform(tokenized_dataframe_validation['label'])


Unnamed: 0,label,text
0,0,"[mentioned, facebook, struggling, motivation, ..."
1,2,"[bbc, news, amazon, boss, jeff, bezos, rejects..."
2,1,"[why, pay, word, functions, poorly, chromebook]"
3,1,"[csgo, matchmaking, full, closet, hacking, tru..."
4,2,"[now, president, slapping, americans, face, re..."
...,...,...
995,0,"[toronto, arts, culture, capital, canada, wond..."
996,0,"[this, actually, good, move, tot, bring, more,..."
997,3,"[today, sucked, time, drink, wine, play, borde..."
998,3,"[bought, fraction, microsoft, today, small, wins]"


### **Word2Vec**

### **Fine-Tuning**

Fine-tuning is done through a custom callback function, but due to computational constraints, the model is not fine-tuned in this notebook.

In [19]:
utils.evaluate_clf_model(tokenized_dataframe, tokenized_dataframe_validation, SVC(), vector_size=13)

Accuracy:  0.5265265265265265  - F1:  0.4540793258294027


(0.5265265265265265, 0.4540793258294027)

In [20]:
utils.evaluate_clf_model(tokenized_dataframe, tokenized_dataframe_validation, GaussianNB(), vector_size=13)

Accuracy:  0.46346346346346345  - F1:  0.4257875578845655


(0.46346346346346345, 0.4257875578845655)

In [21]:
utils.evaluate_clf_model(tokenized_dataframe, tokenized_dataframe_validation, LogisticRegression(), vector_size=13)

Accuracy:  0.48148148148148145  - F1:  0.4165693311203361


(0.48148148148148145, 0.4165693311203361)

In [22]:
utils.evaluate_clf_model(tokenized_dataframe, tokenized_dataframe_validation, AdaBoostClassifier(), vector_size=13)



Accuracy:  0.47147147147147145  - F1:  0.4132542701133447


(0.47147147147147145, 0.4132542701133447)

In [23]:
utils.evaluate_clf_model(tokenized_dataframe, tokenized_dataframe_validation, LogisticRegressionCV(max_iter=1000), vector_size=13)

Accuracy:  0.4804804804804805  - F1:  0.4026972378906446


(0.4804804804804805, 0.4026972378906446)