# Sentiment Analysis: VADER

Dataset used: https://www.kaggle.com/datasets/abhi8923shriv/sentiment-analysis-dataset

### Classification Report 

### Result Analysis 

# Sentiment Analysis: Naive Bayes 

In [None]:
import nltk
from nltk.corpus import stopwords
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy as np
from collections import Counter 


In [3]:
train_path = 'sentiment_train.csv'
train_data = pd.read_csv(train_path, on_bad_lines="warn", encoding='latin1')
train_texts = train_data['text'].astype(str).tolist()
train_labels = np.array(train_data['sentiment']) 

In [None]:
test_path = 'sentiment-topic-test.tsv' 
test_data = pd.read_csv(test_path, sep='\t', on_bad_lines="warn", encoding='latin1')
test_texts = test_data['sentence'].astype(str).tolist()
test_labels = np.array(test_data['sentiment'])

In [None]:
print("Frequency distribution of the training dataset")
Counter(train_labels)

Counter({'neutral': 11118, 'positive': 8582, 'negative': 7781})

In [19]:
train_proportion = {}
test_proportion = {}

counter = dict(Counter(train_labels))

sum_instances = 0

for label in counter:
    sum_instances += counter[label]

for label in counter:
    counter[label] = f"{round((counter[label] / sum_instances)*100, ndigits=2)}%"

print("Training data proportion")
counter

Training data proportion


{'neutral': '40.46%', 'negative': '28.31%', 'positive': '31.23%'}

In [13]:
print("Frequency distribution of the testing dataset")
Counter(test_labels)

Frequency distribution of the testing dataset


Counter({'positive': 6, 'neutral': 6, 'negative': 6})

In [None]:
# Bag of words representation, min_df=2
vec = CountVectorizer(min_df=2, 
                             tokenizer=nltk.word_tokenize, 
                             stop_words=stopwords.words('english'))

train_features = vec.fit_transform(train_texts)

test_features = vec.transform(test_texts)

model = MultinomialNB()
clf = model.fit(train_features, train_labels)
y_pred = clf.predict(test_features)




### Classification Report

In [5]:
print(classification_report(test_labels, y_pred))

              precision    recall  f1-score   support

    negative       0.40      0.33      0.36         6
     neutral       0.45      0.83      0.59         6
    positive       1.00      0.33      0.50         6

    accuracy                           0.50        18
   macro avg       0.62      0.50      0.48        18
weighted avg       0.62      0.50      0.48        18



### Result Analysis 


In [10]:
for text, pred, actual in zip(test_texts, y_pred, test_labels):
    print(f"Text: {text}\nPredicted Label: {pred}\nActual Label: {actual}\n")

Text: The atmosphere at the stadium tonight was electric.
Predicted Label: negative
Actual Label: positive

Text: The game was so intense I forgot to breathe at times. What a win!
Predicted Label: negative
Actual Label: positive

Text: It had me hooked from the first chapter.
Predicted Label: neutral
Actual Label: positive

Text: Itâs more of a slow burn than a page-turner, but itâs well-written, I guess.
Predicted Label: negative
Actual Label: neutral

Text: Itâs split into two timelines, which keeps it interesting but also a bit confusing at times.
Predicted Label: neutral
Actual Label: neutral

Text: I could watch this film a hundred times and still find something new to love about it.
Predicted Label: neutral
Actual Label: positive

Text: Best thriller Iâve seen in ages. Had me on the edge of my seat the entire time.
Predicted Label: positive
Actual Label: positive

Text: How do you concede three goals in ten minutes? The whole defence needs replacing.
Predicted Label: neut

The model struggles with negative sentiment. This can be seen in the precision (0.40) and recall (0.33) values for the negative label. On the other hand neutral sentiment has a fairly high recall (0.83), but a lower precision (0.45). This could be due to the training data having a lot of neutral instances (around 40%). For positive sentiment the model has really high precision (1), so all predicted positives are also actual positives but the low recall (0.33) indicates that other instances positive instances are not being classified as positive.

- choices we have to justify in the poster:
    + why naive bayes -> literature

    + min_df value -> compare results when df=5 or df=10, performs worse bc data is in short tweet-like sentences, so a lot of important words don't have that high frequency but should still be included when determining sentiment. Setting min_df too high risks losing these key words.

    + BoW representation -> back up with literature: focus on how it works and that it works well with NB

    + removing stopwords -> compare results when not removing stopwords -> removing stopwords allows model to focus more on words that actually help determine sentiment and ignores common function words (e.g., the, is etc.)

    + why nltk tokenisation vs letting CountVectorizer handle tokenisation -> literature: explain how nltk tokenises (handles punctuation, contractions, special cases) and how CountVectoriser handles tokenisation (simple whitespaces or punctuation-based splitting)

# Comparison