# Sentiment Analysis: VADER

Dataset used: https://www.kaggle.com/datasets/abhi8923shriv/sentiment-analysis-dataset

#### Running VADER on our data

In [6]:
# Import libraries
from nltk.sentiment import vader
from nltk.sentiment.vader import SentimentIntensityAnalyzer

import pandas as pd
import gensim
import numpy as np
import sklearn
from sklearn.metrics import classification_report
from simpletransformers.classification import ClassificationModel, ClassificationArgs
import matplotlib.pyplot as plt 
import seaborn as sn 
import nltk 
from nltk.stem import WordNetLemmatizer 
from sklearn import svm

In [7]:
def vader_output_to_label(vader_output):
    """
    map vader output e.g.,
    {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.4215}
    to one of the following values:
    a) positive float -> 'positive'
    b) 0.0 -> 'neutral'
    c) negative float -> 'negative'
    
    :param dict vader_output: output dict from vader
    
    :rtype: str
    :return: 'negative' | 'neutral' | 'positive'
    """
    compound = vader_output['compound']
    
    if compound < 0:
        return 'negative'
    elif compound == 0.0:
        return 'neutral'
    elif compound > 0.0:
        return 'positive'
    
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.0}) == 'neutral'
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.01}) == 'positive'
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': -0.01}) == 'negative'

In [8]:
#load the dataset
path = 'sentiment_train.csv'
dataset = pd.read_csv(path, encoding='latin1')

In [30]:
text = dataset['text']
print(len(text))
sentiment = dataset['sentiment']

27481


In [None]:
tweets = []
all_vader_output = []
gold = []

to_lemmatize = True 
pos = set()
vader_model = SentimentIntensityAnalyzer()

for _, data in dataset.iterrows():
    the_tweet = data['text']
    if type(the_tweet) == str and the_tweet.strip(): #checks if the tweets are not empty 
        vader_output = vader_model.polarity_scores(the_tweet)  
        vader_label = vader_output_to_label(vader_output)  

    tweets.append(the_tweet)
    all_vader_output.append(vader_label)
    gold.append(data['sentiment'])

    # Print tweet and its classification
    print(f"Tweet: {the_tweet}")
    print(f"Predicted Sentiment: {vader_label}")
    print(f"Actual Sentiment: {data['sentiment']}")
    print()

Tweet:  I`d have responded, if I were going
Predicted Sentiment: neutral
Actual Sentiment: neutral

Tweet:  Sooo SAD I will miss you here in San Diego!!!
Predicted Sentiment: negative
Actual Sentiment: negative

Tweet: my boss is bullying me...
Predicted Sentiment: negative
Actual Sentiment: negative

Tweet:  what interview! leave me alone
Predicted Sentiment: negative
Actual Sentiment: negative

Tweet:  Sons of ****, why couldn`t they put them on the releases we already bought
Predicted Sentiment: neutral
Actual Sentiment: negative

Tweet: http://www.dothebouncy.com/smf - some shameless plugging for the best Rangers forum on earth
Predicted Sentiment: positive
Actual Sentiment: neutral

Tweet: 2am feedings for the baby are fun when he is all smiles and coos
Predicted Sentiment: positive
Actual Sentiment: positive

Tweet: Soooo high
Predicted Sentiment: neutral
Actual Sentiment: neutral

Tweet:  Both of you
Predicted Sentiment: neutral
Actual Sentiment: neutral

Tweet:  Journey!? Wow..

### Classification Report 

In [25]:
print(classification_report(gold, all_vader_output))

              precision    recall  f1-score   support

    negative       0.70      0.60      0.65      7781
     neutral       0.72      0.46      0.56     11118
    positive       0.55      0.87      0.68      8582

    accuracy                           0.63     27481
   macro avg       0.65      0.65      0.63     27481
weighted avg       0.66      0.63      0.62     27481



#### Running VADER on given test data

In [26]:
path = 'sentiment-topic-test.tsv'
data_tsv = pd.read_csv(path, sep="\t")

In [28]:
tweets = []
all_vader_output_tsv = []
gold_tsv = []

to_lemmatize = True 
pos = set()
vader_model = SentimentIntensityAnalyzer()

for _, data in data_tsv.iterrows():
    the_tweet = data['sentence']
    if type(the_tweet) == str and the_tweet.strip(): #checks if the tweets are not empty 
        vader_output = vader_model.polarity_scores(the_tweet)  
        vader_label = vader_output_to_label(vader_output)  

    tweets.append(the_tweet)
    all_vader_output_tsv.append(vader_label)
    gold_tsv.append(data['sentiment'])

    # Print tweet and its classification
    print(f"Tweet: {the_tweet}")
    print(f"Predicted Sentiment: {vader_label}")
    print(f"Actual Sentiment: {data['sentiment']}")
    print()


Tweet: The atmosphere at the stadium tonight was electric.
Predicted Sentiment: neutral
Actual Sentiment: positive

Tweet: The game was so intense I forgot to breathe at times. What a win!
Predicted Sentiment: positive
Actual Sentiment: positive

Tweet: It had me hooked from the first chapter.
Predicted Sentiment: neutral
Actual Sentiment: positive

Tweet: It’s more of a slow burn than a page-turner, but it’s well-written, I guess.
Predicted Sentiment: neutral
Actual Sentiment: neutral

Tweet: It’s split into two timelines, which keeps it interesting but also a bit confusing at times.
Predicted Sentiment: negative
Actual Sentiment: neutral

Tweet: I could watch this film a hundred times and still find something new to love about it.
Predicted Sentiment: positive
Actual Sentiment: positive

Tweet: Best thriller I’ve seen in ages. Had me on the edge of my seat the entire time.
Predicted Sentiment: positive
Actual Sentiment: positive

Tweet: How do you concede three goals in ten minutes? 

In [29]:
print(classification_report(gold_tsv, all_vader_output_tsv))

              precision    recall  f1-score   support

    negative       0.00      0.00      0.00         6
     neutral       0.33      0.50      0.40         6
    positive       0.43      0.50      0.46         6

    accuracy                           0.33        18
   macro avg       0.25      0.33      0.29        18
weighted avg       0.25      0.33      0.29        18



### Result Analysis

The VADER model performs much better on the dataset chosen by us. This may be due to the difference in support number; with our data consisting of 27481 tweets, and the test data consisting of 18 sentences. However, after analysing the classification report of the latter, it came to our notice that the model struggles predicting negative sentences. This can be due to the fact that some of the 18 sentences were formulated in such a way that the model gets deceived. Especially when exaggerated words are used, such as 'incredibly'. 


# Sentiment Analysis: Naive Bayes 

### Classification Report

### Result Analysis 


# Comparison