# Sentimental Analysis of 1.6 M Data using VADER

### Introduction to VADER
VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool designed for analyzing sentiment in text data. Developed by C.J. Hutto and Eric Gilbert, VADER is widely used for sentiment analysis tasks due to its simplicity and effectiveness, especially in social media contexts. It features a pre-trained lexicon containing thousands of words with associated sentiment scores, along with rules and heuristics to handle linguistic nuances and context-specific sentiments. VADER is "valence aware," meaning it considers the valence of words in context to accurately interpret sentiments expressed in text. It assigns sentiment intensity scores, including positive, negative, and neutral scores, as well as a compound score representing the overall sentiment polarity of the text. VADER finds applications in social media analysis, customer feedback analysis, and brand monitoring, providing a robust and efficient solution for sentiment analysis tasks in both research and industry settings.

### Objective:
The primary objective of this project is to assess the performance of the VADER sentiment analysis tool when applied to both cleaned and uncleaned text data. Additionally, we aim to establish a baseline score for sentiment analysis using VADER.

In [1]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\MILAN\AppData\Roaming\nltk_data...


True

In [2]:
sia = SentimentIntensityAnalyzer()

text = "I love this product. It is amaing."

sentiment_scores = sia.polarity_scores(text)
sentiment_scores

{'neg': 0.0, 'neu': 0.543, 'pos': 0.457, 'compound': 0.6369}

In [3]:
compound_score = sentiment_scores['compound']

if compound_score >= 0.05:
    print("Positive")
elif compound_score <= -0.05:
    print("Negative")
else:
    print("Neutral")

Positive


### Importing Libraries and Modules

In [4]:
import numpy as np
import pandas as pd

import re

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import roc_auc_score, average_precision_score, log_loss

import nltk
nltk.data.path.append("/path/to/nltk_data")

from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

pd.options.mode.copy_on_write = True

### Importing Dataset

In [7]:
tweets = pd.read_csv("C:/Users/MILAN/Downloads/training.1600000.processed.noemoticon.csv",  encoding = "ISO-8859-1")
tweets.head()

Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew


In [8]:
tweets.columns = ["target", "ids", "Date", "flag", "user", "text_"]
tweets.head()

Unnamed: 0,target,ids,Date,flag,user,text_
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew


In [9]:
tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599999 entries, 0 to 1599998
Data columns (total 6 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   target  1599999 non-null  int64 
 1   ids     1599999 non-null  int64 
 2   Date    1599999 non-null  object
 3   flag    1599999 non-null  object
 4   user    1599999 non-null  object
 5   text_   1599999 non-null  object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB


In [10]:
tweets.describe(include=['object'])

Unnamed: 0,Date,flag,user,text_
count,1599999,1599999,1599999,1599999
unique,774362,1,659775,1581465
top,Mon Jun 15 12:53:14 PDT 2009,NO_QUERY,lost_dog,isPlayer Has Died! Sorry
freq,20,1599999,549,210


In [11]:
tweets.target.unique()

array([0, 4], dtype=int64)

In [12]:
tweets['target'] = tweets['target'].replace(4, 1)

In [13]:
tweets['ids'].nunique()

1598314

### Data Preprocessing

In [14]:
tweets = tweets.sample(n=100000, random_state=42)
tweets = tweets.drop_duplicates(subset=['ids'])
tweets.reset_index(drop=True, inplace=True)

column_to_delete = ['flag']
tweets = tweets.drop(column_to_delete, axis=1)
tweets.head()

Unnamed: 0,target,ids,Date,user,text_
0,0,2200003313,Tue Jun 16 18:18:13 PDT 2009,DEWGetMeTho77,@Nkluvr4eva My poor little dumpling In Holmde...
1,0,1467998601,Mon Apr 06 23:11:18 PDT 2009,Young_J,I'm off too bed. I gotta wake up hella early t...
2,0,2300049112,Tue Jun 23 13:40:12 PDT 2009,dougnawoschik,I havent been able to listen to it yet My spe...
3,0,1993474319,Mon Jun 01 10:26:09 PDT 2009,thireven,now remembers why solving a relatively big equ...
4,0,2256551006,Sat Jun 20 12:56:51 PDT 2009,taracollins086,"Ate too much, feel sick"


Removing usernames from text: Iterating over each row in the 'text' column and checking if the text starts with a '@' symbol, indicating a mention. If a mention is found, we are removing the mention by finding the index of the first space character after the '@' symbol and retains the text following that space.

In [15]:
tweets['text'] = tweets['text_']

for i in range(len(tweets['text'])):
    str_val = tweets['text'].iloc[i]
    if str_val.startswith("@"):
        first_idx = str_val.index(" ") + 1
        tweets.loc[i, 'text'] = str_val[first_idx:]

tweets.drop(columns=['text_'], inplace=True)

tweets.head()

Unnamed: 0,target,ids,Date,user,text
0,0,2200003313,Tue Jun 16 18:18:13 PDT 2009,DEWGetMeTho77,My poor little dumpling In Holmdel vids he wa...
1,0,1467998601,Mon Apr 06 23:11:18 PDT 2009,Young_J,I'm off too bed. I gotta wake up hella early t...
2,0,2300049112,Tue Jun 23 13:40:12 PDT 2009,dougnawoschik,I havent been able to listen to it yet My spe...
3,0,1993474319,Mon Jun 01 10:26:09 PDT 2009,thireven,now remembers why solving a relatively big equ...
4,0,2256551006,Sat Jun 20 12:56:51 PDT 2009,taracollins086,"Ate too much, feel sick"


In [16]:
for i in tweets.Date:
    l = len(tweets.Date.iloc[0])
    if ((len(i)!=l) and (i[3]!=" ") and (i[7]!=" ") and (i[0]!=" ") and (i[19]!=" ") and (i[23]!=" ")):
        print(i, "inconsistent")
        break;

Handling Date column: Extracting date components from the 'Date' column, combining them into a datetime format, and dropping the original date components columns, resulting in a DataFrame with the desired datetime information.

In [17]:
tweets['day'] = tweets['Date'].str.split().str[0]
tweets['month'] = tweets['Date'].str.split().str[1]
tweets['date'] = tweets['Date'].str.split().str[2]
tweets['year'] = tweets['Date'].str.split().str[-1]

tweets['date_time'] = tweets['day'] + ' ' + tweets['month'] + ' ' + tweets['date'] + ' ' + tweets['year']
tweets['date_time'] = pd.to_datetime(tweets['date_time'], format='%a %b %d %Y')

tweets.drop(columns=['day', 'month', 'date', 'year', 'Date'], inplace=True)

tweets.head()

Unnamed: 0,target,ids,user,text,date_time
0,0,2200003313,DEWGetMeTho77,My poor little dumpling In Holmdel vids he wa...,2009-06-16
1,0,1467998601,Young_J,I'm off too bed. I gotta wake up hella early t...,2009-04-06
2,0,2300049112,dougnawoschik,I havent been able to listen to it yet My spe...,2009-06-23
3,0,1993474319,thireven,now remembers why solving a relatively big equ...,2009-06-01
4,0,2256551006,taracollins086,"Ate too much, feel sick",2009-06-20


### NLTK Text Cleaning Pipeline

This demonstrates a text cleaning pipeline using NLTK (Natural Language Toolkit) in Python. It consists of a function clean_text() that preprocesses input text by removing special characters and digits, tokenizing the text, removing stopwords, and lemmatizing the tokens.

In [18]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

nltk.download('wordnet', "/kaggle/working/nltk_data/")
nltk.download('omw-1.4', "/kaggle/working/nltk_data/")
! unzip /kaggle/working/nltk_data/corpora/wordnet.zip -d /kaggle/working/nltk_data/corpora
! unzip /kaggle/working/nltk_data/corpora/omw-1.4.zip -d /kaggle/working/nltk_data/corpora
nltk.data.path.append("/kaggle/working/nltk_data/")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\MILAN\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\MILAN\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\MILAN\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\MILAN\AppData\Roaming\nltk_data...
[nltk_data] Downloading package wordnet to
[nltk_data]     /kaggle/working/nltk_data/...
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /kaggle/working/nltk_data/...
'unzip' is not recognized as an internal or external command,
operable program or batch file.
'unzip' is not recognized as an internal or external command,
operable program or batch file.


In [19]:
def clean_text(text):
    # Remove special characters and digits
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Tokenization
    tokens = word_tokenize(text)

    # Lowercasing
    tokens = [token.lower() for token in tokens]

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]

    # Join tokens back into a string
    cleaned_text = ' '.join(tokens)
    
    return cleaned_text

In [20]:
tweets['cleaned_text'] = tweets['text']

for i in range(len(tweets['cleaned_text'])):
    str = tweets['cleaned_text'].iloc[i]
    tweets.loc[i, 'cleaned_text'] = clean_text(str)

tweets.head()

Unnamed: 0,target,ids,user,text,date_time,cleaned_text
0,0,2200003313,DEWGetMeTho77,My poor little dumpling In Holmdel vids he wa...,2009-06-16,poor little dumpling holmdel vids really tryin...
1,0,1467998601,Young_J,I'm off too bed. I gotta wake up hella early t...,2009-04-06,im bed got ta wake hella early tomorrow morning
2,0,2300049112,dougnawoschik,I havent been able to listen to it yet My spe...,2009-06-23,havent able listen yet speaker busted
3,0,1993474319,thireven,now remembers why solving a relatively big equ...,2009-06-01,remembers solving relatively big equation two ...
4,0,2256551006,taracollins086,"Ate too much, feel sick",2009-06-20,ate much feel sick


### Sentiment Score Computation

In [21]:
def sent_scores(col):
    
    length = len(tweets[col])
    neg = np.empty([1, length])
    neg = neg[0]
    neu = np.empty([1, length])
    neu = neu[0]
    pos = np.empty([1, length])
    pos = pos[0]
    compound = np.empty([1, length])
    compound = compound[0]

    for i in range(length):
        t = tweets[col][i]
        sentiment_scores = sia.polarity_scores(t)
        neg[i] = sentiment_scores["neg"]
        neu[i] = sentiment_scores["neu"]
        pos[i] = sentiment_scores["pos"]
        compound[i] = sentiment_scores["compound"]

    tweets[col+"_neg"] = neg
    tweets[col+"_neu"] = neu
    tweets[col+"_pos"] = pos
    tweets[col+"_compound"] = compound

sent_scores("text")
sent_scores("cleaned_text")

tweets.head()

Unnamed: 0,target,ids,user,text,date_time,cleaned_text,text_neg,text_neu,text_pos,text_compound,cleaned_text_neg,cleaned_text_neu,cleaned_text_pos,cleaned_text_compound
0,0,2200003313,DEWGetMeTho77,My poor little dumpling In Holmdel vids he wa...,2009-06-16,poor little dumpling holmdel vids really tryin...,0.151,0.732,0.117,-0.4013,0.214,0.621,0.166,-0.4013
1,0,1467998601,Young_J,I'm off too bed. I gotta wake up hella early t...,2009-04-06,im bed got ta wake hella early tomorrow morning,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
2,0,2300049112,dougnawoschik,I havent been able to listen to it yet My spe...,2009-06-23,havent able listen yet speaker busted,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0,1993474319,thireven,now remembers why solving a relatively big equ...,2009-06-01,remembers solving relatively big equation two ...,0.168,0.711,0.122,-0.2263,0.241,0.584,0.175,-0.2263
4,0,2256551006,taracollins086,"Ate too much, feel sick",2009-06-20,ate much feel sick,0.452,0.548,0.0,-0.5106,0.524,0.476,0.0,-0.5106


In [22]:
min_value = tweets['text_compound'].min()
max_value = tweets['text_compound'].max()

print("Minimum value in text_compound:", min_value)
print("Maximum value in text_compound:", max_value)

Minimum value in text_compound: -0.9776
Maximum value in text_compound: 0.9928


### Evaluation

The compound score, a 'normalized, weighted composite score,' is derived by summing the valence scores of each word in the lexicon, adjusted based on specific rules, and then normalized to fall within the range of -1 (most extreme negative) to +1 (most extreme positive). This metric serves as a valuable single-dimensional measure of sentiment for a given sentence. Typical threshold values are as follows:

* Positive sentiment: Compound score ≥ 0.05
* Neutral sentiment: -0.05 < Compound score < 0.05
* Negative sentiment: Compound score ≤ -0.05

In [23]:
def evaluate(col):
    
    tweets['predicted_sentiment'] = tweets[col].apply(lambda x: 1 if x >= 0.05 else (0 if x <= -0.05 else None))
    
    # remove rows with neutral sentiment (nans)
    tweets.dropna(subset=['predicted_sentiment'], inplace=True)

    accuracy = accuracy_score(tweets['target'], tweets['predicted_sentiment'])
    precision = precision_score(tweets['target'], tweets['predicted_sentiment'])
    recall = recall_score(tweets['target'], tweets['predicted_sentiment'])
    f1 = f1_score(tweets['target'], tweets['predicted_sentiment'])
    roc_auc = roc_auc_score(tweets['target'], tweets['predicted_sentiment'])
    
    print("============================= "+col+" =============================")
    print("ROC-AUC score:", roc_auc)
    print("Accuracy:", accuracy)
    print("Recall:", recall)
    
evaluate('text_compound')
evaluate('cleaned_text_compound')

ROC-AUC score: 0.7204268320995052
Accuracy: 0.7191188040912667
Recall: 0.8616529201961658
ROC-AUC score: 0.702428166876544
Accuracy: 0.7025635893109907
Recall: 0.8685962456070173


### Conclusion