## NLTK VADER

A rule-based (heuristic) reasoner that gives a score based on the text input.

## 1. Import VADER

- required packages: pandas, matplotlib, nltk

- Files that will be downloaded to AppData\Roaming\nltk_data once the code is ran: vader_lexicon files.

- required datasets (csv's) for this program: Tweets.csv, replies.csv.

- All csv's (input and output) will be occuring at folder-level.

In [118]:
import pandas as pd
import matplotlib.pyplot as plt
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to C:\Users\Adam
[nltk_data]     Chen\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

## 2. Read in dataset: Tweets.csv (our dataset for training purposes), and clean up the dataset

- textID and selected_text will not matter for this program.

In [119]:
dataset = pd.read_csv('Tweets.csv', encoding='ISO-8859-1')
dataset_drop = dataset.drop(['textID', 'selected_text'], axis=1)
dataset_drop

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative
...,...,...,...,...
27476,4eac33d1c0,wish we could come see u on Denver husband l...,d lost,negative
27477,4f4c4fc327,I`ve wondered about rake to. The client has ...,", don`t force",negative
27478,f67aae2310,Yay good for both of you. Enjoy the break - y...,Yay good for both of you.,positive
27479,ed167662a5,But it was worth it ****.,But it was worth it ****.,positive


### 2.1. Further cleaning up of the dataset, leaving only text and sentiment

- separate it into text (datasetX), and sentiment (datasetY)

In [120]:
datasetX = dataset['text'].astype(str)
datasetY = dataset['sentiment']
datasetY.value_counts()
datasetX

0                      I`d have responded, if I were going
1            Sooo SAD I will miss you here in San Diego!!!
2                                my boss is bullying me...
3                           what interview! leave me alone
4         Sons of ****, why couldn`t they put them on t...
                               ...                        
27476     wish we could come see u on Denver  husband l...
27477     I`ve wondered about rake to.  The client has ...
27478     Yay good for both of you. Enjoy the break - y...
27479                           But it was worth it  ****.
27480       All this flirting going on - The ATG smiles...
Name: text, Length: 27481, dtype: object

## 3. Build VADER

- an example is provided as to how VADER scores the sentence.

In [121]:
vader = SentimentIntensityAnalyzer()
sample = 'I really love NVIDIA'
vader.polarity_scores(sample)

{'neg': 0.0, 'neu': 0.308, 'pos': 0.692, 'compound': 0.6697}

## 4. Get Raw Prediction data from VADER

- iteratively apply VADER over the texts of the entire dataset.

In [122]:
datasetPredictedRaw = datasetX.apply(lambda x:vader.polarity_scores(x))
datasetPredictedRaw

0        {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...
1        {'neg': 0.474, 'neu': 0.526, 'pos': 0.0, 'comp...
2        {'neg': 0.494, 'neu': 0.506, 'pos': 0.0, 'comp...
3        {'neg': 0.538, 'neu': 0.462, 'pos': 0.0, 'comp...
4        {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...
                               ...                        
27476    {'neg': 0.128, 'neu': 0.722, 'pos': 0.15, 'com...
27477    {'neg': 0.0, 'neu': 0.89, 'pos': 0.11, 'compou...
27478    {'neg': 0.0, 'neu': 0.572, 'pos': 0.428, 'comp...
27479    {'neg': 0.0, 'neu': 0.68, 'pos': 0.32, 'compou...
27480    {'neg': 0.0, 'neu': 0.458, 'pos': 0.542, 'comp...
Name: text, Length: 27481, dtype: object

## 5. Process the output of VADER

- formatting VADER scores into a dataframe

In [123]:
datasetPredicted = pd.DataFrame()
datasetPredicted['neg'] = datasetPredictedRaw.apply(lambda score_dict: score_dict['neg'])
datasetPredicted['neu'] = datasetPredictedRaw.apply(lambda score_dict: score_dict['neu'])
datasetPredicted['pos'] = datasetPredictedRaw.apply(lambda score_dict: score_dict['pos'])
datasetPredicted['compound'] = datasetPredictedRaw.apply(lambda score_dict: score_dict['compound'])
datasetPredicted

Unnamed: 0,neg,neu,pos,compound
0,0.000,1.000,0.000,0.0000
1,0.474,0.526,0.000,-0.7437
2,0.494,0.506,0.000,-0.5994
3,0.538,0.462,0.000,-0.3595
4,0.000,1.000,0.000,0.0000
...,...,...,...,...
27476,0.128,0.722,0.150,0.1027
27477,0.000,0.890,0.110,0.3818
27478,0.000,0.572,0.428,0.9136
27479,0.000,0.680,0.320,0.3291


### 5.1. extracting predictions base on the compound score column. 

- 0 is neutral, positive is positive, negative is negative.

In [124]:
datasetPredicted['prediction'] = datasetPredicted['compound'].apply(lambda c: 'positive' if c>0 else ('neutral' if c==0 else 'negative') )
datasetPredicted

Unnamed: 0,neg,neu,pos,compound,prediction
0,0.000,1.000,0.000,0.0000,neutral
1,0.474,0.526,0.000,-0.7437,negative
2,0.494,0.506,0.000,-0.5994,negative
3,0.538,0.462,0.000,-0.3595,negative
4,0.000,1.000,0.000,0.0000,neutral
...,...,...,...,...,...
27476,0.128,0.722,0.150,0.1027,positive
27477,0.000,0.890,0.110,0.3818,positive
27478,0.000,0.572,0.428,0.9136,positive
27479,0.000,0.680,0.320,0.3291,positive


## 6. Generating a classification report on VADER

In [125]:
from sklearn.metrics import classification_report
print(classification_report(datasetY,datasetPredicted['prediction']))

              precision    recall  f1-score   support

    negative       0.70      0.60      0.65      7781
     neutral       0.72      0.46      0.56     11118
    positive       0.55      0.87      0.68      8582

    accuracy                           0.63     27481
   macro avg       0.65      0.65      0.63     27481
weighted avg       0.66      0.63      0.62     27481



# Verification of VADER against a real-life scenario

- 'replies.csv' contains replies of one of Elon Musk's tweets, that our team members at the front end has pulled using the API and has provided for us for testing purposes.

In [126]:
df = pd.read_csv('replies.csv')
df

Unnamed: 0,"Hello Mr. Musk, I ask you to provide Starlink Internet for Iranians. Censorship and low internet speed suffer the people of iran.",1528992634827681800
0,Musk has done so much for the American people....,1528989682918301703
1,@elonmusk Did Elon Musk forget *Starlinks' own...,1528988660606980097
2,I love you elon musk,1528983442234347521
3,Elon Musk what is your next step of trying to ...,1528982239157321728
4,@MrLeonMusk Dear @elonmusk I beg you to please...,1528976306121019392
...,...,...
449,maybe Elon musk is 1st trillioner,1528877473160523776
450,@elonmusk We just bought the electric Porsche ...,1528877393481326592
451,😂 ANTENNA TO WAITING MR MUSK,1528877239668023298
452,"I speak for everyone when i say, In all my lif...",1528877047170441218


## Running the dataset into VADER and returning the score dataset

In [127]:
df_drop = df.iloc[:,0]
dfPredictedRaw = df_drop.apply(lambda x:vader.polarity_scores(x))
dfPredicted = pd.DataFrame()
dfPredicted['neg'] = dfPredictedRaw.apply(lambda score_dict: score_dict['neg'])
dfPredicted['neu'] = dfPredictedRaw.apply(lambda score_dict: score_dict['neu'])
dfPredicted['pos'] = dfPredictedRaw.apply(lambda score_dict: score_dict['pos'])
dfPredicted['compound'] = dfPredictedRaw.apply(lambda score_dict: score_dict['compound'])
dfPredicted

Unnamed: 0,neg,neu,pos,compound
0,0.000,1.000,0.000,0.0000
1,0.121,0.879,0.000,-0.6072
2,0.000,0.417,0.583,0.6369
3,0.000,1.000,0.000,0.0000
4,0.083,0.545,0.372,0.9643
...,...,...,...,...
449,0.000,1.000,0.000,0.0000
450,0.000,0.827,0.173,0.7673
451,0.000,1.000,0.000,0.0000
452,0.057,0.826,0.118,0.5123


## Generating the prediction column

- and do a count of how many positive, neutral, and negative replies did VADER think Elon Musk has gotten from this tweet.

In [128]:
dfPredicted['prediction'] = dfPredicted['compound'].apply(lambda c: 'positive' if c>0 else ('neutral' if c==0 else 'negative') )
dfPredicted['prediction'].value_counts()

positive    293
neutral      96
negative     65
Name: prediction, dtype: int64

## Saving a combination of text and VADER predictions into a CSV

In [132]:
dfCombined = pd.concat([df_drop,dfPredicted['prediction']],axis='columns')
dfCombined.to_csv('real_life_prediction_results_VADER.csv')