# Vadar
Vadar is a rule based or lexicon based approach for predicting sentiment where each word is assigned a value and than each word is summed to obtain a review

### Importing Libraries

In [77]:
import pandas as pd

### Loading Data

In [78]:
dataset = pd.read_csv('Sentiments.csv', delimiter='\t')

In [79]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Labels   1000 non-null   int64 
 1   Reviews  1000 non-null   object
dtypes: int64(1), object(1)
memory usage: 15.8+ KB


In [80]:
dataset.Reviews[:5]

0                             Wow... Loved this place.
1                                   Crust is not good.
2            Not tasty and the texture was just nasty.
3    Stopped by during the late May bank holiday of...
4    The selection on the menu was great and so wer...
Name: Reviews, dtype: object

### Vadar Explanation

In [81]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sent_analysis = SentimentIntensityAnalyzer()
sent_analysis.polarity_scores(dataset.Reviews[0])

{'neg': 0.0, 'neu': 0.435, 'pos': 0.565, 'compound': 0.5994}

In [82]:
text = dataset.Reviews[0]
text.split(' ')

['Wow...', 'Loved', 'this', 'place.']

In [83]:
sent_analysis.polarity_scores('Wow'), 
sent_analysis.polarity_scores('Wow...'), 
sent_analysis.polarity_scores('wow')

{'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.5859}

Here, you can see that the effect of lower and upper case has no effect on this approach but the effect "..." is significant. So, For Vader Approach data preprocessing is very important step.

In [84]:
sent_analysis.polarity_scores(':')  #No meaning
sent_analysis.polarity_scores(')')  #No meaning
sent_analysis.polarity_scores(':)') #Smile or Happy or Positive Emotion

{'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.4588}

We can also infer that the it can also handle emotions, so this should be included as it can change the result.

In [85]:
import re
new_text = re.sub("[^'a-z]",' ',text.lower())
words = new_text.split(' ')
words

['wow', '', '', '', 'loved', 'this', 'place', '']

In [86]:
for word in list(set(words)):
    score = sent_analysis.polarity_scores(word)
    print(word, score)

 {'neg': 0.0, 'neu': 0.0, 'pos': 0.0, 'compound': 0.0}
loved {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.5994}
wow {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.5859}
this {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
place {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}


In [87]:
sent_analysis.polarity_scores(text)

{'neg': 0.0, 'neu': 0.435, 'pos': 0.565, 'compound': 0.5994}

### Data Preprocessing

In [88]:
from tqdm import tqdm

In [89]:
processed_reviews = []
for review in tqdm(dataset.Reviews):
    tokenize_review = re.sub("[^'a-z]",' ',review.lower())
    processed_reviews.append(tokenize_review)

100%|██████████| 1000/1000 [00:00<00:00, 184640.96it/s]


In [95]:
processed_reviews[6]

"honeslty it didn't taste that fresh  "

In [96]:
dataset['Processed_Review'] = processed_reviews

###  Testing Result

In [101]:
review_score = []
for review in processed_reviews:
    score = sent_analysis.polarity_scores(review)
    if score['compound'] > 0:
        review_score.append(1)
    elif score['compound'] < 0:
        review_score.append(-1)
    else:
        review_score.append(0)

In [102]:
dataset['Vader_Score']= review_score

In [104]:
# accuracy
sum(dataset['Labels'] == dataset['Vader_Score'])

653

accuracy is 65.3% which is not good. But since score is in 3 categories (pos, neg, neu) lets make analysis for the result.

### Analysing Difference

In [122]:
dataset.head(30)

Unnamed: 0,Labels,Reviews,Processed_Review,Vader_Score
0,1,Wow... Loved this place.,wow loved this place,1
1,-1,Crust is not good.,crust is not good,-1
2,-1,Not tasty and the texture was just nasty.,not tasty and the texture was just nasty,-1
3,1,Stopped by during the late May bank holiday of...,stopped by during the late may bank holiday of...,1
4,1,The selection on the menu was great and so wer...,the selection on the menu was great and so wer...,1
5,-1,Now I am getting angry and I want my damn pho.,now i am getting angry and i want my damn pho,-1
6,-1,Honeslty it didn't taste THAT fresh.),honeslty it didn't taste that fresh,-1
7,-1,The potatoes were like rubber and you could te...,the potatoes were like rubber and you could te...,1
8,1,The fries were great too.,the fries were great too,1
9,1,A great touch.,a great touch,1


For few data the score is 0. We can see at index 10, 11, 13, 17. 

In [107]:
#would not go back, index=11
print(sent_analysis.polarity_scores("would not go back"))

{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}


"would not go back" has score=0. Here, with a sentence there is no sentiment associated but as a human it is easier to understand the review. So, this is the issue while working with VADAR. It is unable to classify sentence with not a sentiment word.

In [124]:
#index = 17
print(sent_analysis.polarity_scores("waitress was a little slow in service"))

Took an hour to get our food only 4 tables in restaurant my food was Luke warm, Our sever was running around like he was totally overwhelmed.
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}


In [134]:
#index=28
print(dataset['Reviews'][28])
sent_analysis.polarity_scores("totally overwhelmed")

Took an hour to get our food only 4 tables in restaurant my food was Luke warm, Our sever was running around like he was totally overwhelmed.


{'neg': 0.0, 'neu': 0.401, 'pos': 0.599, 'compound': 0.1263}

We can see it is assigning totally overwhelmed with a positive score which is incoorect. So, this may fail to take context into account, which result in incorrect results.

So, considering simple sentence that contains word of emotion and removing all neutral sentences, lets find the accuracy.

In [136]:
check = dataset[dataset['Vader_Score']!=0] 
sum(check['Labels'] == check['Vader_Score'])/len(check)

0.8524804177545692

If sentences with non-neutral words are to be considered the accuracy of model is 85% which is quite good. So, we can conclude that:
1. Vadar is fast.
2. Easier to implement.
3. Very good while predicting simple sentence.