### SA with Vader

In their official [repo](https://github.com/cjhutto/vaderSentiment) you can find the lexicon, a list of rules and the support ofr emoji recognition

In [1]:
!pip install vaderSentiment  # Vader è una repository :ha un lexicon, cioè 7520 parole 
                             # divise in gruppi secondo la direzione di sentimento

Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[K     |████████████████████████████████| 125 kB 4.2 MB/s eta 0:00:01
Installing collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2


In [3]:
sentences=["I like text mining so much, it is super fun", 
           "I can't bear this lecture anymore", 
           "today is quite sunny outside",
           "I hate flamingos, they're so inappropriate",
           "Catch utf-8 emoji such as 💘 and 💋 and 😁"]

In [4]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()
for sentence in sentences:
    vs = analyzer.polarity_scores(sentence)
    print("{:-<50} {}".format(sentence, str(vs)))  # compound:somma del valore delle parole, normalizzata tra -1 e 1.

I like text mining so much, it is super fun------- {'neg': 0.0, 'neu': 0.419, 'pos': 0.581, 'compound': 0.8658}
I can't bear this lecture anymore----------------- {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
today is quite sunny outside---------------------- {'neg': 0.0, 'neu': 0.564, 'pos': 0.436, 'compound': 0.4754}
I hate flamingos, they're so inappropriate-------- {'neg': 0.425, 'neu': 0.575, 'pos': 0.0, 'compound': -0.5719}
Catch utf-8 emoji such as 💘 and 💋 and 😁----------- {'neg': 0.0, 'neu': 0.583, 'pos': 0.417, 'compound': 0.875}


emoji [cheat sheet](https://unicode.org/emoji/charts/emoji-list.html)  (replace _U+_ with _U000_)

In [5]:
print("\U0001F600")

😀


[here](https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/emoji_utf8_lexicon.txt) you can find the vader emoticons evaluations

### About the scoring

The Positive, Negative and Neutral scores represent the proportion of text that falls in these categories. This means our first sentence was rated as 58% Positive, 42% Neutral and 0% Negative. Hence all these should add up to 1.

The compound score is computed by summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive). This is the most useful metric if you want a single unidimensional measure of sentiment for a given sentence. Calling it a 'normalized, weighted composite score' is accurate.

It is also useful for researchers who would like to set standardized thresholds for classifying sentences as either positive, neutral, or negative. Typical threshold values (used in the literature cited on this page) are:

positive sentiment: compound score >= 0.05 <br>
neutral sentiment: (compound score > -0.05) and (compound score < 0.05) <br>
negative sentiment: compound score <= -0.05

__[Some rules](https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/vaderSentiment.py):__  <br>


<img src='img/vader_rules.PNG'>

__Get words polarity:__
    
In the [lexicon](https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/vader_lexicon.txt) you can find, for each word in the vocab (7520 elements): <br>
Token, mean value, SD and scores of 10 independent human evaluators <br>

<img src='img/vader_lexicon.PNG'>



### Polarity detection of Amazon data with Vader

In [12]:
import pandas as pd
df = pd.read_csv("amazon_cellphones_binary.csv")
df

Unnamed: 0,asin,reviewText,overall,bin_y
0,B007D6J64K,probably favorite cover super sassy protective...,5,1
1,B007D6J64K,case_protects phone damage,5,1
2,B007D6J64K,nice,4,1
3,B007D6J64K,wa another favorite one thanks quality pricing,5,1
4,B007D6J64K,decent_case lot_of_protection,5,1
...,...,...,...,...
26840,B0096QI0QK,easy_to_put phone protects really_well obvious...,5,1
26841,B0096QI0QK,much_better quality expected price great_alter...,5,1
26842,B0096QI0QK,one_of_the_best screen_protector used applicat...,4,1
26843,B0096QI0QK,kit included microfiber_cloth soft card help a...,5,1


In [13]:
df = df[~df['reviewText'].isna()]   # eliminiamo le righe vuote

compute polarity scores

In [15]:
df['scores'] = df['reviewText'].apply(lambda review: analyzer.polarity_scores(review))
df.head()
# df['prova'] = df['ovarall'] + df['bin_y'] ... creiamo nuove colonne in questo modo

Unnamed: 0,asin,reviewText,overall,bin_y,scores
0,B007D6J64K,probably favorite cover super sassy protective...,5,1,"{'neg': 0.149, 'neu': 0.532, 'pos': 0.319, 'co..."
1,B007D6J64K,case_protects phone damage,5,1,"{'neg': 0.615, 'neu': 0.385, 'pos': 0.0, 'comp..."
2,B007D6J64K,nice,4,1,"{'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound..."
3,B007D6J64K,wa another favorite one thanks quality pricing,5,1,"{'neg': 0.0, 'neu': 0.459, 'pos': 0.541, 'comp..."
4,B007D6J64K,decent_case lot_of_protection,5,1,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound..."


Extract the compund value

In [16]:
df['compound']  = df['scores'].apply(lambda score_dict: score_dict['compound']) 
df.head()    # teniamo solo compound perchè esprime la polarità della frase

Unnamed: 0,asin,reviewText,overall,bin_y,scores,compound
0,B007D6J64K,probably favorite cover super sassy protective...,5,1,"{'neg': 0.149, 'neu': 0.532, 'pos': 0.319, 'co...",0.5859
1,B007D6J64K,case_protects phone damage,5,1,"{'neg': 0.615, 'neu': 0.385, 'pos': 0.0, 'comp...",-0.4939
2,B007D6J64K,nice,4,1,"{'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound...",0.4215
3,B007D6J64K,wa another favorite one thanks quality pricing,5,1,"{'neg': 0.0, 'neu': 0.459, 'pos': 0.541, 'comp...",0.7096
4,B007D6J64K,decent_case lot_of_protection,5,1,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0


compute binary predictions

In [17]:
df['predicted'] = df['compound'].apply(lambda c: '1' if c >=0 else '0')
df.head()    # predicted tra sentimento positivo :1 e negativo:0

Unnamed: 0,asin,reviewText,overall,bin_y,scores,compound,predicted
0,B007D6J64K,probably favorite cover super sassy protective...,5,1,"{'neg': 0.149, 'neu': 0.532, 'pos': 0.319, 'co...",0.5859,1
1,B007D6J64K,case_protects phone damage,5,1,"{'neg': 0.615, 'neu': 0.385, 'pos': 0.0, 'comp...",-0.4939,0
2,B007D6J64K,nice,4,1,"{'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound...",0.4215,1
3,B007D6J64K,wa another favorite one thanks quality pricing,5,1,"{'neg': 0.0, 'neu': 0.459, 'pos': 0.541, 'comp...",0.7096,1
4,B007D6J64K,decent_case lot_of_protection,5,1,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,1


In [18]:
from sklearn.metrics import classification_report

print(classification_report(df.bin_y, df.predicted.astype('int64'))) 

# BoW e Rf accuracy era pari a 0.44 circa, qui è 0.82: rispetto al deep learning è però più bassa, infatti lì era
# pari a 0.89. Qui non abbiamo però fatto nessun tipo di training
# recall (sensitivity) := quanti sono stati classificati correttamente di quelli di classe j
# precision := 
# macro avg := media delle due sopra (quando abbiamo classi sbilanciate la si usa)
# weighted avg := media pesata dei 2 sopra per le numerosità delle 2 classi

              precision    recall  f1-score   support

           0       0.36      0.34      0.35      3754
           1       0.89      0.90      0.89     21964

    accuracy                           0.82     25718
   macro avg       0.63      0.62      0.62     25718
weighted avg       0.81      0.82      0.81     25718

