## Pre-processing

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import *
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfTransformer, TfidfVectorizer, CountVectorizer
from sklearn.svm import LinearSVC, SVC
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import MinMaxScaler, StandardScaler



In [None]:
# Unisco i 4 dataframe

filename1 = 'drive/MyDrive/TextAnalytics/training_set_1.csv'
filename2 = 'drive/MyDrive/TextAnalytics/training_set_2.csv'
filename3 = 'drive/MyDrive/TextAnalytics/test_set.csv'
filename4 = 'drive/MyDrive/TextAnalytics/remaining_set.csv'

df_train1 = pd.read_csv(filename1)
df_train2 = pd.read_csv(filename2)
df_test = pd.read_csv(filename3)
df_remaining = pd.read_csv(filename4)

frames = [df_train1, df_train2, df_test, df_remaining]
df = pd.concat(frames)

df = df[['body', 'body_tok', 'pos_review', 'review_rating']]

len(df)

53337

In [None]:
df.head(2)

Unnamed: 0,body,body_tok,pos_review,review_rating
0,I’m happy with the way the phone looks but upo...,"['happy', 'way', 'phone', 'looks', 'upon', 'op...",0,2
1,the brand itself is not a problem. the problem...,"['brand', 'problem', 'problem', 'seller', 'pho...",0,1


## Creazione del classificatore

In [None]:
# Scarichiamo il lexicon

nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [None]:
# Funzione per la Sentiment Analysis

vader = SentimentIntensityAnalyzer()

In [None]:
# Esempio

vader.polarity_scores('the battery is ok, but the screen is terrible')

{'compound': -0.5499, 'neg': 0.325, 'neu': 0.549, 'pos': 0.125}

## VADER su *body*

In [None]:
# Ottengo compound scores

scores = []
for e in df['body']:
    cp = vader.polarity_scores(e)
    scores.append(cp['compound'])

In [None]:
# Salviamo 'compound score' e 'pos_review' di ogni recensione

scores_and_labels = list(zip(scores, df['pos_review']))

scores_and_labels[:5]

[(-0.1923, 0), (-0.4651, 0), (0.9451, 0), (0.807, 1), (0.6275, 0)]

In [None]:
# Creo la colonna con i compound scores

df['vader_compound'] = scores

df.tail(6)

Unnamed: 0,body,body_tok,pos_review,review_rating,vader_compound
13331,The product has been very good. I had used thi...,"['product', 'good', 'used', 'cell', 'phone', '...",1,5,0.8777
13332,For the price you really get a solid boost mot...,"['price', 'really', 'get', 'solid', 'boost', '...",1,5,0.7745
13333,This phone isn't kidding when it says military...,"['phone', 'kidding', 'says', 'military', 'spec...",1,4,-0.821
13334,Wouldn't know anything about the cell phone I ...,"['would', 'know', 'anything', 'cell', 'phone',...",0,1,-0.2144
13335,I got this phone just as secondary cell phone....,"['got', 'phone', 'secondary', 'cell', 'phone',...",0,3,0.4404
13336,This is probably the wrost experience i had wi...,"['probably', 'wrost', 'experience', 'amazon', ...",0,1,0.7845


Il nostro *pos_review* dà 0 a recensioni fino a 3 stelle, 1 da 4 stelle in su. 

Invece *vader_compound* comprende valori da -1 a 1. Lo normalizziamo e creiamo una variabile che ha valore 0 per [0; 0.6] e 1 per (0.6; 1].

In [None]:
# Gli scaler non funzionano, usiamo una funzione per normalizzare [0,1]

def NormalizeData(data):
    return (data - np.min(data)) / (np.max(data) - np.min(data))

df['vader_compound'] = NormalizeData(df['vader_compound'])

In [None]:
df.tail(6)

Unnamed: 0,body,body_tok,pos_review,review_rating,vader_compound
13331,The product has been very good. I had used thi...,"['product', 'good', 'used', 'cell', 'phone', '...",1,5,0.938704
13332,For the price you really get a solid boost mot...,"['price', 'really', 'get', 'solid', 'boost', '...",1,5,0.886938
13333,This phone isn't kidding when it says military...,"['phone', 'kidding', 'says', 'military', 'spec...",1,4,0.086627
13334,Wouldn't know anything about the cell phone I ...,"['would', 'know', 'anything', 'cell', 'phone',...",0,1,0.390901
13335,I got this phone just as secondary cell phone....,"['got', 'phone', 'secondary', 'cell', 'phone',...",0,3,0.719352
13336,This is probably the wrost experience i had wi...,"['probably', 'wrost', 'experience', 'amazon', ...",0,1,0.891954


In [None]:
# Segue la distribuzione che abbiamo fatto con le stelle
# 1,2,3 stelle = 0 -> 60%
# 4,5 stelle = 1 -> 40%

df['vader_compound_binary'] = [1 if e > 0.6 else 0 for e in df['vader_compound']]

In [None]:
df.tail(6)

Unnamed: 0,body,body_tok,pos_review,review_rating,vader_compound,vader_compound_binary
13331,The product has been very good. I had used thi...,"['product', 'good', 'used', 'cell', 'phone', '...",1,5,0.938704,1
13332,For the price you really get a solid boost mot...,"['price', 'really', 'get', 'solid', 'boost', '...",1,5,0.886938,1
13333,This phone isn't kidding when it says military...,"['phone', 'kidding', 'says', 'military', 'spec...",1,4,0.086627,0
13334,Wouldn't know anything about the cell phone I ...,"['would', 'know', 'anything', 'cell', 'phone',...",0,1,0.390901,0
13335,I got this phone just as secondary cell phone....,"['got', 'phone', 'secondary', 'cell', 'phone',...",0,3,0.719352,1
13336,This is probably the wrost experience i had wi...,"['probably', 'wrost', 'experience', 'amazon', ...",0,1,0.891954,1


#### Evaluation binary

In [None]:
# Funzione per metriche di Evaluation

def model_evaluation(real_v, pred_v):
    print(f"Accuracy score: {accuracy_score(real_v, pred_v)}")
    print("Classification report:")
    print(classification_report(real_v, pred_v))
    cm = confusion_matrix(real_v, pred_v)
    print (f"Confusion matrix \n {cm}")

In [None]:
model_evaluation(df['pos_review'], df['vader_compound_binary'])

Accuracy score: 0.8032697752029548
Classification report:
              precision    recall  f1-score   support

           0       0.69      0.70      0.70     17116
           1       0.86      0.85      0.85     36221

    accuracy                           0.80     53337
   macro avg       0.77      0.78      0.78     53337
weighted avg       0.80      0.80      0.80     53337

Confusion matrix 
 [[12021  5095]
 [ 5398 30823]]


#### Evaluation multiclass

Dividiamo *vader_compound* in 5 bins, per confrontarlo con le stelle (*review_rating*).

In [None]:
stars = []

for e in df['vader_compound']:
    if e >= 0 and e < 0.2:
        stars.append(1)
    elif e >= 0.2 and e < 0.4:
        stars.append(2)
    elif e >= 0.4 and e < 0.6:
        stars.append(3)
    elif e >= 0.6 and e < 0.8:
        stars.append(4)
    else:
        stars.append(5)

df['vader_compound_stars'] = stars

df.tail(6)

Unnamed: 0,body,body_tok,pos_review,review_rating,vader_compound,vader_compound_binary,vader_compound_stars
13331,The product has been very good. I had used thi...,"['product', 'good', 'used', 'cell', 'phone', '...",1,5,0.938704,1,5
13332,For the price you really get a solid boost mot...,"['price', 'really', 'get', 'solid', 'boost', '...",1,5,0.886938,1,5
13333,This phone isn't kidding when it says military...,"['phone', 'kidding', 'says', 'military', 'spec...",1,4,0.086627,0,1
13334,Wouldn't know anything about the cell phone I ...,"['would', 'know', 'anything', 'cell', 'phone',...",0,1,0.390901,0,2
13335,I got this phone just as secondary cell phone....,"['got', 'phone', 'secondary', 'cell', 'phone',...",0,3,0.719352,1,4
13336,This is probably the wrost experience i had wi...,"['probably', 'wrost', 'experience', 'amazon', ...",0,1,0.891954,1,5


In [None]:
model_evaluation(df['review_rating'], df['vader_compound_stars'])

Accuracy score: 0.48439169807075766
Classification report:
              precision    recall  f1-score   support

           1       0.61      0.24      0.34     10136
           2       0.13      0.23      0.17      3117
           3       0.11      0.24      0.15      3863
           4       0.18      0.24      0.20      7090
           5       0.76      0.69      0.72     29131

    accuracy                           0.48     53337
   macro avg       0.36      0.33      0.32     53337
weighted avg       0.57      0.48      0.51     53337

Confusion matrix 
 [[ 2431  2708  2855  1238   904]
 [  540   720   810   475   572]
 [  401   645   911   830  1076]
 [  220   512   910  1675  3773]
 [  374   815  2567  5276 20099]]


## VADER su *body_tok*
Risultati peggiori

In [None]:
# Riprendo il df senza le nuove colonne

df_train1 = pd.read_csv(filename1)
df_train2 = pd.read_csv(filename2)
df_test = pd.read_csv(filename3)
df_remaining = pd.read_csv(filename4)

frames = [df_train1, df_train2, df_test, df_remaining]
df = pd.concat(frames)

df = df[['body', 'body_tok', 'pos_review', 'review_rating']]

In [None]:
# Ottengo compound scores
scores = []
for e in df['body_tok']:
    cp = vader.polarity_scores(e)
    scores.append(cp['compound'])

# Creo la colonna con i compound scores
df['vader_compound'] = scores

# Normalizzo i dati [0,1]
df['vader_compound'] = NormalizeData(df['vader_compound'])

# Prendo come positive le recensioni da 0.6 in su 
# segue quello cha abbiamo con le stelle
df['vader_compound_binary'] = [1 if e > 0.6 else 0 for e in df['vader_compound']]

#### Evaluation binary

In [None]:
model_evaluation(df['pos_review'], df['vader_compound_binary'])

Accuracy score: 0.3209029379230178
Classification report:
              precision    recall  f1-score   support

           0       0.32      1.00      0.49     17116
           1       0.00      0.00      0.00     36221

    accuracy                           0.32     53337
   macro avg       0.16      0.50      0.24     53337
weighted avg       0.10      0.32      0.16     53337

Confusion matrix 
 [[17116     0]
 [36221     0]]


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


#### Evaluation multiclass

In [None]:
stars = []

for e in df['vader_compound']:
    if e >= 0 and e < 0.2:
        stars.append(1)
    elif e >= 0.2 and e < 0.4:
        stars.append(2)
    elif e >= 0.4 and e < 0.6:
        stars.append(3)
    elif e >= 0.6 and e < 0.8:
        stars.append(4)
    else:
        stars.append(5)

df['vader_compound_stars'] = stars

In [None]:
model_evaluation(df['review_rating'], df['vader_compound_stars'])

Accuracy score: 0.546168700901813
Classification report:


  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           1       0.00      0.00      0.00     10136
           2       0.00      0.00      0.00      3117
           3       0.00      0.00      0.00      3863
           4       0.00      0.00      0.00      7090
           5       0.55      1.00      0.71     29131

    accuracy                           0.55     53337
   macro avg       0.11      0.20      0.14     53337
weighted avg       0.30      0.55      0.39     53337

Confusion matrix 
 [[    0     0     0     0 10136]
 [    0     0     0     0  3117]
 [    0     0     0     0  3863]
 [    0     0     0     0  7090]
 [    0     0     0     0 29131]]


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Valutazioni sulla variabile *pos_review*

Abbiamo deciso arbitrariamente che una recensione è positiva da 4 stelle in su. Utilizziamo VADER per vedere se i testi delle recensioni giudicate come positive abbiano effettivamente almeno 4 stelle.

Gli ostacoli ad un'analisi di questo tipo sono due:
*   Il testo delle recensioni non sempre è coerente con la valutazione data (nel nostro caso con le stelle): un utente può scrivere "fantastico" e dare 4 stelle, un altro può scrivere "buono" e darne 5.
*   VADER non dà come risultato una variabile discreta (positiva o negativa), dà un punteggio espresso con una variabile continua da -1 a 1. Non è stabilita una soglia per giudicare una recensione come positiva, perciò anche lì va fatta una scelta arbitraria.



In [None]:
# Riprendo il df senza le nuove colonne

df = pd.concat(frames) # concatenazione

In [None]:
# Ottengo compound scores
scores = []
for e in df['body']:
    cp = vader.polarity_scores(e)
    scores.append(cp['compound'])

# Creo la colonna con i compound scores
df['vader_compound'] = scores

In precedenza abbiamo diviso la variabile continua (normalizzata) *vader_compound* in 60% negativa e top 40% positiva, per seguire la distribuzione che abbiamo applicato alle stelle. 

Potremmo provare ad utilizzare una definizione meno stringente di recensione positiva, e considerare come tale tutti i testi con compound positito (eliminando quindi la possibilità che una recensione possa essere neutra).



In [None]:
# Prendo come positive le recensioni da 0.6 in su 
# segue quello cha abbiamo con le stelle
df['vader_compound_binary'] = [1 if e > 0 else 0 for e in df['vader_compound']]

#### Evaluation e confronto

In [None]:
model_evaluation(df['pos_review'], df['vader_compound_binary'])

Accuracy score: 0.804582184974783
Classification report:
              precision    recall  f1-score   support

           0       0.72      0.65      0.68     17116
           1       0.84      0.88      0.86     36221

    accuracy                           0.80     53337
   macro avg       0.78      0.76      0.77     53337
weighted avg       0.80      0.80      0.80     53337

Confusion matrix 
 [[11067  6049]
 [ 4374 31847]]


Si ottengono risultati simili se si considera come recensione positiva il top 40% di *vader_compound* (fatto in precedenza) e il top 50% (fatto ora). È bene specificare che anche in questo caso la variabile target è la nostra *pos_review*.