# Análise de sentimentos dos reviews do imdb
O notebook abaixo demonstra um experimento para comparar os resultados de um classificador treinado nos dados de review do imdb vs uma solução pronta da biblioteca nltk na tarefa de análises de sentimento.

In [25]:
import pandas as pd
import nltk
import csv
import re

nltk.download('stopwords') # Lista de stopwords
nltk.download('vader_lexicon') # Polarity score 

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

## Pré-processamento do conjunto de dados
Os dados estão fora de padrão para serem carregados diretamente em um Data Frame. Por isso, o arquivo é aberto manualmente e tratado para que seja possível carregá-lo..

In [26]:
inputs = list()
outputs = list()
with open('imdb.csv') as csv_file:
  reader = csv.reader(csv_file, delimiter=',', quoting=csv.QUOTE_NONE)
  header = next(reader)
  
  for i, row in enumerate(reader):
    sentiment = row[-1].replace('\"', '').replace(';', '').replace('\'', '')
    text = ','.join(row[:-1]).replace('\"', '').replace('\'', '')
    
    inputs.append(text)
    outputs.append(sentiment)

In [27]:
df_imdb = pd.DataFrame.from_dict(dict(zip(header, [inputs, outputs])))
df_imdb.rename({'review': 'Review', 'sentiment;;;;;;':'Sentiment'}, axis=1, inplace=True)
df_imdb

Unnamed: 0,Review,Sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically theres a family where a little boy (...,negative
4,Petter Matteis Love in the Time of Money is a ...,positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,Im going to have to disagree with the previous...,negative


## Pré-processamento de texto
Construção de um método para fazer a limpeza do texto através da remoção de tags html, pontuação, símbolos e padronização dos números. Após a tokenização as stopwords são removidas, o texto é transformado em lowercase e as palavras são reduzidas ao seu radical. 

In [28]:
re_strip_html = re.compile('(<|</).+?(/>|>)') # Expressão regular para tags html
re_strip_punct = re.compile('[^a-zA-Z0-9]') # Expressão regular para pontuação e símbolos
re_strip_numbers = re.compile('[0-9]') # Expressão regular para números

stopwords = nltk.corpus.stopwords.words('english')
stopwords = set(stopwords)

stemmer = nltk.stem.PorterStemmer()

In [29]:
def preprocess_reviews(
    sentence,
    re_strip_html,
    re_strip_punct,
    re_strip_numbers,
    stopwords,
    stemmer
):
  # Aplica a limpeza de tags html, pontuação, símbolos e padronização 
  # dos números
  sentence = re_strip_html.sub('', sentence)
  sentence = re_strip_punct.sub(' ', sentence)
  sentence = re_strip_numbers.sub(' <NUMBER> ', sentence)
  # Elimina os espaços duplos entre as palavras, tranforma em lower case e 
  # filtra as stopwords
  sentence = ' '.join([ 
    word.lower() for word in sentence.split() 
    if word.lower() not in stopwords 
  ])
  # Quebra o texto e reduz as palavras ao seu radical
  sentence = ' '.join([ 
    stemmer.stem(word) for word in sentence.split() 
  ])
  return sentence

In [30]:
# Aplica o pré-processamento de texto e cria uma nova coluna no data frame
df_imdb = df_imdb.assign(Cleaned_Review = df_imdb.Review.apply(lambda x: preprocess_reviews(
    x,
    re_strip_html,
    re_strip_punct,
    re_strip_numbers,
    stopwords,
    stemmer
)))

df_imdb

Unnamed: 0,Review,Sentiment,Cleaned_Review
0,One of the other reviewers has mentioned that ...,positive,one review mention watch <number> oz episod yo...
1,A wonderful little production. <br /><br />The...,positive,wonder littl product film techniqu unassum old...
2,I thought this was a wonderful way to spend ti...,positive,thought wonder way spend time hot summer weeke...
3,Basically theres a family where a little boy (...,negative,basic there famili littl boy jake think there ...
4,Petter Matteis Love in the Time of Money is a ...,positive,petter mattei love time money visual stun film...
...,...,...,...
49995,I thought this movie did a down right good job...,positive,thought movi right good job wasnt creativ orig...
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative,bad plot bad dialogu bad act idiot direct anno...
49997,I am a Catholic taught in parochial elementary...,negative,cathol taught parochi elementari school nun ta...
49998,Im going to have to disagree with the previous...,negative,im go disagre previou comment side maltin one ...


## Experimentação
No presente experimento será feita uma seleção de modelos e parametrizações em um dataset de holdout. O conjunto de dados será dividido entre 70% de dados para o treino e 30% de dados para o teste. Será utilizada a estatística tf-idf para representar o texto. Serão variados o número de n-gramas e será aplicada seleção de features para uma redução de dimensionalidade e controle de overfitting. 

In [31]:
# Verificando a proporção dos sentimentos
# Não há necessidade de aplicar ajustes por conta do balanceamento
# A acurácia pode ser usada como métrica de avaliação
df_imdb.Sentiment.value_counts()

negative    25000
positive    25000
Name: Sentiment, dtype: int64

In [32]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

def build_features(df_train, df_test, n_gram_range):
  tf_idf_vectorizer = TfidfVectorizer(
      ngram_range=n_gram_range, # Criar tokens a partir da combinação de n-palavras
      min_df=25 # Valor mínimo de reviews que a palavra deve aparecer
  )

  x_train = tf_idf_vectorizer.fit_transform(df_train.Cleaned_Review)
  x_test = tf_idf_vectorizer.transform(df_test.Cleaned_Review)
  y_train = df_train.Sentiment
  y_test = df_test.Sentiment

  return x_train, y_train, x_test, y_test

def build_models(algorithm):
  model = None
  
  if algorithm == 'lr':
    model = GridSearchCV(
        LogisticRegression(n_jobs=-1),
        param_grid={
            'C': [0.01, 0.1, 1, 10, 100]
        },
        scoring='accuracy',
        n_jobs=-1
    )
  elif algorithm == 'svm':
    model = GridSearchCV(
        LinearSVC(),
        param_grid={
            'C': [0.01, 0.1, 1, 10, 100]
        },
        scoring='accuracy',
        n_jobs=-1
    ) 
  else:
    raise Exception('Modelo não foi implementado')

  return model
  
df_train, df_test = train_test_split(df_imdb, train_size=0.70)
models = dict()
for n_gram_range in [(1,1), (1,2)]:
  feature_selector = SelectKBest(chi2, k=8000)
  
  x_train, y_train, x_test, y_test = build_features(df_train, df_test, n_gram_range)
  x_train = feature_selector.fit_transform(x_train, y_train)
  x_test = feature_selector.transform(x_test)

  for algorithm in ['lr', 'svm']:
    model = build_models(algorithm)
    model.fit(x_train, y_train)

    predicted_train = model.predict(x_train)
    predicted_test = model.predict(x_test)

    accuracy_train = accuracy_score(y_train, predicted_train)
    accuracy_test = accuracy_score(y_test, predicted_test)
    
    print('Algorithm: {}\nN-Gram: {}\nAccuracy Train: {}\nAccuracy Test: {}\n'.format(
        algorithm, 
        n_gram_range[1],
        accuracy_train,
        accuracy_test
    ))

Algorithm: lr
N-Gram: 1
Accuracy Train: 0.9158571428571428
Accuracy Test: 0.8909333333333334

Algorithm: svm
N-Gram: 1
Accuracy Train: 0.9170857142857143
Accuracy Test: 0.8920666666666667

Algorithm: lr
N-Gram: 2
Accuracy Train: 0.9461714285714286
Accuracy Test: 0.8991333333333333

Algorithm: svm
N-Gram: 2
Accuracy Train: 0.9479142857142857
Accuracy Test: 0.8991333333333333



## Avaliação dos resultados
Após executar o experimento de seleção de modelos, vamos treinar um classificador a partir da melhor configuração. Será feita a predição no conjunto de testes para posteriormente comparar a predição do modelo com o resultado obtido a partir do nltk.

In [36]:
df_train, df_test = train_test_split(df_imdb, train_size=0.70)
x_train, y_train, x_test, y_test = build_features(df_train, df_test, (1,2))

feature_selector = SelectKBest(chi2, k=8000)
x_train = feature_selector.fit_transform(x_train, y_train)
x_test = feature_selector.transform(x_test)

model = build_models('lr')
model.fit(x_train, y_train)

predicted_test = model.predict(x_test)

accuracy_train = accuracy_score(y_train, predicted_train)
accuracy_test = accuracy_score(y_test, predicted_test)

df_test = df_test.assign(Model_Prediction = predicted_test)
df_test

Unnamed: 0,Review,Sentiment,Cleaned_Review,Model_Prediction
48802,This film stars Peter Lorre as an exceptionall...,positive,film star peter lorr except nice guy immigr am...,negative
11415,There are some great philosophical questions. ...,negative,great philosoph question purpos life happen di...,negative
42745,"OK, so this film is well acted. It has good di...",negative,ok film well act good direct simpl fact underm...,negative
49547,There are four great movie depicting the Vietn...,positive,four great movi depict vietnam war particular ...,positive
33723,John Hughes wrote a lot of great comedies in t...,negative,john hugh wrote lot great comedi <number> <num...,negative
...,...,...,...,...
10468,Plenty has been written about Mamets The House...,positive,plenti written mamet hous game good decid revi...,positive
38176,Ive tried to reconcile why so many bad reviews...,positive,ive tri reconcil mani bad review film vast maj...,positive
44623,Alain Resnais directs three parallel stories t...,negative,alain resnai direct three parallel stori fanta...,negative
1087,Maria Braun is an extraordinary woman presente...,positive,maria braun extraordinari woman present fulli ...,negative


In [37]:
from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()

def parse_sia_result(result):
  if result['pos'] > result['neg']:
    return 'positive'
  else:
    return 'negative'

df_test = df_test.assign(SIA_Prediction = df_test.Review.apply(lambda x: parse_sia_result(sia.polarity_scores(x))))
df_test

Unnamed: 0,Review,Sentiment,Cleaned_Review,Model_Prediction,SIA_Prediction
48802,This film stars Peter Lorre as an exceptionall...,positive,film star peter lorr except nice guy immigr am...,negative,positive
11415,There are some great philosophical questions. ...,negative,great philosoph question purpos life happen di...,negative,negative
42745,"OK, so this film is well acted. It has good di...",negative,ok film well act good direct simpl fact underm...,negative,positive
49547,There are four great movie depicting the Vietn...,positive,four great movi depict vietnam war particular ...,positive,negative
33723,John Hughes wrote a lot of great comedies in t...,negative,john hugh wrote lot great comedi <number> <num...,negative,positive
...,...,...,...,...,...
10468,Plenty has been written about Mamets The House...,positive,plenti written mamet hous game good decid revi...,positive,positive
38176,Ive tried to reconcile why so many bad reviews...,positive,ive tri reconcil mani bad review film vast maj...,positive,positive
44623,Alain Resnais directs three parallel stories t...,negative,alain resnai direct three parallel stori fanta...,negative,negative
1087,Maria Braun is an extraordinary woman presente...,positive,maria braun extraordinari woman present fulli ...,negative,negative


## Conclusão
A partir da comparação do classificador treinado para resolver a tarefa de análise de sentimentos com a solução pronta do nltk, podemos observar os resultados superiores da nossa solução.

In [38]:
print('Model accuracy score: %.2f' % (accuracy_score(df_test.Sentiment, df_test.Model_Prediction) * 100))
print('SIA accuracy score: %.2f' % (accuracy_score(df_test.Sentiment, df_test.SIA_Prediction) * 100))

Model accuracy score: 89.38
SIA accuracy score: 69.00
