# Projeto de Trainee - Processamento de Linguagem Natural


<img src="https://i.ibb.co/DtHQ3FG/802x265-Logo-GT.png" width="370">


### Ánalise de sentimentos com um dataset de reviews do IMDB.

# Parte 2
Aplicação de método léxico com VADER.

## Pré-processamento

In [2]:
import pandas as pd

In [3]:
data = pd.read_csv("/content/drive/My Drive/nlp-trainee/IMDB Dataset.csv")

# Tratando o dataset como na pt 1
# Retirando duplicados
data.drop_duplicates(inplace=True)
#Transformando as labels em 0 (neg) e 1 (pos)
data['sentiment'] = pd.get_dummies(data['sentiment'])['positive']

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 49582 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     49582 non-null  object
 1   sentiment  49582 non-null  uint8 
dtypes: object(1), uint8(1)
memory usage: 823.1+ KB


Para usar o VADER, a única etapa de pré-processamento do texto será tirar as tags html.

In [6]:
from bs4 import BeautifulSoup

In [7]:
def remove_tags_html(texto):
    """(str) -> str
    Recebe uma string e retira tags html se houver."""
    soup = BeautifulSoup(texto, "html.parser")
    sem_tags = soup.get_text(separator=" ")
    return sem_tags

In [8]:
data['review'] = data['review'].apply(remove_tags_html)

In [9]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. The filming te...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


## Aplicando o VADER

In [11]:
!pip install vaderSentiment

Collecting vaderSentiment
[?25l  Downloading https://files.pythonhosted.org/packages/76/fc/310e16254683c1ed35eeb97386986d6c00bc29df17ce280aed64d55537e9/vaderSentiment-3.3.2-py2.py3-none-any.whl (125kB)
[K     |██▋                             | 10kB 8.3MB/s eta 0:00:01[K     |█████▏                          | 20kB 3.0MB/s eta 0:00:01[K     |███████▉                        | 30kB 3.4MB/s eta 0:00:01[K     |██████████▍                     | 40kB 3.9MB/s eta 0:00:01[K     |█████████████                   | 51kB 3.3MB/s eta 0:00:01[K     |███████████████▋                | 61kB 3.8MB/s eta 0:00:01[K     |██████████████████▏             | 71kB 4.0MB/s eta 0:00:01[K     |████████████████████▉           | 81kB 4.5MB/s eta 0:00:01[K     |███████████████████████▍        | 92kB 4.5MB/s eta 0:00:01[K     |██████████████████████████      | 102kB 4.4MB/s eta 0:00:01[K     |████████████████████████████▋   | 112kB 4.4MB/s eta 0:00:01[K     |███████████████████████████████▏| 122

In [13]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [15]:
analyser = SentimentIntensityAnalyzer()

def vader_scores_compound(texto):
  '''(str) -> float
  Recebe um texto, calcula a nota do vader e retorna o valor compound.'''

  score = analyser.polarity_scores(texto)
  return score['compound']

In [17]:
data['compound score (vader)'] = data['review'].apply(vader_scores_compound)

In [18]:
data.head(10)

Unnamed: 0,review,sentiment,compound score (vader)
0,One of the other reviewers has mentioned that ...,1,-0.9916
1,A wonderful little production. The filming te...,1,0.967
2,I thought this was a wonderful way to spend ti...,1,0.9745
3,Basically there's a family where a little boy ...,0,-0.9213
4,"Petter Mattei's ""Love in the Time of Money"" is...",1,0.9766
5,"Probably my all-time favorite movie, a story o...",1,0.9828
6,I sure would like to see a resurrection of a u...,1,0.9022
7,"This show was an amazing, fresh & innovative i...",0,0.873
8,Encouraged by the positive comments about this...,0,0.2362
9,If you like original gut wrenching laughter yo...,1,0.9507


In [19]:
def compound_to_label(score):
  '''(float) -> int
  Recebe um compound score calculado pelo VADER.
  Se score >= 0, retorna 1 (que corresponde a label positive).
  Se score < 0, retorna 0 (negative).'''
  if score < 0:
    return 0
  return 1

In [20]:
data['vader_pred'] = data['compound score (vader)'].apply(compound_to_label)

In [21]:
data.head()

Unnamed: 0,review,sentiment,compound score (vader),vader_pred
0,One of the other reviewers has mentioned that ...,1,-0.9916,0
1,A wonderful little production. The filming te...,1,0.967,1
2,I thought this was a wonderful way to spend ti...,1,0.9745,1
3,Basically there's a family where a little boy ...,0,-0.9213,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1,0.9766,1


## Reultados

In [24]:
# Importando as métricas a serem usadas
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [22]:
def mostra_metricas(y_true, y_pred):
  print("Matriz de confusão:")
  print(confusion_matrix(y_true, y_pred))
  print("\nAcurácia: ", accuracy_score(y_true, y_pred))
  print("\n",classification_report(y_true, y_pred))

In [25]:
mostra_metricas(data['sentiment'], data['vader_pred'])

Matriz de confusão:
[[13174 11524]
 [ 3413 21471]]

Acurácia:  0.6987414787624541

               precision    recall  f1-score   support

           0       0.79      0.53      0.64     24698
           1       0.65      0.86      0.74     24884

    accuracy                           0.70     49582
   macro avg       0.72      0.70      0.69     49582
weighted avg       0.72      0.70      0.69     49582



A acurácia e f1 foram de 69,8% e 69%, respectivamente. De acordo com a matriz de precisão, o modelo teve bem mais falsos positivos do que negativos, indicando uma tendência de predizer uma review como positiva com mais frequência do que negativa.

Em geral, os resultados dos modelos da parte 1 foram melhores.