### Classificação de textos para análise de sentimentos

Base de dados 

Instruções:
- O objetivo deste trabalho é criar um modelo binário de aprendizado de máquina para classificação de textos. 
Para isso, será utilizado a base de dados [IMDb](http://ai.stanford.edu/~amaas/data/sentiment/), que consiste de dados textuais de críticas positivas e negativas de filmes
- Uma vez treinado, o modelo deve ter uma função `predict` que recebe uma string como parâmetro e retorna o valor 1 ou 0, aonde 1 significa uma crítica positiva e 0 uma crítica negativa
- O pré-processamento pode ser desenvolvidado conforme desejar (ex.: remoção de stopwords, word embedding, one-hot encoding, char encoding)
- É preferível que seja empregado um modelo de recorrência (ex.: rnn, lstm, gru) para a etapa de classificação
- Documente o código (explique sucintamente o que cada função faz, insira comentários em trechos de código relevantes)
- **Atenção**: Uma vez treinado o modelo final, salve-o no diretório do seu projeto e crie uma célula ao final do notebook contendo uma função de leitura deste arquivo, juntamente com a execução da função `predict`

Sugestões:
- Explorar a base de dados nas células iniciais do notebook para ter um melhor entendimento do problema, distribuição dos dados, etc
- Após desenvolver a estrutura de classificação, é indicado fazer uma busca de hiperparâmetros e comparar os resultados obtidos em diferentes situações

Prazo de entrega:
- 01-08-2021 às 23:59hs GMT-3

Formato preferível de entrega:
- Postar no portal Ava da disciplina o link do projeto no github (ou anexar o projeto diretamente no portal Ava)

luann.porfirio@gmail.com

### Import all libraries

In [82]:
import re
import nltk
import numpy as np
import pandas as pd

from torchtext import datasets
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, TensorBoard
from tensorflow.keras.models import load_model

### Load dataset

In [17]:
train_iter, test_iter = datasets.IMDB()
dataset_imdb = list(train_iter + test_iter)
df_imdb_raw = pd.DataFrame(data=dataset_imdb, columns=['sentiment', 'review'])

### Inspect Dataset

In [4]:
df_imdb_raw.shape

(50000, 2)

In [5]:
df_imdb_raw.head()

Unnamed: 0,sentiment,review
0,neg,I rented I AM CURIOUS-YELLOW from my video sto...
1,neg,"""I Am Curious: Yellow"" is a risible and preten..."
2,neg,If only to avoid making this type of film in t...
3,neg,This film was probably inspired by Godard's Ma...
4,neg,"Oh, brother...after hearing about this ridicul..."


In [6]:
df_imdb_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   sentiment  50000 non-null  object
 1   review     50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [7]:
df_imdb_raw.describe()

Unnamed: 0,sentiment,review
count,50000,50000
unique,2,49582
top,neg,Loved today's show!!! It was a variety and not...
freq,25000,5


In [8]:
df_imdb_raw.nunique()

sentiment        2
review       49582
dtype: int64

In [9]:
df_imdb_raw.head()

Unnamed: 0,sentiment,review
0,neg,I rented I AM CURIOUS-YELLOW from my video sto...
1,neg,"""I Am Curious: Yellow"" is a risible and preten..."
2,neg,If only to avoid making this type of film in t...
3,neg,This film was probably inspired by Godard's Ma...
4,neg,"Oh, brother...after hearing about this ridicul..."


In [10]:
df_imdb_raw.tail()

Unnamed: 0,sentiment,review
49995,pos,Just got around to seeing Monster Man yesterda...
49996,pos,I got this as part of a competition prize. I w...
49997,pos,I got Monster Man in a box set of three films ...
49998,pos,"Five minutes in, i started to feel how naff th..."
49999,pos,I caught this movie on the Sci-Fi channel rece...


In [11]:
df_imdb_raw[df_imdb_raw.duplicated()]

Unnamed: 0,sentiment,review
168,neg,I am not so much like Love Sick as I image. Fi...
664,neg,Holy freaking God all-freaking-mighty. This mo...
701,neg,"The story and the show were good, but it was r..."
3070,neg,I watched this movie when Joe Bob Briggs hoste...
3591,neg,"I like Chris Rock, but I feel he is wasted in ..."
...,...,...
49911,pos,I watched Pola X because Scott Walker composed...
49912,pos,Leos Carax has made 3 great movies: Boys Meet ...
49913,pos,Leos Carax is brilliant and is one of the best...
49914,pos,I've tried to reconcile why so many bad review...


Necessário remover stopwords, caracteres que não são alfabeto e transformar para lowercase.

Há também casos duplicados.


### Clean Dataset

In [18]:
df_cleaned = df_imdb_raw.drop_duplicates()

In [19]:
df_cleaned.describe()

Unnamed: 0,sentiment,review
count,49582,49582
unique,2,49582
top,pos,OK... so... I really like Kris Kristofferson a...
freq,24884,1


In [20]:
nltk.download('stopwords')
english_stops = set(nltk.corpus.stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [26]:
X, y = df_imdb_raw['review'],df_imdb_raw['sentiment']

In [27]:
X = X.replace({'<.*?>': ''}, regex = True)
X = X.replace({'[^A-Za-z]': ' '}, regex = True)
X = X.apply(lambda review: [w for w in review.split() if w not in english_stops])
X = X.apply(lambda review: [w.lower() for w in review])

In [28]:
y = y.replace('pos', 1)
y = y.replace('neg', 0)

In [31]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

In [32]:
def get_max_length():
    review_length = []
    for review in x_train:
        review_length.append(len(review))

    return int(np.ceil(np.mean(review_length)))

In [37]:
# ENCODE REVIEW
token = Tokenizer(lower=False)
token.fit_on_texts(x_train)
x_train = token.texts_to_sequences(x_train)
x_test = token.texts_to_sequences(x_test)

max_length = get_max_length()

x_train = pad_sequences(x_train, maxlen=max_length, padding='post', truncating='post')
x_test = pad_sequences(x_test, maxlen=max_length, padding='post', truncating='post')

total_words = len(token.word_index) + 1   # add 1 because of 0 padding

print('Encoded X Train\n', x_train, '\n')
print('Encoded X Test\n', x_test, '\n')
print('Maximum review length: ', max_length)

Encoded X Train
 [[   55    27  2695 ...     0     0     0]
 [    8     3   183 ...   133   250 28352]
 [  105   457  1127 ...     0     0     0]
 ...
 [    2  9488  3360 ...     0     0     0]
 [  378   226   269 ...  6742   444  1257]
 [   39   689   338 ...   142  1285     0]] 

Encoded X Test
 [[   39  1733   514 ...     0     0     0]
 [  172  6195   334 ...     0     0     0]
 [    8     5    92 ...     0     0     0]
 ...
 [ 5286    46     3 ...     0     0     0]
 [    1   297  3586 ...     0     0     0]
 [    1   465 27807 ...     0     0     0]] 

Maximum review length:  130


### Build Architecture

In [71]:
# ARCHITECTURE
EMBED_DIM = 32
LSTM_OUT = 64

model = Sequential()
model.add(Embedding(total_words, EMBED_DIM, input_length = max_length))
model.add(LSTM(LSTM_OUT))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

print(model.summary())

Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_7 (Embedding)      (None, 130, 32)           2801248   
_________________________________________________________________
lstm_7 (LSTM)                (None, 64)                24832     
_________________________________________________________________
dense_7 (Dense)              (None, 1)                 65        
Total params: 2,826,145
Trainable params: 2,826,145
Non-trainable params: 0
_________________________________________________________________
None


### Train

In [72]:
early_stop = EarlyStopping(
    monitor="loss",
    patience=2
)

checkpoint = ModelCheckpoint(
    'models/imdb.{epoch:03d}-{loss:.4f}-{accuracy:.4f}.hdf5',
    monitor='accuracy',
    save_best_only=True,
    verbose=1
)

logs = TensorBoard(log_dir='./logs')

callbacks = [ early_stop, checkpoint, logs ]

In [74]:
EPOCHS = 10

model.fit(x_train, y_train, batch_size = 128, epochs = EPOCHS, callbacks=callbacks)

Epoch 1/10





Epoch 00001: accuracy improved from -inf to 0.74814, saving model to models/imdb.001-0.7481.hdf5
Epoch 2/10





Epoch 00002: accuracy improved from 0.74814 to 0.92354, saving model to models/imdb.002-0.9235.hdf5
Epoch 3/10





Epoch 00003: accuracy improved from 0.92354 to 0.96189, saving model to models/imdb.003-0.9619.hdf5
Epoch 4/10





Epoch 00004: accuracy improved from 0.96189 to 0.97837, saving model to models/imdb.004-0.9784.hdf5
Epoch 5/10





Epoch 00005: accuracy improved from 0.97837 to 0.98574, saving model to models/imdb.005-0.9857.hdf5
Epoch 6/10





Epoch 00006: accuracy improved from 0.98574 to 0.99074, saving model to models/imdb.006-0.9907.hdf5
Epoch 7/10





Epoch 00007: accuracy did not improve from 0.99074
Epoch 8/10





Epoch 00008: accuracy improved from 0.99074 to 0.99123, saving model to models/imdb.008-0.9912.hdf5
Epoch 9/10





Epoch 00009: accuracy improved from 0.99123 to 0.99314, saving model to models/imdb.009-0.9931.hdf5
Epoch 10/10





Epoch 00010: accuracy did not improve from 0.99314


<tensorflow.python.keras.callbacks.History at 0x7faed080e350>

### Test

In [99]:
y_pred = (model.predict(x_test, batch_size = 128) > 0.5).astype("int32")

true = 0
for i, y in enumerate(y_test):
    if y == y_pred[i]:
        true += 1

print('Correct Prediction: {}'.format(true))
print('Wrong Prediction: {}'.format(len(y_pred) - true))
print('Accuracy: {}'.format(true/len(y_pred)*100))

Correct Prediction: 12888
Wrong Prediction: 2112
Accuracy: 85.92


### Load model

In [78]:
loaded_model = load_model('models/imdb.009-0.9931.hdf5')

In [84]:
def predict(model, review):
  regex = re.compile(r'[^a-zA-Z\s]')
  review = regex.sub('', review)

  words = review.split(' ')
  filtered = [w for w in words if w not in english_stops]
  filtered = ' '.join(filtered)
  filtered = [filtered.lower()]

  tokenize_words = token.texts_to_sequences(filtered)
  tokenize_words = pad_sequences(tokenize_words, maxlen=max_length, padding='post', truncating='post')
  return model.predict(tokenize_words)

In [91]:
test_review = 'This film is boring and ia hate it, I fell a sleep during the movie.'

In [92]:
result = predict(loaded_model, test_review)
print('Result: {}'.format(result))

Result: [[0.81380796]]
