### Objetivo

Esse notebook é um resumo do anterior. Tem como objetivo deixar tudo mais sucinto.

Esse notebook tem como objetivo realizar uma análise de sentimentos do dataset da [amazon](http://deepyeti.ucsd.edu/jianmo/amazon/index.html). Essa vai ser uma série de 10 projetos de NLP onde esse é o primeiro projeto. 

Análise de sentimento é um problema difícil, mas tratando-se de NLP é um dos problemas mais simples de serem resolvidos.
Neste projeto, pretendo:

    1 - Analisar os dados.
    2 - Classificar revisões em boas ou ruins. 
        2.1 - Aplicar um modelo usando rede neural recorrente
        2.2 - Aplicar um modelo usando LSTM
    

In [1]:
import nltk
import pandas as pd
import tensorflow as tf
from tqdm import tqdm
from IPython.display import clear_output
import numpy as np
import sklearn
from tensorflow.keras.callbacks import EarlyStopping

In [2]:
def equalize_samples(reader, data, samples_max_size, data_size):
    for _ in reader:
        validate = data["overall"].value_counts().sum()
        if validate == data_size:
            break
        for i in range(1,6):
            aux = _.groupby("overall").filter(lambda x: pd.Series([i]).isin(x["overall"]).all())[data.columns]
            curr_class_size = data["overall"].value_counts()[i]
            if curr_class_size + aux.shape[0] < samples_max_size:
                #adiciona
                data = pd.concat([data, aux], axis = 0)
            elif curr_class_size < samples_max_size:
                #adiciona parcial
                offset = curr_class_size + aux.shape[0] - samples_max_size
                data = pd.concat([data, aux[offset:]], axis = 0)
            else:
                clear_output(wait=True)
                print(data["overall"].value_counts())
                continue
            clear_output(wait=True)
            print(data["overall"].value_counts())
    return data

def process_sentence(sentence, padding=30):
    stopwords = nltk.corpus.stopwords
    stemer = nltk.stem.PorterStemmer()
    processed = nltk.word_tokenize(sentence[:padding])
    processed = [stemer.stem(word) for word in processed if word not in stopwords.words("english")] + padding * ["<PAD>"]
    return processed[:padding]

def make_vocabulary(corpus, padding=30, max_vocab_size=2000):
    vocabulary = {'<PAD>':0, '<UNK>':1}
    rvocabulary = {0:'<PAD>', 1:'<UNK>'}
    fvocabulary = {'<PAD>':0, '<UNK>':0}
    index = 2
    for sentence in tqdm(corpus):
        processed = process_sentence(sentence, padding=padding)
        for word in processed:
            if word not in vocabulary.keys():
                vocabulary[word] = index
                rvocabulary[index] = word
                fvocabulary[word] = 1
                index += 1
            else:
                fvocabulary[word] += 1
    fvocabulary = dict(sorted(fvocabulary.items(), key=lambda item: item[1], reverse=True))
    words_by_freq = list(fvocabulary.keys())[:max_vocab_size]
    index = 2
    aux = {'<PAD>':0, '<UNK>':1}
    for word in vocabulary.keys():
        if word in words_by_freq and word not in ['<PAD>', '<UNK>']:
            aux[word] = index
            index+=1
    vocabulary = aux
    return rvocabulary, vocabulary, fvocabulary

def tokenize(sentence, vocabulary, padding = 30):
    sentence = process_sentence(sentence, padding=padding)
    return [vocabulary[word] if word in vocabulary else vocabulary["<UNK>"] for word in sentence]

def detokenize(sentence, rvocabulary, padding=30):
    return [rvocabulary[token] for token in sentence]

class LogSoftmax(tf.keras.layers.Softmax):
    def __init__(self):
        super(LogSoftmax, self).__init__()
        
    def call(self, inputs):
        return tf.math.log(super(LogSoftmax, self).call(inputs))

#### As funções acima

As funções acima são justamente para se trabalhar os dados. Vão ser usadas abaixo antes do modelo ser executado.
Para o processamento das sentenças foi criado esse método para gerar a sentença sem as palavras que são irrelevantes para análise de sentimento (stopwords) e também foi usada uma técnica de stem para preservar somente o radical aproximado das palavras.

#### A seguir, initialize_data é a função que já trabalha os dados com alguns parâmetros

Essa função, retorna os dados de treino e teste divididos. Também é lá que acontece o trabalho nos dados. Ou seja, a equalização das amostras de cada classe. A remoção de classes que se deseja tirar. 

In [3]:
def initialize_data(chunksize, vocab_size, padding, proportion_train, exclude_classes = []):
    data_size = 5*chunksize

    reader = pd.read_json("./dataset/Video_Games.json", chunksize=chunksize, lines=True)

    data = []
    for _ in reader:
        data.append(_)
        break
    data = data[0]

    samples_max_size = chunksize

    data = equalize_samples(reader,data,samples_max_size,data_size)

    for i in exclude_classes:
        data = data[data.overall != i]
    
    data.dropna(subset=["reviewText"],  axis=0, inplace=True)
    clear_output(wait=True)
    print(data["overall"].value_counts())

    vocab_size=vocab_size
    padding=padding
    rvocabulary, vocabulary, fvocabulary = make_vocabulary(data["reviewText"].to_numpy(), padding=padding, max_vocab_size=vocab_size)
    x = data["reviewText"].to_numpy()
    x = np.array([tokenize(text, vocabulary, padding=padding) for text in tqdm(x)])
    print(x[:5])
    y = pd.get_dummies(data["overall"].to_numpy()).to_numpy() #Corrigindo intervalo para {0,1}
    print(y[:5])
    return sklearn.model_selection.train_test_split(x, y, train_size=proportion_train)

In [4]:
vocab_size=8000
samples_size=5000
padding=50
xtrain, xtest, ytrain, ytest = initialize_data(samples_size, vocab_size, padding, .9, [2,3,4])

5    5000
1    4995
Name: overall, dtype: int64


100%|█████████████████████████████████████████████████████████████████████████████| 9995/9995 [00:22<00:00, 444.98it/s]
100%|█████████████████████████████████████████████████████████████████████████████| 9995/9995 [00:23<00:00, 432.13it/s]

[[ 2  3  4  5  6  7  8  9  2  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0]
 [10 11 12 13 14  9  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0]
 [15 16 17 18  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0]
 [19 20 21 22 23  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0]
 [ 2 24 25 26  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0]]
[[1 0]
 [1 0]
 [0 1]
 [0 1]
 [0 1]]





#### My Model

Essa é uma função que define um mesmo padrão de modelo. Fiz isso porque não estou testando a arquitetura, mas sim a camada recorrente.

No final, eu imprimo a epoca(Ou pelo menos a ordem de impressão), precisão e a perca.

In [23]:
def my_model(name, output=2, embedding_dim=128,model_dim=128, epochs=50, patience=3):
    rlayer = None
    if name == "RNN":
        rlayer = tf.keras.layers.SimpleRNN(model_dim)
    if name == "LSTM":
        rlayer = tf.keras.layers.LSTM(model_dim)
    if name == "GRU":
        rlayer = tf.keras.layers.GRU(model_dim)
    rnn = tf.keras.models.Sequential(
        [
            #Lembrando que são 5000 tokens + 2 que é o <PAD> e <UNK>
            tf.keras.layers.Embedding(vocab_size + 2, embedding_dim, input_length=xtrain.shape[1]),
            rlayer,
            tf.keras.layers.Dense(output),
            LogSoftmax()
        ]
    )
    print(rnn.summary())

    rnn.compile(optimizer=tf.keras.optimizers.Adam(), 
                  loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
                  metrics=['accuracy'])

    history = rnn.fit(
        x = xtrain,
        y = ytrain, 
        epochs=epochs, 
        batch_size=64,
        validation_data=(xtest,ytest),
        verbose=True,
        callbacks=EarlyStopping(monitor='val_accuracy', mode='max',patience=patience)
    ),

    history = history[0]
    for i in range(1,6):
        print("Epoca:{0}, Acurácia:{1}, Acurácia_Val:{2}, Perca:{3}".format(epochs-i+1,history.history["accuracy"][-i],history.history["val_accuracy"][-i], history.history["loss"][-i]))

In [8]:
my_model("RNN", embedding_dim = 512)

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 50, 512)           4097024   
_________________________________________________________________
simple_rnn_1 (SimpleRNN)     (None, 128)               82048     
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 258       
_________________________________________________________________
log_softmax_1 (LogSoftmax)   (None, 2)                 0         
Total params: 4,179,330
Trainable params: 4,179,330
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoca:50, Acurácia:0.8624791502952576, Acurácia_Val:0.7860000133514404, Perca:0.3510028123855591
Epoca:49, Acurácia:0.8667037487030029, Acurácia_Val:0.7919999957084656, Perca:0.35443

In [22]:
my_model("RNN",embedding_dim = 64, model_dim=256)

Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_8 (Embedding)      (None, 50, 64)            512128    
_________________________________________________________________
simple_rnn_2 (SimpleRNN)     (None, 50, 256)           82176     
_________________________________________________________________
lstm_9 (LSTM)                (None, 256)               525312    
_________________________________________________________________
dense_7 (Dense)              (None, 2)                 514       
_________________________________________________________________
log_softmax_7 (LogSoftmax)   (None, 2)                 0         
Total params: 1,120,130
Trainable params: 1,120,130
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoca:50, Acurácia:0.7565314173698425, Acurácia_Va

In [48]:
my_model("GRU")

Model: "sequential_20"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_21 (Embedding)     (None, 50, 128)           1024256   
_________________________________________________________________
gru_2 (GRU)                  (None, 128)               99072     
_________________________________________________________________
dense_21 (Dense)             (None, 2)                 258       
_________________________________________________________________
log_softmax_18 (LogSoftmax)  (None, 2)                 0         
Total params: 1,123,586
Trainable params: 1,123,586
Non-trainable params: 0
_________________________________________________________________
None
Epoca:50, Acurácia:0.4950528144836426, Acurácia_Val:0.49000000953674316, Perca:0.6933028697967529
Epoca:49, Acurácia:0.4930517077445984, Acurácia_Val:0.49000000953674316, Perca:0.6933252215385437
Epoca:48, Acurácia:0.49749860167503357, 

#### Conclusão sobre a classificação binária
##### Considerando 
O modelo mais simples obteve maior acurácia. 79%
O modelo que usa LSTM obteve 72% de acurácia.
O modelo que usa GRU ficou estagnado, e foi muito pobre.

Foi considerado:

```python
vocab_size=8000
samples_size=5000
padding=30
```


#### Agora vamos ver os resultados obtidos com as 5 classes

In [9]:
vocab_size=3000
samples_size=3000
padding=40
xtrain, xtest, ytrain, ytest = initialize_data(samples_size, vocab_size, padding, .9, [])

4    3000
5    3000
2    3000
3    2999
1    2996
Name: overall, dtype: int64


100%|███████████████████████████████████████████████████████████████████████████| 14995/14995 [00:28<00:00, 517.44it/s]
100%|███████████████████████████████████████████████████████████████████████████| 14995/14995 [00:28<00:00, 517.17it/s]


[[ 2  3  4  5  6  7  8  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 9  5 10 11 12  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 2 13 14 15  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 9 16 17 18 19  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 2 20 21 13  5  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]]
[[1 0 0 0 0]
 [0 0 1 0 0]
 [0 0 0 1 0]
 [1 0 0 0 0]
 [0 0 0 1 0]]


In [10]:
my_model("RNN", output=5, patience=5)

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 40, 128)           384256    
_________________________________________________________________
simple_rnn_2 (SimpleRNN)     (None, 128)               32896     
_________________________________________________________________
dense_2 (Dense)              (None, 5)                 645       
_________________________________________________________________
log_softmax_2 (LogSoftmax)   (None, 5)                 0         
Total params: 417,797
Trainable params: 417,797
Non-trainable params: 0
_________________________________________________________________
None
Epoca:50, Acurácia:0.7315301895141602, Acurácia_Val:0.3226666748523712, Perca:0.7267200350761414
Epoca:49, Acurácia:0.7199703454971313, Acurácia_Val:0.3580000102519989, Perca:0.7707535028457642
Epoca:48, Acurácia:0.7029269933700562, Acurácia

In [11]:
my_model("LSTM", output=5, patience=5)

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 40, 128)           384256    
_________________________________________________________________
lstm (LSTM)                  (None, 128)               131584    
_________________________________________________________________
dense_3 (Dense)              (None, 5)                 645       
_________________________________________________________________
log_softmax_3 (LogSoftmax)   (None, 5)                 0         
Total params: 516,485
Trainable params: 516,485
Non-trainable params: 0
_________________________________________________________________
None
Epoca:50, Acurácia:0.19473879039287567, Acurácia_Val:0.21466666460037231, Perca:1.6100106239318848
Epoca:49, Acurácia:0.19881437718868256, Acurácia_Val:0.21466666460037231, Perca:1.6099098920822144
Epoca:48, Acurácia:0.19570210576057434, Acu

#### Os outros modelos performam bem pior. 

Não consegui parametrizar ou melhorar os resultados usando GRU ou LSTM. O que performou melhor para fazer essa classificação binária foi o RNN simples. 
