### Objetivo

Esse objetivo tem como objetivo utilizar um modelo multi-linguagens do tensorflow hub como embedding. Simplesmente, a ideia central é realizar a aplicação de uma rede bem simples com esse embedding. 

Em teoria espera-se uma performance melhor do que a encontrada anteriormente.

Nos experimentos anterioriores, para as 5 classes a RNN conseguiu 33% de precisão na validação para as 5 classes. O que é horrível, considerando que um experimento aleatório teria 20% de precisão na validação. A LSTM teve uma performance pior ainda que foi justamente de 20% de precisão na validação, o que implica que não foi possível generalizar.

Aqui, veremos se com um embedding existente é possível generalizar melhor esses dados.

In [1]:
import nltk
import pandas as pd
import tensorflow as tf
from tqdm import tqdm
from IPython.display import clear_output
import numpy as np
import sklearn
from tensorflow.keras.callbacks import EarlyStopping
import tensorflow_hub as hub
import numpy as np
import tensorflow_text

In [2]:
embed_model = hub.load("https://tfhub.dev/google/universal-sentence-encoder-multilingual/3")
embed_layer = hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder-multilingual/3", input_shape=[], dtype=tf.string)

In [3]:
def equalize_samples(reader, data, samples_max_size, data_size):
    for _ in reader:
        validate = data["overall"].value_counts().sum()
        if validate == data_size:
            break
        for i in range(1,6):
            aux = _.groupby("overall").filter(lambda x: pd.Series([i]).isin(x["overall"]).all())[data.columns]
            curr_class_size = data["overall"].value_counts()[i]
            if curr_class_size + aux.shape[0] < samples_max_size:
                #adiciona
                data = pd.concat([data, aux], axis = 0)
            elif curr_class_size < samples_max_size:
                #adiciona parcial
                offset = curr_class_size + aux.shape[0] - samples_max_size
                data = pd.concat([data, aux[offset:]], axis = 0)
            else:
                clear_output(wait=True)
                print(data["overall"].value_counts())
                continue
            clear_output(wait=True)
            print(data["overall"].value_counts())
    return data

def initialize_data(chunksize, proportion_train, exclude_classes = [], padding=40):
    data_size = 5*chunksize

    reader = pd.read_json("./dataset/Video_Games.json", chunksize=chunksize, lines=True)

    data = []
    for _ in reader:
        data.append(_)
        break
    data = data[0]

    samples_max_size = chunksize

    data = equalize_samples(reader,data,samples_max_size,data_size)

    for i in exclude_classes:
        data = data[data.overall != i]
    
    data.dropna(subset=["reviewText"],  axis=0, inplace=True)
    clear_output(wait=True)
    print(data["overall"].value_counts())
    #data["reviewText"] = data["reviewText"].apply(lambda x: (nltk.wordpunct_tokenize(x) + ["<PAD>"] * padding)[:padding])
    x = data["reviewText"].to_numpy()
    print(x[:5])
    y = pd.get_dummies(data["overall"].to_numpy()).to_numpy()
    print(y[:5])
    return sklearn.model_selection.train_test_split(x, y, train_size=proportion_train) + [data]

In [4]:
samples_size=5000
xtrain, xtest, ytrain, ytest, data = initialize_data(samples_size,.9, exclude_classes=[2,3,4])

5    5000
1    4995
Name: overall, dtype: int64
['I used to play this game years ago and loved it. I found this did not work on my computer even though it said it would work with Windows 7.'
 'The product description should state this clearly. The CD, the box, and the product description suggest that the game is compatible with all Macs. It is not.'
 'Choose your career which sets your money for the trip.  Then name how many and who will be traveling with you.  Before you leave town, you must go into town  choose wagons or Conestoga, animals and many supplies -watch your cash and your wagon weight!  On your journey you can talk with different people to make decisions about your next moves.  You also get to hunt, fish, & gather..Be careful of disease & rivers!'
 'It took a few hours to get this up and running on Windows 8 computer and Windows XP.  If you get an error go and download their patch.\n\n[...]\n\nJust the patch alone worked like a charm on Windows XP.  For Windows 8 I downloa

In [5]:
class LogSoftmax(tf.keras.layers.Softmax):
    def __init__(self):
        super(LogSoftmax, self).__init__()
        
    def call(self, inputs):
        return tf.math.log(super(LogSoftmax, self).call(inputs))
        
    
def my_model(output=2, embedding_dim=128, epochs=50, patience=3):
    
    symbolic_input = tf.keras.layers.Input(shape=[], dtype=tf.string)
    embedding = embed_layer(symbolic_input)
    #dense_n = tf.keras.layers.Dense(128)(embedding)
    dense_n = tf.keras.layers.Dense(64)(embedding)
    dense_n = tf.keras.layers.Dense(32)(dense_n)
    dense_1 = tf.keras.layers.Dense(output)(dense_n)
    pred = LogSoftmax()(dense_1)
    rnn = tf.keras.models.Model(inputs=[symbolic_input], outputs=pred)
    print(rnn.summary())

    rnn.compile(optimizer=tf.keras.optimizers.Adam(), 
                  loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
                  metrics=['accuracy'])

    history = rnn.fit(
        x = xtrain,
        y = ytrain, 
        epochs=epochs, 
        batch_size=64,
        validation_data=(xtest,ytest),
        verbose=True,
        callbacks=EarlyStopping(monitor='val_accuracy', mode='max',patience=patience)
    ),

    history = history[0]
    for i in range(1,6):
        print("Epoca:{0}, Acurácia:{1}, Acurácia_Val:{2}, Perca:{3}".format(epochs-i+1,history.history["accuracy"][-i],history.history["val_accuracy"][-i], history.history["loss"][-i]))

In [6]:
my_model(output=2)

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None,)]                 0         
_________________________________________________________________
keras_layer (KerasLayer)     (None, 512)               68927232  
_________________________________________________________________
dense (Dense)                (None, 64)                32832     
_________________________________________________________________
dense_1 (Dense)              (None, 32)                2080      
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 66        
_________________________________________________________________
log_softmax (LogSoftmax)     (None, 2)                 0         
Total params: 68,962,210
Trainable params: 34,978
Non-trainable params: 68,927,232
____________________________________________

In [7]:
samples_size=3000
xtrain, xtest, ytrain, ytest, data = initialize_data(samples_size,.9, exclude_classes=[])
my_model(output=5)

4    3000
5    3000
2    3000
3    2999
1    2996
Name: overall, dtype: int64
['I used to play this game years ago and loved it. I found this did not work on my computer even though it said it would work with Windows 7.'
 'The game itself worked great but the story line videos would never play, the sound was fine but the picture would freeze and go black every time.'
 "I had to learn the hard way after ordering this for my MacBook Pro that this doesn't work unless you have MAC OS version 10.3 or less. I found that out after contact the Learning Company directly. They were very prompt in their response. However, I also have a laptop with Microsoft 7. This program loaded beautifully with the Microsoft base. So, if you have Microsoft 7 or 8, purchase and enjoy this game. Any mac systems will likely have issues."
 'The product description should state this clearly. The CD, the box, and the product description suggest that the game is compatible with all Macs. It is not.'
 'I would recommen

#### Usando um Embedding treinado

Não foi preciso nem aplicar LSTM ou RNN. O modelo atingiu uma performance muito maior. 93% de precisão para o problema binário e 54% de precisão para o problema com 5 classes.

Deu OOM acima porque tive que abrir uma porrada de coisa porque também estava fazendo outra coisa. 