## Aula 1 - NLP

Neste exercício faremos todos os pré-processamentos necessários para que sequências de textos possam ser interpretadas por Redes Neurais.

A tarefa que exploraremos é a Classificaćão de Sentimento usando um dataset de revisões de restaurantes (YELP), produtos (Amazon) e filmes (IMDB) [link](https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences).

A nossa tarefa consiste em analisar a revisão e classificá-la entre "positiva" ou "negativa".

Primeiramente, vamos explorar o dataset:

In [1]:
import tensorflow as tf
tf.__version__ #Talvez seja necessário instalar tensorflow 2 antes de iniciar esse notebook

'2.1.0'

Nosso dataset tem 3 colunas:

- sentence: O texto da revisão
- label: 1 para texto positivo e 0 para negativo
- source: yelp, amazon ou imdb


In [93]:
import pandas as pd
filepath_dict = {'yelp':   'data/sentiment/yelp_labelled.txt',
                 'amazon': 'data/sentiment/amazon_cells_labelled.txt',
                 'imdb':   'data/sentiment/imdb_labelled.txt'}

df_list = []
for source, filepath in filepath_dict.items():
    df = pd.read_csv(filepath, names=['sentence', 'label'], sep='\t')
    df['source'] = source  # Add another column filled with the source name
    df_list.append(df)

df = pd.concat(df_list)
df = df.sample(frac=1).reset_index(drop=True)
df.head()


Unnamed: 0,sentence,label,source
0,Same problem as others have mentioned.,0,amazon
1,The directing seems too pretentious.,0,imdb
2,"I ate there twice on my last visit, and especi...",1,yelp
3,"Pros:-Good camera - very nice pictures , also ...",1,amazon
4,"not even a ""hello, we will be right with you.""",0,yelp


Primeiramente, vamos separar nosso dataset de modo que 15% dele seja reservado para teste.

In [94]:
perc_train = 0.85
len_train = int(len(df)*perc_train)

dataset_train = df.iloc[0:len_train, :-1]
dataset_test = df.iloc[len_train:, :-1]

print(len(dataset_train))
print(len(dataset_test))

dataset_train.head()

2335
413


Unnamed: 0,sentence,label
0,Same problem as others have mentioned.,0
1,The directing seems too pretentious.,0
2,"I ate there twice on my last visit, and especi...",1
3,"Pros:-Good camera - very nice pictures , also ...",1
4,"not even a ""hello, we will be right with you.""",0


In [95]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\cv\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [96]:
stopWords = set(stopwords.words('english'))

Agora que temos nosso dataset organizado, o primeiro passo é processar o texto para que seja legível por uma Rede Neural

O primeiro passo é gerar o vocabulário a partir da base de treinamento com a classe [Tokenizer](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer).

Essa classe executa diversas rotinas de pré-processamento úteis, entre elas:

- Remover pontuacões.
- através do parâmetro `num_words`, permite limitar o tamanho do vocabulário, descartando palavras incomuns.
- Normaliza capitalizacao com `lower=True`

Porém, antes de utilizar a classe, vamos remover stopwords do texto.

Stopwords são palavras com serventia apenas sintática, isso é, são irrelevantes para classificar o "sentimento" da sentenca (leia mais sobre stopwords [aqui](https://demacdolincoln.github.io/anotacoes-nlp/posts/pre-processamento-de-textos/#id2)).

In [97]:
#Estamos adicionando stopwords manualmente aqui. Também é possível baixá-las do módulo nltk
#stopwords = [ "a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", "did", "do", "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "it", "it's", "its", "itself", "let's", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "she", "she'd", "she'll", "she's", "should", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "we", "we'd", "we'll", "we're", "we've", "were", "what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom", "why", "why's", "with", "would", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves" ]


#Adicione seu código para Excluir todas as stopwords de todos os exemplos de treinamento
dataset_train.loc[:,'sentence'] = dataset_train.loc[:,'sentence'].apply(lambda x: ' '.join([item for item in x.split() if item not in stopWords]))
dataset_train.head()

Unnamed: 0,sentence,label
0,Same problem others mentioned.,0
1,The directing seems pretentious.,0
2,"I ate twice last visit, especially enjoyed sal...",1
3,"Pros:-Good camera - nice pictures , also cool ...",1
4,"even ""hello, right you.""",0


Agora podemos gerar o vocabulário e codificar as sentencas

In [98]:
from tensorflow.keras.preprocessing.text import Tokenizer

max_vocab_size = 500   #Tamanho máximo do vocabulário
oov_token = '<OOV>'   # Token usado caso alguma palavra não for encontrada no vocabulário

tokenizer = Tokenizer(num_words=max_vocab_size, lower=True, oov_token = oov_token)
tokenizer.fit_on_texts(dataset_train.loc[:, 'sentence'])

Através do atributo `word_index`, podemos consultar o vocabulário gerado. As primeiras palavras são as mais comuns.

Em seguida, codificamos o dataset de treinamento e de teste

In [99]:
vocab_size = len(tokenizer.word_index)
tokenizer.word_index

{'<OOV>': 1,
 'i': 2,
 'the': 3,
 'it': 4,
 'good': 5,
 'great': 6,
 'this': 7,
 'movie': 8,
 'phone': 9,
 'film': 10,
 'one': 11,
 '0': 12,
 'food': 13,
 'like': 14,
 'place': 15,
 'service': 16,
 '1': 17,
 'time': 18,
 'bad': 19,
 'really': 20,
 'well': 21,
 'would': 22,
 'even': 23,
 'ever': 24,
 'best': 25,
 'quality': 26,
 'back': 27,
 'also': 28,
 'go': 29,
 'product': 30,
 'love': 31,
 "i've": 32,
 'work': 33,
 "it's": 34,
 'get': 35,
 'nice': 36,
 'made': 37,
 'not': 38,
 'never': 39,
 'recommend': 40,
 'works': 41,
 "i'm": 42,
 'very': 43,
 'much': 44,
 'first': 45,
 'all': 46,
 'sound': 47,
 'excellent': 48,
 'better': 49,
 'battery': 50,
 'way': 51,
 'if': 52,
 'pretty': 53,
 'could': 54,
 'headset': 55,
 'still': 56,
 'my': 57,
 'you': 58,
 'think': 59,
 'use': 60,
 'acting': 61,
 'see': 62,
 'and': 63,
 'make': 64,
 'got': 65,
 'a': 66,
 'but': 67,
 'worst': 68,
 '2': 69,
 '10': 70,
 'going': 71,
 'enough': 72,
 'we': 73,
 'everything': 74,
 'there': 75,
 'disappointed': 7

In [100]:
dataset_train_sequences = tokenizer.texts_to_sequences(dataset_train.loc[:,'sentence'])
dataset_test_sequences = tokenizer.texts_to_sequences(dataset_test.loc[:,'sentence'])
print(dataset_train_sequences[0:2])

[[1, 210, 1, 1], [3, 361, 396, 1]]


O último passo de pré-processamento agora consiste em realizar o padding das sequências.

Para isso, utilizaremos a funcão [`pad_sequences`](https://keras.io/preprocessing/sequence/)

Os principais argumentos dessa funcão são:

- `maxlen`: tamanho da sequência a ser gerada.
- `padding`: 'pre' para adicionar zeros à esquerda e 'post' para adicionar zeros à direita.
- `truncating`: 'pre' para remover palavras no comeco da frase se for maior que o tamanho especificado, 'post' para remover do final

In [101]:
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences

maxlen = 500  #Tamanho máximo da frase
padding_type = 'post'
truncating_type = 'post'

dataset_train_sequences = pad_sequences(dataset_train_sequences, maxlen = maxlen, padding=padding_type, truncating=truncating_type)
dataset_test_sequences = pad_sequences(dataset_test_sequences, maxlen = maxlen, padding=padding_type, truncating=truncating_type)

print(len(dataset_train_sequences[0]))
print(len(dataset_train_sequences[1]))
print(dataset_train_sequences)

500
500
[[  1 210   1 ...   0   0   0]
 [  3 361 396 ...   0   0   0]
 [  2   1   1 ...   0   0   0]
 ...
 [  2 107 186 ...   0   0   0]
 [ 49 492   0 ...   0   0   0]
 [  1  19  10 ...   0   0   0]]


Agora que as sentencas estão em um formato favorável, podemos treinar nosso modelo.

In [102]:
#Adicione a sua arquitetura, lembrando que a entrada tem tamanho maxlen e a saída 2
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Conv1D
from tensorflow.keras.layers import MaxPooling1D

embedding_dim = 16

In [109]:
model = Sequential()
model.add(Embedding(512, embedding_dim, input_length=maxlen))
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

In [110]:
# Defina aqui seu otimizador e sua loss
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

Model: "sequential_25"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_18 (Embedding)     (None, 500, 16)           8192      
_________________________________________________________________
conv1d_13 (Conv1D)           (None, 500, 32)           1568      
_________________________________________________________________
max_pooling1d_12 (MaxPooling (None, 250, 32)           0         
_________________________________________________________________
flatten_14 (Flatten)         (None, 8000)              0         
_________________________________________________________________
dense_24 (Dense)             (None, 128)               1024128   
_________________________________________________________________
dense_25 (Dense)             (None, 1)                 129       
Total params: 1,034,017
Trainable params: 1,034,017
Non-trainable params: 0
___________________________________________

In [111]:
from tensorflow.keras.utils import to_categorical
num_epochs = 20

train_seqs = dataset_train_sequences
train_labels = np.array(dataset_train.loc[:, 'label'])
test_seqs = dataset_test_sequences
test_labels = np.array(dataset_test.loc[:, 'label'])

print(len(train_seqs))
print(len(train_labels))
print(len(test_seqs))
print(len(test_labels))

model.fit(train_seqs,train_labels, epochs = num_epochs, validation_data=(test_seqs,test_labels), verbose=2 )

2335
2335
413
413
Train on 2335 samples, validate on 413 samples
Epoch 1/20
2335/2335 - 4s - loss: 0.6936 - accuracy: 0.5255 - val_loss: 0.6897 - val_accuracy: 0.6271
Epoch 2/20
2335/2335 - 3s - loss: 0.6122 - accuracy: 0.6809 - val_loss: 0.5367 - val_accuracy: 0.7288
Epoch 3/20
2335/2335 - 3s - loss: 0.4241 - accuracy: 0.8051 - val_loss: 0.5874 - val_accuracy: 0.7240
Epoch 4/20
2335/2335 - 3s - loss: 0.3575 - accuracy: 0.8385 - val_loss: 0.5822 - val_accuracy: 0.7312
Epoch 5/20
2335/2335 - 3s - loss: 0.3319 - accuracy: 0.8505 - val_loss: 0.5050 - val_accuracy: 0.7724
Epoch 6/20
2335/2335 - 3s - loss: 0.3046 - accuracy: 0.8587 - val_loss: 0.5214 - val_accuracy: 0.7603
Epoch 7/20
2335/2335 - 3s - loss: 0.2840 - accuracy: 0.8754 - val_loss: 0.6331 - val_accuracy: 0.7385
Epoch 8/20
2335/2335 - 3s - loss: 0.2679 - accuracy: 0.8835 - val_loss: 0.6149 - val_accuracy: 0.7506
Epoch 9/20
2335/2335 - 3s - loss: 0.2537 - accuracy: 0.8887 - val_loss: 0.6123 - val_accuracy: 0.7482
Epoch 10/20
2335/

<tensorflow.python.keras.callbacks.History at 0x2164d72b208>

Vamos verificar se as classificacões fazem sentido

In [112]:
or_test_sentences = ['very good movie', 'terrible taste', 'worst product ever']
#codificando
test_sentences = tokenizer.texts_to_sequences(or_test_sentences)
test_sentences = pad_sequences(test_sentences, maxlen = maxlen, padding=padding_type, truncating=truncating_type)

print(test_sentences)


[[ 43   5   8 ...   0   0   0]
 [ 82 266   0 ...   0   0   0]
 [ 68  30  24 ...   0   0   0]]


In [113]:
predictions = model.predict(test_sentences)
print(or_test_sentences)
print(predictions > 0.5)

['very good movie', 'terrible taste', 'worst product ever']
[[ True]
 [False]
 [False]]


Avalie como o número de dimensões do embedding, o tipo do padding, o tamanho do vocabulário, o tamanho máximo de sentenca, etc. contribuem para a qualidade do modelo.
