### Classificação de textos para análise de sentimentos

Base de dados 

Instruções:
- O objetivo deste trabalho é criar um modelo binário de aprendizado de máquina para classificação de textos. 
Para isso, será utilizado a base de dados [IMDb](http://ai.stanford.edu/~amaas/data/sentiment/), que consiste de dados textuais de críticas positivas e negativas de filmes
- Uma vez treinado, o modelo deve ter uma função `predict` que recebe uma string como parâmetro e retorna o valor 1 ou 0, aonde 1 significa uma crítica positiva e 0 uma crítica negativa
- O pré-processamento pode ser desenvolvidado conforme desejar (ex.: remoção de stopwords, word embedding, one-hot encoding, char encoding)
- É preferível que seja empregado um modelo de recorrência (ex.: rnn, lstm, gru) para a etapa de classificação
- Documente o código (explique sucintamente o que cada função faz, insira comentários em trechos de código relevantes)
- **Atenção**: Uma vez treinado o modelo final, salve-o no diretório do seu projeto e crie uma célula ao final do notebook contendo uma função de leitura deste arquivo, juntamente com a execução da função `predict`

Sugestões:
- Explorar a base de dados nas células iniciais do notebook para ter um melhor entendimento do problema, distribuição dos dados, etc
- Após desenvolver a estrutura de classificação, é indicado fazer uma busca de hiperparâmetros e comparar os resultados obtidos em diferentes situações

Prazo de entrega:
- 01-08-2021 às 23:59hs GMT-3

Formato preferível de entrega:
- Postar no portal Ava da disciplina o link do projeto no github (ou anexar o projeto diretamente no portal Ava)

luann.porfirio@gmail.com

In [1]:
!pip install torchtext



In [37]:
import numpy as np
import pandas as pd
import torch.nn.functional as F

from torchtext import datasets
from torch import nn
from sklearn.preprocessing import LabelEncoder
from torch.utils.data import Dataset, DataLoader, random_split

In [3]:
train_iter, test_iter = datasets.IMDB()

aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:02<00:00, 29.7MB/s]


In [4]:
dataset_imdb = list(train_iter + test_iter)

In [5]:
df_imdb_raw = pd.DataFrame(data=dataset_imdb, columns=['sentiment', 'review'])

In [6]:
df_imdb_raw.shape

(50000, 2)

In [7]:
df_imdb_raw.head()

Unnamed: 0,sentiment,review
0,neg,I rented I AM CURIOUS-YELLOW from my video sto...
1,neg,"""I Am Curious: Yellow"" is a risible and preten..."
2,neg,If only to avoid making this type of film in t...
3,neg,This film was probably inspired by Godard's Ma...
4,neg,"Oh, brother...after hearing about this ridicul..."


In [8]:
df_imdb_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   sentiment  50000 non-null  object
 1   review     50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [9]:
df_imdb_raw.describe()

Unnamed: 0,sentiment,review
count,50000,50000
unique,2,49582
top,neg,Loved today's show!!! It was a variety and not...
freq,25000,5


In [10]:
df_imdb_raw.nunique()

sentiment        2
review       49582
dtype: int64

In [11]:
df_imdb_raw.head()

Unnamed: 0,sentiment,review
0,neg,I rented I AM CURIOUS-YELLOW from my video sto...
1,neg,"""I Am Curious: Yellow"" is a risible and preten..."
2,neg,If only to avoid making this type of film in t...
3,neg,This film was probably inspired by Godard's Ma...
4,neg,"Oh, brother...after hearing about this ridicul..."


In [12]:
df_imdb_raw.tail()

Unnamed: 0,sentiment,review
49995,pos,Just got around to seeing Monster Man yesterda...
49996,pos,I got this as part of a competition prize. I w...
49997,pos,I got Monster Man in a box set of three films ...
49998,pos,"Five minutes in, i started to feel how naff th..."
49999,pos,I caught this movie on the Sci-Fi channel rece...


In [13]:
df_imdb_raw[df_imdb_raw.duplicated()]

Unnamed: 0,sentiment,review
168,neg,I am not so much like Love Sick as I image. Fi...
664,neg,Holy freaking God all-freaking-mighty. This mo...
701,neg,"The story and the show were good, but it was r..."
3070,neg,I watched this movie when Joe Bob Briggs hoste...
3591,neg,"I like Chris Rock, but I feel he is wasted in ..."
...,...,...
49911,pos,I watched Pola X because Scott Walker composed...
49912,pos,Leos Carax has made 3 great movies: Boys Meet ...
49913,pos,Leos Carax is brilliant and is one of the best...
49914,pos,I've tried to reconcile why so many bad review...


In [14]:
df_imdb_raw.drop_duplicates(inplace=True)

In [15]:
df_imdb_raw.describe()

Unnamed: 0,sentiment,review
count,49582,49582
unique,2,49582
top,pos,"Sometimes, things should just not be made. And..."
freq,24884,1


Pré-processamento

In [16]:
def one_hot_encode(arr, n_labels):
    
    # Inicializa array
    one_hot = np.zeros((arr.size, n_labels), dtype=np.float32)
    
    # Preenche com valor 1
    one_hot[np.arange(one_hot.shape[0]), arr.flatten()] = 1.
    
    # Reshape
    one_hot = one_hot.reshape((*arr.shape, n_labels))
    
    return one_hot

In [17]:
def get_batches(arr, batch_size, seq_length):
    
    batch_size_total = batch_size * seq_length
    n_batches = len(arr)//batch_size_total
    
    arr = arr[:n_batches * batch_size_total]
    arr = arr.reshape((batch_size, -1))
    
    for n in range(0, arr.shape[1], seq_length):
        x = arr[:, n:n+seq_length]
        y = np.zeros_like(x)
        try:
            y[:, :-1], y[:, -1] = x[:, 1:], arr[:, n+seq_length]
        except IndexError:
            y[:, :-1], y[:, -1] = x[:, 1:], arr[:, 0]
        yield x, y

Define a arquitetura

In [40]:
class ImdbDataset(Dataset):
    """Imdb Dataset."""

    def __init__(self, df_raw):
      df_cleaned = df_raw
      df_cleaned['review'] = df_cleaned['review'].apply(self.encode_review)

      self.imdb_frame = df_cleaned

      # Save target and predictors
      self.X = self.imdb_frame.drop('sentiment', axis=1)

      le = LabelEncoder()
      self.y = le.fit_transform(self.imdb_frame['sentiment'].values)

    def __len__(self):
      return len(self.imdb_frame)

    def __getitem__(self, idx):
      if torch.is_tensor(idx):
          idx = idx.tolist()

      features = self.X.iloc[idx].values
      target = self.y[idx]

      sample = [features, target]

      return sample

    def encode_review(self, x: str):
      chars = tuple(set(x))
      int2char = dict(enumerate(chars))
      char2int = {ch: ii for ii, ch in int2char.items()}
      return np.array([char2int[ch] for ch in x])

In [41]:
imdb_dataset = ImdbDataset(df_imdb_raw)

In [20]:
class CharLSTM(nn.Module):
    
    def __init__(self, tokens, n_hidden=256, n_layers=2,
                               drop_prob=0.5, lr=0.001):
        super().__init__()
        self.drop_prob = drop_prob
        self.n_layers = n_layers
        self.n_hidden = n_hidden
        self.lr = lr
        
        self.chars = tokens
        self.int2char = dict(enumerate(self.chars))
        self.char2int = {ch: ii for ii, ch in self.int2char.items()}
        
        #definir lstm input_size, hidden_size, num_layers
        self.lstm = nn.LSTM(len(self.chars), n_hidden, n_layers, 
                            dropout=drop_prob, batch_first=True)
        
        #definir dropout
        self.dropout = nn.Dropout(drop_prob)
        
        #definir camada fc num_hidden input_size
        self.fc = nn.Linear(n_hidden, len(self.chars))
      
    
    def forward(self, x, hidden):
                
        r_output, hidden = self.lstm(x, hidden)
        
        out = self.dropout(r_output)
        out = out.contiguous().view(-1, self.n_hidden)
        out = self.fc(out)
        
        return out, hidden
    
    
    def init_hidden(self, batch_size):
        # Gera tensores de tamanho n_layers x betch_size x n_hidden
        weight = next(self.parameters()).data
        hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_(),
                  weight.new(self.n_layers, batch_size, self.n_hidden).zero_())
        
        return hidden

In [21]:
def train(net, data, epochs=10, batch_size=10, seq_length=50, lr=0.001, clip=5, val_frac=0.1, print_every=10):
    net.train()
    
    opt = torch.optim.Adam(net.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()
    
    #dados de treino/validacao
    val_idx = int(len(data)*(1-val_frac))
    data, val_data = data[:val_idx], data[val_idx:]
    
    counter = 0
    n_chars = len(net.chars)
    for e in range(epochs):
        h = net.init_hidden(batch_size)
        
        for x, y in get_batches(data, batch_size, seq_length):
            counter += 1
            
            # One-hot encoding
            x = one_hot_encode(x, n_chars)
            inputs, targets = torch.from_numpy(x), torch.from_numpy(y)
            
            # Cria variáveis para hidden state 
            h = tuple([each.data for each in h])

            net.zero_grad()
            
            # saida do modelo
            output, h = net(inputs, h)
            
            loss = criterion(output, targets.view(batch_size*seq_length).long())
            loss.backward()
            
            nn.utils.clip_grad_norm_(net.parameters(), clip)
            opt.step()
            
            if counter % print_every == 0:
                val_h = net.init_hidden(batch_size)
                val_losses = []
                net.eval()
                for x, y in get_batches(val_data, batch_size, seq_length):
                    
                    x = one_hot_encode(x, n_chars)
                    x, y = torch.from_numpy(x), torch.from_numpy(y)
                    
                    val_h = tuple([each.data for each in val_h])
                    
                    inputs, targets = x, y

                    output, val_h = net(inputs, val_h)
                    val_loss = criterion(output, targets.view(batch_size*seq_length).long())
                
                    val_losses.append(val_loss.item())
                
                net.train() 
                
                print("Epoch: {}/{}...".format(e+1, epochs),
                      "Step: {}...".format(counter),
                      "Loss: {:.4f}...".format(loss.item()),
                      "Val Loss: {:.4f}".format(np.mean(val_losses)))

Treinamento

In [22]:
n_hidden=256
n_layers=2

net = CharLSTM(chars, n_hidden, n_layers)
print(net)

NameError: ignored

In [None]:
batch_size = 128
seq_length = 100
n_epochs = 110

train(net, encoded, epochs=n_epochs, batch_size=batch_size, seq_length=seq_length, lr=0.001, print_every=10)