#  Exercício: Modelo de Linguagem com auto-atenção (versão eficiente)

Este exercício é similar ao da aula 5, mas iremos agora treinar *eficientemente* uma rede neural com uma ou mais camadas de auto-atenção para prever a próxima palavra de um texto, data as palavras anteriores como entrada. 

Para tanto, deve-se implementar:
1. A máscara causal de atenção. Ela possibilitará que, durante o treinamento, com apenas uma forward+backward pass na rede, tenhamos as losses para todos os tokens de entrada (slide 117).
2. A máscara de PADs, que permite que usemos sequencias de comprimento variável no mesmo batch (slide 118).
3. Múltiplas cabeças.

## Importação dos pacotes

In [7]:
import collections
import itertools
import functools
import math
import os
import random
import re

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from torch.utils.data import DataLoader, Dataset
from tqdm.notebook import tqdm
from typing import List, Type

In [3]:
# Check which GPU we are using
!nvidia-smi

Mon Oct 17 19:23:31 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| N/A   53C    P8     7W /  N/A |    690MiB /  5944MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [4]:
if torch.cuda.is_available(): 
   dev = "cuda:0"
else: 
   dev = "cpu"
device = torch.device(dev)
print('Using {}'.format(device))

Using cuda:0


# 1) Carregamento do dataset 

Primeiro, fazemos download do dataset:

In [5]:
!wget -nc http://files.fast.ai/data/aclImdb.tgz 
!tar -xzf aclImdb.tgz

File ‘aclImdb.tgz’ already there; not retrieving.



## Carregando o dataset

Criaremos uma divisão de treino (80%) e validação (20%) artificialmente.

Nota: Evitar de olhar ao máximo o dataset de teste para não ficar enviseado no que será testado. Em aplicações reais, o dataset de teste só estará disponível no futuro, ou seja, é quando o usuário começa a testar o seu produto.

In [6]:
def load_texts(folder):
    texts = []
    for path in os.listdir(folder):
        with open(os.path.join(folder, path)) as f:
            texts.append(f.read())
    return texts

x_train_pos = load_texts('aclImdb/train/pos')
x_train_neg = load_texts('aclImdb/train/neg')
x_test_pos = load_texts('aclImdb/test/pos')
x_test_neg = load_texts('aclImdb/test/neg')

x_train = x_train_pos + x_train_neg
x_test = x_test_pos + x_test_neg

# Embaralhamos o treino para depois fazermos a divisão treino/valid.
random.shuffle(x_train)

n_train = int(0.8 * len(x_train))

x_valid = x_train[n_train:]
x_train = x_train[:n_train]

print(len(x_train), 'amostras de treino.')
print(len(x_valid), 'amostras de desenvolvimento.')
print(len(x_test), 'amostras de teste.')

print('3 primeiras amostras treino:')
for x in x_train[:3]:
    print(x[:100])

print('3 últimas amostras treino:')
for x in x_train[-3:]:
    print(x[:100])

print('3 primeiras amostras validação:')
for x in x_valid[:3]:
    print(x[:100])

print('3 últimas amostras validação:')
for x in x_valid[-3:]:
    print(x[:100])

20000 amostras de treino.
5000 amostras de desenvolvimento.
25000 amostras de teste.
3 primeiras amostras treino:
Whoever made this movie must have done it as a joke. I mean, this was the stupidest movie I think I 
This movie has beautiful scenery. Unfortunately it has no plot. In order to have a plot there must b
<br /><br />I have to admit to enjoying bad movies. I love them I watch all of them. Horror especial
3 últimas amostras treino:
There is no greater disservice to do to history than to misrepresent it. This takes the easiest and 
I watched this movie and the original Carlitos Way back to back. The difference between the two is d
I didn't mind all the walking. People really did walk places back then. It loaned an air of authenti
3 primeiras amostras validação:
I went to see this movie twice within a week and can only sum it up in one word (which I normally do
I really did like this show, once upon a time. That is, until I realized all the faults in it. It's 
I recently found th

# 2) Tokenizer

In [8]:
class IMDBTokenizer():

  def __init__(self, max_tokens: int = 1000):
    self.max_tokens = max_tokens

  def __call__(self, text: str, padding: bool = False, truncation: bool = False, max_length: int = 50, len_multiple: int = None):
    tokens = self.encode(text)
    attention_mask = [1] * len(tokens)

    if truncation and len(tokens) > max_length:
      tokens = tokens[:max_length]
      attention_mask = attention_mask[:max_length]
    if padding:
      missing_size = 0
      if len_multiple is not None:
        if len(tokens) % len_multiple is not 0:
          missing_size = len_multiple - len(tokens) % len_multiple + 1
        else:
          missing_size = 1
      else :
        if len(tokens) < max_length:
          missing_size = max_length - len(tokens)

      tokens.extend([self.vocab['<pad>']] * missing_size)
      attention_mask.extend([0] * missing_size)
    
    return tokens, attention_mask

  def encode(self, text: str):
    tokens = self.tokenize(text)
    tokens = [self.vocab.get(t, self.vocab['<unk>']) for t in tokens]
    return tokens

  def decode(self, tokens: List[int]):
    decoder_dict = dict(zip(self.vocab.values(), self.vocab.keys()))
    texts = [decoder_dict[t] for t in tokens]
    
    def replace_fn(match):
      return match.group(0).replace(' ', '')

    return re.sub(r'\s[,.!?\']', replace_fn, ' '.join(texts))

  def create_vocab(self, corpus: List[str]):
    texts = [self.tokenize(t) for t in corpus]
    tokens = [t for tokens in texts for t in tokens]
    vocab = collections.Counter(tokens).most_common(self.max_tokens - 4)
    self.vocab = {v[0]:i+4 for i, v in enumerate(vocab)}
    self.vocab['<sos>'] = 0
    self.vocab['<eos>'] = 1
    self.vocab['<pad>'] = 2
    self.vocab['<unk>'] = 3
    self.vocab_size = len(self.vocab)
    self.eos_token_id = self.vocab['<eos>']
    self.pad_token_id = self.vocab['<pad>']
    self.unk_token_id = self.vocab['<unk>']

  def tokenize(self, text: str):
    
    #-----Optionally removes html for breaking line------------
    # Obs: Essa troca causa um aumento no erro do modelo.
    # Muito provavelmente devido ao fato de que a sequência de <br /><br /> 
    #sempre aparece nesta ordem, gerando sempre a mesma sequência de 8 tokens:
    # '<,' 'br', '\', '>', '<', 'br', '\', '>'
    # Logo o modelo fica enviesado a sempre prever essa sequência quando parte
    # dela aparece (e.g. prever \ quando a sequencia de entrada termina em <br)
    # o que acaba diminuindo seu erro
    #text = re.sub(re.escape('°'), ' ', text) # The symbol ° only appears 4 times in the dataset and will be used to represent <br /><br />  (line breaks)
    text = re.sub(re.escape('<br /><br />'), ' ', text)
    #------------------------------------------------------

    #Optionally ignore ' " \ | / and some others special characters
    #tokens = re.findall(r'\w+|[,.!?-]', text.lower())
    
    tokens = re.findall(r'<sos>|<eos>|\w+|[^\w\s]', text.lower())
    tokens = [token.lower() for token in tokens]
    return tokens

  if len(tokens) % len_multiple is not 0:


## 2.1) Testando tokenizer

In [None]:
text = '<sos> Four divided by two is two. <eos>'
text2 = 'Words not in vocab'
corpus = ['Two plus two is four.', 'Four plus four is eigth!', 'Four divided by two is two.']

tokenizer = IMDBTokenizer()
tokenizer.create_vocab(corpus)
print(tokenizer.vocab)
assert tokenizer.encode(text) == [0, 5, 11, 12, 4, 6, 4, 8, 1]
assert tokenizer.encode(text2) == [tokenizer.unk_token_id]*4

assert tokenizer.decode([0, 5, 11, 12, 4, 6, 4, 8, 1]) == '<sos> four divided by two is two. <eos>'

{'two': 4, 'four': 5, 'is': 6, 'plus': 7, '.': 8, 'eigth': 9, '!': 10, 'divided': 11, 'by': 12, '<sos>': 0, '<eos>': 1, '<pad>': 2, '<unk>': 3}


# 3) Dataset

In [9]:
class IMDBDataset(Dataset):
  def __init__(self, corpus: List[str], tokenizer: Type[IMDBTokenizer], max_seq_length: int = 5, shifting_window: bool = False):
    
    self.tokenizer = tokenizer

    data = []

    for text in corpus:
      if shifting_window:
        tokens, attention_masks = self.tokenizer('<sos> ' + text + ' <eos>')
        data.extend([[tokens[i:i+max_seq_length+1], attention_masks[i:i+max_seq_length+1]] for i in range(len(tokens)-max_seq_length)])
      else:
        tokens, attention_masks = self.tokenizer('<sos> ' + text + ' <eos>', padding=True, len_multiple=max_seq_length)
        data.extend([[tokens[i:i+max_seq_length+1], attention_masks[i:i+max_seq_length+1]] for i in range(0, len(tokens) - max_seq_length, max_seq_length)])

    self.data = torch.IntTensor(data)

  def __len__(self):
    return len(self.data)

  def __getitem__(self, index):
    target_ids = self.data[index][0, 1:].long()

    ignore_loss = (target_ids == self.tokenizer.pad_token_id) + (target_ids == self.tokenizer.unk_token_id)
    target_ids[ignore_loss] = -100

    return self.data[index][0, :-1], self.data[index][1, :-1], target_ids, self.data[index][1, 1:]


In [13]:
tokenizer = IMDBTokenizer(max_tokens=3)
tokenizer.create_vocab(x_train)

In [14]:
train_dataset = IMDBDataset(x_train[:1], tokenizer, max_seq_length=5, shifting_window=True)
print(train_dataset[0])

(tensor([0, 3, 3, 3, 3], dtype=torch.int32), tensor([1, 1, 1, 1, 1], dtype=torch.int32), tensor([-100, -100, -100, -100, -100]), tensor([1, 1, 1, 1, 1], dtype=torch.int32))


In [None]:
corpus = ['Two plus two is four.', 'Four plus four is eigth!', 'Four divided by two is two.']

tokenizer = IMDBTokenizer()
tokenizer.create_vocab(corpus)

dataset = IMDBDataset(corpus, tokenizer, max_seq_length=2)

assert len(dataset) == 13

tokens1 = torch.IntTensor(tokenizer.encode(corpus[0]))
print(dataset[1], tokens1)
for i in range(2):
  assert (dataset[i+1][0] == tokens1[i*2+1:i*2+3]).all()
  assert (dataset[i+1][2] == tokens1[i*2+2:i*2+4]).all()

#Teste se sequência com tokens desconhecidos são ignorados
tokenizer = IMDBTokenizer(max_tokens=3)
tokenizer.create_vocab(corpus)

dataset = IMDBDataset(corpus, tokenizer, max_seq_length=2)
tokens2 = torch.IntTensor(tokenizer.encode(corpus[0]))
print(dataset[0], tokens2)

assert len(dataset) == 13
assert (dataset[1][0] == tokens2[2:4]).all()
assert (dataset[1][2] == torch.tensor([-100, -100])).all()

(tensor([7, 4], dtype=torch.int32), tensor([1, 1], dtype=torch.int32), tensor([4, 6]), tensor([1, 1], dtype=torch.int32)) tensor([4, 7, 4, 6, 5, 8], dtype=torch.int32)
(tensor([0, 3], dtype=torch.int32), tensor([1, 1], dtype=torch.int32), tensor([-100, -100]), tensor([1, 1], dtype=torch.int32)) tensor([3, 3, 3, 3, 3, 3], dtype=torch.int32)


# 4) Modelo

In [None]:
class MultiHeadSelfAttention(nn.Module):
  
  def __init__(self, heads: int, embedding_dim: int = 50, test: bool = False):
    super(MultiHeadSelfAttention, self).__init__()
    
    self.test = test
    
    self.heads = heads
    assert embedding_dim % heads == 0, "Dimensão de embedding deve ser divisível pelo número de cabeças"

    self.querry_proj = nn.Linear(embedding_dim, embedding_dim)
    self.key_proj = nn.Linear(embedding_dim, embedding_dim)
    self.value_proj = nn.Linear(embedding_dim, embedding_dim)
    self.out_proj = nn.Linear(embedding_dim, embedding_dim)
  
  def forward(self, input_embeddings, mask=None):
    Q = self.querry_proj(input_embeddings)
    K = self.key_proj(input_embeddings)
    V = self.value_proj(input_embeddings)

    Q = Q.reshape(Q.shape[0], Q.shape[1], self.heads, -1)
    K = K.reshape(K.shape[0], K.shape[1], self.heads, -1)
    V = V.reshape(V.shape[0], V.shape[1], self.heads, -1)

    Q = Q.transpose(1, 2)
    K = K.transpose(1, 2)
    V = V.transpose(1, 2)

    scores = Q @ torch.transpose(K, -2, -1)

    if mask is not None:
      scores[mask == 0] = -1e8

    attention_weigths = torch.softmax(scores, dim=-1)
    
    E = attention_weigths @ V


    E = E.transpose(1, 2)

    E = E.reshape(E.shape[0], E.shape[1], -1)

    E = self.out_proj(E)

    if self.test:
      return E, attention_weigths
    else:
      return E

In [None]:
m = MultiHeadSelfAttention(5)
x = torch.arange(100).reshape(1,2,50).float()

# Se máscara é fully-visible os resultados devem ser iguais
mask = torch.tensor([[1, 1], [1, 1]])
mask = mask.view(1, 1, 2, 2).expand(1, 5, 2, 2)
assert torch.sum(m(x, mask=mask) - m(x)) < 1e-6

# Se máscara é fully-visible para o segundo elemento, mas não para o primeiro
mask = torch.tensor([[1, 0], [1, 1]])
mask = mask.view(1, 1, 2, 2).expand(1, 5, 2, 2)
assert torch.sum(torch.abs(m(x, mask=mask)[:, 0, :] - m(x)[:, 0, :])) > 1
assert torch.sum(torch.abs(m(x, mask=mask)[:, 1, :] - m(x)[:, 1, :])) < 1e-6

# Se máscara é fully-visible para o primeiro elemento, mas não para o segundo
mask = torch.tensor([[1, 1], [1, 0]])
mask = mask.view(1, 1, 2, 2).expand(1, 5, 2, 2)
assert torch.sum(torch.abs(m(x, mask=mask)[:, 0, :] - m(x)[:, 0, :])) < 1e-6
assert torch.sum(torch.abs(m(x, mask=mask)[:, 1, :] - m(x)[:, 1, :])) > 1

In [None]:
torch.set_printoptions(precision=2, sci_mode=False)
m = MultiHeadSelfAttention(5, test=True)
x = torch.arange(500).reshape(1,10,50).float()/500
mask = torch.tril(torch.ones(10, 10))
mask = mask.view(1, 1, 10, 10).expand(1, 5, 10, 10)
mask, m(x, mask=mask)[1][0,0,:,:]

(tensor([[[[1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
           [1., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
           [1., 1., 1., 0., 0., 0., 0., 0., 0., 0.],
           [1., 1., 1., 1., 0., 0., 0., 0., 0., 0.],
           [1., 1., 1., 1., 1., 0., 0., 0., 0., 0.],
           [1., 1., 1., 1., 1., 1., 0., 0., 0., 0.],
           [1., 1., 1., 1., 1., 1., 1., 0., 0., 0.],
           [1., 1., 1., 1., 1., 1., 1., 1., 0., 0.],
           [1., 1., 1., 1., 1., 1., 1., 1., 1., 0.],
           [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]],
 
          [[1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
           [1., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
           [1., 1., 1., 0., 0., 0., 0., 0., 0., 0.],
           [1., 1., 1., 1., 0., 0., 0., 0., 0., 0.],
           [1., 1., 1., 1., 1., 0., 0., 0., 0., 0.],
           [1., 1., 1., 1., 1., 1., 0., 0., 0., 0.],
           [1., 1., 1., 1., 1., 1., 1., 0., 0., 0.],
           [1., 1., 1., 1., 1., 1., 1., 1., 0., 0.],
           [1., 1., 1., 1., 1., 1., 1., 1.,

In [None]:
class MyAttentionModel(nn.Module):

  def __init__(self, max_seq_length: int, vocab_size: int, embedding_dim: int = 50, heads: int = 5, eos_token_id: int = None):
    super(MyAttentionModel, self).__init__()

    self.generating = False

    self.eos_token_id = eos_token_id

    self.max_seq_length = max_seq_length

    self.heads = heads

    self.causal_mask = nn.Parameter(data=torch.tril(torch.ones(max_seq_length, max_seq_length)), requires_grad=False)

    self.dropout = nn.Dropout(p=0.15)

    self.tokens_embeddings = nn.Embedding(vocab_size, embedding_dim)
    self.positional_embeddings = nn.Parameter(data=torch.normal(0, 0.1, size=(max_seq_length, embedding_dim)))
    
    self.self_attention = MultiHeadSelfAttention(heads=heads, embedding_dim=embedding_dim)

    self.feed_foward = nn.Sequential(
        nn.Linear(embedding_dim, 4*embedding_dim),
        nn.ReLU(),
        nn.Linear(4*embedding_dim, embedding_dim)
    )

    self.language_head = nn.Linear(embedding_dim, vocab_size)
  
  def forward(self, x, attention_mask=None):

    batch_size = x.shape[0]
    seq_len = x.shape[1]

    input_embeddings = self.tokens_embeddings(x) + self.positional_embeddings[:seq_len, :]

    if attention_mask is not None:
      attention_mask = attention_mask.reshape(batch_size, 1, 1, seq_len).expand(-1, self.heads, seq_len, -1)
      causal_mask = self.causal_mask[:seq_len, :seq_len].reshape(1, 1, seq_len, seq_len).expand(batch_size, self.heads, seq_len, seq_len)
      mask = attention_mask * causal_mask
    else:
      mask = self.causal_mask[:seq_len, :seq_len].reshape(1, 1, seq_len, seq_len).expand(batch_size, self.heads, seq_len, seq_len)

    E = self.self_attention(input_embeddings, mask=mask) 
    E = E + self.dropout(input_embeddings) # skip-connection

    y = self.feed_foward(E)
    y = y + self.dropout(E) # skip-connection
    
    if self.generating:
      logits = self.language_head(y[:, -1, :])
    else:
      logits = self.language_head(y)

    return logits

  def generate(self, x, attention_mask=None, max_length=10):
    self.generating = True

    # Garantir que tem a dimensão do batch
    if len(x.shape) < 2:
      x = x.reshape(1, -1)

    initial_input = x

    text_out = self.__call__(x, attention_mask=attention_mask).argmax(dim=-1)
    x = torch.hstack((x, text_out.reshape(1, -1)))
    
    while text_out.shape[0] <= max_length and text_out[-1].item() != self.eos_token_id:
      
      token_id = self.__call__(x, attention_mask=attention_mask).argmax(dim=-1)
      text_out = torch.cat((text_out, token_id))

      x = torch.hstack((x, text_out[-1].reshape(1, 1)))
     
      if attention_mask is not None:
        attention_mask = torch.hstack((attention_mask, torch.tensor([[1]])))

      if x.shape[1] > self.max_seq_length:
        x = x[:,1:]
        if attention_mask is not None:
          attention_mask = attention_mask[:,1:]


    self.generating = False
    
    return torch.cat((initial_input[0], text_out))

In [None]:
m = MyAttentionModel(13, 10, 5)
x = torch.LongTensor([[1, 2, 3], [1, 2, 3]])
print(m(x).shape)
m.generating = True
m(x).shape

torch.Size([2, 3, 10])


torch.Size([2, 10])

In [None]:
m.generate(torch.LongTensor([1, 2, 3, 2, 3, 7, 7]))

tensor([1, 2, 3, 2, 3, 7, 7, 8, 8, 8, 1, 1, 1, 1, 1, 0, 2, 1])

# 5) Treinamento

## 5.1) Hyperparâmetros

In [None]:
hyperparams = {
    'batch_size': 128,
    'learning_rate': 1e-3,
    'epochs': 40,
    'embedding_dim': 200,
    'vocab_size': 5000,
    'max_seq_length': 15
}

## 5.2) Datasets e Dataloaders

In [None]:
tokenizer = IMDBTokenizer(max_tokens=hyperparams['vocab_size'])
tokenizer.create_vocab(x_train)

train_dataset = IMDBDataset(x_train, tokenizer, max_seq_length=hyperparams['max_seq_length'])
valid_dataset = IMDBDataset(x_valid, tokenizer, max_seq_length=hyperparams['max_seq_length'], shifting_window=True)

train_loader = DataLoader(train_dataset, batch_size=hyperparams['batch_size'], shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=500)

print(f'Dataset sizes | Train: {len(train_dataset):,} - Valid: {len(valid_dataset):,}')
print(f'Number of batches | Train: {len(train_loader):,} - Valid: {len(valid_loader):,}')

Dataset sizes | Train: 385,277 - Valid: 1,315,154
Number of batches | Train: 3,010 - Valid: 2,631


## 5.3) Funções de treino e validação

In [None]:
def evaluate_only_last(model, valid_dataloader, len_valid, criterion):
  accuracy = 0
  acc_loss = 0
  model.eval()
  model.generating = True
  for i, (input_ids, attention_mask, output_ids, decoder_mask) in enumerate(valid_dataloader):
    input_ids, attention_mask, output_ids, decoder_mask = input_ids.to(device), attention_mask.to(device), output_ids.to(device), decoder_mask.to(device)
    
    with torch.no_grad():
      logits = model(input_ids, attention_mask=attention_mask)
    
    loss = criterion(logits, output_ids[:, -1])
    acc_loss += loss.item()
    preds = logits.argmax(dim=-1)
    batch_correct = (preds == output_ids[:, -1])
    accuracy += batch_correct.sum().item()

  model.generating = False
  return accuracy / len_valid, acc_loss / len(valid_dataloader)

In [None]:
def evaluate_sequence(model, valid_dataloader, len_valid, criterion):
  accuracy = 0
  acc_loss = 0
  model.eval()
  for i, (input_ids, attention_mask, output_ids, decoder_mask) in enumerate(valid_dataloader):
    input_ids, attention_mask, output_ids, decoder_mask = input_ids.to(device), attention_mask.to(device), output_ids.to(device), decoder_mask.to(device)
    
    with torch.no_grad():
      logits = model(input_ids, attention_mask=attention_mask)
    
    loss = criterion(logits.transpose(-2, -1), output_ids)
    acc_loss += loss.item()
    preds = logits.argmax(dim=-1)
    batch_correct = (preds == output_ids)
    accuracy += batch_correct.sum().item() / batch_correct.shape[1]

  return accuracy / len_valid, acc_loss / len(valid_dataloader)

In [None]:
def train(model, train_dataloader, optimizer, criterion, scheduler):
  acc_loss = 0
  model.train()
  for i, (input_ids, attention_mask, output_ids, decoder_mask) in enumerate(train_dataloader):
    input_ids, attention_mask, output_ids, decoder_mask = input_ids.to(device), attention_mask.to(device), output_ids.to(device), decoder_mask.to(device)
    
    logits = model(input_ids, attention_mask=attention_mask)
    loss = criterion(logits.transpose(-2, -1), output_ids)
    acc_loss += loss.item()

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    scheduler.step()
  
    current_mean_loss = acc_loss / (i+1)
  
  return acc_loss / len(train_dataloader)

## 5.4) Tetes das métricas do modelo

In [None]:
criterion = nn.CrossEntropyLoss()

initial_accuracy, initial_loss = 0, 0
for i in range(5):
  model = MyAttentionModel(hyperparams['max_seq_length'], tokenizer.vocab_size, embedding_dim=hyperparams['embedding_dim']).to(device)
  metrics = evaluate_only_last(model, valid_loader, len(valid_dataset), criterion)
  initial_accuracy += metrics[0] / 5
  initial_loss += metrics[1] / 5

initial_perplexity = np.exp(initial_loss)
print(f'Before trainning metrics | Loss: {initial_loss:.4f} - Perplexity: {initial_perplexity:.4f} - Accuracy: {initial_accuracy:.4f}')

#Testando se métricas iniciais estão dentro do esperado
assert (0.9 * tokenizer.vocab_size) <= initial_perplexity <= (1.25 * tokenizer.vocab_size)
assert initial_accuracy <= (1.25 / tokenizer.vocab_size)

Before trainning metrics | Loss: 8.7016 - Perplexity: 6012.4105 - Accuracy: 0.0002


## 5.5) Laço de treino

In [None]:
model = MyAttentionModel(hyperparams['max_seq_length'], tokenizer.vocab_size, embedding_dim=hyperparams['embedding_dim'], eos_token_id=tokenizer.eos_token_id).to(device)

optim = torch.optim.AdamW(model.parameters(), lr=hyperparams['learning_rate'])
scheduler = torch.optim.lr_scheduler.OneCycleLR(optim, max_lr=hyperparams['learning_rate'], steps_per_epoch=len(train_loader), epochs=hyperparams['epochs'],anneal_strategy='linear', pct_start=0.05)

log = Report(hyperparams['epochs'])
best_loss = 100000

for e in tqdm(range(hyperparams['epochs'])):
  train_loss = train(model, train_loader, optim, criterion, scheduler)
  log.record(pos=e+0.99, train_loss=train_loss, train_perplexity=np.exp(train_loss), end='\r')
  
  valid_accuracy, valid_loss = evaluate_only_last(model, valid_loader, len(valid_dataset), criterion)
  log.record(pos=e+0.99, val_loss=valid_loss, val_perplexity=np.exp(valid_loss), val_acc=valid_accuracy, end='\r')
  
  if valid_loss < best_loss:
    torch.save({'model_state_dict': model.cpu().state_dict(), 'tokenizer': tokenizer, 'hyperparams': hyperparams}, 'model_best.pt')
    model.to(device)
    best_loss = valid_loss
  
  log.report_avgs(e+1)

  0%|          | 0/40 [00:00<?, ?it/s]

EPOCH: 1.000	train_loss: 5.389	train_perplexity: 218.899	val_loss: 4.859	val_perplexity: 128.833	val_acc: 0.166	(101.30s - 3950.52s remaining)
EPOCH: 2.000	train_loss: 4.773	train_perplexity: 118.229	val_loss: 4.621	val_perplexity: 101.612	val_acc: 0.185	(202.94s - 3855.92s remaining)
EPOCH: 3.000	train_loss: 4.628	train_perplexity: 102.357	val_loss: 4.516	val_perplexity: 91.433	val_acc: 0.193	(303.68s - 3745.41s remaining)
EPOCH: 4.000	train_loss: 4.539	train_perplexity: 93.598	val_loss: 4.456	val_perplexity: 86.147	val_acc: 0.198	(404.66s - 3641.92s remaining)
EPOCH: 5.000	train_loss: 4.478	train_perplexity: 88.047	val_loss: 4.420	val_perplexity: 83.112	val_acc: 0.202	(504.05s - 3528.38s remaining)
EPOCH: 6.000	train_loss: 4.431	train_perplexity: 84.034	val_loss: 4.387	val_perplexity: 80.367	val_acc: 0.205	(604.34s - 3424.57s remaining)
EPOCH: 7.000	train_loss: 4.393	train_perplexity: 80.845	val_loss: 4.363	val_perplexity: 78.519	val_acc: 0.207	(703.39s - 3315.99s remaining)
EPOCH: 8

In [None]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
fig.set_size_inches(20, 9)
#ax.set_ylim([3.95, 4.15])
log.plot(['train_loss', 'val_loss'], ax=ax)

In [None]:
fig, ax = plt.subplots()
fig.set_size_inches(20, 9)
#ax.set_ylim([53, 65])
log.plot(['train_perplexity', 'val_perplexity'], ax=ax)

# 6) Analisando o modelo

## 6.1) Carregando modelo

In [None]:
def load_model_and_tokenizer(file_name: str):
  checkpoint = torch.load(file_name)
  load_params = checkpoint['hyperparams']
  load_tokenizer = checkpoint['tokenizer']
  load_model = MyAttentionModel(load_params['max_seq_length'], load_tokenizer.vocab_size, embedding_dim=load_params['embedding_dim'])
  load_model.load_state_dict(checkpoint['model_state_dict'])

  model_params = {k:v for k, v in load_params.items() if k in ['max_seq_length', 'vocab_size', 'embedding_dim']}
  print(f'Model loaded | {model_params}')

  return load_model, load_tokenizer, load_params['max_seq_length']

In [None]:
model_best, tokenizer_best, input_length = load_model_and_tokenizer('model_best.pt')

In [None]:
print(model_best)

## 6.2) Métricas no dataset de teste

In [None]:
criterion = nn.CrossEntropyLoss()

test_dataset = IMDBDataset(x_test, tokenizer_best, max_seq_length=input_length)

test_loader = DataLoader(test_dataset, batch_size=500)

test_accuracy1, test_loss1 = evaluate_only_last(model_best.to(device), test_loader, len(test_dataset), criterion)

test_accuracy2, test_loss2 = evaluate_sequence(model_best.to(device), test_loader, len(test_dataset), criterion)

print(f'Final model scores | Loss: {test_loss1:.5f} - Perplexity: {np.exp(test_loss1):.5f} - Accuracy: {test_accuracy1:.5f}')
print(f'Final model scores | Loss: {test_loss2:.5f} - Perplexity: {np.exp(test_loss2):.5f} - Accuracy: {test_accuracy2:.5f}')

## 6.3) Geração de textos

In [None]:
model_best.cpu()
def generate_text(tknz: Type[IMDBTokenizer], seed: str, gen_length: int = 50):
  input = torch.tensor(tknz.encode('<sos> ' + seed))
  pred = model_best.generate(input, max_length=gen_length).tolist()
  out_text = tknz.decode(pred)

  return out_text

In [None]:
generate_text(tokenizer_best, 'The movie was really good, but')

In [None]:
generate_text(tokenizer_best, 'This was not the worst movie I\'ve')

In [None]:
generate_text(tokenizer_best, 'Today I was walking on the street and then')

In [None]:
generate_text(tokenizer_best, 'I walked my dog by the park today when')

## 6.4) Matriz de confusão

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

valid_dataset = IMDBDataset(x_valid, tokenizer_best, max_seq_length=input_length)
valid_loader = DataLoader(valid_dataset, batch_size=len(valid_dataset)//5)

preds, ground_truth = torch.IntTensor([]), torch.IntTensor([])
for x, att, y, _ in valid_loader:
  logits = model_best(x)
  preds = torch.concat((preds, logits.argmax(dim=1)))
  ground_truth = torch.concat((ground_truth, y))

cfm = confusion_matrix(ground_truth.numpy(), preds.numpy())
del preds, x, y, ground_truth, logits

In [None]:
num_labels = 35

filter_cfm = cfm[:num_labels,:num_labels]
filter_cfm = filter_cfm/filter_cfm.sum(axis=1)

labels = list(tokenizer_best.vocab.keys())[:num_labels]

disp = ConfusionMatrixDisplay(confusion_matrix=filter_cfm, display_labels=labels)
fig, ax = plt.subplots()
fig.set_size_inches(15, 15)
disp.plot(include_values=False, ax=ax, values_format='.2f')
ax.set_xticklabels(ax.get_xticklabels(), rotation = 90)
plt.show()