## Exercício: Modelo de Linguagem com auto-atenção

Este exercício é similar ao da aula passada, mas iremos agora treinar uma rede neural *com auto-atenção* para prever a próxima palavra de um texto, data as palavras anteriores como entrada.

Na camada de auto-atenção, deve-se implementar (vide slide 34):
- Embeddings de posição
- Projeções lineares (WQ, WK, WV, WO)
- Camada de feed forward (2-layer MLP)

Instrucões:
- É necessário fazer duas implementações da camada de auto-atenção: uma usando laços (ineficiente, mas fácil de entender) e outra matricial (eficiente mas difícil de entender). Usar slide 36 como referência.

- Fazer um assert para garantir que o resultado das duas implementações é exatamente igual.

- No treinamento, usar apenas a implementação matricial.

In [1]:
import os
import sys
import random
import torch.nn as nn
import torch.nn.functional as F
from sklearn.model_selection import train_test_split

In [2]:
# Global variables

# Vocabulary
vocab_size = 5000
context_size = 9

# Training
batch_size = 32
epochs = 10
lr = 0.01

# Model
embedding_dim = 64
hidden_dim = 128

## Faz download e carrega o dataset

In [3]:
# Check if download is necessary
if not os.path.exists("67724.txt.utf-8"):
    print("Downloading Gutenberg texts")

    !wget https://www.gutenberg.org/ebooks/67724.txt.utf-8
    !wget https://www.gutenberg.org/ebooks/67725.txt.utf-8

In [4]:
text = open("67724.txt.utf-8","r").read()
text += open("67725.txt.utf-8","r").read()

paragraphs = text.split("\n\n")

len(paragraphs)

4969

In [5]:
# Checking the text
print(paragraphs[0])

The Project Gutenberg eBook of O Guarany: romance brazileiro, Vol. 1 (of 2)
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.


In [6]:
cleaned_paragraphs = [paragraph.replace("\n", " ") for paragraph in paragraphs if paragraph.strip()]

# Print 5 random paragraphs
num_paragraphs = len(cleaned_paragraphs)
for i in range(0,5):
    idx = random.randrange(num_paragraphs)
    print(f"{cleaned_paragraphs[idx]}\n")

print("Number of paragraphs: " + str(num_paragraphs))

len(cleaned_paragraphs)

PLEASE READ THIS BEFORE YOU DISTRIBUTE OR USE THIS WORK

O indio ajoelhou aos pés de Cecilia; sem animar-se a levantar os olhos para ella apresentou-lhe o cabaz de palha: abrindo a tampa, a menina assustou-se, mas sorrio; um enxame de beija-flôres esvoaçava dentro; alguns conseguirão escapar-se.

 X

--Sim! respondeu a menina tomando-lhe as mãos; Cecilia fica comtigo e não te deixará. Tu és rei destas florestas, destes campos, destas montanhas; tua irmã te acompanhará!

Então tratou de recuperar as forças que havia perdido, e tudo quanto a floresta lhe offerecia de saboroso nutriente servio a esse banquete da vida, em que o selvagem festejava a sua victoria sobre a morte e o veneno.

Number of paragraphs: 4892


4892

## Análise do dataset

In [7]:
# Conta as palavras no dataset
from collections import Counter
import re

def count_words(texts):
    word_counts = Counter()
    for text in texts:
        word_counts.update(re.findall(r'\w+', text.lower()))
    return word_counts

word_counts = count_words(cleaned_paragraphs)

len(word_counts)

12603

## Criando um vocabulário

In [8]:
most_frequent_words = [word for word, count in word_counts.most_common(vocab_size)]
vocab = {word: i for i, word in enumerate(most_frequent_words, 1)}

In [9]:
print(vocab)

{'a': 1, 'que': 2, 'o': 3, 'de': 4, 'e': 5, 'se': 6, 'um': 7, 'do': 8, 'não': 9, 'uma': 10, 'da': 11, 'os': 12, 'com': 13, 'sua': 14, 'para': 15, 'seu': 16, 'pery': 17, 'as': 18, 'em': 19, 'no': 20, 'por': 21, 'ao': 22, 'como': 23, 'lhe': 24, 'd': 25, 'á': 26, 'tinha': 27, 'era': 28, 'cecilia': 29, 'na': 30, 'é': 31, 'sobre': 32, 'mas': 33, 'elle': 34, 'the': 35, 'dos': 36, 'indio': 37, 'me': 38, 'seus': 39, 'mais': 40, 'antonio': 41, 'quando': 42, 'alvaro': 43, 'disse': 44, 'das': 45, 'vos': 46, 'of': 47, 'ella': 48, 'olhos': 49, 'te': 50, 'senhora': 51, 'menina': 52, 'pela': 53, 'tu': 54, 'depois': 55, 'nos': 56, 'isabel': 57, 'havia': 58, 'gutenberg': 59, 'fidalgo': 60, 'casa': 61, 'estava': 62, 'ainda': 63, 'tempo': 64, 'já': 65, 'mariz': 66, 'project': 67, 'aventureiros': 68, 'momento': 69, 'loredano': 70, 'só': 71, 'mesmo': 72, 'italiano': 73, 'todos': 74, 'pelo': 75, 'vida': 76, 'sem': 77, 'dous': 78, 'to': 79, 'homem': 80, 'eu': 81, 'porque': 82, 'or': 83, 'meio': 84, 'you': 85

In [10]:
def encode_sentence(sentence, vocab):
    return [vocab.get(word, 0) for word in re.findall(r'\w+', sentence.lower())]

encode_sentence(cleaned_paragraphs[20], vocab)

[0,
 139,
 4376,
 19,
 0,
 6,
 44,
 110,
 269,
 259,
 2662,
 10,
 1064,
 0,
 2,
 186,
 130,
 280,
 3,
 2257,
 6,
 0,
 1,
 2665]

## Classe do dataset

In [11]:
# Dataset class
import torch
from torch.utils.data import Dataset, DataLoader

class CustomDataset(Dataset):
  def __init__(self, paragraphs, vocab, context):
    # Define your data here
    self.tokens = []
    self.targets = []
    self.removed = 0
    for paragraph in paragraphs:
      encoded = encode_sentence(paragraph, vocab)
      # Do not add examples with unknown tokens
      if 0 not in encoded:
        for i in range(len(encoded) - context):
          self.tokens.append(encoded[i:i+context])
          self.targets.append(encoded[i+context])
      else:
        self.removed+=1

  def __len__(self):
    return len(self.tokens)

  def __getitem__(self, idx):
    return torch.tensor(self.tokens[idx]), torch.tensor(self.targets[idx])
  
  def __removed__(self):
    return self.removed

In [12]:
# Train/Validation split
train_data, val_data = train_test_split(cleaned_paragraphs, test_size=0.2, random_state=18)

print("Training samples: " + str(len(train_data)))
print("Validation samples: " + str(len(val_data)))

Training samples: 3913
Validation samples: 979


In [13]:
train_dataset = CustomDataset(train_data, vocab, context_size)
val_dataset = CustomDataset(val_data, vocab, context_size)

# Samples
print("Training samples:")
print(train_dataset[:5])
print()
print("Validation samples:")
print(val_dataset[:5])
print()
print("Removed training samples: " + str(train_dataset.removed))
print("Removed validation samples: " + str(val_dataset.removed))

Training samples:
(tensor([[4082,    2,   25,   41,    4,   66, 4083,  393, 3119],
        [   2,   25,   41,    4,   66, 4083,  393, 3119,   50],
        [  25,   41,    4,   66, 4083,  393, 3119,   50,    3],
        [  41,    4,   66, 4083,  393, 3119,   50,    3,    2],
        [   4,   66, 4083,  393, 3119,   50,    3,    2, 1336]]), tensor([  50,    3,    2, 1336,   21]))

Validation samples:
(tensor([[  31, 1602,  839,    1,  105, 1443,   21,   14,  117],
        [1602,  839,    1,  105, 1443,   21,   14,  117,   55],
        [ 839,    1,  105, 1443,   21,   14,  117,   55,    4],
        [   1,  105, 1443,   21,   14,  117,   55,    4,  953],
        [ 105, 1443,   21,   14,  117,   55,    4,  953, 2330]]), tensor([  55,    4,  953, 2330,  234]))

Removed training samples: 2561
Removed validation samples: 661


In [14]:
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True)

sample = next(iter(train_loader))
print(sample)

[tensor([[ 451,  173,   43,    4, 1078, 1990,    2,   46, 3505],
        [ 906,    6,   15, 1874,    3, 1558,    4,   88,  527],
        [   3,  130, 1607,    3,  905,    5, 1110,   23,   10],
        [   1,   52,  807,  720,   12,   39,   49,  787, 2812],
        [3762,   77,    3, 4918,  280,   13,    1,  781, 1035],
        [ 217,  368,    3, 1366,    2,   39,  140,  661,  525],
        [   2, 1026,   11,   51,    5,   17, 3978,    7,   71],
        [1258,  443, 1340,   13,   10, 1303,  779,    1,  311],
        [  85,  325,  252,  929,  109,   35,  463,  375,   85],
        [ 125,  665,    5,   22,   72,   64,  761,    5,  454],
        [  94,    7,  514,    4, 2061,   11,  177,    4,  714],
        [   3,   37,  808,    8,  535,    5,  495,    6,   22],
        [   1,  165,    4,   17,   17,    9,  569,    2,    1],
        [ 876,  877,  484,   79,   35,   67,   59,  559,  560],
        [  71,  562,   10,  149,    2,  153,  182,    9,   46],
        [   5, 1868,  228,  381,   17, 

## Model

In [15]:
class BengioModel(torch.nn.Module):
    def __init__(self, vocab_size, embedding_dim, context_size, h):
        super(BengioModel, self).__init__()
        self.embeddings = nn.Embedding(vocab_size+1, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, h)
        self.linear2 = nn.Linear(h, vocab_size+1)

    def forward(self, inputs):
        embeds = self.embeddings(inputs)
        # Flatten embeddings
        embeds = embeds.view(embeds.size(0), -1)
        # Linear layer with Relu activation
        out = self.linear1(embeds)
        out = F.relu(out)
        # Second layer
        out = self.linear2(out)
        return out

In [16]:
model = BengioModel(vocab_size, embedding_dim, context_size, hidden_dim)

In [17]:
sample = next(iter(train_loader))
input = sample[0]
target = sample[1]

print(input.shape)
print(target.shape)

torch.Size([32, 9])
torch.Size([32])


In [18]:
output = model(input)

In [19]:
output.argmax(dim=1)

tensor([3089,  777, 4517,  178, 3280, 1065, 1214, 4583, 3269, 1696,  346, 4055,
         781,  313, 1636, 3203,  360, 4517, 3777, 1448, 2141,  452, 3063, 3992,
        4490, 2067, 4317, 1478, 4295, 1396,  792,  398])

In [20]:
target

tensor([  34,    4,   41,    4, 1023,  918,    7,  125,   25, 2632, 3496, 2893,
           1,  319,    1,    4,    2,  906,  141,    6,  950, 1622,    2, 1571,
         302, 1189,  499,  126,   13,  326,   41, 3841])

## Training

In [21]:
# Verifica se há uma GPU disponível e define o dispositivo para GPU se possível, caso contrário, usa a CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cpu')

In [22]:
# Cross Entropy
criterion = nn.CrossEntropyLoss()

# Optimizer
optimizer = torch.optim.Adam(model.parameters(), lr)

model.to(device)

BengioModel(
  (embeddings): Embedding(5001, 64)
  (linear1): Linear(in_features=576, out_features=128, bias=True)
  (linear2): Linear(in_features=128, out_features=5001, bias=True)
)

In [23]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

# Exemplo de uso:
total_params = count_parameters(model)
print(f'O modelo tem um total de {total_params:,} parâmetros.')

O modelo tem um total de 1,039,049 parâmetros.


In [24]:
# Initial Perplexity and Loss
# Before training
model.eval()

loss = 0
perp = 0

with torch.no_grad():
    for inputs, targets in train_loader:
        inputs = inputs.to(device)
        targets = targets.to(device)
        outputs = model(inputs)
        loss += criterion(outputs, targets).item()

loss /= len(train_loader)
perp = torch.exp(torch.tensor(loss))

print(f'Initial Loss: {loss:.4f}')
print(f'Initial Perplexity: {perp:.4f}')

Initial Loss: 8.5382
Initial Perplexity: 5106.0981


In [25]:
# Training Loop

for epoch in range(epochs):
  model.train()

  # Metrics
  epoch_loss = 0
  epoch_correct = 0
  epoch_samples = 0

  for inputs, targets in train_loader:
        inputs = inputs.to(device)  # Move input data to the device
        targets = targets.to(device)

        # Forward pass
        outputs = model(inputs)
        loss = criterion(outputs, targets)

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Loss
        epoch_loss += loss.item()

        # Predicted
        _, predicted = torch.max(outputs, 1)
        epoch_correct += (predicted == targets).sum().item()
        epoch_samples += targets.size(0)

  # Calculate average loss and accuracy for epoch
  avg_loss = epoch_loss / len(train_loader)
  acc = epoch_correct / epoch_samples

  # Perplexity
  perp = torch.exp(torch.tensor(avg_loss))

  # Print epoch statistics
  print(f'Epoch [{epoch+1}/{epochs}], Loss: {avg_loss:.4f}, Accuracy: {acc:.2f}%, Perplexity: {perp:.4f}')


Epoch [1/10], Loss: 7.6150, Accuracy: 0.03%, Perplexity: 2028.4803
Epoch [2/10], Loss: 6.2619, Accuracy: 0.07%, Perplexity: 524.1955
Epoch [3/10], Loss: 5.5638, Accuracy: 0.11%, Perplexity: 260.8195
Epoch [4/10], Loss: 5.0556, Accuracy: 0.16%, Perplexity: 156.8942
Epoch [5/10], Loss: 4.6801, Accuracy: 0.20%, Perplexity: 107.7773
Epoch [6/10], Loss: 4.4508, Accuracy: 0.23%, Perplexity: 85.6932
Epoch [7/10], Loss: 4.3014, Accuracy: 0.26%, Perplexity: 73.8021
Epoch [8/10], Loss: 4.1567, Accuracy: 0.28%, Perplexity: 63.8601
Epoch [9/10], Loss: 4.1041, Accuracy: 0.30%, Perplexity: 60.5870
Epoch [10/10], Loss: 4.0702, Accuracy: 0.29%, Perplexity: 58.5697


## Avaliação

In [30]:
import torch.nn.functional as F

model.eval()

loss_sum = 0
total_sum = 0
correct_sum = 0

with torch.no_grad():
    for inputs, targets in val_loader:
        inputs = inputs.to(device)
        targets = targets.to(device)

        # Forward pass
        outputs = model(inputs)
        loss = F.cross_entropy(outputs, targets)
        loss_sum += loss

        # Get the predicted labels
        _, predicted = torch.max(outputs, 1)

        total_sum += targets.size(0)
        correct_sum += (predicted == targets).sum().item()

# Calculate accuracy
acc = 100 * correct_sum/total_sum

# Calculate average perplexity
average_loss = loss_sum / len(val_loader)
average_perplexity = torch.exp(average_loss)

print(f'Test Accuracy: {acc:.2f}%')
print(f'Average Perplexity: {average_perplexity:.2f}')

print(len(val_loader))

Test Accuracy: 3.64%
Average Perplexity: 47981.93
51


## Exemplo de uso

In [27]:
# Código adaptado da implementação do Cesar Bastos
from colorama import Fore, Style

text = cleaned_paragraphs
model.to(device)
def generate_text(model, vocab, text, max_length, context_size):
    words = []
    # Ensure there are enough words for at least one sequence
    while len(words) < context_size:
        random_number = random.randint(1, 4891)
        words = encode_sentence(text[random_number], vocab)
        if not words:
            words = []
            continue  # Skip if the sentence cannot be encoded
        words = words[:context_size]
        #print(words)
        if any(token == 0 for token in words):
            words = []
            continue  # Skip if any token is zero (assuming 0 is a special token)
        context = words

    print(f"Frase: {cleaned_paragraphs[random_number]}")
    print(words)

    for _ in range(max_length):
        words_tensor = torch.tensor(context[-context_size:], dtype=torch.long).unsqueeze(0).to(device)
        logits = model(words_tensor)
        probs = F.softmax(logits, dim=1)
        next_token = torch.multinomial(probs, num_samples=1)
        context.append(next_token.item())
        print(context)
    frase = []
    for i in context: ##Agradecimentos a Ramon Abilio
        word = next((word for word, code in vocab.items() if code == i), "<UNKNOWN>")
        frase.append(word)

    print(f"{Fore.BLUE}{frase[:context_size]}{Style.RESET_ALL} {Fore.RED}{frase[-max_length:]}{Style.RESET_ALL} ")

context_size = 9
max_length= 10
generate_text(model, vocab, text, max_length, context_size)

Frase: Apenas concluio, a altivez do guerreiro desappareceu; ficou timido e modesto; já não era mais do que um barbaro em face de creaturas civilisadas, cuja superioridade de educação o seu instincto reconhecia.
[116, 2110, 1, 1019, 8, 819, 737, 322, 2017]
[116, 2110, 1, 1019, 8, 819, 737, 322, 2017, 1]
[116, 2110, 1, 1019, 8, 819, 737, 322, 2017, 1, 77]
[116, 2110, 1, 1019, 8, 819, 737, 322, 2017, 1, 77, 4584]
[116, 2110, 1, 1019, 8, 819, 737, 322, 2017, 1, 77, 4584, 177]
[116, 2110, 1, 1019, 8, 819, 737, 322, 2017, 1, 77, 4584, 177, 13]
[116, 2110, 1, 1019, 8, 819, 737, 322, 2017, 1, 77, 4584, 177, 13, 4]
[116, 2110, 1, 1019, 8, 819, 737, 322, 2017, 1, 77, 4584, 177, 13, 4, 43]
[116, 2110, 1, 1019, 8, 819, 737, 322, 2017, 1, 77, 4584, 177, 13, 4, 43, 1]
[116, 2110, 1, 1019, 8, 819, 737, 322, 2017, 1, 77, 4584, 177, 13, 4, 43, 1, 213]
[116, 2110, 1, 1019, 8, 819, 737, 322, 2017, 1, 77, 4584, 177, 13, 4, 43, 1, 213, 418]
[34m['apenas', 'concluio', 'a', 'altivez', 'do', 'guerreiro', 'd