## Exercício: Modelo de Linguagem (Bengio 2003) - MLP + Embeddings

Neste exercício iremos treinar uma rede neural similar a do Bengio 2003 para prever a próxima palavra de um texto, data as palavras anteriores como entrada. Esta tarefa é chamada de "Modelagem da Linguagem".

Portanto, você deve implementar o modelo de linguagem inspirado no artigo do Bengio, para prever a próxima palavra usando rede com embeddings e duas camadas.
Sugestão de alguns parâmetros:
* context_size = 9
* max_vocab_size = 3000
* embedding_dim = 64
* usar pontuação no vocabulário
* descartar qualquer contexto ou target que não esteja no vocabulário
* É esperado conseguir uma perplexidade da ordem de 50.
* Procurem fazer asserts para garantir que partes do seu programa estão testadas

Este enunciado não é fixo, podem mudar qualquer um dos parâmetros acima, mas procurem conseguir a perplexidade esperada ou menor.

Gerem alguns frases usando um contexto inicial e depois deslocando o contexto e prevendo a próxima palavra gerando frases compridas para ver se está gerando texto plausível.

Algumas dicas:
- Inclua caracteres de pontuação (ex: `.` e `,`) no vocabulário.
- Deixe tudo como caixa baixa (lower-case).
- A escolha do tamanho do vocabulario é importante: ser for muito grande, fica difícil para o modelo aprender boas representações. Se for muito pequeno, o modelo apenas conseguirá gerar textos simples.
- Remova qualquer exemplo de treino/validação/teste que tenha pelo menos um token desconhecido (ou na entrada ou na saída).
- Durante a depuração, faça seu dataset ficar bem pequeno, para que a depuração seja mais rápida e não precise de GPU. Somente ligue a GPU quando o seu laço de treinamento já está funcionando
- Não deixe para fazer esse exercício na véspera. Ele é trabalhoso.

Procure por `TODO` para entender onde você precisa inserir o seu código.

## Faz download e carrega o dataset

In [1]:
!wget https://www.gutenberg.org/ebooks/67724.txt.utf-8
!wget https://www.gutenberg.org/ebooks/67725.txt.utf-8
!pip install colorama

--2024-03-13 20:23:45--  https://www.gutenberg.org/ebooks/67724.txt.utf-8
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://www.gutenberg.org/cache/epub/67724/pg67724.txt [following]
--2024-03-13 20:23:45--  http://www.gutenberg.org/cache/epub/67724/pg67724.txt
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://www.gutenberg.org/cache/epub/67724/pg67724.txt [following]
--2024-03-13 20:23:46--  https://www.gutenberg.org/cache/epub/67724/pg67724.txt
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 372908 (364K) [text/plain]
Saving to: ‘67724.txt.utf-8.13’


2024-03-13 20:23:46 (1.56 MB/s) - ‘67724.txt.u

In [2]:
text = open("67724.txt.utf-8","r").read()
text += open("67725.txt.utf-8","r").read()

paragraphs = text.split("\n\n")
len(paragraphs)

4969

In [3]:
import random

cleaned_paragraphs = [paragraph.replace("\n", " ") for paragraph in paragraphs if paragraph.strip()]

# lowercase
cleaned_paragraphs = [paragraph.lower() for paragraph in paragraphs]

# Print 5 random paragraphs
num_paragraphs = len(cleaned_paragraphs)
for i in range(0,5):
    idx = random.randrange(num_paragraphs)
    print(f"{cleaned_paragraphs[idx]}\n")

print("Number of paragraphs: " + str(num_paragraphs))

--sim.

--não ha aqui culpados, sr. d. antonio de mariz, disse o italiano
animando-se progressivamente; ha homens que são tratados como cães;
que são sacrificados a um capricho vosso, e que estão resolvidos a
reivindicarem os seus fóros de homens e de christãos!

só quem tem viajado nos sertões e visto esses cardos gigantes, cujas
largas palmas crivadas de espinhos se entrelação estreitamente
formando uma alta muralha de alguns pés de grossura, poderá fazer
idéa da barreira impenetravel que cercava por todos os lados as pessoas
cuja voz pery ouvia sem distinguir as palavras.

e loredano dizendo esta palavra assentou a mão sobre um seixo que havia
ao lado.

a confiança que tinha, e com razão, no caracter de d. antonio
tranquillisava-o completamente; sabia que em caso algum o fidalgo
abriria um testamento que lhe fôra dado em deposito.

Number of paragraphs: 4969


## Análise do dataset

In [4]:
# Conta as palavras no dataset
from collections import Counter
import re

def count_words(texts):
    word_counts = Counter()
    for text in texts:
        word_counts.update(re.findall(r'\w+', text.lower()))
    return word_counts

word_counts = count_words(cleaned_paragraphs)

len(word_counts)

12603

## Criando um vocabulário

In [5]:
vocab_size = 3000
most_frequent_words = [word for word, count in word_counts.most_common(vocab_size)]
vocab = {word: i for i, word in enumerate(most_frequent_words, 1)}

In [6]:
def encode_sentence(sentence, vocab):
    return [vocab.get(word, 0) for word in re.findall(r'\w+', sentence.lower())]

encode_sentence(cleaned_paragraphs[20], vocab)

[2660]

## Classe do dataset

In [7]:
# Dataset class

import torch
from torch.utils.data import Dataset, DataLoader

class CustomDataset(Dataset):
  def __init__(self, paragraphs, vocab, context):
    # Define your data here
    self.tokens = []
    self.targets = []
    for paragraph in paragraphs:
      encoded = encode_sentence(paragraph, vocab)
      for i in range(len(encoded) - context):
        self.tokens.append(encoded[i:i+context])
        self.targets.append(encoded[i+context])

  def __len__(self):
    return len(self.tokens)

  def __getitem__(self, idx):
    return torch.tensor(self.tokens[idx]), torch.tensor(self.targets[idx])

In [8]:
from sklearn.model_selection import train_test_split

train_data, val_data = train_test_split(cleaned_paragraphs, test_size=0.2, random_state=18)

In [9]:
context_size = 9

train_dataset = CustomDataset(train_data, vocab, context_size)
val_dataset = CustomDataset(val_data, vocab, context_size)

# Samples
print(train_dataset[0])
print(val_dataset[0])

(tensor([ 12,  68,   0,   1, 102,   5,   9,   0,   0]), tensor(2))
(tensor([ 42,  70,  20,  84, 568, 716,   0,   1, 293]), tensor(8))


In [10]:
batch_size = 128
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True)
sample = next(iter(train_loader))

## Model

In [11]:
import torch.nn as nn
import torch.nn.functional as F

class BengioModel(torch.nn.Module):
    def __init__(self, vocab_size, embedding_dim, context_size, h):
        super(BengioModel, self).__init__()
        self.embeddings = nn.Embedding(vocab_size+1, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, h)
        self.linear2 = nn.Linear(h, vocab_size+1)

    def forward(self, inputs):
        embeds = self.embeddings(inputs)
        # Flatten embeddings
        embeds = embeds.view(embeds.size(0), -1)
        # Linear layer with Relu activation
        out = self.linear1(embeds)
        out = F.relu(out)
        # Second layer
        out = self.linear2(out)
        return out

In [12]:
embedding_dim = 64
hidden_dim = 128
model = BengioModel(vocab_size, embedding_dim, context_size, hidden_dim)

In [13]:
sample = next(iter(train_loader))
input = sample[0]
target = sample[1]

print(input.shape)
print(target.shape)

torch.Size([128, 9])
torch.Size([128])


In [14]:
output = model(input)

In [15]:
output.argmax(dim=1)

tensor([ 465, 1550,  655, 2727, 2901, 2278,  998, 2926, 1802, 2469, 2411, 2007,
        1469, 1806, 2453, 2926, 1484,  737, 2411,  695, 2989, 2999, 2576, 2771,
        2926, 1643,  696, 1176, 1389,  754, 2662, 1384, 1484, 2397, 2453, 2762,
        1176, 1802, 1802, 1802, 1720, 2989,  859, 1537, 2504, 1752, 1720,  927,
        2926, 2695, 2758,  254, 1864, 1550, 2901, 1427,  342,   73, 2563, 2926,
        2242,  695, 1802, 1802,  388, 2737, 1462,   73, 2894, 2771, 1677, 2677,
        1872, 2926,  342, 1537,  939, 2759, 2182, 2284, 1677, 1025, 2236, 1034,
        1155, 1883, 1856,  313,  313, 2453, 1176, 2878,  338, 1404, 1785, 1722,
        1752,  334, 1389, 2282,   75, 1518,  121, 2926, 2005, 2926, 2453, 1484,
        1484, 1738, 2759, 1802, 2685, 1802,  616, 1720, 1176,  906,  939, 2453,
        2989, 1806,  705,  938,  363,  711,  711, 1759])

In [16]:
target

tensor([   4,    4,   13,   10,   37,   11, 1053,  845,   70,   90,  401,    8,
        1095,   67,    0,    6,    7,  105,   75,    4,   42,    3,    0,    8,
           3, 1697,    1,  808,    5, 1371,    6,   13,   32,   83,   54, 1083,
          14,    7,    0,    4,    5,    2, 1057,    0, 1406,    4,  116,  692,
           4,    0,    5,    4,   21,  206,    0,    1,    0,   23,    3,    1,
        1324,    9,  958,    5,    0,    0, 2429, 2837,    1,   10, 2782,    0,
          27,   69,  168,    2,  379,    1,    2,    9,   16,  236, 2994,    5,
        1676,    0,    1,    3, 1589,  103,    8,    2,    2,  469,   41, 1460,
          11,  440,    8, 1505, 1048,   14,   61, 1222,    2, 2685,    5,    0,
           0,    1,    0,   12,  110,  273,  817,    1,    4,  270,    4,    0,
        2218,   53,  134, 2225, 2789,  136,  307,   49])

## Training

In [17]:
# Verifica se há uma GPU disponível e define o dispositivo para GPU se possível, caso contrário, usa a CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [18]:
epochs = 10

# Learning rate
lr = 0.01

# Cross Entropy
criterion = nn.CrossEntropyLoss()

# Optimizer
optimizer = torch.optim.Adam(model.parameters(), lr)

model.to(device)

BengioModel(
  (embeddings): Embedding(3001, 64)
  (linear1): Linear(in_features=576, out_features=128, bias=True)
  (linear2): Linear(in_features=128, out_features=3001, bias=True)
)

In [19]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

# Exemplo de uso:
total_params = count_parameters(model)
print(f'O modelo tem um total de {total_params:,} parâmetros.')

O modelo tem um total de 653,049 parâmetros.


In [20]:
# Initial Perplexity and Loss
# Before training
model.eval()

loss = 0
perp = 0

with torch.no_grad():
    for inputs, targets in train_loader:
        inputs = inputs.to(device)
        targets = targets.to(device)
        outputs = model(inputs)
        loss += criterion(outputs, targets).item()

loss /= len(train_loader)
perp = torch.exp(torch.tensor(loss))

print(f'Initial Loss: {loss:.4f}')
print(f'Initial Perplexity: {perp:.4f}')

Initial Loss: 8.0086
Initial Perplexity: 3006.7974


In [21]:
# Training Loop

for epoch in range(epochs):
  model.train()

  # Metrics
  epoch_loss = 0
  epoch_correct = 0
  epoch_samples = 0

  for inputs, targets in train_loader:
        inputs = inputs.to(device)  # Move input data to the device
        targets = targets.to(device)

        # Forward pass
        outputs = model(inputs)
        loss = criterion(outputs, targets)

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Loss
        epoch_loss += loss.item()

        # Predicted
        _, predicted = torch.max(outputs, 1)
        epoch_correct += (predicted == targets).sum().item()
        epoch_samples += targets.size(0)

  # Calculate average loss and accuracy for epoch
  avg_loss = epoch_loss / len(train_loader)
  acc = epoch_correct / epoch_samples

  # Perplexity
  perp = torch.exp(torch.tensor(avg_loss))

  # Print epoch statistics
  print(f'Epoch [{epoch+1}/{epochs}], Loss: {avg_loss:.4f}, Accuracy: {acc:.2f}%, Perplexity: {perp:.4f}')


Epoch [1/10], Loss: 5.9605, Accuracy: 0.14%, Perplexity: 387.7994
Epoch [2/10], Loss: 5.3782, Accuracy: 0.15%, Perplexity: 216.6312
Epoch [3/10], Loss: 5.1707, Accuracy: 0.16%, Perplexity: 176.0393
Epoch [4/10], Loss: 5.0128, Accuracy: 0.17%, Perplexity: 150.3320
Epoch [5/10], Loss: 4.8967, Accuracy: 0.18%, Perplexity: 133.8424
Epoch [6/10], Loss: 4.7999, Accuracy: 0.19%, Perplexity: 121.5022
Epoch [7/10], Loss: 4.7320, Accuracy: 0.20%, Perplexity: 113.5183
Epoch [8/10], Loss: 4.6710, Accuracy: 0.21%, Perplexity: 106.8021
Epoch [9/10], Loss: 4.6319, Accuracy: 0.21%, Perplexity: 102.7060
Epoch [10/10], Loss: 4.5889, Accuracy: 0.21%, Perplexity: 98.3830


## Avaliação

In [22]:
import torch.nn.functional as F

model.eval()

loss_sum = 0
total_sum = 0
correct_sum = 0

with torch.no_grad():
    for inputs, targets in val_loader:
        inputs = inputs.to(device)
        targets = targets.to(device)

        # Forward pass
        outputs = model(inputs)
        loss = F.cross_entropy(outputs, targets)
        loss_sum += loss

        # Get the predicted labels
        _, predicted = torch.max(outputs, 1)

        total_sum += targets.size(0)
        correct_sum += (predicted == targets).sum().item()

# Calculate accuracy
acc = 100 * correct_sum/total_sum

# Calculate average perplexity
average_loss = loss_sum / len(val_loader)
average_perplexity = torch.exp(torch.tensor(average_loss))

print(f'Test Accuracy: {acc}%')
print(f'Average Perplexity: {average_perplexity}')

Test Accuracy: 15.575232121994713%
Average Perplexity: 340.0611267089844


  average_perplexity = torch.exp(torch.tensor(average_loss))


## Exemplo de uso

In [23]:
# Código adaptado da implementação do Cesar Bastos
from colorama import Fore, Style

text = cleaned_paragraphs
model.to(device)
def generate_text(model, vocab, text, max_length, context_size):
    words = []
    # Ensure there are enough words for at least one sequence
    while len(words) < context_size:
        random_number = random.randint(1, 4891)
        words = encode_sentence(text[random_number], vocab)
        if not words:
            words = []
            continue  # Skip if the sentence cannot be encoded
        words = words[:context_size]
        #print(words)
        if any(token == 0 for token in words):
            words = []
            continue  # Skip if any token is zero (assuming 0 is a special token)
        context = words

    print(f"Frase: {cleaned_paragraphs[random_number]}")
    print(words)

    for _ in range(max_length):
        words_tensor = torch.tensor(context[-context_size:], dtype=torch.long).unsqueeze(0).to(device)
        logits = model(words_tensor)
        probs = F.softmax(logits, dim=1)
        next_token = torch.multinomial(probs, num_samples=1)
        context.append(next_token.item())
        print(context)
    frase = []
    for i in context: ##Agradecimentos a Ramon Abilio
        word = next((word for word, code in vocab.items() if code == i), "<UNKNOWN>")
        frase.append(word)

    print(f"{Fore.BLUE}{frase[:context_size]}{Style.RESET_ALL} {Fore.RED}{frase[-max_length:]}{Style.RESET_ALL} ")

context_size = 9
max_length= 10
generate_text(model, vocab, text, max_length, context_size)

Frase: era o melhor leito que podia ter a menina no meio do deserto; puxou a
canôa, alcatifou o fundo com as folhas macias das palmeiras, e, tomando
cecilia nos braços, deitou-a no seu berço.
[28, 3, 506, 344, 2, 108, 152, 1, 52]
[28, 3, 506, 344, 2, 108, 152, 1, 52, 1]
[28, 3, 506, 344, 2, 108, 152, 1, 52, 1, 674]
[28, 3, 506, 344, 2, 108, 152, 1, 52, 1, 674, 5]
[28, 3, 506, 344, 2, 108, 152, 1, 52, 1, 674, 5, 24]
[28, 3, 506, 344, 2, 108, 152, 1, 52, 1, 674, 5, 24, 0]
[28, 3, 506, 344, 2, 108, 152, 1, 52, 1, 674, 5, 24, 0, 0]
[28, 3, 506, 344, 2, 108, 152, 1, 52, 1, 674, 5, 24, 0, 0, 11]
[28, 3, 506, 344, 2, 108, 152, 1, 52, 1, 674, 5, 24, 0, 0, 11, 1991]
[28, 3, 506, 344, 2, 108, 152, 1, 52, 1, 674, 5, 24, 0, 0, 11, 1991, 4]
[28, 3, 506, 344, 2, 108, 152, 1, 52, 1, 674, 5, 24, 0, 0, 11, 1991, 4, 17]
[34m['era', 'o', 'melhor', 'leito', 'que', 'podia', 'ter', 'a', 'menina'][0m [31m['a', 'partio', 'e', 'lhe', '<UNKNOWN>', '<UNKNOWN>', 'da', 'escapou', 'de', 'pery'][0m 
