## Exercício: Modelo de Linguagem (Bengio 2003) - MLP + Embeddings

Neste exercício iremos treinar uma rede neural similar a do Bengio 2003 para prever a próxima palavra de um texto, data as palavras anteriores como entrada. Esta tarefa é chamada de "Modelagem da Linguagem".

Portanto, você deve implementar o modelo de linguagem inspirado no artigo do Bengio, para prever a próxima palavra usando rede com embeddings e duas camadas.
Sugestão de alguns parâmetros:
* context_size = 9
* max_vocab_size = 3000
* embedding_dim = 64
* usar pontuação no vocabulário
* descartar qualquer contexto ou target que não esteja no vocabulário
* É esperado conseguir uma perplexidade da ordem de 50.
* Procurem fazer asserts para garantir que partes do seu programa estão testadas

Este enunciado não é fixo, podem mudar qualquer um dos parâmetros acima, mas procurem conseguir a perplexidade esperada ou menor.

Gerem alguns frases usando um contexto inicial e depois deslocando o contexto e prevendo a próxima palavra gerando frases compridas para ver se está gerando texto plausível.

Algumas dicas:
- Inclua caracteres de pontuação (ex: `.` e `,`) no vocabulário.
- Deixe tudo como caixa baixa (lower-case).
- A escolha do tamanho do vocabulario é importante: ser for muito grande, fica difícil para o modelo aprender boas representações. Se for muito pequeno, o modelo apenas conseguirá gerar textos simples.
- Remova qualquer exemplo de treino/validação/teste que tenha pelo menos um token desconhecido (ou na entrada ou na saída).
- Durante a depuração, faça seu dataset ficar bem pequeno, para que a depuração seja mais rápida e não precise de GPU. Somente ligue a GPU quando o seu laço de treinamento já está funcionando
- Não deixe para fazer esse exercício na véspera. Ele é trabalhoso.

Procure por `TODO` para entender onde você precisa inserir o seu código.

In [1]:
import os
import sys
import random
import torch.nn as nn
import torch.nn.functional as F
import time
from sklearn.model_selection import train_test_split

In [2]:
# Global variables

# Vocabulary
vocab_size = 5000
context_size = 5
pattern = r'\w+|[,;.:!?\']'

# Training
batch_size = 128
epochs = 10
lr = 0.1

# Model
embedding_dim = 256
hidden_dim = 128

In [3]:
# Colab environment
IN_COLAB = 'google.colab' in sys.modules

if (IN_COLAB):
    %pip install colorama

    # Google Drive
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)

    project_folder="/content/drive/MyDrive/Classes/IA024/Aula_2_3"
    os.chdir(project_folder)
    !ls -la

## Faz download e carrega o dataset

In [4]:
# Check if download is necessary
if not os.path.exists("67724.txt.utf-8"):
    print("Downloading Gutenberg texts")

    !wget https://www.gutenberg.org/ebooks/67724.txt.utf-8
    !wget https://www.gutenberg.org/ebooks/67725.txt.utf-8

In [5]:
text = open("67724.txt.utf-8","r").read()
text += open("67725.txt.utf-8","r").read()

paragraphs = text.split("\n\n")

len(paragraphs)

4969

In [6]:
# Checking the text
print(paragraphs[0])

The Project Gutenberg eBook of O Guarany: romance brazileiro, Vol. 1 (of 2)
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.


In [7]:
cleaned_paragraphs = [paragraph.replace("\n", " ") for paragraph in paragraphs if paragraph.strip()]

# Print 5 random paragraphs
num_paragraphs = len(cleaned_paragraphs)
for i in range(0,5):
    idx = random.randrange(num_paragraphs)
    print(f"{cleaned_paragraphs[idx]}\n")

print("Number of paragraphs: " + str(num_paragraphs))

len(cleaned_paragraphs)

O indio hesitou de novo:

Tinha então, sempre em sonho, um desses assomos de colera de rainha offendida, que fazia arquear as sobrancelhas louras, e bater sobre a relva a ponta de um pézinho de menina.

--Elles me temem, dizes tu; mas desde o momento em que se julgarem offendidos por mim soffrerão tudo para vingar-se.

Parecia que devião ser seis horas da tarde, e que o dia cahindo envolvia a terra nas sombras pardacentas do occaso.

O escudeiro, que depois de sua conversa com mestre Nunes tinha adormecido, fôra despertado de repente pelas imprecações e gritos que soltavão os aventureiros quando a agua começou a invadir as esteiras em que esta vão deitados.

Number of paragraphs: 4892


4892

## Análise do dataset

In [8]:
# Conta as palavras no dataset
from collections import Counter
import re

def count_words(texts):
    word_counts = Counter()
    for text in texts:
        word_counts.update(re.findall(pattern, text.lower()))
    return word_counts

word_counts = count_words(cleaned_paragraphs)

len(word_counts)

12610

## Criando um vocabulário

In [9]:
most_frequent_words = [word for word, count in word_counts.most_common(vocab_size)]
vocab = {word: i for i, word in enumerate(most_frequent_words, 1)}

In [10]:
print(f"Most Frequent Words: {most_frequent_words[:10]}")
print(f"Vocabulary Size: {len(vocab)}")

Most Frequent Words: ['.', ',', 'a', 'que', 'o', 'de', 'e', 'se', ';', 'um']
Vocabulary Size: 5000


In [11]:
def encode_sentence(sentence, vocab):
    return [vocab.get(word, 0) for word in re.findall(pattern, sentence.lower())]

print(cleaned_paragraphs[20])
print(encode_sentence(cleaned_paragraphs[20], vocab))

 Publicando este livro em 1857, se disse ser aquella primeira edição uma prova typographica, que algum dia talvez o autor se dispuzesse a rever.
[0, 146, 4383, 23, 0, 2, 8, 50, 117, 276, 266, 2669, 13, 1071, 0, 2, 4, 193, 137, 287, 5, 2264, 8, 0, 3, 2672, 1]


## Classe do dataset

In [12]:
# Dataset class
import torch
from torch.utils.data import Dataset, DataLoader

class CustomDataset(Dataset):
  def __init__(self, paragraphs, vocab, context):
    self.paragraphs = paragraphs
    self.vocab = vocab
    self.context = context
    self.tokens, self.targets = self.setup()

  def __len__(self):
    return len(self.tokens)

  def __getitem__(self, idx):
    return torch.tensor(self.tokens[idx]), torch.tensor(self.targets[idx])
  
  def setup(self):
    tokens = []
    targets = []
    for paragraph in self.paragraphs:
      encoded = encode_sentence(paragraph, self.vocab)
      
      # If paragraph is smaller than the context, skip it.
      if len(encoded) < self.context + 1:
          continue

      for i in range(len(encoded) - self.context):
        tks = encoded[i:i+self.context]
        tgt = encoded[i+self.context]
        # Only add if there are no unknown tokens in both context and target.
        bad_token = 0
        if not (bad_token in tks or tgt == bad_token):
          tokens.append(tks)
          targets.append(tgt)
    return tokens, targets


In [13]:
# Train/Validation split
train_data, val_data = train_test_split(cleaned_paragraphs, test_size=0.2, random_state=18)

train_dataset = CustomDataset(train_data, vocab, context_size)
val_dataset = CustomDataset(val_data, vocab, context_size)

# Counting all Samples
print(f"Training samples: {len(train_data)}")
print(f"Validation samples: {len(val_data)}")
print()
print(f"Training dataset samples: {len(train_dataset)}")
print(f"Validation dataset samples: {len(val_dataset)}")

Training samples: 3913
Validation samples: 979

Training dataset samples: 59646
Validation dataset samples: 16215


In [14]:
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True)

sample = next(iter(train_loader))
print(sample)

[tensor([[   5,  328,    4, 1239,    5],
        [1716,  332,  116,   41, 2215],
        [1455,    2,   13, 2871,    2],
        [ 641,    2,   41,   74,   66],
        [   5, 1017,  657,    1,  384],
        [   5,   97,   23,  810,    6],
        [  40,    5, 2087, 1220,    8],
        [  10,  389,    9,   39,   12],
        [  19, 4941,    6,  999,    4],
        [ 215,    2,  562,    3, 1046],
        [ 344,   42,   45, 3599,  507],
        [ 124,   23,   48,   10,  320],
        [3630,   11, 3017, 1084,    3],
        [   7, 1207,  290,   22,  837],
        [  13, 2255, 1287,    4, 2506],
        [4687,   46,  112,   25,    1],
        [  21,  740,   15,  584,   13],
        [   2,    5,  378,    2,    5],
        [ 507,    6,   17,   83, 4467],
        [   2,   62,   11,    4,    8],
        [4309,    6,  152,    4,    8],
        [  15,   75, 1560,    2, 1594],
        [4215,    8,   27,   19,  496],
        [1347,   16,   13, 1310,  786],
        [   5,  250,  956,    9,    7],

## Model

In [15]:
class BengioModel(torch.nn.Module):
    def __init__(self, vocab_size, embedding_dim, context_size, h):
        super(BengioModel, self).__init__()
        self.embeddings = nn.Embedding(vocab_size+1, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, h)
        self.relu = torch.nn.ReLU()
        # Tanh non-linearity as described in the article
        #self.tanh = torch.nn.Tanh()
        self.linear2 = nn.Linear(h, vocab_size+1)
        # Softmax to scale outputs
        self.logSoftMax = torch.nn.LogSoftmax(dim=1)

    def forward(self, inputs):
        embeds = self.embeddings(inputs)
        # Flatten embeddings
        embeds = embeds.view(embeds.size(0), -1)
        # Linear layer
        out = self.linear1(embeds)
        #out = self.tanh(out)
        out = self.relu(out)
        # Second layer
        out = self.linear2(out)
        # Softmax output
        out = self.logSoftMax(out)
        return out

In [16]:
model = BengioModel(vocab_size, embedding_dim, context_size, hidden_dim)

In [17]:
sample = next(iter(train_loader))
input = sample[0]
target = sample[1]

print(input.shape)
print(target.shape)

torch.Size([128, 5])
torch.Size([128])


In [18]:
output = model(input)
print(output)
print(output.shape)

tensor([[-8.7467, -8.4369, -8.5924,  ..., -8.6832, -8.8771, -8.2656],
        [-8.7607, -8.6259, -8.8695,  ..., -8.6298, -9.1078, -8.6166],
        [-8.5961, -8.5810, -8.3906,  ..., -8.7745, -8.6335, -8.8381],
        ...,
        [-8.6303, -8.3665, -8.4601,  ..., -8.9469, -8.9174, -8.3805],
        [-8.9053, -8.6383, -8.6528,  ..., -8.4610, -8.8756, -8.5364],
        [-8.8552, -8.6704, -8.5204,  ..., -8.7545, -8.9742, -7.8670]],
       grad_fn=<LogSoftmaxBackward0>)
torch.Size([128, 5001])


In [19]:
output.argmax(dim=1)

tensor([4832, 4297,  663,  975, 2307, 1127,  100, 2795,  827, 4506,  824, 3160,
        3115, 2586, 3217, 4213,  891, 4221,   25, 4191, 4946, 1958, 2692, 3828,
        3119, 2736, 4236, 3566, 4429, 4228,  314, 3347,  511, 2377,  784,  524,
        4429, 4698, 1847, 4889, 2264,   15, 4429, 2997, 4278, 1898, 1010, 3098,
        4429, 3251,  865, 1847, 1898, 3612, 4476,  100, 1657, 2795, 2000, 3734,
        1847, 4436, 1847,  105, 2377, 1791, 1676, 2377, 2201, 1847,  174, 3924,
        2377, 2393, 3560, 1127, 1335,  511,  585, 2260, 1102, 1127, 2569, 4946,
        4838, 2377, 2245, 4148, 4946, 2812,  100, 4429, 2377, 4125, 4544, 4204,
        4946, 1996, 2384, 1898,  100, 4946, 3612, 1780, 1791, 4429, 2377, 3986,
         464,   25,  762, 4892,  131, 2377, 1333, 4893,  131,  100, 4889, 3625,
        3986, 1255, 1333, 3353, 2586, 1780, 2377, 3625])

In [20]:
target

tensor([ 452,   27,  226,   19,  508,    5,   26,    3,    5,    3,    5,   14,
           4,    4,  122,    2,    2,  510, 1439,   13,  689,    2,  137,  315,
         564,    4,   27,    7, 1289,    1,   29,  860, 3469,    5,    4,    8,
        4571, 1758, 1485,  180,    2, 1744,   29, 3808,    8,  514,   24,  118,
         663,    5, 2281,  375,  873,  153,    2,    4,    9,   10, 2194,    7,
        2168,    1,    4,  545,  357, 3059,    3,    1,   70,   15, 1459,   12,
         321,    6,   67,  784,  571,  110,   17, 2497, 3137,  405,  109,  183,
           3,    1,   46,   45,  245,   14,  114, 3900,    1,    5,    6, 2107,
           4,  207,  954,    6,    3,   42,    8,    5,    2,    8,   18,  134,
           7,    3,   32,   16,  717,  405,    1,  304,  361,  118,    8,  672,
           9,   40,   10,    2,   25,   28,    1,    7])

## Training

In [21]:
# Verifica se há uma GPU disponível e define o dispositivo para GPU se possível, caso contrário, usa a CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [22]:
# Cross Entropy
criterion = nn.CrossEntropyLoss()

# Optimizer
#optimizer = torch.optim.Adam(model.parameters(), lr)
optimizer = torch.optim.SGD(model.parameters(), lr)

model.to(device)

BengioModel(
  (embeddings): Embedding(5001, 256)
  (linear1): Linear(in_features=1280, out_features=128, bias=True)
  (relu): ReLU()
  (linear2): Linear(in_features=128, out_features=5001, bias=True)
  (logSoftMax): LogSoftmax(dim=1)
)

In [23]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

# Exemplo de uso:
total_params = count_parameters(model)
print(f'O modelo tem um total de {total_params:,} parâmetros.')

O modelo tem um total de 2,089,353 parâmetros.


In [24]:
# Initial Perplexity and Loss
# Before training
model.eval()

loss = 0
perp = 0

with torch.no_grad():
    for inputs, targets in train_loader:
        inputs = inputs.to(device)
        targets = targets.to(device)
        outputs = model(inputs)
        loss += criterion(outputs, targets).item()

loss /= len(train_loader)
perp = torch.exp(torch.tensor(loss))

print(f'Initial Loss: {loss:.4f}')
print(f'Initial Perplexity: {perp:.4f}')

Initial Loss: 8.5447
Initial Perplexity: 5139.2554


In [25]:
# Training Loop
model.train()
for epoch in range(epochs):

  epoch_start = time.time()
  # Metrics
  epoch_loss = 0
  epoch_correct = 0
  epoch_samples = 0

  for inputs, targets in train_loader:
        inputs = inputs.to(device)  # Move input data to the device
        targets = targets.to(device)

        # Forward pass
        outputs = model(inputs)
        loss = criterion(outputs, targets)

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Loss
        epoch_loss += loss.item()

        # Predicted
        _, predicted = torch.max(outputs, 1)
        epoch_correct += (predicted == targets).sum().item()
        epoch_samples += targets.size(0)

  # Calculate average loss and accuracy for epoch
  avg_loss = epoch_loss / len(train_loader)
  acc = epoch_correct / epoch_samples

  # Perplexity
  perp = torch.exp(torch.tensor(avg_loss))

  epoch_end = time.time()
  epoch_time = epoch_end - epoch_start
  # Print epoch statistics
  print(f'Epoch [{epoch+1}/{epochs}], Time:{epoch_time:.2f}, Loss: {avg_loss:.4f}, Accuracy: {acc:.2f}%, Perplexity: {perp:.4f}')


Epoch [1/10], Time:2.76, Loss: 6.4389, Accuracy: 0.11%, Perplexity: 625.7480
Epoch [2/10], Time:2.84, Loss: 5.5617, Accuracy: 0.15%, Perplexity: 260.2597
Epoch [3/10], Time:2.83, Loss: 5.1835, Accuracy: 0.18%, Perplexity: 178.3056
Epoch [4/10], Time:2.83, Loss: 4.8998, Accuracy: 0.20%, Perplexity: 134.2661
Epoch [5/10], Time:2.83, Loss: 4.6607, Accuracy: 0.22%, Perplexity: 105.7100
Epoch [6/10], Time:2.86, Loss: 4.4458, Accuracy: 0.24%, Perplexity: 85.2665
Epoch [7/10], Time:2.89, Loss: 4.2432, Accuracy: 0.26%, Perplexity: 69.6273
Epoch [8/10], Time:2.85, Loss: 4.0552, Accuracy: 0.28%, Perplexity: 57.6966
Epoch [9/10], Time:2.99, Loss: 3.8732, Accuracy: 0.30%, Perplexity: 48.0979
Epoch [10/10], Time:2.94, Loss: 3.6974, Accuracy: 0.31%, Perplexity: 40.3413


## Avaliação

In [26]:
model.eval()

loss_sum = 0
total_sum = 0
correct_sum = 0
eval_round = 0

loss = 0
perp = 0

with torch.no_grad():
    for inputs, targets in val_loader:
        inputs = inputs.to(device)
        targets = targets.to(device)

        outputs = model(inputs)
        loss = criterion(outputs, targets)      
        loss_sum += loss

        # Get the predicted labels
        _, predicted = torch.max(outputs, 1)

        total_sum += targets.size(0)
        correct_sum += (predicted == targets).sum().item()
        eval_round += 1

# Calculate accuracy
acc = 100 * correct_sum / total_sum

# Calculate average perplexity
average_loss = loss_sum / len(val_loader)
average_perplexity = torch.exp(average_loss)

print(f'Test Accuracy: {acc:.2f}%')
print(f'Average Loss: {average_loss:.2f}')
print(f'Average Perplexity: {average_perplexity:.2f}')

Test Accuracy: 19.61%
Average Loss: 5.31
Average Perplexity: 202.43


## Exemplo de uso

In [27]:
# Código adaptado da implementação do Cesar Bastos
from colorama import Fore, Style

text = cleaned_paragraphs
model.to(device)
def generate_text(model, vocab, text, max_length, context_size):
    words = []
    # Ensure there are enough words for at least one sequence
    while len(words) < context_size:
        random_number = random.randint(1, 4891)
        words = encode_sentence(text[random_number], vocab)
        if not words:
            words = []
            continue  # Skip if the sentence cannot be encoded
        words = words[:context_size]
        #print(words)
        if any(token == 0 for token in words):
            words = []
            continue  # Skip if any token is zero (assuming 0 is a special token)
        context = words

    print(f"Frase: {cleaned_paragraphs[random_number]}")
    print(words)

    for _ in range(max_length):
        words_tensor = torch.tensor(context[-context_size:], dtype=torch.long).unsqueeze(0).to(device)
        logits = model(words_tensor)
        probs = F.softmax(logits, dim=1)
        next_token = torch.multinomial(probs, num_samples=1)
        context.append(next_token.item())
        print(context)
    frase = []
    for i in context: ##Agradecimentos a Ramon Abilio
        word = next((word for word, code in vocab.items() if code == i), "<UNKNOWN>")
        frase.append(word)

    print(f"{Fore.BLUE}{frase[:context_size]}{Style.RESET_ALL} {Fore.RED}{frase[-max_length:]}{Style.RESET_ALL} ")


max_length= 10
generate_text(model, vocab, text, max_length, context_size)

Frase: Nesse momento os Aymorés preparavão settas inflammaveis para incendiar a casa de D. Antonio de Mariz; não podendo vencer o inimigo pelas armas, contavão destrui-lo pelo fogo.
[231, 76, 15, 217, 4265]
[231, 76, 15, 217, 4265, 151]
[231, 76, 15, 217, 4265, 151, 103]
[231, 76, 15, 217, 4265, 151, 103, 2288]
[231, 76, 15, 217, 4265, 151, 103, 2288, 36]
[231, 76, 15, 217, 4265, 151, 103, 2288, 36, 235]
[231, 76, 15, 217, 4265, 151, 103, 2288, 36, 235, 1]
[231, 76, 15, 217, 4265, 151, 103, 2288, 36, 235, 1, 49]
[231, 76, 15, 217, 4265, 151, 103, 2288, 36, 235, 1, 49, 2]
[231, 76, 15, 217, 4265, 151, 103, 2288, 36, 235, 1, 49, 2, 39]
[231, 76, 15, 217, 4265, 151, 103, 2288, 36, 235, 1, 49, 2, 39, 26]
[34m['nesse', 'momento', 'os', 'aymorés', 'preparavão'][0m [31m['this', 'ou', 'cortina', 'na', 'alguns', '.', 'alvaro', ',', 'mas', 'por'][0m 
