## Exercício: Modelo de Linguagem com auto-atenção

Este exercício é similar ao da aula passada, mas iremos agora treinar uma rede neural *com auto-atenção* para prever a próxima palavra de um texto, data as palavras anteriores como entrada.

Na camada de auto-atenção, deve-se implementar (vide slide 34):
- Embeddings de posição
- Projeções lineares (WQ, WK, WV, WO)
- Camada de feed forward (2-layer MLP)

Instrucões:
- É necessário fazer duas implementações da camada de auto-atenção: uma usando laços (ineficiente, mas fácil de entender) e outra matricial (eficiente mas difícil de entender). Usar slide 36 como referência.

- Fazer um assert para garantir que o resultado das duas implementações é exatamente igual.

- No treinamento, usar apenas a implementação matricial.

## Imports

In [1]:
import os
import sys
import random
import torch.nn as nn
import torch.nn.functional as F
import time
import math
from sklearn.model_selection import train_test_split
from torch import nn

## Variáveis Globais e Inicialização

In [2]:
# Global variables

# Vocabulary
vocab_size = 5000
context_size = 5
pattern = r'\w+|[,;.:!?\']'

# Training
batch_size = 128
epochs = 10
lr = 0.1

# Model
embedding_dim = 256
hidden_dim = 128

In [3]:
# Colab environment
IN_COLAB = 'google.colab' in sys.modules

if (IN_COLAB):
    %pip install colorama

    # Google Drive
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)

    project_folder="/content/drive/MyDrive/Classes/IA024/Aula_2_3"
    os.chdir(project_folder)
    !ls -la

## Faz download e carrega o dataset

In [4]:
# Check if download is necessary
if not os.path.exists("67724.txt.utf-8"):
    print("Downloading Gutenberg texts")

    !wget https://www.gutenberg.org/ebooks/67724.txt.utf-8
    !wget https://www.gutenberg.org/ebooks/67725.txt.utf-8

In [5]:
text = open("67724.txt.utf-8","r").read()
text += open("67725.txt.utf-8","r").read()

paragraphs = text.split("\n\n")

len(paragraphs)

4969

In [6]:
# Checking the text
print(paragraphs[0])

The Project Gutenberg eBook of O Guarany: romance brazileiro, Vol. 1 (of 2)
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.


In [7]:
cleaned_paragraphs = [paragraph.replace("\n", " ") for paragraph in paragraphs if paragraph.strip()]

# Print 5 random paragraphs
num_paragraphs = len(cleaned_paragraphs)
for i in range(0,5):
    idx = random.randrange(num_paragraphs)
    print(f"{cleaned_paragraphs[idx]}\n")

print("Number of paragraphs: " + str(num_paragraphs))

len(cleaned_paragraphs)

VERME E FLOR

--E porque não confessais? Não vos mereço confiança? Tendes em mim um amigo.

Voltou o rosto e continuou a pensar em sua senhora, e a rever a sua imagem; debalde a menina selvagem, lhe apresentava um lindo fructo, um alimento, um vinho saboroso; elle não lhe dava attenção.

--Deus tenha sua alma!

Passeava pois embalando-se de novo nas suas esperanças, quando Martim Vaz, sahindo do alpendre, chegou-se a elle.

Number of paragraphs: 4892


4892

## Análise do dataset

In [8]:
# Conta as palavras no dataset
from collections import Counter
import re

def count_words(texts):
    word_counts = Counter()
    for text in texts:
        word_counts.update(re.findall(pattern, text.lower()))
    return word_counts

word_counts = count_words(cleaned_paragraphs)

len(word_counts)

12610

## Criando um vocabulário

In [9]:
most_frequent_words = [word for word, count in word_counts.most_common(vocab_size)]
vocab = {word: i for i, word in enumerate(most_frequent_words, 1)}

In [10]:
print(f"Most Frequent Words: {most_frequent_words[:10]}")
print(f"Vocabulary Size: {len(vocab)}")

Most Frequent Words: ['.', ',', 'a', 'que', 'o', 'de', 'e', 'se', ';', 'um']
Vocabulary Size: 5000


In [11]:
def encode_sentence(sentence, vocab):
    return [vocab.get(word, 0) for word in re.findall(pattern, sentence.lower())]

print(cleaned_paragraphs[20])
print(encode_sentence(cleaned_paragraphs[20], vocab))

 Publicando este livro em 1857, se disse ser aquella primeira edição uma prova typographica, que algum dia talvez o autor se dispuzesse a rever.
[0, 146, 4383, 23, 0, 2, 8, 50, 117, 276, 266, 2669, 13, 1071, 0, 2, 4, 193, 137, 287, 5, 2264, 8, 0, 3, 2672, 1]


## Classe do dataset

In [12]:
# Dataset class
import torch
from torch.utils.data import Dataset, DataLoader

class CustomDataset(Dataset):
  def __init__(self, paragraphs, vocab, context):
    self.paragraphs = paragraphs
    self.vocab = vocab
    self.context = context
    self.tokens, self.targets = self.setup()

  def __len__(self):
    return len(self.tokens)

  def __getitem__(self, idx):
    return torch.tensor(self.tokens[idx]), torch.tensor(self.targets[idx])
  
  def setup(self):
    tokens = []
    targets = []
    for paragraph in self.paragraphs:
      encoded = encode_sentence(paragraph, self.vocab)
      
      # If paragraph is smaller than the context, skip it.
      if len(encoded) < self.context + 1:
          continue

      for i in range(len(encoded) - self.context):
        tks = encoded[i:i+self.context]
        tgt = encoded[i+self.context]
        # Only add if there are no unknown tokens in both context and target.
        bad_token = 0
        if not (bad_token in tks or tgt == bad_token):
          tokens.append(tks)
          targets.append(tgt)
    return tokens, targets


In [13]:
# Train/Validation split
train_data, val_data = train_test_split(cleaned_paragraphs, test_size=0.2, random_state=18)

train_dataset = CustomDataset(train_data, vocab, context_size)
val_dataset = CustomDataset(val_data, vocab, context_size)

# Counting all Samples
print(f"Training samples: {len(train_data)}")
print(f"Validation samples: {len(val_data)}")
print()
print(f"Training dataset samples: {len(train_dataset)}")
print(f"Validation dataset samples: {len(val_dataset)}")

Training samples: 3913
Validation samples: 979

Training dataset samples: 59646
Validation dataset samples: 16215


In [14]:
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True)

sample = next(iter(train_loader))
print(sample)

[tensor([[ 287,  395,   59,  387,  124],
        [2495,  579,  125,  454,   20],
        [  55,   11,   80, 2766,   13],
        [ 874,   84, 1841,   22,  150],
        [ 629,   17,   57,   24,  221],
        [  98,   40,  590,    9,  533],
        [1052,  557,    3,  299,    2],
        [ 904,   34,   88,   52, 3936],
        [ 206,  201,  414,  269, 2001],
        [   9,    7, 2685,    6,  325],
        [   3,  481,   34,    3, 2002],
        [   2,    7,   62,    6,   97],
        [1606,   84, 4720,    8,   38],
        [  16,    3,   17, 1387,    7],
        [  12,  367,    2,  171,    3],
        [   4, 1439,   16,    5,  221],
        [1488,    1,    5,  980,    6],
        [   2,    5,  826, 1856,    2],
        [ 134,   61,   10,   78, 4170],
        [   2, 1154,   45,  212, 3997],
        [ 596,   12,   56,  327,   46],
        [ 145,    2,    4, 1098,   87],
        [   3, 2614,  381,   11, 1231],
        [1077,   42,  572,    9,   17],
        [1114,    3,   13,  415,    6],

## Model

#### Implementação das camadas de self-attention (Loop e Matricial)
#### Positional Encoding:

In [15]:
# Positional Embedding - as described in "Attention is All You Need"
class PositionalEncoding(nn.Module):
    def __init__(self, max_sequence, embedding_dim):
        super().__init__()
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.positional_encoding = torch.zeros(max_sequence, embedding_dim, device=device)
        position = torch.arange(0, max_sequence, device=device).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, embedding_dim, 2, device=device) * (-math.log(10000.0) / embedding_dim))
        self.positional_encoding[:, 0::2] = torch.sin(position * div_term)
        self.positional_encoding[:, 1::2] = torch.cos(position * div_term)
        self.positional_encoding = self.positional_encoding.unsqueeze(0)

    def forward(self, x):
        _, seq_length, _ = x.size()
        positional_encoding = self.positional_encoding[:, :seq_length, :]
        positional_encoding = positional_encoding.to(x.device)
        # Position encoding is added to the input embeddings.
        return x + positional_encoding   

#### Implementação em Loop

In [16]:
# Adaptado da implementação do Ramon, muito elegante e traduziu bem o slide em aula.

class SelfAttention_Loop(nn.Module):
  def __init__(self, embedding_dim, vocab_size):
    super(SelfAttention_Loop, self).__init__()

    self.WQ = nn.Linear(embedding_dim, embedding_dim, bias=False)
    self.WK = nn.Linear(embedding_dim, embedding_dim, bias=False)
    self.WV = nn.Linear(embedding_dim, embedding_dim, bias=False)
    self.WO = nn.Linear(embedding_dim, vocab_size, bias=False)
    
  def setProjections(self, WQ, WK, WV, WO):
    self.WQ = WQ
    self.WK = WK
    self.WV = WV
    self.WO = WO

  def forward(self, seq):
    E = []  
    for Xq in seq:
        q = self.WQ(Xq)
        scores = []
        for Xk in seq:
            k = self.WK(Xk)
            score = torch.dot(q, k.transpose(-1,0))
            scores.append(score)

        scores_tensor = torch.tensor(scores)  
        probs         = scores_tensor.softmax(dim=-1)

        e = 0
        for xv, p in zip(seq, probs):
            v = self.WV(xv)
            e += v * p

        e = self.WO(e)
        E.append(e)

    return torch.stack(E) 

#### Implementação Matricial

In [17]:
# Matrix Implementation
class SelfAttention_Matrix(nn.Module):
  def __init__(self, embedding_dim, vocab_size):
    super().__init__()

    self.WQ = nn.Linear(embedding_dim, embedding_dim, bias=False)
    self.WK = nn.Linear(embedding_dim, embedding_dim, bias=False)
    self.WV = nn.Linear(embedding_dim, embedding_dim, bias=False)
    self.WO = nn.Linear(embedding_dim, vocab_size, bias=False)

  def setProjections(self, WQ, WK, WV, WO):
    self.WQ = WQ
    self.WK = WK
    self.WV = WV
    self.WO = WO

  def forward(self, inputs):
    # Linear projections
    Q = self.WQ(inputs)
    K = self.WK(inputs)
    V = self.WV(inputs)

    scores = torch.matmul(Q, K.transpose(-2, -1))
    probs = F.softmax(scores, dim=-1)
    new_embedding = torch.matmul(probs, V)
    # Projection in WO
    new_embedding = self.WO(new_embedding)
    return new_embedding

### Teste das Camadas de Atenção

In [18]:
# Test data
tst_dim = 5
tst_vocab = 10
data = torch.randint(0, tst_vocab, (tst_dim,))
embedding = nn.Embedding(tst_vocab, tst_dim)
embeds = embedding(data)

# Projections (need to be the same for this test)
WQ = nn.Linear(tst_dim, tst_dim, bias=False)
WK = nn.Linear(tst_dim, tst_dim, bias=False)
WV = nn.Linear(tst_dim, tst_dim, bias=False)
WO = nn.Linear(tst_dim, tst_vocab, bias=False)

# Loop
attn_loop = SelfAttention_Loop(tst_dim, tst_vocab)
attn_loop.setProjections(WQ, WK, WV, WO)
embeds_attn_loop = attn_loop(embeds)
# Matrix
attn_matrix = SelfAttention_Matrix(tst_dim, tst_vocab)
attn_matrix.setProjections(WQ, WK, WV, WO)
embeds_attn_matrix = attn_matrix(embeds)

print("Loop Embeds:")
print(embeds_attn_loop)

print("Matrix Embeds:")
print(embeds_attn_matrix)

# Check results
print()
print(f'Loop and Matrix results are the same: {torch.allclose(embeds_attn_loop, embeds_attn_matrix)}')

Loop Embeds:
tensor([[-0.1534, -0.5238, -0.1658, -0.4192, -0.2767,  0.2976, -0.2233, -0.3732,
          0.4384,  0.3550],
        [-0.1318, -0.5553, -0.0287, -0.4341, -0.3143,  0.0538, -0.4738, -0.6360,
          0.6172,  0.4791],
        [-0.1143, -0.5497, -0.0061, -0.4440, -0.3028,  0.0635, -0.4849, -0.6406,
          0.6213,  0.4742],
        [-0.1667, -0.5236, -0.1385, -0.4182, -0.2710,  0.2276, -0.2522, -0.4099,
          0.4575,  0.4058],
        [-0.2452, -0.5927, -0.1866, -0.3741, -0.3874,  0.0058, -0.3854, -0.5867,
          0.5769,  0.5096]], grad_fn=<StackBackward0>)
Matrix Embeds:
tensor([[-0.1534, -0.5238, -0.1658, -0.4192, -0.2767,  0.2976, -0.2233, -0.3732,
          0.4384,  0.3550],
        [-0.1318, -0.5553, -0.0287, -0.4341, -0.3143,  0.0538, -0.4738, -0.6360,
          0.6172,  0.4791],
        [-0.1143, -0.5497, -0.0061, -0.4440, -0.3028,  0.0635, -0.4849, -0.6406,
          0.6213,  0.4742],
        [-0.1667, -0.5236, -0.1385, -0.4182, -0.2710,  0.2276, -0.2522, -

### Implementação dos Modelos (Com e sem atenção, e com embeddings de posição)

In [19]:
class BengioModel(torch.nn.Module):
    def __init__(self, vocab_size, embedding_dim, context_size, h):
        super(BengioModel, self).__init__()
        self.embeddings = nn.Embedding(vocab_size+1, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, h)
        self.relu = torch.nn.ReLU()
        self.linear2 = nn.Linear(h, vocab_size+1)
        # Softmax to scale outputs
        self.logSoftMax = torch.nn.LogSoftmax(dim=1)

    def forward(self, inputs):
        embeds = self.embeddings(inputs)
        # Flatten embeddings
        embeds = embeds.view(embeds.size(0), -1)
        # Linear layer
        out = self.linear1(embeds)
        out = self.relu(out)
        # Second layer
        out = self.linear2(out)
        # Softmax output
        out = self.logSoftMax(out)
        return out

In [20]:
class BengioModel_SelfAttentionMatrix(torch.nn.Module):
    def __init__(self, vocab_size, embedding_dim, context_size, h):
        super(BengioModel_SelfAttentionMatrix, self).__init__()
        self.embeddings = nn.Embedding(vocab_size+1, embedding_dim)
        self.attention = SelfAttention_Matrix(embedding_dim, vocab_size)        
        self.linear1 = nn.Linear(context_size * embedding_dim, h)
        self.relu = torch.nn.ReLU()
        self.linear2 = nn.Linear(h, vocab_size+1)
        # Softmax to scale outputs
        self.logSoftMax = torch.nn.LogSoftmax(dim=1)

    def forward(self, inputs):
        embeds = self.embeddings(inputs)
        x = torch.stack(torch.unbind(embeds, dim=1), dim=1)
        # Camada de autoatenção
        attention  = self.attention(x)
        # Flatten embeddings
        embeds = embeds.view(attention.size(0), -1)
        # Linear layer
        out = self.linear1(embeds)
        out = self.relu(out)
        # Second layer
        out = self.linear2(out)
        # Softmax output
        out = self.logSoftMax(out)
        return out

In [21]:
class BengioModel_SelfAttentionMatrix_PosEncoding(torch.nn.Module):
    def __init__(self, vocab_size, embedding_dim, context_size, h):
        super(BengioModel_SelfAttentionMatrix_PosEncoding, self).__init__()
        self.embeddings = nn.Embedding(vocab_size+1, embedding_dim)
        self.posencoding = PositionalEncoding(context_size, embedding_dim)
        self.attention = SelfAttention_Matrix(embedding_dim, vocab_size)        
        self.linear1 = nn.Linear(context_size * embedding_dim, h)
        self.relu = torch.nn.ReLU()
        self.linear2 = nn.Linear(h, vocab_size+1)
        # Softmax to scale outputs
        self.logSoftMax = torch.nn.LogSoftmax(dim=1)

    def forward(self, inputs):
        embeds = self.embeddings(inputs)
        embeds_pos = self.posencoding(embeds)
        x = torch.stack(torch.unbind(embeds_pos, dim=1), dim=1)
        # Camada de autoatenção
        attention  = self.attention(x)
        # Flatten embeddings
        embeds = embeds.view(attention.size(0), -1)
        # Linear layer
        out = self.linear1(embeds)
        out = self.relu(out)
        # Second layer
        out = self.linear2(out)
        # Softmax output
        out = self.logSoftMax(out)
        return out

## Funções de Treinamento e Avaliação do Modelo

In [22]:
def count_parameters(model):
    total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    print(f'The model has a total of {total_params:,} parameters.')

In [23]:
def initial_eval(model):
    # Initial Perplexity and Loss
    # Before training
    model.eval()

    loss = 0
    perp = 0

    with torch.no_grad():
        for inputs, targets in train_loader:
            inputs = inputs.to(device)
            targets = targets.to(device)
            outputs = model(inputs)
            loss += criterion(outputs, targets).item()

    loss /= len(train_loader)
    perp = torch.exp(torch.tensor(loss))

    print(f'Initial Loss: {loss:.4f}')
    print(f'Initial Perplexity: {perp:.4f}')

In [24]:
def train(model, criterion, optimizer):
      # Training Loop
      model.train()
      for epoch in range(epochs):

            epoch_start = time.time()
            # Metrics
            epoch_loss = 0
            epoch_correct = 0
            epoch_samples = 0

            for inputs, targets in train_loader:
                  inputs = inputs.to(device)  # Move input data to the device
                  targets = targets.to(device)

                  # Forward pass
                  outputs = model(inputs)
                  loss = criterion(outputs, targets)

                  # Backward pass and optimization
                  optimizer.zero_grad()
                  loss.backward()
                  optimizer.step()

                  # Loss
                  epoch_loss += loss.item()

                  # Predicted
                  _, predicted = torch.max(outputs, 1)
                  epoch_correct += (predicted == targets).sum().item()
                  epoch_samples += targets.size(0)

            # Calculate average loss and accuracy for epoch
            avg_loss = epoch_loss / len(train_loader)
            acc = epoch_correct / epoch_samples

            # Perplexity
            perp = torch.exp(torch.tensor(avg_loss))

            epoch_end = time.time()
            epoch_time = epoch_end - epoch_start
            # Print epoch statistics
            print(f'Epoch [{epoch+1}/{epochs}], Time:{epoch_time:.2f}, Loss: {avg_loss:.4f}, Accuracy: {acc:.2f}%, Perplexity: {perp:.4f}')


In [25]:
def eval(model, criterion):
    model.eval()

    loss_sum = 0
    total_sum = 0
    correct_sum = 0
    eval_round = 0

    loss = 0
    perp = 0

    with torch.no_grad():
        for inputs, targets in val_loader:
            inputs = inputs.to(device)
            targets = targets.to(device)

            outputs = model(inputs)
            loss = criterion(outputs, targets)      
            loss_sum += loss

            # Get the predicted labels
            _, predicted = torch.max(outputs, 1)

            total_sum += targets.size(0)
            correct_sum += (predicted == targets).sum().item()
            eval_round += 1

    # Calculate accuracy
    acc = 100 * correct_sum / total_sum

    # Calculate average perplexity
    average_loss = loss_sum / len(val_loader)
    average_perplexity = torch.exp(average_loss)

    print(f'Test Accuracy: {acc:.2f}%')
    print(f'Average Loss: {average_loss:.2f}')
    print(f'Average Perplexity: {average_perplexity:.2f}')

In [26]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

## Avaliação dos Modelos


### 1. Sem camada de atenção

In [27]:
model = BengioModel(vocab_size, embedding_dim, context_size, hidden_dim)
print("Model without Self Attention:")
print()
count_parameters(model)

# Cross Entropy
criterion = nn.CrossEntropyLoss()

# Optimizer
optimizer = torch.optim.SGD(model.parameters(), lr)

model.to(device)

print()
print("Training Start")
print()
train(model, criterion, optimizer)

print()
print("Evaluation Start")
print()
eval(model, criterion)

Model without Self Attention:

The model has a total of 2,089,353 parameters.



Training Start

Epoch [1/10], Time:1.39, Loss: 6.4188, Accuracy: 0.11%, Perplexity: 613.2775
Epoch [2/10], Time:1.28, Loss: 5.5662, Accuracy: 0.15%, Perplexity: 261.4394
Epoch [3/10], Time:1.26, Loss: 5.1897, Accuracy: 0.17%, Perplexity: 179.4220
Epoch [4/10], Time:1.26, Loss: 4.9069, Accuracy: 0.20%, Perplexity: 135.2157
Epoch [5/10], Time:1.29, Loss: 4.6648, Accuracy: 0.22%, Perplexity: 106.1455
Epoch [6/10], Time:1.27, Loss: 4.4467, Accuracy: 0.24%, Perplexity: 85.3485
Epoch [7/10], Time:1.28, Loss: 4.2453, Accuracy: 0.26%, Perplexity: 69.7777
Epoch [8/10], Time:1.31, Loss: 4.0522, Accuracy: 0.28%, Perplexity: 57.5264
Epoch [9/10], Time:1.34, Loss: 3.8720, Accuracy: 0.30%, Perplexity: 48.0380
Epoch [10/10], Time:1.35, Loss: 3.6994, Accuracy: 0.32%, Perplexity: 40.4228

Evaluation Start

Test Accuracy: 19.33%
Average Loss: 5.30
Average Perplexity: 200.13


### 2. Com camada de atenção

In [28]:
model_attn = BengioModel_SelfAttentionMatrix(vocab_size, embedding_dim, context_size, hidden_dim)
print("Model with Self Attention:")
print()
count_parameters(model_attn)

# Cross Entropy
criterion = nn.CrossEntropyLoss()

# Optimizer
optimizer = torch.optim.SGD(model_attn.parameters(), lr)

model_attn.to(device)

print()
print("Training Start")
print()
train(model_attn, criterion, optimizer)

print()
print("Evaluation Start")
print()
eval(model_attn, criterion)

Model with Self Attention:

The model has a total of 3,571,729 parameters.

Training Start

Epoch [1/10], Time:1.58, Loss: 6.4633, Accuracy: 0.11%, Perplexity: 641.1990
Epoch [2/10], Time:1.56, Loss: 5.5873, Accuracy: 0.15%, Perplexity: 267.0125
Epoch [3/10], Time:1.63, Loss: 5.2036, Accuracy: 0.18%, Perplexity: 181.9296
Epoch [4/10], Time:1.57, Loss: 4.9173, Accuracy: 0.20%, Perplexity: 136.6310
Epoch [5/10], Time:1.56, Loss: 4.6772, Accuracy: 0.22%, Perplexity: 107.4699
Epoch [6/10], Time:1.58, Loss: 4.4617, Accuracy: 0.24%, Perplexity: 86.6343
Epoch [7/10], Time:1.58, Loss: 4.2589, Accuracy: 0.26%, Perplexity: 70.7296
Epoch [8/10], Time:1.60, Loss: 4.0686, Accuracy: 0.28%, Perplexity: 58.4754
Epoch [9/10], Time:1.56, Loss: 3.8870, Accuracy: 0.30%, Perplexity: 48.7653
Epoch [10/10], Time:1.57, Loss: 3.7089, Accuracy: 0.32%, Perplexity: 40.8083

Evaluation Start

Test Accuracy: 19.03%
Average Loss: 5.37
Average Perplexity: 214.53


### 3. Com camada de atenção e Embeddings Posicionais
#### Descrição do Modelo:

In [29]:
model_attn_pos = BengioModel_SelfAttentionMatrix_PosEncoding(vocab_size, embedding_dim, context_size, hidden_dim)
print("Model with Self Attention and Positional Encodings:")
print()
count_parameters(model_attn_pos)
print()
print("Model:")
model_attn_pos

Model with Self Attention and Positional Encodings:

The model has a total of 3,571,729 parameters.

Model:


BengioModel_SelfAttentionMatrix_PosEncoding(
  (embeddings): Embedding(5001, 256)
  (posencoding): PositionalEncoding()
  (attention): SelfAttention_Matrix(
    (WQ): Linear(in_features=256, out_features=256, bias=True)
    (WK): Linear(in_features=256, out_features=256, bias=True)
    (WV): Linear(in_features=256, out_features=256, bias=True)
    (WO): Linear(in_features=256, out_features=5000, bias=True)
  )
  (linear1): Linear(in_features=1280, out_features=128, bias=True)
  (relu): ReLU()
  (linear2): Linear(in_features=128, out_features=5001, bias=True)
  (logSoftMax): LogSoftmax(dim=1)
)

#### Perplexidade Inicial:

In [30]:
# Cross Entropy
criterion = nn.CrossEntropyLoss()

# Optimizer
optimizer = torch.optim.SGD(model_attn_pos.parameters(), lr)

model_attn_pos.to(device)
print()
initial_eval(model_attn_pos)


Initial Loss: 8.5343
Initial Perplexity: 5086.5068


#### Treinamento e Avaliação:

In [31]:
print()
print("Training Start")
print()
train(model_attn_pos, criterion, optimizer)

print()
print("Evaluation Start")
print()
eval(model_attn_pos, criterion)




Training Start

Epoch [1/10], Time:1.69, Loss: 6.4338, Accuracy: 0.11%, Perplexity: 622.5213
Epoch [2/10], Time:1.63, Loss: 5.5800, Accuracy: 0.15%, Perplexity: 265.0782
Epoch [3/10], Time:1.69, Loss: 5.2051, Accuracy: 0.17%, Perplexity: 182.2067
Epoch [4/10], Time:1.76, Loss: 4.9174, Accuracy: 0.20%, Perplexity: 136.6441
Epoch [5/10], Time:1.75, Loss: 4.6756, Accuracy: 0.22%, Perplexity: 107.2996
Epoch [6/10], Time:1.64, Loss: 4.4577, Accuracy: 0.24%, Perplexity: 86.2894
Epoch [7/10], Time:1.57, Loss: 4.2533, Accuracy: 0.26%, Perplexity: 70.3357
Epoch [8/10], Time:1.57, Loss: 4.0603, Accuracy: 0.28%, Perplexity: 57.9890
Epoch [9/10], Time:1.56, Loss: 3.8789, Accuracy: 0.30%, Perplexity: 48.3722
Epoch [10/10], Time:1.55, Loss: 3.7000, Accuracy: 0.31%, Perplexity: 40.4471

Evaluation Start

Test Accuracy: 20.34%
Average Loss: 5.30
Average Perplexity: 199.77
