## Exercício: LoRA

- Exercício didático para entender a técnica de fazer ajuste fino em modelos grandes usando poucos recursos
- Aplicar no pré exercício de análise de sentimento ou no segundo exercício, e modelo de linguagem, com vocabulário de 3000 palavras, embedding size e 2 camadas, treinados da forma usual (medir tempo de treinamento/época)
- Modificar o seu modelo para adotar a técnica do LoRA no embedding e nas 2 camadas, e fazer o ajuste-fino, isto é, continuar o treinamento anterior, lembrando que as matrizes originais ficarão congeladas e o ajuste dos pesos serão apenas aplicados nas matrizes do LoRA. Medir o tempo de treinamento/época.
- Por último, substituir o modelo original, com os novos pesos calculados pelo W + LoRA.

In [1]:
import os
import sys
import random
import time
import re
import math
from collections import Counter
from sklearn.model_selection import train_test_split

# Pytorch
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

import warnings
warnings.filterwarnings("ignore")

In [2]:
# Global variables

# Vocabulary
vocab_size = 3000
context_size = 9
pattern = r'\w+|[,;.:!?\']'

# Training
batch_size = 32
epochs = 25
lr = 0.01

# Model
embedding_dim = 64
hidden_dim = 200
dropout_rate = 0.2

# LoRA parameters
lora_r = 1        # Rank adaptation
lora_alpha = 2    # Scaling factor

In [3]:
# Colab environment
IN_COLAB = 'google.colab' in sys.modules

if (IN_COLAB):
    %pip install colorama

    # Google Drive
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)

    project_folder="/content/drive/MyDrive/Classes/IA024/Aula_2_3"
    os.chdir(project_folder)
    !ls -la

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
#device = 'cpu'
device

device(type='cuda')

## Faz download e carrega o dataset

In [4]:
# Check if download is necessary
if not os.path.exists("67724.txt.utf-8"):
    print("Downloading Gutenberg texts")

    !wget https://www.gutenberg.org/ebooks/67724.txt.utf-8
    !wget https://www.gutenberg.org/ebooks/67725.txt.utf-8

### Limpeza do texto principal

In [5]:
text_1 = open("67724.txt.utf-8","r").read()
text_2 = open("67725.txt.utf-8","r").read()

def clean_text(text):
    start_marker = "*** START OF THE PROJECT GUTENBERG EBOOK"
    end_marker = "*** END OF THE PROJECT GUTENBERG EBOOK"
    text_start = text.find(start_marker)
    text_end = text.find(end_marker)

    text_content= text[text_start:text_end].replace('\r','')
    paragraphs = []
    for paragraph in text_content.split("\n\n"):
        paragraph = paragraph.replace('\n', ' ').strip()
        # Validation of length and index lines
        if (len(paragraph) > 10 and '....' not in paragraph):
            paragraphs.append(paragraph)
    return paragraphs

cleaned_paragraphs = clean_text(text_1)+clean_text(text_2)
print(f'Number of paragraphs: {len(cleaned_paragraphs)}')

Number of paragraphs: 4596


## Análise do dataset

In [6]:
# Conta as palavras no dataset
def count_words(texts):
    word_counts = Counter()
    for text in texts:
        word_counts.update(re.findall(pattern, text.lower()))
    return word_counts

word_counts = count_words(cleaned_paragraphs)

len(word_counts)

11875

## Criando um vocabulário

In [7]:
most_frequent_words = [word for word, count in word_counts.most_common(vocab_size)]
vocab = {word: i for i, word in enumerate(most_frequent_words, 1)}

In [8]:
print(f"Most Frequent Words: {most_frequent_words[:10]}")
print(f"Vocabulary Size: {len(vocab)}")

Most Frequent Words: [',', '.', 'a', 'que', 'o', 'de', 'e', 'se', ';', 'um']
Vocabulary Size: 3000


#### Codificação / Decodificação das sentenças

In [9]:
def encode_sentence(sentence, vocab):
    return [vocab.get(word, 0) for word in re.findall(pattern, sentence.lower())]

def decode_sentence(encoded_sentence, vocab):
    words = []
    for index in encoded_sentence:
        word = next((word for word, code in vocab.items() if code == index), "<UNK>")
        words.append(word)

    return words

seq = cleaned_paragraphs[20]
spc = ' '
encoded = encode_sentence(seq, vocab)
decoded = decode_sentence(encoded, vocab)

print(f'Original Seq: {seq}')
print(f'Encoded: {encoded}')
print(f'Decoded: {decoded}')
print(f'Reconstructed Seq: {spc.join(decoded)}')

Original Seq: Ahi, o _Paquequer_ lança-se rapido sobre o seu leito, e atravessa as florestas como o tapir, espumando, deixando o pello esparso pelas pontas de rochedo, e enchendo a solidão com o estampido de sua carreira. De repente, falta-lhe o espaço, foge-lhe a terra; o soberbo rio recúa um momento para concentrar as suas forças e precipita-se de um só arremesso, como o tigre sobre a presa.
Encoded: [235, 1, 5, 723, 0, 8, 762, 38, 5, 19, 324, 1, 7, 0, 23, 634, 28, 5, 2447, 1, 0, 1, 763, 5, 1776, 0, 269, 1065, 6, 486, 1, 7, 2448, 3, 687, 16, 5, 2449, 6, 17, 1777, 2, 6, 240, 1, 522, 29, 5, 612, 1, 0, 29, 3, 128, 9, 5, 1577, 99, 0, 10, 72, 18, 0, 23, 87, 591, 7, 2450, 8, 6, 10, 74, 0, 1, 28, 5, 592, 38, 3, 979, 2]
Decoded: ['ahi', ',', 'o', '_paquequer_', '<UNK>', 'se', 'rapido', 'sobre', 'o', 'seu', 'leito', ',', 'e', '<UNK>', 'as', 'florestas', 'como', 'o', 'tapir', ',', '<UNK>', ',', 'deixando', 'o', 'pello', '<UNK>', 'pelas', 'pontas', 'de', 'rochedo', ',', 'e', 'enchendo', 'a', 's

## Classe do dataset

In [10]:
# Dataset class
class BagOfWordsDataset(Dataset):
  def __init__(self, paragraphs, vocab, context):
    self.paragraphs = paragraphs
    self.vocab = vocab
    self.context = context
    self.tokens, self.targets = self.setup()

  def __len__(self):
    return len(self.tokens)

  def __getitem__(self, idx):
    return torch.tensor(self.tokens[idx]), torch.tensor(self.targets[idx])
  
  def setup(self):
    tokens = []
    targets = []
    for paragraph in self.paragraphs:
      encoded = encode_sentence(paragraph, self.vocab)
      
      # If paragraph is smaller than the context, skip it.
      if len(encoded) < self.context + 1:
          continue

      for i in range(len(encoded) - self.context):
        tks = encoded[i:i+self.context]
        tgt = encoded[i+self.context]
        # Only add if there are no unknown tokens in both context and target.
        bad_token = 0
        if not (bad_token in tks or tgt == bad_token):
          tokens.append(tks)
          targets.append(tgt)
    return tokens, targets


In [11]:
# Train/Validation split
train_data, val_data = train_test_split(cleaned_paragraphs, test_size=0.2, random_state=18)

train_dataset = BagOfWordsDataset(train_data, vocab, context_size)
val_dataset = BagOfWordsDataset(val_data, vocab, context_size)

# Counting all Samples
print(f"Training samples: {len(train_data)}")
print(f"Validation samples: {len(val_data)}")
print()
print(f"Training dataset samples: {len(train_dataset)}")
print(f"Validation dataset samples: {len(val_dataset)}")

Training samples: 3676
Validation samples: 920

Training dataset samples: 24360
Validation dataset samples: 5851


In [12]:
tst_loader = DataLoader(train_dataset, batch_size = 1, shuffle=True)
sample = next(iter(tst_loader))
print(sample)

[tensor([[ 27, 742,  10,  90,  18, 141,   8,   3,  17]]), tensor([144])]


In [13]:
# Train/val loaders
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True)

## Modelo (Modificado para receber o parâmetro booleano *apply_LoRA*)
#### Se o parâmetro é ativado, as três camandas do modelo base (embedding, linear_1 e linear_2) terão seu pesos congelados para a utilização de *low-rank adaptation*, com apenas as camadas específicas da LoRA sendo treinadas.

In [14]:
class BengioModel(torch.nn.Module):
    def __init__(self, apply_LoRA):
        super(BengioModel, self).__init__()
        self.apply_LoRA = apply_LoRA
        self.vocab_size = vocab_size

        # LoRA parameters
        self.lora_alpha = lora_alpha
        self.lora_r = lora_r
        self.scaling = self.lora_alpha / self.lora_r
        
        # Embeddings layer
        self.embeddings = nn.Embedding(vocab_size+1, embedding_dim)
        if (self.apply_LoRA):
            # Freeze weights
            print("Freezing Embeddings")
            self.embeddings.weight.requires_grad = False

            # LoRA on embeddings layer
            self.embeddings_lora_A = torch.nn.Parameter(torch.empty(self.lora_r, embedding_dim))
            self.embeddings_lora_B = torch.nn.Parameter(torch.empty(vocab_size+1, self.lora_r))
            torch.nn.init.zeros_(self.embeddings_lora_A)
            torch.nn.init.normal_(self.embeddings_lora_B)

        # First Linear Layer
        self.linear1 = nn.Linear(context_size * embedding_dim, hidden_dim, bias=True)
        if (self.apply_LoRA):
            # Freeze weights
            print("Freezing Layer 1")
            self.linear1.weight.requires_grad = False

            # LoRA on first linear layer
            self.linear1_lora_A = torch.nn.Parameter(torch.empty(self.lora_r, hidden_dim))
            self.linear1_lora_B = torch.nn.Parameter(torch.empty(context_size*embedding_dim, self.lora_r))
            torch.nn.init.zeros_(self.linear1_lora_A)
            torch.nn.init.normal_(self.linear1_lora_B)

        self.tanh = torch.nn.Tanh()
        self.dropout = torch.nn.Dropout(dropout_rate)

        # Second Linear Layer
        self.linear2 = nn.Linear(hidden_dim, vocab_size+1, bias=True)
        if (self.apply_LoRA):
            # Freeze weights
            print("Freezing Layer 2")
            
            # LoRA on second linear layer
            self.linear2_lora_A = torch.nn.Parameter(torch.empty(self.lora_r, vocab_size+1))
            self.linear2_lora_B = torch.nn.Parameter(torch.empty(hidden_dim, self.lora_r))
            torch.nn.init.zeros_(self.linear2_lora_A)
            torch.nn.init.normal_(self.linear2_lora_B)

    def forward(self, inputs):
        # Embeddings
        embeds = self.embeddings(inputs)
        if (self.apply_LoRA):
            one_hot = torch.nn.functional.one_hot(inputs, self.vocab_size+1).to(torch.float32)
            embeddings_LoRA = torch.matmul(one_hot,
                                           torch.matmul(self.embeddings_lora_B, self.embeddings_lora_A))
            embeddings_LoRA = embeddings_LoRA * self.scaling
            embeds = embeds + embeddings_LoRA

        # Flatten embeddings
        embeds = embeds.view(embeds.size(0), -1)
        
        # First linear layer
        out = self.linear1(embeds)
        if (self.apply_LoRA):
            linear1_lora_out = torch.matmul(embeds,
                                            torch.matmul(self.linear1_lora_B, self.linear1_lora_A))
            linear1_lora_out = linear1_lora_out * self.scaling
            out = out + linear1_lora_out
        
        activation = self.tanh(out)
        activation = self.dropout(activation)

        # Second linear layer
        out = self.linear2(activation)
        if (self.apply_LoRA):
            linear2_lora_out = torch.matmul(activation,
                                            torch.matmul(self.linear2_lora_B, self.linear2_lora_A))
            linear2_lora_out = linear2_lora_out * self.scaling
            out = out + linear2_lora_out

        return out

In [15]:
model = BengioModel(apply_LoRA=False)

#### Teste básico do modelo

In [16]:
sample = next(iter(train_loader))
input = sample[0]
target = sample[1]

print(input.shape)
print(target.shape)

output = model(input)
pred = output.argmax(dim=1)

print(pred)
print(target)

torch.Size([32, 9])
torch.Size([32])
tensor([1825, 2830, 2202, 2294,  999,  158,  114, 2817,  107, 1391,  594, 1727,
        2816, 2676, 1107, 1001,  188, 1861, 2980, 2211, 1684,  721, 1706, 2508,
        2633, 2442,  921, 2086,  587, 1426, 1626, 1048])
tensor([ 328,   24,   14, 2607,    2,   38,    4,   26,  128,    7,    4,   46,
         453,   14,    1, 2688, 1122,    5,    5,    6,    4,    5,   34,    4,
          43,    6,    9,  407,  313,   30,    6,    3])


## Training

### Funções de Treinamento e Avaliação do Modelo

#### Função para Contagem de Parâmetros do Modelo

In [17]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

# Exemplo de uso:
total_params = count_parameters(model)
print(f'O modelo tem um total de {total_params:,} parâmetros.')

O modelo tem um total de 910,665 parâmetros.


#### Função para Avaliação Inicial do Modelo

In [18]:
def init_eval(model):
    # Initial Perplexity and Loss
    # Before training
    model.eval()

    loss = 0
    perp = 0

    with torch.no_grad():
        for inputs, targets in train_loader:
            inputs = inputs.to(device)
            targets = targets.to(device)
            outputs = model(inputs)
            loss += criterion(outputs, targets).item()

    loss /= len(train_loader)
    perp = torch.exp(torch.tensor(loss))

    print(f'Initial Loss: {loss:.4f}')
    print(f'Initial Perplexity: {perp:.4f}')

#### Função para Treinamento do Modelo

In [19]:
def train(model):
      # Training Loop
      model.train()
      for epoch in range(epochs):

            epoch_start = time.time()
            # Metrics
            epoch_loss = 0
            epoch_correct = 0
            epoch_samples = 0
            
            # Training times
            forward_time = 0

            for inputs, targets in train_loader:
                  inputs = inputs.to(device)  # Move input data to the device
                  targets = targets.to(device)

                  # Forward pass
                  forward_start = time.time()
                  outputs = model(inputs)
                  forward_time += (time.time() - forward_start)

                  loss = criterion(outputs, targets)

                  # Backward pass and optimization
                  optimizer.zero_grad()
                  loss.backward()

                  optimizer.step()

                  # Loss
                  epoch_loss += loss.item()

                  # Predicted
                  predicted = outputs.argmax(dim=1)
                  epoch_correct += (predicted == targets).sum().item()
                  epoch_samples += targets.size(0)

            # Calculate average loss and accuracy for epoch
            avg_loss = epoch_loss / len(train_loader)
            acc = epoch_correct / epoch_samples

            # Perplexity
            perp = torch.exp(torch.tensor(avg_loss))

            epoch_end = time.time()
            epoch_time = epoch_end - epoch_start
            
            # Print epoch statistics
            print(f'Epoch [{epoch+1}/{epochs}], Loss: {avg_loss:.4f}, Accuracy: {acc:.2f}%, Perplexity: {perp:.4f}')
            print(f'Training Times Epoch: {epoch_time:.2f}, Forward Pass: {forward_time:.2f}, Backward Pass: {epoch_time-forward_time:.2f}')


#### Função para Avaliação na Base de Validação

In [20]:
def eval(model):
    model.eval()

    loss_sum = 0
    total_sum = 0
    correct_sum = 0
    eval_round = 0

    loss = 0
    perp = 0

    with torch.no_grad():
        for inputs, targets in val_loader:
            inputs = inputs.to(device)
            targets = targets.to(device)

            outputs = model(inputs)
            loss = criterion(outputs, targets)      
            loss_sum += loss

            # Get the predicted labels
            predicted = outputs.argmax(dim=1)

            total_sum += targets.size(0)
            correct_sum += (predicted == targets).sum().item()
            eval_round += 1

    # Calculate accuracy
    acc = 100 * correct_sum / total_sum

    # Calculate average perplexity
    average_loss = loss_sum / len(val_loader)
    average_perplexity = torch.exp(average_loss)

    print(f'Test Accuracy: {acc:.2f}%')
    print(f'Average Loss: {average_loss:.2f}')
    print(f'Average Perplexity: {average_perplexity:.2f}')

### Treinamento e Avaliação (Sem utilizar *low-rank adaptation*)

In [21]:
# Cross Entropy
criterion = nn.CrossEntropyLoss()

# Optimizer
optimizer = torch.optim.SGD(model.parameters(), lr)

model.to(device)
print(model)
count_parameters(model)

BengioModel(
  (embeddings): Embedding(3001, 64)
  (linear1): Linear(in_features=576, out_features=200, bias=True)
  (tanh): Tanh()
  (dropout): Dropout(p=0.2, inplace=False)
  (linear2): Linear(in_features=200, out_features=3001, bias=True)
)


910665

In [22]:
print("Base Model - No LoRA")
print()
print("Initial Evaluation")
print()
init_eval(model)
print()
print("Training the Model")
print()
train(model)
print()
print("Evaluation on the Validation Dataset")
eval(model)

Base Model - No LoRA

Initial Evaluation

Initial Loss: 8.0377
Initial Perplexity: 3095.4580

Training the Model

Epoch [1/25], Loss: 7.6686, Accuracy: 0.04%, Perplexity: 2140.1570
Training Times Epoch: 2.93, Forward Pass: 0.37, Backward Pass: 2.56
Epoch [2/25], Loss: 6.5710, Accuracy: 0.09%, Perplexity: 714.0520
Training Times Epoch: 3.02, Forward Pass: 0.38, Backward Pass: 2.64


Epoch [3/25], Loss: 6.1775, Accuracy: 0.10%, Perplexity: 481.7852
Training Times Epoch: 2.98, Forward Pass: 0.38, Backward Pass: 2.60
Epoch [4/25], Loss: 5.9759, Accuracy: 0.11%, Perplexity: 393.8247
Training Times Epoch: 3.42, Forward Pass: 0.37, Backward Pass: 3.05
Epoch [5/25], Loss: 5.8366, Accuracy: 0.12%, Perplexity: 342.6158
Training Times Epoch: 3.16, Forward Pass: 0.37, Backward Pass: 2.79
Epoch [6/25], Loss: 5.7329, Accuracy: 0.12%, Perplexity: 308.8754
Training Times Epoch: 3.41, Forward Pass: 0.39, Backward Pass: 3.02
Epoch [7/25], Loss: 5.6434, Accuracy: 0.12%, Perplexity: 282.4118
Training Times Epoch: 3.01, Forward Pass: 0.35, Backward Pass: 2.66
Epoch [8/25], Loss: 5.5714, Accuracy: 0.13%, Perplexity: 262.8052
Training Times Epoch: 3.30, Forward Pass: 0.35, Backward Pass: 2.95
Epoch [9/25], Loss: 5.5036, Accuracy: 0.14%, Perplexity: 245.5740
Training Times Epoch: 3.15, Forward Pass: 0.35, Backward Pass: 2.81
Epoch [10/25], Loss: 5.4441, Accuracy: 0.14%, Perplexity: 231.

### Treinamento e Avaliação (Utilizando *low-rank adaptation*)

In [23]:
model_LoRA = BengioModel(apply_LoRA=True)

# Cross Entropy
criterion = nn.CrossEntropyLoss()

# Optimizer
optimizer = torch.optim.SGD(model_LoRA.parameters(), lr)

model_LoRA.to(device)
print(model_LoRA)
count_parameters(model_LoRA)

Freezing Embeddings
Freezing Layer 1
Freezing Layer 2
BengioModel(
  (embeddings): Embedding(3001, 64)
  (linear1): Linear(in_features=576, out_features=200, bias=True)
  (tanh): Tanh()
  (dropout): Dropout(p=0.2, inplace=False)
  (linear2): Linear(in_features=200, out_features=3001, bias=True)
)


610443

In [24]:
print("Base Model - With LoRA")
print()
print("Initial Evaluation")
print()
init_eval(model_LoRA)
print()
print("Training the Model")
print()
train(model_LoRA)
print()
print("Evaluation on the Validation Dataset")
eval(model_LoRA)

Base Model - With LoRA

Initial Evaluation



Initial Loss: 8.0337
Initial Perplexity: 3083.0664

Training the Model

Epoch [1/25], Loss: 7.5687, Accuracy: 0.03%, Perplexity: 1936.6826
Training Times Epoch: 3.93, Forward Pass: 0.85, Backward Pass: 3.08
Epoch [2/25], Loss: 6.9522, Accuracy: 0.04%, Perplexity: 1045.4222
Training Times Epoch: 3.90, Forward Pass: 0.83, Backward Pass: 3.08
Epoch [3/25], Loss: 6.7645, Accuracy: 0.04%, Perplexity: 866.5404
Training Times Epoch: 3.84, Forward Pass: 0.82, Backward Pass: 3.02
Epoch [4/25], Loss: 6.6441, Accuracy: 0.04%, Perplexity: 768.2319
Training Times Epoch: 3.90, Forward Pass: 0.83, Backward Pass: 3.07
Epoch [5/25], Loss: 6.4999, Accuracy: 0.05%, Perplexity: 665.0814
Training Times Epoch: 3.89, Forward Pass: 0.81, Backward Pass: 3.08
Epoch [6/25], Loss: 6.3723, Accuracy: 0.05%, Perplexity: 585.4117
Training Times Epoch: 3.82, Forward Pass: 0.82, Backward Pass: 2.99
Epoch [7/25], Loss: 6.3059, Accuracy: 0.05%, Perplexity: 547.7678
Training Times Epoch: 3.82, Forward Pass: 0.81, Backward

## 

## Geração de Sentenças

In [25]:
# Get a random sentence of context size

seq = "O indio atravessou a sala, e collocando-se"
inputs = encode_sentence(seq, vocab)
inputs = inputs[len(inputs)-context_size:len(inputs)]
inputs = torch.tensor([inputs])

new_tokens = 10
with torch.no_grad():
    context = torch.tensor(inputs, dtype=torch.long).squeeze().to(device)
    
    for _ in range(new_tokens):

        output = model(torch.tensor(inputs).to(device))
        probs = F.softmax(output, dim=1)
        next_token = torch.multinomial(probs, num_samples=1).squeeze()       
        context = torch.cat([context, next_token.reshape(1)], dim=0)
        
' '.join(decode_sentence(context.tolist(), vocab))

'o indio atravessou a sala , e <UNK> se continuava tua o no ella parecia face no do passasse'