## Exercício: LoRA

- Exercício didático para entender a técnica de fazer ajuste fino em modelos grandes usando poucos recursos
- Aplicar no pré exercício de análise de sentimento ou no segundo exercício, e modelo de linguagem, com vocabulário de 3000 palavras, embedding size e 2 camadas, treinados da forma usual (medir tempo de treinamento/época)
- Modificar o seu modelo para adotar a técnica do LoRA no embedding e nas 2 camadas, e fazer o ajuste-fino, isto é, continuar o treinamento anterior, lembrando que as matrizes originais ficarão congeladas e o ajuste dos pesos serão apenas aplicados nas matrizes do LoRA. Medir o tempo de treinamento/época.
- Por último, substituir o modelo original, com os novos pesos calculados pelo W + LoRA.

In [1]:
import os
import sys
import random
import time
import re
import math
from collections import Counter
from sklearn.model_selection import train_test_split

# Pytorch
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

import warnings
warnings.filterwarnings("ignore")

In [2]:
# Global variables

# Vocabulary
vocab_size = 3000
context_size = 9
pattern = r'\w+|[,;.:!?\']'

# Training
batch_size = 32
epochs = 10
lr = 0.05

# Model
embedding_dim = 64
hidden_dim = 200
dropout_rate = 0.2

# LoRA parameters
lora_r = 1         # Rank adaptation
lora_alpha = 4     # Scaling factor

In [3]:
# Colab environment
IN_COLAB = 'google.colab' in sys.modules

if (IN_COLAB):
    %pip install colorama

    # Google Drive
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)

    project_folder="/content/drive/MyDrive/Classes/IA024/Aula_2_3"
    os.chdir(project_folder)
    !ls -la

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device = 'cpu'
device

'cpu'

## Faz download e carrega o dataset

In [4]:
# Check if download is necessary
if not os.path.exists("67724.txt.utf-8"):
    print("Downloading Gutenberg texts")

    !wget https://www.gutenberg.org/ebooks/67724.txt.utf-8
    !wget https://www.gutenberg.org/ebooks/67725.txt.utf-8

### Limpeza do texto principal

In [5]:
text_1 = open("67724.txt.utf-8","r").read()
text_2 = open("67725.txt.utf-8","r").read()

def clean_text(text):
    start_marker = "*** START OF THE PROJECT GUTENBERG EBOOK"
    end_marker = "*** END OF THE PROJECT GUTENBERG EBOOK"
    text_start = text.find(start_marker)
    text_end = text.find(end_marker)

    text_content= text[text_start:text_end].replace('\r','')
    paragraphs = []
    for paragraph in text_content.split("\n\n"):
        paragraph = paragraph.replace('\n', ' ').strip()
        # Validation of length and index lines
        if (len(paragraph) > 10 and '....' not in paragraph):
            paragraphs.append(paragraph)
    return paragraphs

cleaned_paragraphs = clean_text(text_1)+clean_text(text_2)
print(f'Number of paragraphs: {len(cleaned_paragraphs)}')

Number of paragraphs: 4596


## Análise do dataset

In [6]:
# Conta as palavras no dataset
def count_words(texts):
    word_counts = Counter()
    for text in texts:
        word_counts.update(re.findall(pattern, text.lower()))
    return word_counts

word_counts = count_words(cleaned_paragraphs)

len(word_counts)

11875

## Criando um vocabulário

In [7]:
most_frequent_words = [word for word, count in word_counts.most_common(vocab_size)]
vocab = {word: i for i, word in enumerate(most_frequent_words, 1)}

In [8]:
print(f"Most Frequent Words: {most_frequent_words[:10]}")
print(f"Vocabulary Size: {len(vocab)}")

Most Frequent Words: [',', '.', 'a', 'que', 'o', 'de', 'e', 'se', ';', 'um']
Vocabulary Size: 3000


#### Codificação / Decodificação das sentenças

In [9]:
def encode_sentence(sentence, vocab):
    return [vocab.get(word, 0) for word in re.findall(pattern, sentence.lower())]

def decode_sentence(encoded_sentence, vocab):
    words = []
    for index in encoded_sentence:
        word = next((word for word, code in vocab.items() if code == index), "<UNK>")
        words.append(word)

    return words

seq = cleaned_paragraphs[20]
spc = ' '
encoded = encode_sentence(seq, vocab)
decoded = decode_sentence(encoded, vocab)

print(f'Original Seq: {seq}')
print(f'Encoded: {encoded}')
print(f'Decoded: {decoded}')
print(f'Reconstructed Seq: {spc.join(decoded)}')

Original Seq: Ahi, o _Paquequer_ lança-se rapido sobre o seu leito, e atravessa as florestas como o tapir, espumando, deixando o pello esparso pelas pontas de rochedo, e enchendo a solidão com o estampido de sua carreira. De repente, falta-lhe o espaço, foge-lhe a terra; o soberbo rio recúa um momento para concentrar as suas forças e precipita-se de um só arremesso, como o tigre sobre a presa.
Encoded: [235, 1, 5, 723, 0, 8, 762, 38, 5, 19, 324, 1, 7, 0, 23, 634, 28, 5, 2447, 1, 0, 1, 763, 5, 1776, 0, 269, 1065, 6, 486, 1, 7, 2448, 3, 687, 16, 5, 2449, 6, 17, 1777, 2, 6, 240, 1, 522, 29, 5, 612, 1, 0, 29, 3, 128, 9, 5, 1577, 99, 0, 10, 72, 18, 0, 23, 87, 591, 7, 2450, 8, 6, 10, 74, 0, 1, 28, 5, 592, 38, 3, 979, 2]
Decoded: ['ahi', ',', 'o', '_paquequer_', '<UNK>', 'se', 'rapido', 'sobre', 'o', 'seu', 'leito', ',', 'e', '<UNK>', 'as', 'florestas', 'como', 'o', 'tapir', ',', '<UNK>', ',', 'deixando', 'o', 'pello', '<UNK>', 'pelas', 'pontas', 'de', 'rochedo', ',', 'e', 'enchendo', 'a', 's

## Classe do dataset

In [10]:
# Dataset class
class BagOfWordsDataset(Dataset):
  def __init__(self, paragraphs, vocab, context):
    self.paragraphs = paragraphs
    self.vocab = vocab
    self.context = context
    self.tokens, self.targets = self.setup()

  def __len__(self):
    return len(self.tokens)

  def __getitem__(self, idx):
    return torch.tensor(self.tokens[idx]), torch.tensor(self.targets[idx])
  
  def setup(self):
    tokens = []
    targets = []
    for paragraph in self.paragraphs:
      encoded = encode_sentence(paragraph, self.vocab)
      
      # If paragraph is smaller than the context, skip it.
      if len(encoded) < self.context + 1:
          continue

      for i in range(len(encoded) - self.context):
        tks = encoded[i:i+self.context]
        tgt = encoded[i+self.context]
        # Only add if there are no unknown tokens in both context and target.
        bad_token = 0
        if not (bad_token in tks or tgt == bad_token):
          tokens.append(tks)
          targets.append(tgt)
    return tokens, targets


In [11]:
# Train/Validation split
train_data, val_data = train_test_split(cleaned_paragraphs, test_size=0.2, random_state=18)

train_dataset = BagOfWordsDataset(train_data, vocab, context_size)
val_dataset = BagOfWordsDataset(val_data, vocab, context_size)

# Counting all Samples
print(f"Training samples: {len(train_data)}")
print(f"Validation samples: {len(val_data)}")
print()
print(f"Training dataset samples: {len(train_dataset)}")
print(f"Validation dataset samples: {len(val_dataset)}")

Training samples: 3676
Validation samples: 920

Training dataset samples: 24360
Validation dataset samples: 5851


In [12]:
tst_loader = DataLoader(train_dataset, batch_size = 1, shuffle=True)
sample = next(iter(tst_loader))
print(sample)

[tensor([[ 259,  115,    4,  277,  149, 1607,    7,  258,  548]]), tensor([354])]


In [13]:
# Train/val loaders
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True)

## Modelo (Modificado habilitar ou desabilitar *low-rank adaptation*)
#### Se o parâmetro é ativado, as três camandas do modelo base (embedding, linear_1 e linear_2) terão seu pesos congelados para a utilização de *low-rank adaptation*, com apenas as camadas específicas da LoRA sendo treinadas.
#### Quando o parâmetro é desativado, as três camadas voltam a ser treinadas.

In [14]:
class BengioModel(torch.nn.Module):
    def __init__(self):
        super(BengioModel, self).__init__()
        self.LoRA_enabled = False # Default
        self.vocab_size = vocab_size

        # LoRA parameters
        self.lora_alpha = lora_alpha
        self.lora_r = lora_r
        self.scaling = self.lora_alpha / self.lora_r
        
        # Embeddings layer
        self.embeddings = nn.Embedding(vocab_size+1, embedding_dim)
        # First Linear Layer
        self.linear1 = nn.Linear(context_size * embedding_dim, hidden_dim, bias=True)
        # Activation and Dropout
        self.tanh = torch.nn.Tanh()
        self.dropout = torch.nn.Dropout(dropout_rate)
        # Second Linear Layer
        self.linear2 = nn.Linear(hidden_dim, vocab_size+1, bias=True)

        # LoRA Matrixes
        # LoRA on embeddings layer
        self.embeddings_lora_A = torch.nn.Parameter(torch.empty(self.lora_r, embedding_dim))
        self.embeddings_lora_B = torch.nn.Parameter(torch.empty(vocab_size+1, self.lora_r))
        torch.nn.init.zeros_(self.embeddings_lora_A)
        torch.nn.init.normal_(self.embeddings_lora_B)
        # LoRA on first linear layer
        self.linear1_lora_A = torch.nn.Parameter(torch.empty(self.lora_r, hidden_dim))
        self.linear1_lora_B = torch.nn.Parameter(torch.empty(context_size*embedding_dim, self.lora_r))
        torch.nn.init.zeros_(self.linear1_lora_A)
        torch.nn.init.normal_(self.linear1_lora_B)
        # LoRA on second linear layer
        self.linear2_lora_A = torch.nn.Parameter(torch.empty(self.lora_r, vocab_size+1))
        self.linear2_lora_B = torch.nn.Parameter(torch.empty(hidden_dim, self.lora_r))
        torch.nn.init.zeros_(self.linear2_lora_A)
        torch.nn.init.normal_(self.linear2_lora_B)
            
    def forward(self, inputs):
        # Embeddings
        embeds = self.embeddings(inputs)
        if (self.LoRA_enabled):
            one_hot = torch.nn.functional.one_hot(inputs, self.vocab_size+1).to(torch.float32)
            embeddings_LoRA = torch.matmul(one_hot,
                                           torch.matmul(self.embeddings_lora_B, self.embeddings_lora_A))
            embeddings_LoRA = embeddings_LoRA * self.scaling
            embeds = embeds + embeddings_LoRA

        # Flatten embeddings
        embeds = embeds.view(embeds.size(0), -1)
        
        # First linear layer
        out = self.linear1(embeds)
        if (self.LoRA_enabled):
            linear1_lora_out = torch.matmul(embeds,
                                            torch.matmul(self.linear1_lora_B, self.linear1_lora_A))
            linear1_lora_out = linear1_lora_out * self.scaling
            out = out + linear1_lora_out
        
        activation = self.tanh(out)
        activation = self.dropout(activation)

        # Second linear layer
        out = self.linear2(activation)
        if (self.LoRA_enabled):
            linear2_lora_out = torch.matmul(activation,
                                            torch.matmul(self.linear2_lora_B, self.linear2_lora_A))
            linear2_lora_out = linear2_lora_out * self.scaling
            out = out + linear2_lora_out

        return out
    
    def enable_LoRA(self):
        self.enable_LoRA = True
        # Freeze base model parameters
        print("Freezing Embeddings")
        self.embeddings.weight.requires_grad = False
        print("Freezing Layer 1")
        self.linear1.weight.requires_grad = False
        print("Freezing Layer 2")
        self.linear2.weight.requires_grad = False

    def disable_LoRA(self):
        self.enable_LoRA = False
        print("Unfreezing Embeddings")
        self.embeddings.weight.requires_grad = True
        print("Unfreezing Layer 1")
        self.linear1.weight.requires_grad = True
        print("Unfreezing Layer 2")
        self.linear2.weight.requires_grad = True

In [15]:
model = BengioModel()

#### Teste básico do modelo

In [16]:
sample = next(iter(train_loader))
input = sample[0]
target = sample[1]

print(input.shape)
print(target.shape)

output = model(input)
pred = output.argmax(dim=1)

print(pred)
print(target)

torch.Size([32, 9])
torch.Size([32])
tensor([2039,  628,   35, 2011, 1872, 1989,  544, 1778, 2088,  647,  712, 2314,
        2107, 2727, 2338,  752,  700,  589,  228,  774, 1614,  353,  585, 2728,
        1963,  170, 2265, 1152,   81,  949, 2479, 2564])
tensor([ 101,   29, 1646,    8, 1048,   95,  185,  152,  186,  154,   23,   68,
           4,  134,    3,    6,   11,   19,    3,    1,  272,   17,   21, 1113,
         823,    3,  279,   10,    5,  146,   89,    3])


## Training

### Funções de Treinamento e Avaliação do Modelo

#### Função para Contagem de Parâmetros do Modelo

In [17]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

# Exemplo de uso:
total_params = count_parameters(model)
print(f'O modelo tem um total de {total_params:,} parâmetros.')

O modelo tem um total de 917,707 parâmetros.


#### Função para Avaliação Inicial do Modelo

In [18]:
def init_eval(model):
    # Initial Perplexity and Loss
    # Before training
    model.eval()

    loss = 0
    perp = 0

    with torch.no_grad():
        for inputs, targets in train_loader:
            inputs = inputs.to(device)
            targets = targets.to(device)
            outputs = model(inputs)
            loss += criterion(outputs, targets).item()

    loss /= len(train_loader)
    perp = torch.exp(torch.tensor(loss))

    print(f'Initial Loss: {loss:.4f}')
    print(f'Initial Perplexity: {perp:.4f}')

#### Função para Treinamento do Modelo

In [19]:
def train(model):
      # Training Loop
      model.train()
      for epoch in range(epochs):

            epoch_start = time.time()
            # Metrics
            epoch_loss = 0
            epoch_correct = 0
            epoch_samples = 0
            
            # Training times
            forward_time = 0

            for inputs, targets in train_loader:
                  inputs = inputs.to(device)  # Move input data to the device
                  targets = targets.to(device)

                  # Forward pass
                  forward_start = time.time()
                  outputs = model(inputs)
                  forward_time += (time.time() - forward_start)

                  loss = criterion(outputs, targets)

                  # Backward pass and optimization
                  optimizer.zero_grad()
                  loss.backward()

                  optimizer.step()

                  # Loss
                  epoch_loss += loss.item()

                  # Predicted
                  predicted = outputs.argmax(dim=1)
                  epoch_correct += (predicted == targets).sum().item()
                  epoch_samples += targets.size(0)

            # Calculate average loss and accuracy for epoch
            avg_loss = epoch_loss / len(train_loader)
            acc = epoch_correct / epoch_samples

            # Perplexity
            perp = torch.exp(torch.tensor(avg_loss))

            epoch_end = time.time()
            epoch_time = epoch_end - epoch_start
            
            # Print epoch statistics
            print(f'Epoch [{epoch+1}/{epochs}], Loss: {avg_loss:.4f}, Accuracy: {acc:.2f}%, Perplexity: {perp:.4f}')
            print(f'Training Times Epoch: {epoch_time:.2f}, Forward Pass: {forward_time:.2f}, Backward Pass: {epoch_time-forward_time:.2f}')


#### Função para Avaliação na Base de Validação

In [20]:
def eval(model):
    model.eval()

    loss_sum = 0
    total_sum = 0
    correct_sum = 0
    eval_round = 0

    loss = 0
    perp = 0

    with torch.no_grad():
        for inputs, targets in val_loader:
            inputs = inputs.to(device)
            targets = targets.to(device)

            outputs = model(inputs)
            loss = criterion(outputs, targets)      
            loss_sum += loss

            # Get the predicted labels
            predicted = outputs.argmax(dim=1)

            total_sum += targets.size(0)
            correct_sum += (predicted == targets).sum().item()
            eval_round += 1

    # Calculate accuracy
    acc = 100 * correct_sum / total_sum

    # Calculate average perplexity
    average_loss = loss_sum / len(val_loader)
    average_perplexity = torch.exp(average_loss)

    print(f'Test Accuracy: {acc:.2f}%')
    print(f'Average Loss: {average_loss:.2f}')
    print(f'Average Perplexity: {average_perplexity:.2f}')

### Treinamento e Avaliação (Sem utilizar *low-rank adaptation*)

In [21]:
# Cross Entropy
criterion = nn.CrossEntropyLoss()

# Optimizer
optimizer = torch.optim.SGD(model.parameters(), lr)

model.to(device)
print(model)
count_parameters(model)

BengioModel(
  (embeddings): Embedding(3001, 64)
  (linear1): Linear(in_features=576, out_features=200, bias=True)
  (tanh): Tanh()
  (dropout): Dropout(p=0.2, inplace=False)
  (linear2): Linear(in_features=200, out_features=3001, bias=True)
)


917707

In [22]:
print("Base Model - No LoRA")
print()
print("Initial Evaluation")
print()
init_eval(model)
print()
print("Training the Model")
print()
train(model)
print()
print("Evaluation on the Validation Dataset")
eval(model)

Base Model - No LoRA

Initial Evaluation

Initial Loss: 8.0456
Initial Perplexity: 3120.1672

Training the Model

Epoch [1/10], Loss: 6.5887, Accuracy: 0.07%, Perplexity: 726.8450
Training Times Epoch: 1.90, Forward Pass: 0.32, Backward Pass: 1.58
Epoch [2/10], Loss: 5.7143, Accuracy: 0.11%, Perplexity: 303.1589
Training Times Epoch: 1.97, Forward Pass: 0.31, Backward Pass: 1.65
Epoch [3/10], Loss: 5.4331, Accuracy: 0.13%, Perplexity: 228.8650
Training Times Epoch: 1.86, Forward Pass: 0.30, Backward Pass: 1.56
Epoch [4/10], Loss: 5.2079, Accuracy: 0.14%, Perplexity: 182.7099
Training Times Epoch: 1.90, Forward Pass: 0.30, Backward Pass: 1.60
Epoch [5/10], Loss: 5.0174, Accuracy: 0.16%, Perplexity: 151.0185
Training Times Epoch: 1.90, Forward Pass: 0.29, Backward Pass: 1.61
Epoch [6/10], Loss: 4.8456, Accuracy: 0.17%, Perplexity: 127.1756
Training Times Epoch: 1.81, Forward Pass: 0.28, Backward Pass: 1.53
Epoch [7/10], Loss: 4.6833, Accuracy: 0.18%, Perplexity: 108.1297
Training Times E

### Treinamento e Avaliação (Utilizando *low-rank adaptation*)

In [23]:
model.enable_LoRA()
print(model)
count_parameters(model)

Freezing Embeddings
Freezing Layer 1
Freezing Layer 2
BengioModel(
  (embeddings): Embedding(3001, 64)
  (linear1): Linear(in_features=576, out_features=200, bias=True)
  (tanh): Tanh()
  (dropout): Dropout(p=0.2, inplace=False)
  (linear2): Linear(in_features=200, out_features=3001, bias=True)
)


10243

In [24]:
print("Base Model - LoRA enabled")
print()
print()
print("Keep On training the model")
print()
train(model)
print()
print("Evaluation on the Validation Dataset")
eval(model)

Base Model - LoRA enabled


Keep On training the model

Epoch [1/10], Loss: 3.9395, Accuracy: 0.28%, Perplexity: 51.3925
Training Times Epoch: 1.52, Forward Pass: 0.31, Backward Pass: 1.21
Epoch [2/10], Loss: 3.9353, Accuracy: 0.28%, Perplexity: 51.1750
Training Times Epoch: 1.57, Forward Pass: 0.32, Backward Pass: 1.25
Epoch [3/10], Loss: 3.9329, Accuracy: 0.28%, Perplexity: 51.0559
Training Times Epoch: 1.62, Forward Pass: 0.34, Backward Pass: 1.27
Epoch [4/10], Loss: 3.9330, Accuracy: 0.28%, Perplexity: 51.0612
Training Times Epoch: 1.56, Forward Pass: 0.26, Backward Pass: 1.30
Epoch [5/10], Loss: 3.9324, Accuracy: 0.28%, Perplexity: 51.0268
Training Times Epoch: 1.56, Forward Pass: 0.30, Backward Pass: 1.27
Epoch [6/10], Loss: 3.9324, Accuracy: 0.28%, Perplexity: 51.0268
Training Times Epoch: 1.55, Forward Pass: 0.31, Backward Pass: 1.24
Epoch [7/10], Loss: 3.9298, Accuracy: 0.28%, Perplexity: 50.8949
Training Times Epoch: 1.54, Forward Pass: 0.30, Backward Pass: 1.25
Epoch [8/10],

In [25]:
model.disable_LoRA()
print(model)
count_parameters(model)

Unfreezing Embeddings
Unfreezing Layer 1
Unfreezing Layer 2
BengioModel(
  (embeddings): Embedding(3001, 64)
  (linear1): Linear(in_features=576, out_features=200, bias=True)
  (tanh): Tanh()
  (dropout): Dropout(p=0.2, inplace=False)
  (linear2): Linear(in_features=200, out_features=3001, bias=True)
)


917707

In [26]:
print("Base Model - LoRA disabled")
print()
print()
print("Keep On training the model")
print()
train(model)
print()
print("Evaluation on the Validation Dataset")
eval(model)

Base Model - LoRA disabled


Keep On training the model

Epoch [1/10], Loss: 4.0624, Accuracy: 0.25%, Perplexity: 58.1121
Training Times Epoch: 2.07, Forward Pass: 0.33, Backward Pass: 1.74
Epoch [2/10], Loss: 3.9246, Accuracy: 0.26%, Perplexity: 50.6333
Training Times Epoch: 2.09, Forward Pass: 0.34, Backward Pass: 1.75
Epoch [3/10], Loss: 3.7804, Accuracy: 0.28%, Perplexity: 43.8331
Training Times Epoch: 2.12, Forward Pass: 0.35, Backward Pass: 1.76
Epoch [4/10], Loss: 3.6445, Accuracy: 0.30%, Perplexity: 38.2621
Training Times Epoch: 2.08, Forward Pass: 0.33, Backward Pass: 1.74
Epoch [5/10], Loss: 3.5130, Accuracy: 0.31%, Perplexity: 33.5504
Training Times Epoch: 2.02, Forward Pass: 0.32, Backward Pass: 1.70
Epoch [6/10], Loss: 3.3775, Accuracy: 0.33%, Perplexity: 29.2964
Training Times Epoch: 2.04, Forward Pass: 0.34, Backward Pass: 1.70
Epoch [7/10], Loss: 3.2467, Accuracy: 0.35%, Perplexity: 25.7041
Training Times Epoch: 2.03, Forward Pass: 0.33, Backward Pass: 1.70
Epoch [8/10]

## 

## Geração de Sentenças

In [27]:
# Get a random sentence of context size

seq = "O indio atravessou a sala, e collocando-se"
inputs = encode_sentence(seq, vocab)
inputs = inputs[len(inputs)-context_size:len(inputs)]
inputs = torch.tensor([inputs])

new_tokens = 10
with torch.no_grad():
    context = torch.tensor(inputs, dtype=torch.long).squeeze().to(device)
    
    for _ in range(new_tokens):

        output = model(torch.tensor(inputs).to(device))
        probs = F.softmax(output, dim=1)
        next_token = torch.multinomial(probs, num_samples=1).squeeze()       
        context = torch.cat([context, next_token.reshape(1)], dim=0)
        
' '.join(decode_sentence(context.tolist(), vocab))

'o indio atravessou a sala , e <UNK> se braço tremula de o a sentia para tinha passava a'