## LoRA: Low-Rank Adaptation of Large Language Models

### **Introduction to LoRA**

LoRA aims to adapt pre-trained language models by adding low-rank matrices to certain weight matrices, reducing the number of parameters that need to be updated. This saves memory and computation, making it ideal for large models. In LoRA, we introduce low-rank matrices to the weights of the model. This allow to train only the low-rank parameters, while the rest of the model remains frozen. This way, we can adapt the model to a specific task without having to train the entire model from scratch.


In [12]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param:.2f}"
    )

### 1. Implement LoRA from scratch on a BERT model

In this section, we will implement Low-Rank Adaptation (LoRA) on a BERT model from scratch to better understand the concept and its benefits. We will use the `transformers` library to load the pre-trained BERT model and then modify its attention layers to include low-rank matrices. We will then train the modified BERT model on a downstream task to observe the efficiency of LoRA compared to standard fine-tuning.

In [26]:
import torch
import torch.nn as nn
from transformers import AutoModel, AutoTokenizer

In [34]:
model_name = "bert-base-uncased"
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [28]:
print_trainable_parameters(model)

trainable params: 109482240 || all params: 109482240 || trainable%: 100.00


In [40]:
# Step 2: Define the LoRA class
class LoRA(nn.Module):
    def __init__(self, original_layer, rank=8):
        super(LoRA, self).__init__()
        self.original_layer = original_layer
        self.rank = rank
        self.in_features = original_layer.in_features
        self.out_features = original_layer.out_features

        # Low-rank matrices A and B
        self.A = nn.Parameter(torch.zeros(self.in_features, rank))
        self.B = nn.Parameter(torch.zeros(rank, self.out_features))

        self.reset_parameters()

    def reset_parameters(self):
        # TODO: Initialize the low-rank matrix A 
        nn.init.normal_(self.A)

    def forward(self, x):
        # The output is the original layer output plus the low-rank adaptation
        lora_output = torch.matmul(x, self.A)
        lora_output = torch.matmul(lora_output, self.B)
        return self.original_layer(x) + lora_output

In [41]:
# Step 3: Apply LoRA to BERT's attention layers (query, key, value projections)
# Loop through each layer of BERT and replace query, key, and value with LoRA
for layer in model.encoder.layer:
    layer.attention.self.query = LoRA(layer.attention.self.query)
    layer.attention.self.key = LoRA(layer.attention.self.key)
    layer.attention.self.value = LoRA(layer.attention.self.value)

Parameter containing:
tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], requires_grad=True)
Parameter containing:
tensor([[-2.9342e-01,  2.6796e+00,  6.6422e-01,  ..., -7.5667e-01,
         -1.1129e+00,  5.1736e-02],
        [-1.3903e+00, -2.3914e+00,  1.1809e+00,  ..., -1.7479e+00,
         -7.5823e-01,  1.1960e+00],
        [-9.9527e-01, -1.4658e+00,  3.8077e-01,  ..., -1.4435e-02,
          6.3799e-01,  7.2490e-01],
        ...,
        [-4.2331e-01, -3.2786e-01,  2.1527e-01,  ...,  1.2649e-01,
          6.9582e-01,  5.4613e-01],
        [ 7.5443e-01,  3.7856e-02,  4.5712e-01,  ..., -8.7931e-02,
         -1.0759e+00, -7.4353e-01],
        [-6.1606e-01, -4.2132e-01, -1.1763e+00,  ..., -1.3303e+00,
         -2.4752e-03, -1.4127e+00]], requires_grad=True)
Parameter containing:
tensor([

In [None]:
# Step 4: Freeze all parameters except the LoRA parameters

# Freeze all parameters
for param in model.parameters():
    param.requires_grad = False  # Freeze all parameters
# Unfreeze only LoRA parameters
for layer in model.modules():
    if isinstance(layer, LoRA):
        for param in layer.parameters():
            param.requires_grad = True

In [None]:
# Step 5: Define a simple training loop
# Example data
texts = ["Example sentence one.", "Example sentence two."]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

In [None]:
# Training parameters
optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=1e-4)
loss_fn = nn.MSELoss()  # Placeholder loss function for demonstration

# Dummy target (e.g., embedding from a pre-trained model) for demonstration
target = torch.rand((inputs['input_ids'].size(0), model.config.hidden_size))

In [29]:
# Training loop
model.train()
for epoch in range(3):  # Small number of epochs for demonstration
    optimizer.zero_grad()
    
    # Forward pass through the model
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1)  # Pooling for sentence representation

    # Compute loss and backpropagate
    loss = loss_fn(embeddings, target)  # Loss between embeddings and dummy target
    loss.backward()
    optimizer.step()
    
    print(f"Epoch {epoch+1}, Loss: {loss.item()}")


Epoch 1, Loss: 0.47616657614707947
Epoch 2, Loss: 0.433392196893692
Epoch 3, Loss: 0.4243190586566925
Final Embeddings: tensor([[ 0.1188, -0.0775, -0.3016,  ...,  0.3205,  0.0283, -0.2064],
        [ 0.1355, -0.1366, -0.2743,  ...,  0.3305,  0.0543, -0.2159]])


In [None]:
# Evaluation
model.eval()
with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1)
    print("Final Embeddings:", embeddings)

In [30]:
print_trainable_parameters(model)

trainable params: 21703680 || all params: 109924608 || trainable%: 19.74


### 2. Implement LoRA with Hugging Face Transformers

In [20]:
from peft import LoftQConfig, LoraConfig, get_peft_model
from copy import copy

lora_model = copy(model)

config = LoraConfig(
    r=32,
    lora_alpha=32,
    lora_dropout=0.1,
)

get_peft_model(lora_model, config)

PeftModel(
  (base_model): LoraModel(
    (model): BertModel(
      (embeddings): BertEmbeddings(
        (word_embeddings): Embedding(30522, 768, padding_idx=0)
        (position_embeddings): Embedding(512, 768)
        (token_type_embeddings): Embedding(2, 768)
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (encoder): BertEncoder(
        (layer): ModuleList(
          (0-11): 12 x BertLayer(
            (attention): BertAttention(
              (self): BertSdpaSelfAttention(
                (query): lora.Linear(
                  (base_layer): Linear(in_features=768, out_features=768, bias=True)
                  (lora_dropout): ModuleDict(
                    (default): Dropout(p=0.1, inplace=False)
                  )
                  (lora_A): ModuleDict(
                    (default): Linear(in_features=768, out_features=32, bias=False)
                  )
                  (lora_B): Modu

In [21]:
print_trainable_parameters(lora_model)

trainable params: 1179648 || all params: 110661888 || trainable%: 1.07
