## LoRA: Low-Rank Adaptation of Large Language Models

### **Introduction to LoRA**

LoRA aims to adapt pre-trained language models by adding low-rank matrices to certain weight matrices, reducing the number of parameters that need to be updated. This saves memory and computation, making it ideal for large models. In LoRA, we introduce low-rank matrices to the weights of the model. This allow to train only the low-rank parameters, while the rest of the model remains frozen. In this way, we can adapt the model to a specific task without having to train the entire model from scratch.


In this exercise, we will manually add the low-rank (trainable) matrices to an existing model (frozen). We will use BERT. 

In [1]:
import torch
import torch.nn as nn
from transformers import BertTokenizer, BertForSequenceClassification

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# From https://github.com/huggingface/peft/issues/41#issuecomment-1404611868
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param:.2f}"
    )

### 1. Implement LoRA from scratch on a BERT model

In this section, we will implement Low-Rank Adaptation (LoRA) on a BERT model from scratch to better understand the concept and its benefits. We will use the `transformers` library to load the pre-trained BERT model and then modify its attention layers to include low-rank matrices. 

We will then train the modified BERT model on a downstream task to observe the efficiency of LoRA compared to standard fine-tuning. We will use the same training pipeline as proposed in `lab03` on bert to finetune it on a sentiment classification task on the IMDB dataset, so that we can easily compare the results obtained with previous ones. 

In [3]:
model_name = "bert-base-uncased"
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)
tokenizer = BertTokenizer.from_pretrained(model_name)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [4]:
print_trainable_parameters(model)

trainable params: 109483778 || all params: 109483778 || trainable%: 100.00


We create a `LoRA` class, which will inherits from `nn.Module` -- the base class for all neural network modules in PyTorch. The constructor takes an `original_layer` (e.g., a linear layer from BERT) and a `rank` parameter that determines the rank of the low-rank matrices. It initializes two low-rank matrices `A` and `B`, which will be used for the adaptation. The dimensions of these matrices are determined by the input and output features of the original layer.

`A` and `B` are initialized according to the original LoRA paper (https://arxiv.org/abs/2106.09685).

The `forward` method defines how the input `x` is processed through the LoRA layer. The input is multiplied by the low-rank matrix `A` to create a low-rank representation, which is then multiplied by the low-rank matrix `B` to obtain the adapted output. Finally, the output of the original layer is combined with the LoRA output.

In [5]:
class LoRA(nn.Module):
    def __init__(self, original_layer, rank=8):
        super(LoRA, self).__init__()
        self.original_layer = original_layer
        self.rank = rank
        self.in_features = original_layer.in_features
        self.out_features = original_layer.out_features

        # Initialize the Low-rank matrices A and B
        self.A = nn.Parameter(torch.zeros(self.in_features, rank))
        self.B = nn.Parameter(torch.randn(size=(rank, self.out_features)))

    def forward(self, x):
        # The output is the original layer output plus the low-rank adaptation

        # LoRA output
        lora_output = torch.matmul(x, self.A)
        lora_output = torch.matmul(lora_output, self.B)

        # layer output, which combines the original output with the LoRA one 
        return self.original_layer(x) + lora_output

We can choose a list of modules that we want to adapt with LoRA. We will update all linear layers found in the transformer architecture of BERT. 

In [6]:

for layer in model.bert.encoder.layer:
    layer.attention.self.query = LoRA(layer.attention.self.query)
    layer.attention.self.key = LoRA(layer.attention.self.key)
    layer.attention.self.value = LoRA(layer.attention.self.value)
    layer.attention.output.dense = LoRA(layer.attention.output.dense)
    layer.intermediate.dense = LoRA(layer.intermediate.dense)
    layer.output.dense = LoRA(layer.output.dense)

Next, we need to make sure that the model is frozen. We simply all layers by going through the list of modules found in the model, and setting the `requires_grad` attribute to `False`.

Then, only for LoRA modules, we unfreeze A and B. 

In [7]:
# Freeze all parameters except the LoRA parameters

# Freeze all parameters
for param in model.parameters():
    param.requires_grad = False 
    
# Unfreeze only LoRA parameters
for layer in model.modules():
    if isinstance(layer, LoRA):
        layer.A.requires_grad = True
        layer.B.requires_grad = True

Remember, we are introducing low rank versions for:
- query, key, value, output: all 4 converted to two matrices with 768 * 8 parameters (=> 768 * 8 * 2 * 4 = 49,152)
- the two ffnn matrices, both 768x3072, converted to 768x8 and 3072x8 => 2*(768 * 8 + 3072 * 8) = 61,440

Repeating this for 12 layers, should give us 12 * (49152 + 61440) = 1,327,104

We can easily verify if that's the case.

In [8]:
print_trainable_parameters(model)

trainable params: 1327104 || all params: 110810882 || trainable%: 1.20


From this point onwards, the classic training pipeline can be applied to the model!

In [9]:
from datasets import load_dataset

# Load a sentiment analysis dataset
dataset = load_dataset('imdb')
train_dataset = dataset['train'].shuffle(seed=42).select(range(2000))
test_dataset = dataset['test'].shuffle(seed=42).select(range(1000))

In [10]:
from sklearn.metrics import accuracy_score

# Function to compute accuracy
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    accuracy = accuracy_score(labels, preds)
    return {"accuracy": accuracy}

In [11]:
# Tokenize the dataset
def tokenize_function(sample):
    return tokenizer(sample['text'], padding="max_length", truncation=True)

train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

Map: 100%|██████████| 2000/2000 [00:05<00:00, 362.75 examples/s]
Map: 100%|██████████| 1000/1000 [00:02<00:00, 374.79 examples/s]


In [12]:
from transformers import Trainer, TrainingArguments

batch_size = 32
num_train_epochs = 1

learning_rate = 2e-4
weight_decay = 0.01

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    eval_strategy="steps",
    eval_steps=10,
    learning_rate=learning_rate,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_train_epochs,
    weight_decay=weight_decay,
    logging_dir='./logs',  # Directory for storing logs
    logging_steps=10,  # Log every 10 steps
)

# Initialize the Trainer object
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics
)

In [13]:
# Evaluate the model
results = trainer.evaluate()
print(f"Accuracy on the validation set: {results['eval_accuracy']:.4f}")



Accuracy on the validation set: 0.5130


In [14]:
results = trainer.train()



Step,Training Loss,Validation Loss,Accuracy
10,0.6691,0.568748,0.776
20,0.5543,0.446899,0.856
30,0.4413,0.420437,0.856




In [15]:
# Evaluate the model
results = trainer.evaluate()
print(f"Accuracy on the validation set: {results['eval_accuracy']:.4f}")



Accuracy on the validation set: 0.8640


### 2. Using LoRA with Hugging Face Transformers

In addition to implementing Low-Rank Adaptation (LoRA) from scratch, HuggingFace provides an automated way of applying LoRA to models through the **PEFT** library and `LoraConfig`.

In [16]:
model_name = "bert-base-uncased"
model = BertForSequenceClassification.from_pretrained(model_name)
tokenizer = BertTokenizer.from_pretrained(model_name)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**PEFT (Parameter-Efficient Fine-Tuning)** is a framework within Hugging Face's ecosystem designed to enable efficient fine-tuning of large language models. PEFT supports various parameter-efficient techniques, including LoRA, Prefix Tuning, and Adapter Layers, to adapt pre-trained models to specific tasks without requiring extensive training or memory resources.

We create a PEFT configuration object (in this case, `LoraConfig` since we want to apply LoRA). We specify some parameters:
- `r` (rank) determines the rank of the low-rank matrices.
- `lora_alpha` a scaling parameter used in LoRA
- `lora_dropout` dropout rate for the dropout layers introduced in LoRA
- `target_modules` a list of module names that we want to adapt with LoRA. In this case, we adapt all linear layers found in the transformer architecture of BERT (the names of the modules can be found upon inspecting the model).

We instantiate a `PeftModelForSequenceClassification` model. 

In [17]:
from peft import LoraConfig, get_peft_model, PeftModelForSequenceClassification

config = LoraConfig(
    r=8,
    lora_alpha=32, 
    lora_dropout=0.1,
    target_modules=["query", "key", "value", "dense"],
)

peft_model = PeftModelForSequenceClassification(model, peft_config=config)

In [18]:
type(peft_model)

peft.peft_model.PeftModelForSequenceClassification

If we look into the model, we find that, indeed, some extra layers have been added (e.g., `lora_A`, `lora_B`). Their behavior is the same as the layers we implemented from scratch in the previous section.

In [19]:
peft_model.bert.encoder.layer[0]

BertLayer(
  (attention): BertAttention(
    (self): BertSdpaSelfAttention(
      (query): lora.Linear(
        (base_layer): Linear(in_features=768, out_features=768, bias=True)
        (lora_dropout): ModuleDict(
          (default): Dropout(p=0.1, inplace=False)
        )
        (lora_A): ModuleDict(
          (default): Linear(in_features=768, out_features=8, bias=False)
        )
        (lora_B): ModuleDict(
          (default): Linear(in_features=8, out_features=768, bias=False)
        )
        (lora_embedding_A): ParameterDict()
        (lora_embedding_B): ParameterDict()
        (lora_magnitude_vector): ModuleDict()
      )
      (key): lora.Linear(
        (base_layer): Linear(in_features=768, out_features=768, bias=True)
        (lora_dropout): ModuleDict(
          (default): Dropout(p=0.1, inplace=False)
        )
        (lora_A): ModuleDict(
          (default): Linear(in_features=768, out_features=8, bias=False)
        )
        (lora_B): ModuleDict(
          (defa

We see that the number of trainable parameters is the one we approximately expected.

In [20]:
peft_model.print_trainable_parameters()

trainable params: 1,340,930 || all params: 110,824,708 || trainable%: 1.2100


Much like before, we can now run the training! We will reuse some of the objects already created for the previous part (e.g., functions to compute metrics, datasets, training arguments). 

In [21]:
from transformers import Trainer, TrainingArguments

# Initialize the Trainer object
trainer = Trainer(
    model=peft_model,
    args=training_args, # we will recycle the same training arguments as before!
    train_dataset=train_dataset, # also datasets, 
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics # and compute_metrics
)

In [22]:
trainer.train()



Step,Training Loss,Validation Loss,Accuracy
10,0.6772,0.640212,0.689
20,0.626,0.572183,0.759
30,0.5418,0.516855,0.769




TrainOutput(global_step=32, training_loss=0.6135382913053036, metrics={'train_runtime': 41.9036, 'train_samples_per_second': 47.729, 'train_steps_per_second': 0.764, 'total_flos': 534460784640000.0, 'train_loss': 0.6135382913053036, 'epoch': 1.0})

In [23]:
# Evaluate the model
results = trainer.evaluate()
print(f"Accuracy on the validation set: {results['eval_accuracy']:.4f}")



Accuracy on the validation set: 0.7750
