## Week 6: LoRA (Low Rank Adaptation)

Welcome to this week's lab. We will be learning about [LoRA](https://arxiv.org/abs/2106.09685) which has been quite popular lately!

Objectives in this lab are as follow:

1. Better understanding of LoRA through code implementation.
2. Minimal implementation of LoRA for training.

Have fun! 😊

In [1]:
import copy
import math
from typing import Optional, List

import pandas as pd
import datasets

import torch
import torch.nn as nn
import torch.nn.functional as F

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments

### Helper Function

In [36]:
def generate(tokenizer, model, system_prompt, user_prompt, device='cuda', max_new_tokens=256):
    # Format our input to model's instruction-following format
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt}
    ]
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    # Tokenize input text
    input_ids = tokenizer([text], return_tensors="pt").to(device).input_ids
    # Pass the input to model
    generated_ids = model.generate(
        input_ids,
        max_new_tokens=max_new_tokens,
    )
    # Parse output
    output = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    return output

# For training dataset
def convert_to_qwen_format(data, tokenizer):
    messages = [
        {"role": "user", "content": data['user']},
        {"role": "assistant", "content": data['assistant']}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    return text

def tokenize(x, tokenizer):
    return tokenizer(x['text'])

def prepare_dataset(datapath, tokenizer):
    df = pd.read_csv(datapath)
    df['text'] = df.apply(lambda x: convert_to_qwen_format(x, tokenizer), axis=1)
    train_data = datasets.Dataset.from_pandas(df)
    train_dataset = train_data.map(lambda x: tokenize(x, tokenizer), batched=True, batch_size=16, remove_columns=df.columns.to_list())
    return train_dataset

### LoRA Basics

<img src="https://docs.adapterhub.ml/_images/lora.png" alt="lora-diagram" width="200"/>

Denote each hidden layer in our model as $h = Wx$, where $x$ and $W$ is the input and the weight of the layer.

LoRA modules reparametrize $h$, so that $h = Wx + \frac{\alpha}{r}BAx$.

$A \in \mathbb{R}^{d \times r}$, $B \in \mathbb{R}^{r \times k}$

In [3]:
# Suppose we have the following model
class MyModel(nn.Module):

    def __init__(
        self,
    ):
        super().__init__()
        
        self.W0 = nn.Linear(256, 128)
        self.W1 = nn.Linear(128, 64)
        self.W2 = nn.Linear(64, 1)

    def forward(
        self,
        x,
    ):
        h0 = self.W0(x)
        h1 = self.W1(h0)
        h2 = self.W2(h1)
        return h2

In [4]:
# Say we want to inject LoRA modules in the hidden layers
# in layer "W1". How we were about to do that?
class MyModelWithLoRA(nn.Module):

    def __init__(
        self,
        input_size, r, out_size, a
    ):
        super().__init__()
        
        self.W0 = nn.Linear(256, 128)
        self.W1 = nn.Linear(128, 64)
        self.W2 = nn.Linear(64, 1)

        self.input_size = input_size
        self.a = a
        self.r = r
        self.k = out_size
        self.A = nn.Linear(input_size, r)
        self.B = nn.Linear(r, out_size)

    def forward(
        self,
        x,
    ):
        h0 = self.W0(x)
        h1 = ...
        h2 = self.W2(h1)
        return h2

In [5]:
# Forward-pass test
my_model = MyModel()
my_model_w_lora = MyModelWithLoRA(input_size=..., r=..., out_size=..., a=...)

with torch.no_grad():
    x = torch.randn([2, 256])
    
    print(my_model(x).shape, my_model_w_lora(x).shape)

torch.Size([2, 1]) torch.Size([2, 1])


### Our Base Model

In [21]:
# Let's prepare our base model first.
# We will use an instruction-following LLM, Qwen, as our model.
device = "cuda" 
model_name = "Qwen/Qwen1.5-0.5B-Chat" 

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
_ = model.to(device)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [7]:
# As our model is instruction-following
# we can perform prompt / instruction-based text generation as follows:
system_prompt = "You are a helpful AI companion."
user_prompt = "What is the difference between anxiousness and discomfort? Answer it concisely."
print(generate(system_prompt, user_prompt))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


system
You are a helpful AI companion.
user
What is the difference between anxiousness and discomfort? Answer it concisely.
assistant
Anxiety and discomfort can be defined as two different emotions that are often associated with feeling nervous or uneasy about something. Anxiety is characterized by feelings of unease, fear, or worry about the future, while discomfort is characterized by physical symptoms such as pain, pain in the joints, and difficulty moving. While anxiety can sometimes be related to certain situations or events, discomfort is more likely to arise when an individual is facing something that they feel uncomfortable with or uncertain about.


In [8]:
# Let's examine the model's architecture
# our base model is a transformer-based decoder-only LLM 
# in this type of model, it is common to inject the LoRA (or other adapter)
# into its "embedding" layers or linear layers within the transformer blocks.
model

Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(151936, 1024)
    (layers): ModuleList(
      (0-23): 24 x Qwen2DecoderLayer(
        (self_attn): Qwen2Attention(
          (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (o_proj): Linear(in_features=1024, out_features=1024, bias=False)
          (rotary_emb): Qwen2RotaryEmbedding()
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm()
        (post_attention_layernorm): Qwen2RMSNorm()
      )
    )
    (norm): Qwen2RMSNorm()
  )
  (lm_head): Linear(i

### LoRA for Embedding Layer

Reminder: LoRA modules reparametrize $h$, so that $h = Wx + \frac{\alpha}{r}BAx$.

$A \in \mathbb{R}^{d \times r}$, $B \in \mathbb{R}^{r \times k}$

In [15]:
# Let's implement a LoRA module for embedding layer
class EmbeddingLoRA(nn.Module):
    
    def __init__(
        self,
        input_size, r, out_size, a,
        W,
    ):
        super().__init__()

        # Original weights
        self.W = W

        # Defining LoRA layers
        self.d = input_size
        self.a = a
        self.r = r
        self.k = out_size
        self.A = ...
        self.B = ...

        self.use_lora = True
        
    def forward(
        self,
        x,
    ):
        # Forward-pass
        h =  self.W(x)
        if self.use_lora:
            ...
        return h

    def save_lora_weights(
        self,  
        savepath,
    ):
        torch.save({"A": A.state_dict(), "B": B.state_dict()}, savepath)

    def load_weights(
        self,
        loadpath,
    ):
        weights = torch.load(loadpath, map_location="cpu").to(self.device)
        self.A.load_state_dict(weights["A"])
        self.B.load_state_dict(weights["B"])

In [22]:
# Create LoRA embedding layer
lora_embedding = EmbeddingLoRA(
    input_size=...,
    k=...,
    out_size=...,
    a=...,
    W=...,
)
_ = lora_embedding.to(device)

In [23]:
# Inject LoRA embedding to base model
model.model.embed_tokens = lora_embedding

In [24]:
# Inspect model
model

Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): EmbeddingLoRA(
      (W): Embedding(151936, 1024)
      (A): Embedding(151936, 128)
      (B): Linear(in_features=128, out_features=1024, bias=True)
    )
    (layers): ModuleList(
      (0-23): 24 x Qwen2DecoderLayer(
        (self_attn): Qwen2Attention(
          (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (o_proj): Linear(in_features=1024, out_features=1024, bias=False)
          (rotary_emb): Qwen2RotaryEmbedding()
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Q

In [None]:
# Freeze all layers 
for params in model.parameters():
    params.requires_grad = False
# Freeze all layers but LoRA
for name, params in model.model.embed_tokens.named_parameters():
    if "W" not in name:
        params.requires_grad = True

total_params = sum(param.numel() for param in model.parameters())
trainable_params = sum(param.numel() for param in model.parameters() if param.requires_grad)
print(f"Total Params    : {total_params:,}")
print(f"Trainable Params: {trainable_params:,}")

In [None]:
# Let's try to train the model

# Fetch the dataset
train_dataset = prepare_dataset("mbzuai.csv")

# Trainer
training_args = TrainingArguments(
    output_dir="./test-mbz-data",
    overwrite_output_dir=True,
    num_train_epochs=10,
    per_device_train_batch_size=8,
    save_total_limit=1,
    learning_rate=1e-4
)
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
    train_dataset=train_dataset,
)

trainer.train()

In [39]:
# Generate some text
system_prompt = "You are a helpful AI companion."
user_prompt = "What is mbzuai?"
print(generate(system_prompt, user_prompt))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


system
You are a helpful AI companion.
user
What is mbzuai?
assistant
mbzuai is a popular web scraping tool that allows users to extract data from websites using their web browsers. It can be used for various purposes, such as data collection, website analysis, and user profiling.


Seems like the result is not quite good. Let's explore injecting LoRA into Linear layers!

### LoRA for Linear Layer

In [40]:
# Let's implement a LoRA module for linear layer
class LinearLoRA(nn.Module):
    
    def __init__(
        self,
        input_size, r, out_size, a,
        W,
    ):
        super().__init__()

        # Original weights
        self.W = W

        # Defining LoRA layers
        self.input_size = input_size
        self.a = a
        self.r = r
        self.out_size = out_size
        self.A = ...
        self.B = ...

        self.use_lora = True
        
    def forward(
        self,
        x,
    ):
        # Forward-pass
        h =  self.W(x)
        if self.use_lora:
            ...
        return h

    def save_lora_weights(
        self,  
        savepath,
    ):
        torch.save({"A": A.state_dict(), "B": B.state_dict()}, savepath)

    def load_weights(
        self,
        loadpath,
    ):
        weights = torch.load(loadpath, map_location="cpu").to(self.device)
        self.A.load_state_dict(weights["A"])
        self.B.load_state_dict(weights["B"])

In [41]:
# Injecting linear LoRA recusively to each layers
for idx in range(len(model.model.layers)):
    ...

In [43]:
# Now, our model has been injected with more LoRAs
model

Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): EmbeddingLoRA(
      (W): Embedding(151936, 1024)
      (A): Embedding(151936, 128)
      (B): Linear(in_features=128, out_features=1024, bias=True)
    )
    (layers): ModuleList(
      (0-23): 24 x Qwen2DecoderLayer(
        (self_attn): Qwen2Attention(
          (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (v_proj): LinearLoRA(
            (W): Linear(in_features=1024, out_features=1024, bias=True)
            (A): Linear(in_features=1024, out_features=128, bias=True)
            (B): Linear(in_features=128, out_features=1024, bias=True)
          )
          (o_proj): Linear(in_features=1024, out_features=1024, bias=False)
          (rotary_emb): Qwen2RotaryEmbedding()
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
          (up_proj): Linear(in_feature

In [44]:
# Freeze all layers 
for params in model.parameters():
    params.requires_grad = False
# Freeze all layers but LoRA
for name, params in model.model.embed_tokens.named_parameters():
    if "W" not in name:
        params.requires_grad = True
for idx in range(len(model.model.layers)):
    ...

total_params = sum(param.numel() for param in model.parameters())
trainable_params = sum(param.numel() for param in model.parameters() if param.requires_grad)
print(f"Total Params    : {total_params:,}")
print(f"Trainable Params: {trainable_params:,}")

Total Params    : 489,886,720
Trainable Params: 25,899,008


In [55]:
# Let's try to train the model

# Fetch the dataset
train_dataset = prepare_dataset("mbzuai.csv")

# Trainer
training_args = TrainingArguments(
    output_dir="./test-mbz-data",
    overwrite_output_dir=True,
    num_train_epochs=50,
    per_device_train_batch_size=16,
    save_total_limit=1,
    learning_rate=1e-4
)
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
    train_dataset=train_dataset,
)

trainer.train()

Map:   0%|          | 0/25 [00:00<?, ? examples/s]

Step,Training Loss


TrainOutput(global_step=100, training_loss=4.149049682617187, metrics={'train_runtime': 19.2214, 'train_samples_per_second': 65.032, 'train_steps_per_second': 5.203, 'total_flos': 149886152644608.0, 'train_loss': 4.149049682617187, 'epoch': 50.0})

In [59]:
# Generate some text
system_prompt = "You are a helpful AI companion."
user_prompt = "What is MBZUAI?"
print(generate(system_prompt, user_prompt))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


system
You are a helpful AI companion.
user
What is MBZUAI?
assistant
MBZUAI stands for Membangun Anak Kewangan Uji (Malaysian的意思是“儿童教育中心”。它是一个非营利组织，旨在为当地贫困家庭的孩子提供免费的教育机会。它的目标是在贫困地区建立一个学习环境，并确保每个孩子都有接受良好教育的机会。


### Merging LoRA Weights

We can merge the LoRA weights with the original pretrained weights.

As,  $h = Wx + \frac{\alpha}{r}BAx = (W + \frac{\alpha}{r}BA) x$


In [None]:
# Try to create the function to merge weights
# and integrate it to the code

def merge_weights(self):
    pass