<h1 align="center" style="color:green;font-size: 3em;">
Implementing Fine-tuning Techniques</h1>

Implementing various fine-tuning methods as described in different papers, specifically LoRA and IA3.

Pt2:

In this notebook, we will:
- Inject LoRA Adapters into our model
- Finetune our LoRA Adapters

### Install dependencies

In [1]:
%pip install datasets -q

Note: you may need to restart the kernel to use updated packages.


### Import Libraries

In [1]:
# importing required libraries
import torch
import torch.nn as nn
import collections
import random
import numpy as np
import math
import matplotlib.pyplot as plt
import warnings

from torch.optim import AdamW
from typing import List
from torch.nn import functional as F
from tqdm import tqdm
from datasets import load_dataset, Dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    Trainer,
    DataCollatorForLanguageModeling,
    TrainingArguments,
)
from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoModelForSeq2SeqLM, AutoModelForCausalLM, T5Tokenizer, T5ForSequenceClassification
from torch.utils.data import DataLoader

warnings.simplefilter("ignore")
print(torch.__version__)

2.6.0+cu124


In [2]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

### LoRA Adapters

In this section, we will implement LoRA (Low-Rank Adaptation) and inject it into our causal model. Specifically, we will inject LoRA into the **key, query, and value** matrices of each transformer block.

Recall from the LoRA paper that LoRA enhances model training efficiency by reducing the need to retrain all pretrained weights. Instead, it introduces two smaller matrices, A and B, which capture the necessary adaptations for the new task. This significantly reduces computational overhead while maintaining high performance.

For more information, read the [paper](https://arxiv.org/pdf/2106.09685).

By using LoRA in our causal model, we aim to achieve efficient fine-tuning with minimal computational cost, focusing on the key, query, and value matrices within each transformer block.

### LoRA class

First, let's implement the LoRA class based on how it is defined in the paper.

In [3]:
class LoRALayer():
    def __init__(
        self,
        r: int,
        lora_alpha: int,
        lora_dropout: float,
    ):
        self.r = r
        self.lora_alpha = lora_alpha
        # Optional dropout
        if lora_dropout > 0.:
            self.lora_dropout = nn.Dropout(p=lora_dropout)
        else:
            self.lora_dropout = lambda x: x

class LoRAAdapter(nn.Module, LoRALayer):
    def __init__(
        self,
        existing_layer: nn.Module,
        in_features,
        out_features,
        r: int = 0,
        lora_alpha: int = 1,
        lora_dropout: float = 0.,
        **kwargs
    ):
        nn.Module.__init__(self)
        LoRALayer.__init__(self, r=r, lora_alpha=lora_alpha, lora_dropout=lora_dropout)
        self.existing_layer = existing_layer


        self.r = r # Rank of LoRA Adapter
        if r > 0:
            self.lora_A = nn.Parameter(torch.randn(r, in_features))
            self.lora_B = nn.Parameter(torch.zeros(out_features, r))
            self.scaling = self.lora_alpha / self.r

        self.reset_parameters()

    ## Resets the two matrices (A and B) based on how the paper does it
    def reset_parameters(self):
        if self.r > 0:
            nn.init.normal_(self.lora_A, mean=0.0, std=1.0)
            nn.init.zeros_(self.lora_B)


    def train(self, mode: bool = True):
        self.existing_layer.train(mode)


    def forward(self, x: torch.Tensor):
      if self.r > 0:
        # change x shape for matrice multiplication
        batch_size, seq_len, in_features = x.shape
        x = x.view(-1, in_features)  # Shape: (batch_size * seq_len, in_features)

        # Ensure dtype consistency
        x = x.to(torch.bfloat16)

        # LoRA output: B(A(x)) * scaling
        lora_out = torch.matmul(self.lora_A, x.T)  # Shape: (r, batch_size * seq_len)
        lora_out = torch.matmul(self.lora_B, lora_out)  # Shape: (out_features, batch_size * seq_len)
        lora_out = lora_out.T * self.scaling  # Shape: (batch_size * seq_len, out_features)

        # dropout
        lora_out = self.lora_dropout(lora_out)

        # Add lora_out to the existing layer's output
        return self.existing_layer(x) + lora_out
      else:
        # If r is zero, return the existing layer's output
        return self.existing_layer(x)

### Inject into the model

Recall in LoRA that we want to freeze the pre-trained model and only train our adapter weights `lora_A` and `lora_B`.  


Here we will use method: `mask_only_lora_as_trainable` so that only those weights require gradients.

In [4]:
def mark_only_lora_as_trainable(model: nn.Module) -> None:
    # Freeze all parameters in the model
    for param in model.parameters():
      param.requires_grad = False

    # Enable gradients only for LoRA parameters
    for name, param in model.named_parameters():
      if "lora_A" in name or "lora_B" in name:
        param.requires_grad = True

Finally, we want to write the code that will inject the LoRA adapters into our causal model.


`match_submodules`: Returns a list of names of layers in a model whose names match a specified key.

`get_submodule`: Retrieves a specific submodule from a model based on its name.

`replace_submodule`: Replaces a specific submodule in a model with a new module at a given path.


`inject_adapter`: Replaces all submodules in a model that match any string in a list with a new module created by an adapter function.

In [5]:
def match_submodules(model: nn.Module, key:str) -> List[str]:
  matching_layers = []
  for name, module in model.named_modules():
    if key in name:
      matching_layers.append(name)
  return matching_layers

def get_submodule(model: nn.Module, module_name:str):
    return model.get_submodule(module_name)

def replace_submodule(model: nn.Module, module_path: str, new_module):
  modules = module_path.split('.')
  parent_module = model
  for sub in modules[:-1]:
    parent_module = getattr(parent_module, sub)
  setattr(parent_module, modules[-1], new_module)

def inject_adapter(model: nn.Module, match_on: List[str], adapter_fn):
  for key in match_on:
    matching_layers = match_submodules(model, key)
    for module_path in matching_layers:
      current_module = get_submodule(model, module_path)
      new_module = adapter_fn(current_module) # New LoRA module
      new_module = new_module.to(current_module.weight.device) # Move to gpu
      replace_submodule(model, module_path, new_module) # Replace

### Evaluation on a benchmark

Next, we want to inject the LoRA adapter into our causal model we defined earlier. Let's also check to see how many parameters are in this model, as well as how many of these parameters are considered trainable.


Re-initialize the causal model and check the model architecture.

In [6]:
# Re-initialize the causal model
causal_model_name = "facebook/opt-125m"
causal_model = AutoModelForCausalLM.from_pretrained(causal_model_name, torch_dtype=torch.bfloat16, device_map="auto")
causal_tokenizer = AutoTokenizer.from_pretrained(causal_model_name)

# Check the model architecture
causal_model

OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 768, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 768)
      (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (layers): ModuleList(
        (0-11): 12 x OPTDecoderLayer(
          (self_attn): OPTSdpaAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (activation_fn): ReLU()
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (final_layer_norm): LayerNorm((768,)