# Laymanz Notebooks: Parameter Efficient Fine-Tuning
Author: Ambrose Ling

**What is this notebook about?**

In this notebook, we will go over some of the most fundamental ideas behind parameter efficient fine-tuning, how they work and why they have been a major advancement in the field of computer vision and generative artifical intelligence. We hope that you can walk away capable of leveraging these techniques to fine-tune your own language models or diffusion models.

**What do I need to set up my environment?**

All of our notebooks will only use numpy, pytorch, matplotlib for visualizations. We will not use any other third-party libraries for model development, optimization or anything like that.

**How is this notebook structured?**
1.
2.
3.


**Covered papers in this notebook**
* LoRA
* GaLoRE
* DoRA


(will do after finishing)

# What is Paramter Efficient Fine Tuning

PEFT aims to leverage methods that reduce the cost of fine-tuning neural network models.
Businesses, engineers, researchers may want to fine-tune state of the art language models or vision models to perform very specialized tasks for them. 

However the main bottleneck is that existing fine-tuning methods may involve full network training on these large models, which requires a lot of computational resources.

If I want to fine-tune a model for multiple different tasks, it is very time consuming and expensive.

## How to estimate the computational resources required for your model?
- Lets assume that your model $\theta$ has 1 billion parameters
- The way we would calculate the memory requirement for fine-tuning such a model is as follows
    - Parameters: 1B $\times$ 4 bytes (32 bits)
    - Gradients: 1B $\times$ 4 bytes (32 bits)
    - Optimizer States: 1B  $\times$ 4 bytes (32 bits)$\times$  2 (depending on the optimizer, Adam saves 2 moment statistics per parameter)
    - Activations: can vary depending on the model


In [1]:
#import necessary libraries
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt

In [2]:
linear = nn.Linear(10,20)

In [3]:
type(linear)

torch.nn.modules.linear.Linear

## Low Rank Adaptation Fine Tuning (LoRA)

**What is LoRA?**
LoRA stands for Low Rank Adaptation fine-tuning. LoRA parametrizes the weight update matrix $\Delta W$ as a low rank matrix. Authors found 
that update matrices are intrinsically low rank.

**What is the goal of LoRA?**

The main goal of LoRA is to enable fine-tuning of primiarly Large Language Models (also Diffusion Models) in a parameter efficient way,
meaning without having to fine-tune or update all the weights of our model. This is desirable as this can reduce the computational cost when fine-tuning 
Large Language Models or Diffusion Models on downstream tasks

**How does it work?**
* Assume we have a pretrained model $\theta$ where it has linear layers with weights $W_0$:
* when we fine-tune our model, we freeze $W_0$, which prevents the pretrained weights from having gradient updates
* LoRA decomposes the weight update matrix into 2 matrices $A \in R^{r \times k}$ and $B \in R^{d \times r}$
* where we select $r$ , which is the rank and $r$ should be much smaller than $k$ or $d$
* The paper showed that finetuning perforamnce using a small $r$ is on par with a large $r$ that would be used in full-finetuning 
* This shows that increasing $r$ does not help the weight update matrix cover more meaningful subspaces
* The update formula during fine-tuning is as follows:
    - $xW_{updated}$ = $xW_0$ + $x\Delta W$
    - $xW_{updated}$ = $xW_0$ + $x(A \cdot B)\cdot\frac{\alpha}{r}$
    - where $r$ refers to the rank of low rank parametrization, $$


In [None]:
# Lets see analyze the dimensionality of the weight matrices
d = 100 # rows 
k = 200 # columns

r = 64 # the intrinsic low rank
W_0 = torch.randn(d,k) 

A = torch.randn(r,k)
B = torch.randn(d,r)

x = torch.randn(k,k)

W_0 = W_0 
delta_W = (A @ B) @ x

assert W_0.shape == delta_W.shape

y = W_0 + delta_W 


## Why does it work?
- It works because of this: there exists an intrinsic dimension
    - Intrinsic dimension means the **minimum number of parameters required to achieve good performance**
    - It also means the **lowest dimensional subspace** we can optimize our objective function in
    - We fine-tune the model using this formula:
    $$
    \theta^d = \theta_0^D + P(\theta^d)
    $$
    where:
    * D is some higher dimension, or original dimensionality of the pretrained parameters
    * d is some lower dimension, the dimensionality we want to perform optimization in for our 
    * P is the projection function or matrix that transforms the lower dimensional parameters to original dimensionality


<center><img src="https://miro.medium.com/v2/resize:fit:1400/1*Ckp6US9r8iDrEP9jW3m3VA.png" ></center>

In [None]:
class LoRA(nn.Module):
    def __init__(self,rank,scale,layer):
        self.rank = rank
        self.scale = scale
        self.a = nn.Parameter(torch.randn(10,self.rank))
        self.b = nn.Parameter(torch.randn(self.rank,self.rank))
        nn.init.normal(self.a.weight)
        nn.init.normal(self.a.weight)
        nn.init.zero(self.b.weight)
        nn.init.zero(self.b.weight)
    def forward(self,x):
        x = x + self.a @ self.b *(self.scale/self.rank)
        return x




In [None]:
# SVD decomposition & LoRA

In [None]:
# Applying LoRA to a pretrained language model



## Gradient Low Rank Adaptation Fine Tuning (GaLoRE)

**What is GaLoRE?**
GaLoRE is a paper released recently aiming to tackle certain limitations of vanilla LoRA
GaLoRE stands for gradient low-rank projection, which is a training method that allows you to still 
train with full parameters but is more **memory efficient** 


**Some of the main problems with LoRA**
- LoRA is not able to reach good performance compared to full-parameter fine-tuning
    - cause 1: the LoRA reparametrization changes the training dynamics
    - cause 2: the optimal weight matrices are not low rank

**How does it work?**
- GaLoRE is used to fine-tune pretrained langauge models
- GaLoRE approximates the gradient matrices as low rank rather than the paramter matrices (gradient matrices show to have slowly changing low-rank structure)
- We compute 2 **projection matrices** $P \in R^{m \times r}$ and $Q \in R^{n \times r}$ to project gradient matrices to **low rank form**, $P^TGQ$
- Only very infrequent updates applied to the projection matrices
- GaLoRE aims to reduce the gradient statistics of both first-order and second-order

## Deep dive into how it works

In full rank training, we represent the weight update rule to be:
$$
W_T = W_0 + n \Sigma_{t=0}^{T-1} \tilde{G_t} = W_0 + n \Sigma_{t=0}^{T-1} \rho_t (\tilde{G_t})
$$

* $W_T$ is weight matrix after $T$ training steps
* $W_0$ is the inital weight matrix
* $ \tilde{G_t}$ is the final gradient matrix to be added to the weight matrix.
* $\rho_t$ is stateful gradient regularizer (optimizer)


In low-rank training, we represent the weight update rule to be:
$$
W_T = W_0 + B_{T}A_{T}
$$
* $B \in R^{m \times r}$ is the low rank gradient matrix (low rank adaptors)
* $A \in R^{r \times n}$ is the low rank projection matrix (low rank adaptors)

In GaLore, we represent tje weight update rule to be:
$$
W_T = W_0 + n \Sigma_{t=0}^{T-1} \tilde{G_t}, \tilde{G_t} = P_t\rho_t(P_t^TG_tQ_t)Q_t^T
$$
* $P_t,Q_t \in R^{m \times r}, \in R^{n \times r}$ are projection matrices

## Memory consumption comparison
In full rank training:
- Adam: $M,V \in R^{m \times n}$  
- Gradient matrix: $G \in R^{m \times n}$
- Weight matrix: $W \in R^{m \times n}$
Total: $3mn$

In LoRA training:
- Adam: $M \in R^{m \times r},V \in R^{n \times r}$ for $A$,  $M \in R^{m \times r},V \in R^{n \times r}$ for $B$
- Gradient matrix: $G \in R^{m \times n}$
- Weight matrix: $W \in R^{m \times n}$, $A \in R^{m \times r}$,$B \in R^{n \times r}$
Total: $$

In GaLoRE training:
- Adam: $M,V \in R^{n \times r}$  
- Gradient matrix: $G \in R^{n \times r}$
- Weight matrix: $W \in R^{m \times n}$
- Projection matrices: $P \in R^{m \times r}$


**Original Adam Optimizer Algorithm**
<center><a href="https://ibb.co/JK5nkmd"><img src="https://i.ibb.co/xDJh5Xs/Screenshot-2024-06-05-at-1-32-18-PM.png" alt="Screenshot-2024-06-05-at-1-32-18-PM" border="0"></a></center>

In [2]:
# Trainng code for the Adam Optimizer

from torch.optim import Optimizer
import math

class Adam(Optimizer):
    def  __init__(self,params,lr = 1e-6,betas = (0.99,0.9),eps=1e-6,weight_decay=0.0):
        defaults = dict(lr=lr,betas=betas,weight_decay=weight_decay,eps=eps)
        super().__init__(params,defaults=defaults)
    def step(self,closure=None):
        #Iterature through all the parameter 
        for group in self.param_groups:
            for p in group['params']:
                if p.grad is None:
                    continue
                grad = p.grad.data
                state = self.state[p]
                step_size = group['lr']
                # state is a dictionary that holds all the optimizer configurations for each parameter
                if len(state) ==0:
                    state['step'] = 0
                    state['exp_avg'] = torch.zeros_like(p.data)
                    state['exp_avg_sqr'] = torch.zeros_like(p.data)
                exp_avg = state['exp_avg'] #first moment estimate
                exp_avg_sqr = state['exp_avg_sqr'] # second moment estimate
                beta1,beta2 = group['betas'] # get betas
                exp_avg.mul_(beta1).add_((1-beta1),grad) #update biased first moment estimate
                exp_avg_sqr.mul(beta2.addcmul_((1-beta2)),grad,grad) # update biased second moment estimate
                denom = exp_avg_sqr.sqrt().add_(group['eps'])

                # If there is bias correction
                if group['bias_correction'] == True:
                    bias_corrected_first_moment = 1 - beta1 ** state['step']
                    bias_corrected_second_moment = 1 - beta2 ** state['step']
                    step_size = step_size * math.sqrt(bias_corrected_second_moment) / bias_corrected_first_moment

                # Weight update
                p.data.addcdiv_(-step_size,exp_avg,denom)

**GaLoRE+Adam Optimizer Algorithm**
<center><a href="https://ibb.co/nDD9Sj7"><img src="https://i.ibb.co/gDDNCJS/Screenshot-2024-06-05-at-4-02-38-PM.png" alt="Screenshot-2024-06-05-at-4-02-38-PM" border="0"></a></center>

In [1]:
# Trainng code for the GaLoRE Optimizer (Ambrose's implementation)

from torch.optim import Optimizer
import torch
import math

class GaLoREOptimizer(Optimizer):
    def  __init__(self,params,lr = 1e-6,betas = (0.99,0.9),eps=1e-6,weight_decay=0.0,rank=32,subspace_freq=10,lora_scale=1.0):
        defaults = dict(lr=lr,betas=betas,weight_decay=weight_decay,eps=eps, )
        self.subspace_freq = subspace_freq # corresponds to T, subspace change frequency
        self.lora_rank = rank # corresponds to LoRA rank
        self.scaling_factor = lora_scale
        super(GaLoREOptimizer,self).__init__(params,defaults=defaults)
    def step(self,closure=None):
        #Iterature through all the parameter 

        for group in self.param_groups:
            for p in group['params']:
                if p.grad is None:
                    continue
                grad = p.grad.data
                #p.data.shape (512,784)
                m = p.grad.data.shape[0] # 512
                n = p.grad.data.shape[-1] # 768
                r = self.lora_rank
                state = self.state[p]
                step_size = group['lr']
                # state is a dictionary that holds all the optimizer configurations for each parameter
                if len(state) ==0:
                    state['step'] = 0
                    state['exp_avg'] = torch.zeros(r,n)
                    state['exp_avg_sqr'] = torch.zeros(r,n)
                    state['projection'] = torch.zeros(m,r)
                if state['step'] % self.subspace_freq == 0:
                    U,_,_ = torch.svd(grad)
                    #                      m x r
                    state['projection'] = U[:,:r] #state['projection'].shape (512,32)
                else:
                    pass

                # Project the gradient matrix to low rank (compact space)
                r_t = state['projection'].T @ grad # (32,512) * (512,784) = (32, 784) = r x n
                # Exponential moving average of **low-rank projection** of gradient values
                exp_avg = state['exp_avg'] #first moment estimate
                # Exponential moving average of the square of the **low-rank projection** of gradient values
                exp_avg_sqr = state['exp_avg_sqr'] # second moment estimate
                beta1,beta2 = group['betas'] # get betas
                exp_avg.mul_(beta1).add_((1-beta1),r_t) #update biased first moment estimate
                exp_avg_sqr.mul(beta2).addcmul_((1-beta2),r_t,r_t) # update biased second moment estimate
                denom = exp_avg_sqr.sqrt().add_(group['eps']) # denom.shape (32,784)

                # If there is bias correction
                if 'bias_correction' in group:
                    bias_corrected_first_moment = 1 - beta1 ** state['step']
                    bias_corrected_second_moment = 1 - beta2 ** state['step']
                    step_size = step_size * math.sqrt(bias_corrected_second_moment) / bias_corrected_first_moment
                n_t = exp_avg / denom 

                # Project low-rank graident matriz back t to original vectior subspace
                #                               m x r           r x n
                g_t = self.scaling_factor * state['projection'] @ n_t
                state['step'] += 1
                # Weight update
                p.data.add_(-step_size*g_t)

## Weight Decomposition Low Rank Adaptation Fine Tuning (DoRA)

**What is DoRA?**


**How does it work?**

## Q-LoRA (Quantized Fine-Tuning) 
 
**What is Q-LoRA?**

**How does it work?**


##