# LoRA (Low-Rank Adaptation of Large Language Models)

https://www.ml6.eu/blogpost/low-rank-adaptation-a-technical-deep-dive

https://lightning.ai/pages/community/article/lora-llm/

https://alexbaron.es/deep%20learning/jax/lora_benchmark/

Lora example for stable diffusion: https://huggingface.co/blog/lora

Paper: https://arxiv.org/abs/2106.09685

# 1 - Introduction

LoRA is a popular and lightweight training technique that significantly reduces the number of trainable parameters. It works by insering a smaller number of new weights into the model and only these are trained.

This makes training with LoRA much faster, memory-efficient, and produces smaller model weights (a few hundred MBs), which are easier to store and share. LoRA can also be combined with other training techniques like DreamBooth to speedup training.

# 2 - Background concepts

Before diving into LoRA, let's review some fundamental linear algebra concepts.

## 2.1 - Minor of a matrix

**Minor of matrix** for a particular element in the matrix is defined as the matrix obtained after deleting the row and column of the matrix in which that particular element lies.

For example, for a given matrix $A$, the minor of $a_{12}$ is the part of the matrix after excluding the first row and the second column of the matrix.

$$
A = \begin{bmatrix}
a_{11} & a_{12} & a_{13}\\
a_{21} & a_{22} & a_{23}\\
a_{31} & a_{32} & a_{33}
\end{bmatrix}
$$

The minor of the element $a_{12}$ is as follows:

$$
M_{12} = \begin{bmatrix}
a_{21} & a_{23}\\
a_{31} & a_{33}
\end{bmatrix}
$$

Similarly, we can take the minors of the matrix and form a minor matrix $M$ of the given matrix $A$ as:

$$
M = \begin{bmatrix}
M_{11} & M_{12} & M_{13}\\
M_{21} & M_{22} & M_{23}\\
M_{31} & M_{32} & M_{33}
\end{bmatrix}
$$

### How to find the minor of a matrix

There are three simple steps to find the minor of the matrix.

1. Identify and exclude the row and the column which contains the particular element within the matrix.
2. Form a smaller matrix with the remaining elements, to represent the minor of the particular element of the matrix.
3. Find the determinant of the "minor matrix" $M_{xy}$

In order to generate the matrix of minors $M$, you repeat these steps for each element of the original matrix $A$.

For example:

$$
M_{11} = \begin{bmatrix}
a_{22} & a_{23}\\
a_{32} & a_{33}
\end{bmatrix} = a_{22} * a_{33} - a_{23} * a_{32}
$$

$$
M_{23} = \begin{bmatrix}
a_{11} & a_{12}\\
a_{31} & a_{32}
\end{bmatrix} = a_{11} * a_{32} - a_{12} * a_{31}
$$

etc.

## 2.2 - Matrix rank

The rank of a matrix is the dimension of the vector space generated by its columns, which is given by the number of linearly independent columns (or rows) in a given matrix. It can be proven that the number of independent columns (known as *column rank*) is always equal to the number of independent rows (called *row rank*). Hence, for a matrix $A$ with $m$ rows and $n$ columns (represented $A_{mn}$)

$$
\text{rank}(A) \leq \text{min}(m, n)
$$

**An intuitive understanding of Matrix Rank:** For the purposes of this notebook, the rank of a matrix can be perceived as the dimensions of the feature space it represents. In this context, a low-rank matrix of a certain size encapsulated fewer features (or a lower dimensional feature space) than a full-rank matrix of the same dimensions.

### Properties

1. The rank of a matrix is constrained by the minimum of its number of rows and columns
2. The rank of the product of two matrices is constrained by the minimum of their individual ranks. Given matrices $A$ and $B$ with $\text{rank}(A) = m$ and $\text{rank}(B) = n$, then
    $$
    \text{rank}(AB) \leq \text{min}(m,n)
    $$

### How to Find the Rank of a Matrix? 


The rank of a matrix $\rho(A)$ can be found using three methods:

* **Minor method**
* Using echelon form
* Using normal form

Let's study how to find the rank of a matrix using the **minor method**.

* Find the determinant of $A$ (if $A$ is a square matrix). If $\text{det}(A) \neq 0$, then the rank of $A$ = order of $A$
* If either $\text{det}(A) = 0$ (if $A$ is a sqaure matrix), or $A$ is not a square matrix, then see whether there exists any minor of maximum possible order is non-zero. If there exists such non-zero minor, then rank of $A$ = order of that particular minor
* Repeat the above step if all the minors of the order considered in the above step are zeros and then try to find a non-zero minor of order that is 1 less than the order from the above step.

**Example:** Find the rank of the matrix $\rho(A)$, where $A = \begin{bmatrix}
1 & 2 & 3\\
4 & 5 & 6\\
7 & 8 & 9
\end{bmatrix}$

**Solution:**

A is a square matrix and so we can find its determinant.

$\text{det}(A) = 1 (45-48) - 2 (36 - 42) + 3 (32-35) = -3 + 12 - 9 = 0$

So $\rho(A) \neq$ order of the matrix. i.e., $\rho(A) \neq 3$. Now, we will see whether we can find any non-zero minor of order 2.

$M_{33} = \begin{bmatrix}
a_{11} & a_{12}\\
a_{31} & a_{32}
\end{bmatrix} = 5 - 8 = -3 \neq 0$

So there exists a minor of order 2 (or 2 x 2) which is non-zero, so $\rho(A) = 2$.

## 2.3 - Types of matrices of according to rank

### Full-rank matrix

A matrix $A_{mn}$ is called a **full-rank matrix** if $\text{rank}(A) = \text{min}(m, n)$. For example:

$$
\text{rank}(\begin{bmatrix}
1 & 2 & 3\\
1 & 1 & 0\\
3 & 0 & 4
\end{bmatrix}) = 3
$$

### Rank-deficient matrix

The opposite of a full rank matrix is a **rank-deficient matrix**, i.e. $\text{rank}(A) < \text{min}(m, n)$. For example:

$$
\text{rank}(\begin{bmatrix}
1 & 2 & 3\\
2 & 4 & 6
\end{bmatrix}) = 1
$$

**Low-rank matrix**: A rank-deficient matrix $A_{mn}$ is called a low-rank matrix if its rank is significantly lower (no fixed threshold) than the minimum number of rows and columns. Mathematically, $\text{rank}(A) << \text{min}(m, n)$.

## 2.4 - Rank decomposition

Rank decomposition or factorization is a powerful tool in linear algebra that involves expressing a matrix as the product of other matrices. There are various forms, each useful for different applications. [It can be proven that every (finite) matrix has a rank decomposition.](https://en.wikipedia.org/wiki/Rank_factorization#Existence)

**What it does:**

* Breaks down a matrix $A$ (m x n) into a product of other matrices: $A = X * Y$.
* $X$ and $Y$ can take different forms depending on the specific decomposition used.
* The key idea is that $X$ and $Y$ capture different aspects of the information stored in $A$ in a more interpretable way.

**Types of decompositions:**

* **LU decomposition:** Expresses A as the product of a lower triangular matrix (L) and an upper triangular matrix (U). Useful for solving linear systems and inverting matrices.
* **QR decomposition:** Expresses A as the product of an orthogonal matrix (Q) and an upper triangular matrix (R). Useful for solving least squares problems and analyzing projections.
* **Singular Value Decomposition (SVD):** Expresses A as the product of three matrices: U (orthogonal), Σ (diagonal with singular values), and V^T (transpose of another orthogonal matrix). Provides insights into the inherent dimensionality and important features of the data encoded in A.
* **Full rank decomposition:** Similar to SVD, but uses non-square matrices X and Y, where both have the same rank as A. Useful for representing low-rank approximations of A.

**Benefits:**

* Simplifies analysis and calculations related to the original matrix.
* Reveals hidden structure and relationships within the data.
* Enables compression and efficient storage of large matrices.
* Used in various applications like image processing, signal processing, machine learning, and data analysis.

**Things to remember:**

* Decompositions are not unique in general, meaning there can be multiple ways to express A as a product.
* Each type of decomposition has its own strengths and weaknesses, suited for specific problems.
* Understanding the concept and choosing the right decomposition requires knowledge of linear algebra and the problem at hand.

# 3 - LoRA: Low-Rank Adaptation of Large (Language) Models

LoRA is an efficient finetuning technique proposed by Microsoft researchers to adapt large models to specific tasks and datasets. While [the paper (Hu et al., 2021)](https://arxiv.org/abs/2106.09685) uses GPT-3 as the test case and focuses on language models and NLP tasks, this technique is quite generalizable, as we will see below. It can be applied to various models in multiple contexts.

LoRA proposes to decompose the weight changes, $\Delta W$, into a lower-rank representation. (To be technically correct, LoRA does not decompose the matrices directly, but it learns the values of the decomposed matrices via backpropagation - this is a nitpicky detail that will make sense later).

Before we take a closer look at LoRA, let's briefly explain the training procedure during regular finetuning. So, what are the weight changes $\Delta W$? Suppose $W$ represents the weight matrix in a given neural network layer. Then, using regular backpropagation, we can obtain the weight update $\Delta W$, which is typically calculated as a negative gradient of the loss times the learning rate:

$$ \Delta W = \alpha (-\nabla L_{W}) $$

Then, when we have $\Delta W$, we can update the original weights as follows: $W' = W + \Delta W$. Alternatively, we can keep the weight update matrix separate and compute the outputs as follows:

$$ h = Wx + \Delta Wx $$

<table>
    <tr>
        <td><img src="./images_1/regular_finetuning.png" width="600"/></td>
    </tr>
</table>

where $x$ represents the inputs, as illustrated below:

<table>
    <tr>
        <td><img src="./images_1/regular_finetuning_alternative.png" width="400"/></td>
    </tr>
</table>

So, when we train fully connected (i.e., "dense") layers in a neural network, the weight matrices usually have full rank, which is a technical term meaning that a matrix does not have any linearly dependent (i.e., redundant) rows or columns. In contrast, to full rank, low rank means that the matrix has redundant rows or columns.

The LoRA authors point out that while the weights of a pretrained model have full rank on the pretrained tasks, there is a low intrinsic dimension when these models are adapted to a new k, i.e., we don't need to adapt all parameters to adapt the model to the new task. This means we can decompose the new weight matrix for the adapted task into lower-dimensional (smaller) matrices without losing too much important information.

For example, suppose $\Delta W$ is the weight update for an $A \times B$ weight matrix. Then, we can decompose the weight update matrix into two smaller matrices: $\Delta W = W_{A} W_{B}$, where $W_{A}$ is an $A \times r$-dimensional matrix, and $W_{B}$ is an $r \times B$-dimensional matrix. Here, we keep the original weight $W$ frozen and only train the new matrices $W_{A}$ and $W_{B}$. This, in a nutshell is the LoRa method, which is illustrated in the figure below.

<table>
    <tr>
        <td><img src="./images_1/lora_diagram.png" width="400"/></td>
    </tr>
</table>

#### Choosing the rank

Note that $r$, in the figure above, is a hyperparameter here that we can use **to specify the rank of the low-rank matrices used for adaptation**. A smaller $r$ leads to a simpler low-rank matrix, which results in fewer parameters to learn during adaptation.

A smaller $r$ can lead to faster training and potentially reduced computational requirements. However, the capacity of the low-rank matrix to capture task-specific information decreases. This may result in lower adaptation quality, and the model might not perform as well on the new task compared to a higher $r$.

In summary, choosing a smaller $r$ in LoRA has a trade-off between model complexity, adaptation capacity, and the risk of underfitting or overfitting. It's thus important to experiment with different $r$ values to find the right balance to achieve the desired performance on the new task

## 3.1 - Hypothesis

Many previous works have shown that over-parametrized large models reside on a low intrinsic dimension. The main idea behind LoRA is that the change in weights during model adaptation also has a low intrinsic rank/dimension. Concretely, if $W_{nk}$ represents the weights of a single layer and $\Delta W_{nk}$ represents the change of weights during model adaptation, the authors propose that $\Delta W_{nk}$ is a low-rank matrix:

$$
\text{rank}(\Delta W_{n \times k}) << \text{min}(n, k)
$$

#### Why does this make sense?

Large models are trained to capture the general representation of their domain (language for LLMs, audio + language for models like Whisper, and vision for image generation models). These models capture a variety of features which allow them to be used for diverse tasks with reasonable zero-shot accuracy. 

However, **when adapting such a model to a specific task or dataset, only a few "features" need to be emphasized or re-learnt**. This means that **the updated matrix ($\Delta W_{n \times k}$) can be a low-rank matrix**.

## 3.2 - Method

LoRA constrains the rank of the update matrix $\Delta W_{nk}$ using its rank decomposition. It represents $\Delta W_{nk}$ as the product of 2 low-rank matrices $B_{nr}$ and $A_{rk}$, where $r << \text{min}(n, k)$. This implies that the forward pass of the layer, originally $Wx$, is modified to $Wx + BAx$, where $\Delta W_{nk} = BA$.
* $A$ is initialized using random Gaussian initialization.
* $B$ is initially set to 0.

Therefore, $BA = 0$ at the start of training. The update $BA$ is additionally scaled with a factor of $\alpha / r$.

## 3.3 - Implementation

The implementation of LoRA is relatively straightforward. We can think of it as a modified forward pass for the fully connected layers in an LLM. In pseudocode, this looks like as follows:

In [None]:
import torch
import torch.nn as nn
import math

input_dim = 768  # e.g., the hidden size of the pre-trained model
output_dim = 768  # e.g., the output size of the layer
rank = 8  # The rank 'r' for the low-rank adaptation
alpha = ... # scaling factor

W = ... # from pretrained network with shape input_dim x output_dim

W_A = nn.Parameter(torch.empty(input_dim, rank)) # LoRA weight A
W_B = nn.Parameter(torch.empty(rank, output_dim)) # LoRA weight B

# Initialization of LoRA weights
nn.init.kaiming_uniform_(W_A, a=math.sqrt(5))
nn.init.zeros_(W_B)

def regular_forward_matmul(x, W):
    h = x @ W
    return h

def lora_forward_matmul(x, W, W_A, W_B):
    h = x @ W  # regular matrix multiplication
    h += x @ (W_A @ W_B)*alpha # use scaled LoRA weights
    return h


In the pseudocode above, `alpha` is a scaling factor that adjusts the magnitude of the combined result (original model output plus low-rank adaptation). This balances the pretrained model's knowledge and the new task-specific adaptation. By default, `alpha` is usually set to 1. Also note that while $W_{A}$ is initialized to small random weights, $W_{B}$ is initialized to 0 so that $\Delta W = W_{A} W_{B} = 0$ at the beginning of the training, meaning we begin the training with the original weights.

## 3.4 - Benefits

**Reduction in parameter space**

* LoRa reduces the parameter space required for fine-tuning by representing the parameters of a layer using a combination of lower-rank matrices.

**Computational efficiency during inference**

* Inference calculations remain efficient as only the addition of two matrices is required, resulting in minimal computational overhead.

**Memory savings**

* By using separate lightweight matrices for each task, memory usage is optimized, allowing for efficient representation of multiple tasks without significant memory overhead. For example, having separate $W_{q}^{t}$ matrices for each task reduces memory requirements.

**Reduced gradient computations**

* With fewer parameters to fine-tune, the number of gradients to compute is significantly reduced, leading to faster optimization. For instance, in a scenario with $e = 100$, $h=100$, $r=2$, instead of tuning $e \times h = 10000$ original parameters, only $er + rh = 200$ parameters need to be optimized.

## 3.5 - Disadvantages

However, not everything is rosy. For small models (thanks to Pedro Cuenca and Younes Belkada, authors of the great blogpost explaining the method https://huggingface.co/blog/lora that pointed me towards this) the computational overhead of the 
L
 matrices multiplication might not compensate the fact that you are computing less gradients.

## 3.6 - How good is it in practice?

HHow good is LoRA in practice, and how does it compare to full finetuning and other parameter-efficient approaches? According to the [LoRA paper](https://arxiv.org/abs/2106.09685), the modeling performance of models using LoRA performs slightly better than models using [Adapters](https://arxiv.org/abs/2110.07280), [prompt tuning](https://arxiv.org/abs/2104.08691), or prefix tuning across several task-specific benchmarks.

In some cases, LoRA performs even better than finetuning all layers as shown in the annotated table from the LoRA paper below (ROUGE is a metric for evaluating language translation performance, I explained it in more detail [here](https://twitter.com/rasbt/status/1639625228622917632?s=20)).

<table>
    <tr>
        <td><img src="./images_1/lora_performance.png" width="700"/></td>
    </tr>
</table>

Here, it’s worth noting that LoRA is orthogonal to the other finetuning methods, meaning **it can also be combined with prefix tuning and adapters, for example**.

# 4 - LoRA in the wild

With the recent explosion of large foundation and generative AI models, the open-source community has welcomed LoRA with open arms due to its ability to allow low-resource practitioners to adapt large models. Two major uses are "Instruct-tuning LLMs" and "Finetuning Diffusion models".

## 4.1 - Instruct-tuning LLMs

The core idea here is simple. Create a dataset of instructions and responses (either using manual curation or ChatGPT) and use LoRA to finetune a pre-trained large language model using this dataset. This method produces models that are reasonably adept at following instructions and answering questions like humans. Interested readers can check out models such as [Alpaca-LoRA](https://github.com/tloen/alpaca-lora) and [Vicuna](https://vicuna.lmsys.org/).

## 4.2 - Finetuning Diffusion models

Before the launch of ChatGPT and other LLMs like LLaMA, LoRA was primarily used to tune stable diffusion to adapt the style of generated images. The LoRA weights can then be used and shared in a plug-and-play fashion switching them out when a different image generation style is necessary.

As seen before, the main draw of this technique is its parameter and compute efficiency. A testimony to the popularity of this method in the generative AI community is the existence of the [Lora Library](https://huggingface.co/lora-library) where people can share their Lora files!