# LoRA (Low-Rank Adaptation of Large Language Models)

https://www.ml6.eu/blogpost/low-rank-adaptation-a-technical-deep-dive

https://lightning.ai/pages/community/article/lora-llm/

https://alexbaron.es/deep%20learning/jax/lora_benchmark/

Lora example for stable diffusion: https://huggingface.co/blog/lora

Paper: https://arxiv.org/abs/2106.09685

# 1 - Introduction

LoRA is a popular and lightweight training technique that significantly reduces the number of trainable parameters. It works by insering a smaller number of new weights into the model and only these are trained.

This makes training with LoRA much faster, memory-efficient, and produces smaller model weights (a few hundred MBs), which are easier to store and share. LoRA can also be combined with other training techniques like DreamBooth to speedup training.

# 2 - Background concepts

Before diving into LoRA, let's review some fundamental linear algebra concepts.

## 2.1 - Minor of a matrix

**Minor of matrix** for a particular element in the matrix is defined as the matrix obtained after deleting the row and column of the matrix in which that particular element lies.

For example, for a given matrix $A$, the minor of $a_{12}$ is the part of the matrix after excluding the first row and the second column of the matrix.

$$
A = \begin{bmatrix}
a_{11} & a_{12} & a_{13}\\
a_{21} & a_{22} & a_{23}\\
a_{31} & a_{32} & a_{33}
\end{bmatrix}
$$

The minor of the element $a_{12}$ is as follows:

$$
M_{12} = \begin{bmatrix}
a_{21} & a_{23}\\
a_{31} & a_{33}
\end{bmatrix}
$$

Similarly, we can take the minors of the matrix and form a minor matrix $M$ of the given matrix $A$ as:

$$
M = \begin{bmatrix}
M_{11} & M_{12} & M_{13}\\
M_{21} & M_{22} & M_{23}\\
M_{31} & M_{32} & M_{33}
\end{bmatrix}
$$

### How to find the minor of a matrix

There are three simple steps to find the minor of the matrix.

1. Identify and exclude the row and the column which contains the particular element within the matrix.
2. Form a smaller matrix with the remaining elements, to represent the minor of the particular element of the matrix.
3. Find the determinant of the "minor matrix" $M_{xy}$

In order to generate the matrix of minors $M$, you repeat these steps for each element of the original matrix $A$.

For example:

$$
M_{11} = \begin{bmatrix}
a_{22} & a_{23}\\
a_{32} & a_{33}
\end{bmatrix} = a_{22} * a_{33} - a_{23} * a_{32}
$$

$$
M_{23} = \begin{bmatrix}
a_{11} & a_{12}\\
a_{31} & a_{32}
\end{bmatrix} = a_{11} * a_{32} - a_{12} * a_{31}
$$

etc.

## 2.2 - Matrix rank

The rank of a matrix is the dimension of the vector space generated by its columns, which is given by the number of linearly independent columns (or rows) in a given matrix. It can be proven that the number of independent columns (known as *column rank*) is always equal to the number of independent rows (called *row rank*). Hence, for a matrix $A$ with $m$ rows and $n$ columns (represented $A_{mn}$)

$$
\text{rank}(A) \leq \text{min}(m, n)
$$

**An intuitive understanding of Matrix Rank:** For the purposes of this notebook, the rank of a matrix can be perceived as the dimensions of the feature space it represents. In this context, a low-rank matrix of a certain size encapsulated fewer features (or a lower dimensional feature space) than a full-rank matrix of the same dimensions.

### Properties

1. The rank of a matrix is constrained by the minimum of its number of rows and columns
2. The rank of the product of two matrices is constrained by the minimum of their individual ranks. Given matrices $A$ and $B$ with $\text{rank}(A) = m$ and $\text{rank}(B) = n$, then
    $$
    \text{rank}(AB) \leq \text{min}(m,n)
    $$

### How to Find the Rank of a Matrix? 


The rank of a matrix $\rho(A)$ can be found using three methods:

* **Minor method**
* Using echelon form
* Using normal form

Let's study how to find the rank of a matrix using the **minor method**.

* Find the determinant of $A$ (if $A$ is a square matrix). If $\text{det}(A) \neq 0$, then the rank of $A$ = order of $A$
* If either $\text{det}(A) = 0$ (if $A$ is a sqaure matrix), or $A$ is not a square matrix, then see whether there exists any minor of maximum possible order is non-zero. If there exists such non-zero minor, then rank of $A$ = order of that particular minor
* Repeat the above step if all the minors of the order considered in the above step are zeros and then try to find a non-zero minor of order that is 1 less than the order from the above step.

**Example:** Find the rank of the matrix $\rho(A)$, where $A = \begin{bmatrix}
1 & 2 & 3\\
4 & 5 & 6\\
7 & 8 & 9
\end{bmatrix}$

**Solution:**

A is a square matrix and so we can find its determinant.

$\text{det}(A) = 1 (45-48) - 2 (36 - 42) + 3 (32-35) = -3 + 12 - 9 = 0$

So $\rho(A) \neq$ order of the matrix. i.e., $\rho(A) \neq 3$. Now, we will see whether we can find any non-zero minor of order 2.

$M_{33} = \begin{bmatrix}
a_{11} & a_{12}\\
a_{31} & a_{32}
\end{bmatrix} = 5 - 8 = -3 \neq 0$

So there exists a minor of order 2 (or 2 x 2) which is non-zero, so $\rho(A) = 2$.

## 2.3 - Types of matrices of according to rank

### Full-rank matrix

A matrix $A_{mn}$ is called a **full-rank matrix** if $\text{rank}(A) = \text{min}(m, n)$. For example:

$$
\text{rank}(\begin{bmatrix}
1 & 2 & 3\\
1 & 1 & 0\\
3 & 0 & 4
\end{bmatrix}) = 3
$$

### Rank-deficient matrix

The opposite of a full rank matrix is a **rank-deficient matrix**, i.e. $\text{rank}(A) < \text{min}(m, n)$. For example:

$$
\text{rank}(\begin{bmatrix}
1 & 2 & 3\\
2 & 4 & 6
\end{bmatrix}) = 1
$$

**Low-rank matrix**: A rank-deficient matrix $A_{mn}$ is called a low-rank matrix if its rank is significantly lower (no fixed threshold) than the minimum number of rows and columns. Mathematically, $\text{rank}(A) << \text{min}(m, n)$.

## 2.4 - Rank decomposition

Rank decomposition or factorization is a powerful tool in linear algebra that involves expressing a matrix as the product of other matrices. There are various forms, each useful for different applications. [It can be proven that every (finite) matrix has a rank decomposition.](https://en.wikipedia.org/wiki/Rank_factorization#Existence)

**What it does:**

* Breaks down a matrix $A$ (m x n) into a product of other matrices: $A = X * Y$.
* $X$ and $Y$ can take different forms depending on the specific decomposition used.
* The key idea is that $X$ and $Y$ capture different aspects of the information stored in $A$ in a more interpretable way.

**Types of decompositions:**

* **LU decomposition:** Expresses A as the product of a lower triangular matrix (L) and an upper triangular matrix (U). Useful for solving linear systems and inverting matrices.
* **QR decomposition:** Expresses A as the product of an orthogonal matrix (Q) and an upper triangular matrix (R). Useful for solving least squares problems and analyzing projections.
* **Singular Value Decomposition (SVD):** Expresses A as the product of three matrices: U (orthogonal), Σ (diagonal with singular values), and V^T (transpose of another orthogonal matrix). Provides insights into the inherent dimensionality and important features of the data encoded in A.
* **Full rank decomposition:** Similar to SVD, but uses non-square matrices X and Y, where both have the same rank as A. Useful for representing low-rank approximations of A.

**Benefits:**

* Simplifies analysis and calculations related to the original matrix.
* Reveals hidden structure and relationships within the data.
* Enables compression and efficient storage of large matrices.
* Used in various applications like image processing, signal processing, machine learning, and data analysis.

**Things to remember:**

* Decompositions are not unique in general, meaning there can be multiple ways to express A as a product.
* Each type of decomposition has its own strengths and weaknesses, suited for specific problems.
* Understanding the concept and choosing the right decomposition requires knowledge of linear algebra and the problem at hand.

# 3 - LoRA: Low-Rank Adaptation of Large (Language) Models

LoRA is an efficient finetuning technique proposed by Microsoft researchers to adapt large models to specific tasks and datasets. While [the paper (Hu et al., 2021)](https://arxiv.org/abs/2106.09685) uses GPT-3 as the test case and focuses on language models and NLP tasks, this technique is quite generalizable, as we will see below. It can be applied to various models in multiple contexts.

## 3.1 - Hypothesis

Many previous works have shown that over-parametrized large models reside on a low intrinsic dimension. The main idea behind LoRA is that the change in weights during model adaptation also has a low intrinsic rank/dimension. Concretely, if $W_{nk}$ represents the weights of a single layer and $\Delta W_{nk}$ represents the change of weights during model adaptation, the authors propose that $\Delta W_{nk}$ is a low-rank matrix:

$$
\text{rank}(\Delta W_{n \times k}) << \text{min}(n, k)
$$

----

#### Why does this make sense?

Large models are trained to capture the general representation of their domain (language for LLMs, audio + language for models like Whisper, and vision for image generation models). These models capture a variety of features which allow them to be used for diverse tasks with reasonable zero-shot accuracy. 

However, **when adapting such a model to a specific task or dataset, only a few "features" need to be emphasized or re-learnt**. This means that **the updated matrix ($\Delta W_{n \times k}$) can be a low-rank matrix**.

----

LoRA arises from the idea that large models are overparametrized, so the set of parameters for any layer can be expresed by a combination of a lower rank parameter space. Take for example an attention matrix, in particular the weight matrix for the query $W_{q}$ of dimensions $e \times h$ (embedding dimension times hidden dimension).

Instead of fine-tuning the $W_{q}$ matrix itself, LoRA proposes to freeze that matrix and fine tune two new matrices of dimensions $e \times r$ and $r \times h$, with $r$ being a low rank value (i.e. 2)

Then, when doing inference, the calculations times are the same because you will only have in memory the addition of two matrices, which will be the same as fine-tuning!

Another interesting point is savings in memory. You can have your big model with $W_{q}$ and then have one $W_{q}^{t}$ for each one of your tasks, so each task is represented by a separate, addable, lightweight matrix.

You also have to compute way less gradients. For an $e = 100$, $h=100$, $r=2$ matrix with $e \times h = 10000$ original parameters you would be tuning just $er + rh = 200$ parameters.

## 3.2 - Method

## 3.3 - Benefits

## 3.4 - Disadvantages

However, not everything is rosy. For small models (thanks to Pedro Cuenca and Younes Belkada, authors of the great blogpost explaining the method https://huggingface.co/blog/lora that pointed me towards this) the computational overhead of the 
L
 matrices multiplication might not compensate the fact that you are computing less gradients.