# Explanation

All state-of-the-art language models after BERT have depending on pre-training on internet-scale datasets to develop the majority of their knowledge. Because this pre-training is not conducive to any specific task, the fine-tuning stage is critical to make these models useful in production.

The most effective assistant models, like the InstructGPT series, also depend heavily on fine-tuning to make the models behave like assistants and be helpful and aligned.

Because of the importance of fine-tuning, efficient fine-tuning methods are important.

Initially, fine-tuning was performed by training the base model on a new dataset and retuning all the parameters, which is quite expensive for large models.

[Adapters](https://arxiv.org/abs/1902.00751) introduced a far more efficient approach to fine-tuning where new "adapter" layers were added inside each block of the transformer, and then only these layers were trained during fine-tuning.

As a result, a much smaller number of parameters needed to be trained, making the fine-tuning stage much more efficient.

However, by nature of introducing new parameters and non-linearities into the model, adapters can cause unpredictable shifts in the model behavior (as the original architecture's representation spaces were meant for the architecture without adapters).

Because of this, the ideal fine-tuning method would be both _parameter efficient_ and also introduce _no new parameters_, so it couldn't hurt the existing quality of the model.

**LoRA** is the solution to this challenge. It introduces a fine-tuning approach that's both parameter efficient, and fully compatible with existing architectures.

### Intuition

Instead of introducing new parameters to train, LoRA works by updating the existing parameters in the network.

If we did this the standard way (by updating all the parameters during training), it would not achieve any efficiency over standard fine-tuning. But LoRA achieves fine-tuning of all the parameters in the network without training all the parameters directly.

Instead, LoRA trains an equivalent lower dimension representation of the parameters in the network - this means that it has less parameters to train, but also less degrees of freedom to train on, meaning it can't make the same size of changes to the network as the pre-training stage.

Empirically, it turns out that this fine-tuning method achieves high quality results despite the simplification, which suggests that the nature of changes made to models during the fine-tuning stage doesn't require modification of all the parameters.

### Math

For any weight matrix $W$, LoRA works by computing a small change to that weight matrix to produce an effective fine-tuned weight matrix $W_0$ via:

$$W_0 = Wx + \Delta Wx = Wx + BAx$$

During fine-tuning, we fix the values of the original weight matrix $W$ and only modify the parameters of the $\Delta W$ matrix. This is convenient so we can preserve the original pre-trained model, which is convenient during inference where we may want to switch out different fine-tunes of the same base model.

Instead of modifying the parameters of $\Delta W$ directly, which would result in training the same number of parameters as the original network, we instead train parameters from a low-rank decomposition of $\Delta W$ via $BA$, which has a far smaller number of parameters to train.

Importantly, the total number of parameters $B$ and $A$ combined are much smaller than the number of parameters in $\Delta W$.

This allows us to train a far smaller number of parameters, which is parameter efficient, while still being able to fine-tune the whole model.

To make sure that the the fine-tuned matrix starts off close to the original pre-trained matrix, we usually initialize all the parameters in $B$ and $A$ to zero first, and then let them improve from there.

This has an interesting implication - it appears that the modifications to a model made during fine-tuning are fundamentally "low-rank" modifications to the parameters - in other words, the parameters are updated in only a small subset of the dimensions.

LoRA also explores how this transformation appears to amplify the effect of the model having learned to do certain tasks during pre-training, but not emphasizing this ability.



# My Notes

## 📜 [Low-Rank Adaptation of Large Language-Models](https://arxiv.org/pdf/2106.09685)

> As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible.

> We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each
> layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks.

Seems to be building on adapters to make fine-tuning large models like GPT-3 feasible at scale.

> We also provide an empirical investigation into rank-deficiency in language model adaptation, which sheds light on the efficacy of LoRA.

> More importantly, these methods [of fine-tuning models by extending model depth or reducing the model’s usable sequence] often fail to match the fine-tuning baselines, posing a trade-off between efficiency and model quality.

> We hypothesize that the change in weights during model adaptation also has a low “intrinsic rank”, leading to our proposed Low-Rank Adaptation (LoRA) approach.

> LoRA allows us to train some dense layers in a neural network indirectly by optimizing rank decomposition matrices of the dense layers’ change during adaptation instead, while keeping the pre-trained weights frozen.

LoRA lets the same base model be used for different tasks, makes training more efficient, introduces no inference latency, and can be combined with many previous optimization methods like prefix-tuning.

### Problem Statement

> One of the main drawbacks for full fine-tuning is that for _each_ downstream task, we learn a _different_ set of parameters $\Delta \Phi$ whose dimension $|\Delta \Phi|$ equals $|\Phi_0|$. Thus, if the pre-trained model is large, storing and deploying many independent instances of fine-tuned models can be challenging, if at all feasible.

> In the subsequent sections, we propose to use a low-rank representation to encode $\Delta \Phi$ that is both compute- and memory-efficient.

### Aren’t Existing Solutions Good Enough?

> There is no direct ways to bypass the extra compute in adapter layers.

> We observe that prefix tuning is difficult to optimize and that its performance changes non-monotonically in trainable parameter.

### Our Method

**1. Low-Rank-Parameterized Update Matrices**

> For a pre-trained weight matrix $W_0 \in \mathbb{R}^{d \times k}$, we constrain its update by representing the latter with a low-rank de-composition $W_0 + \Delta W = W_0 + BA$, where $B \in \mathbb{R}^{d \times r}$, and the rank $r \ll \min(d, k)$.

> For $h = W_0x$, our modified forward pass yields:

$$
h = W_0x + \Delta Wx = W_0x + BAx
$$

Instead of optimizing a completely new set of parameters $\Delta W$ with dimension $d \times d$ in order to adapt the parameters of the original matrix $W_0$, we can instead create a low-rank decomposition of matrix $\Delta W = BA$ where the dimensions of $B$ and $A$ are $d \times r$ and $r \times d$ respectively. Thus, if $r \ll d$, this decomposition still yields a matrix of dimension $d \times d$ while needing to optimize $2rd$ parameters instead of $d^2$ parameters, which is a massive optimization.

> LoRA takes a step further and does not require the accumulated gradient update to weight matrices to have full-rank during adaptation.

> In other words, as we increase the number of trainable parameters, training LoRA roughly converges to training the original model.

> When deployed in production, we can explicitly compute and store $W = W_0 + BA$.

> When we need to switch to another downstream task, we can recover $W_0$ by subtracting $BA$ and then adding a different $B'A'$, a quick operation with very little memory overhead.

> Critically, this guarantees that we do not introduce any additional latency during inference compared to a fine-tuned model by construction.

**2. Applying LoRA to Transformer**

> In principle, we can apply LoRA to any subset of weight matrices in a neural network to reduce the number of trainable parameters.

> We limit our study to only adapting the attention weights for downstream tasks and freeze the MLP modules (so they are not trained in downstream tasks) both for simplicity and parameter-efficiency.

> The most significant benefit comes from the reduction in memory and storage usage.

### Understanding the Low-Rank Updates

> (1) Given a parameter budget constraint, which subset of weight matrices in a pre-trained Transformer should we adapt to maximize downstream performance?
> (2) Is the “optimal” adaptation matrix $\Delta W$ _really rank-defficient?_ If so, what is a good rank to use in practice?
> (3) What is the connection between $\Delta W$ and W? Does $\Delta W$ highly correlate with W? How large is $\Delta W$ comparing to W?

**1. Which Weight Matrices in Transformer Should We Apply LoRA To?**

![Screenshot 2024-05-16 at 1.55.14 PM.png](../../images/Screenshot_2024-05-16_at_1.55.14_PM.png)

**2. What is the Optimal Rank $r$ for LoRA**

![Screenshot 2024-05-16 at 1.56.08 PM.png](../../images/Screenshot_2024-05-16_at_1.56.08_PM.png)

> We argue that increasing r does not cover a more meaningful subspace, which suggests that a low-rank adaptation matrix is sufficient.

**3. How Does the Adaptation Matrix $\Delta W$ Compare to W**

> This suggests that the low-rank adaptation matrix potentially _amplifies the important features for specific downstream tasks that were learned but not emphasized in the general pre-training model._

This is the core intuition behind why LoRA works. It’s an attempt at understanding why the transformation done by fine-tuning is inherently low rank.

In practice, when looking at the SVD of $W$, it appears that LoRA’s effect is to amplify the directions that are not already emphasized in $W$, potentially augmenting existing representations that already existed in $W$ which are particularly relevant to specific tasks but not emphasized in the original matrix.

### Conclusion

> We propose LoRA, an efficient adaptation strategy that neither introduces inference latency nor reduces input sequence length while retaining high model quality. Importantly, it allows for quick task-switching when deployed as a service by sharing the vast majority of the model parameters.

> While we focused on Transformer language models, the proposed principles are generally applicable to any neural networks with dense layers.

This discovery is actually generally valuable for all fine-tuning and transfer learning cases with neural networks.


# Implementation

I've attempted to create a toy example here just to see how LoRA works in code (it's not plugged into an actual model, but I mainly want to demonstrate what the decomposition actually looks like in practice in terms of trainable parameters).

The example below is my attempt at a super simple version of some of the modules in the [official LoRA implementation](https://github.com/microsoft/LoRA/tree/main).

In [None]:
# This simple class is just used to store some meta data about any layer that uses LoRA
class LoRA():
    def __init__(self, r, merge_weights):
        # r specifies the rank of this adaptation - if r is set to 0, then lora wont do anything
        self.r = r
        self.merged = False
        self.merge_weights = merge_weights

class Linear(nn.Linear, LoRA):
    # LoRA implemented in a dense layer
    def __init__(self, in_dim out_dim, r=0, merge_weights=True):
        nn.Linear.__init__(self, in_dim, out_dim)
        LoRALayer.__init__(self, r, merge_weights)

        # Assuming LoRA is active, we create the trainable B and A matrices
        if r > 0:
            # Importantly, the inner dimension of the matrices are determined by rank "r"
            # And the in_features, out_features beating the outer dimensions mean BA multiply to be
            # the correct dimension of the original weight matrix
            self.lora_A = nn.Parameter(self.weight.new_zeros((r, in_features)))
            self.lora_B = nn.Parameter(self.weight.new_zeros((out_features, r)))
            # Freezing the pre-trained weight matrix, and we only update A and B during training
            self.weight.requires_grad = False


    def train(self, mode = True):
        nn.Linear.train(self, mode)
        if mode:
            # During training mode, we make sure that the weights are not merged into the
            # original weight matrix
            if self.merge_weights and self.merged:
                if self.r > 0:
                    self.weight.data -= self.lora_B @ self.lora_A
                self.merged = False
        else:
            # Turning testing, we merge in BA into W to make the fine-tuned weight matrix
            if self.merge_weights and not self.merged:
                if self.r > 0:
                    self.weight.data += self.lora_B @ self.lora_A
                self.merged = True

    def forward(self, x):
        if self.r > 0 and not self.merged:
            # During feed-forward, we first compute the original result of weights W
            result = F.linear(x, self.weight, bias=self.bias)
            # And then we add on the small changes induced by the LoRA BA matrix after
            result += (self.lora_A @ self.lora_B.transpose(0, 1))
            return result
        else:
            return F.linear(x, self.weight, bias=self.bias)