LoRA is the low ranking technique for model tuning. Try understanding what the rank means in matrices and also what is intrinsic dimensionality.

In the context of linear algebra and LoRA (Low-Rank Adaptation), the concept of **rank** plays a crucial role in understanding how model parameters are optimized and adapted efficiently. Here's a breakdown:

### Rank in Linear Algebra:
- **Rank** refers to the number of linearly independent rows or columns in a matrix. In simpler terms, it indicates the dimensionality of the space spanned by the matrix's rows or columns.
- For a matrix \( A \in \mathbb{R}^{m \times n} \), the rank is the maximum number of linearly independent vectors you can extract from either the rows or the columns of \( A \). If all rows (or columns) are linearly independent, the rank is \( \min(m, n) \); otherwise, it's lower.

### Rank in LoRA (Low-Rank Adaptation):
LoRA is a technique used in fine-tuning large language models (LLMs) where the idea is to adapt model weights using **low-rank matrices**. Here's how rank applies in LoRA:

1. **Low-Rank Matrix Decomposition**:
   - In large models, the full parameter matrices (like weight matrices in transformers) are very large and dense, making fine-tuning all parameters computationally expensive.
   - LoRA uses a technique where instead of updating the entire weight matrix \( W \), it approximates the change in \( W \) using the product of two smaller matrices, \( A \in \mathbb{R}^{m \times r} \) and \( B \in \mathbb{R}^{r \times n} \), where \( r \) is much smaller than \( m \) and \( n \) (with \( r \) representing the rank). This is called **low-rank decomposition**.
   
2. **Why Low-Rank?**:
   - In many high-dimensional datasets (like the ones LLMs are trained on), the effective rank of the weight updates is low, meaning the weight updates lie in a subspace of the full parameter space. LoRA exploits this by using low-rank matrices \( A \) and \( B \) to model these updates, significantly reducing the number of parameters that need to be fine-tuned.
   - As a result, LoRA introduces a low-rank structure to reduce the computational burden without sacrificing much model performance.

3. **Application in LLM Fine-tuning**:
   - Instead of fine-tuning the entire weight matrix \( W \), LoRA updates only the smaller low-rank matrices \( A \) and \( B \). The change to \( W \) can be represented as \( W + \Delta W \), where \( \Delta W = A \times B \).
   - The **rank** \( r \) controls the complexity and expressiveness of the adaptation. A higher rank allows for more expressive updates but requires more computation and parameters, while a lower rank reduces computational complexity.

In summary, **rank** in linear algebra refers to the number of linearly independent vectors in a matrix, and in LoRA, this concept is used to simplify and optimize the fine-tuning of large language models by using low-rank matrix approximations for weight updates.

In [4]:
# https://pytorch.org/torchtune/stable/tutorials/lora_finetune.html 
# https://arxiv.org/pdf/2106.09685 (LoRA: Low-Rank Adaptation of Large Language Models)
# https://arxiv.org/pdf/2012.13255 (INTRINSIC DIMENSIONALITY)

from torchtune.models.llama2 import llama2_7b, lora_llama2_7b

In [5]:
# Build Llama2 without any LoRA layers
base_model = llama2_7b()

# The default settings for lora_llama2_7b will match those for llama2_7b
# We just need to define which layers we want LoRA applied to.
# Within each self-attention, we can choose from ["q_proj", "k_proj", "v_proj", and "output_proj"].
# We can also set apply_lora_to_mlp=True or apply_lora_to_output=True to apply LoRA to other linear
# layers outside of the self-attention.
lora_model = lora_llama2_7b(lora_attn_modules=["q_proj", "v_proj"])

In [6]:

print(base_model.layers[0].attn)

MultiHeadAttention(
  (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
  (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
  (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
  (output_proj): Linear(in_features=4096, out_features=4096, bias=False)
  (pos_embeddings): RotaryPositionalEmbeddings()
)


In [7]:
print(lora_model.layers[0].attn)

MultiHeadAttention(
  (q_proj): LoRALinear(
    (dropout): Identity()
    (lora_a): Linear(in_features=4096, out_features=8, bias=False)
    (lora_b): Linear(in_features=8, out_features=4096, bias=False)
  )
  (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
  (v_proj): LoRALinear(
    (dropout): Identity()
    (lora_a): Linear(in_features=4096, out_features=8, bias=False)
    (lora_b): Linear(in_features=8, out_features=4096, bias=False)
  )
  (output_proj): Linear(in_features=4096, out_features=4096, bias=False)
  (pos_embeddings): RotaryPositionalEmbeddings()
)


In [8]:
# Assuming that base_model already has the pretrained Llama2 weights,
# this will directly load them into your LoRA model without any conversion necessary.
lora_model.load_state_dict(base_model.state_dict(), strict=False)

_IncompatibleKeys(missing_keys=['layers.0.attn.q_proj.lora_a.weight', 'layers.0.attn.q_proj.lora_b.weight', 'layers.0.attn.v_proj.lora_a.weight', 'layers.0.attn.v_proj.lora_b.weight', 'layers.1.attn.q_proj.lora_a.weight', 'layers.1.attn.q_proj.lora_b.weight', 'layers.1.attn.v_proj.lora_a.weight', 'layers.1.attn.v_proj.lora_b.weight', 'layers.2.attn.q_proj.lora_a.weight', 'layers.2.attn.q_proj.lora_b.weight', 'layers.2.attn.v_proj.lora_a.weight', 'layers.2.attn.v_proj.lora_b.weight', 'layers.3.attn.q_proj.lora_a.weight', 'layers.3.attn.q_proj.lora_b.weight', 'layers.3.attn.v_proj.lora_a.weight', 'layers.3.attn.v_proj.lora_b.weight', 'layers.4.attn.q_proj.lora_a.weight', 'layers.4.attn.q_proj.lora_b.weight', 'layers.4.attn.v_proj.lora_a.weight', 'layers.4.attn.v_proj.lora_b.weight', 'layers.5.attn.q_proj.lora_a.weight', 'layers.5.attn.q_proj.lora_b.weight', 'layers.5.attn.v_proj.lora_a.weight', 'layers.5.attn.v_proj.lora_b.weight', 'layers.6.attn.q_proj.lora_a.weight', 'layers.6.attn.q_p

In [11]:
from torchtune.modules.peft._utils import get_adapter_params, set_trainable_params

# Get all the parameters from the model that are part of the LoRA
lora_params = get_adapter_params(lora_model)

# Set all the parameters to be trainable
set_trainable_params(lora_model, lora_params)

# print the trainable parameters
total_params = sum([p.numel() for p in lora_model.parameters()])
trainable_params = sum([p.numel() for p in lora_model.parameters() if p.requires_grad])
print(
  f"""
  {total_params} total params,
  {trainable_params} trainable params,
  {(100.0 * trainable_params / total_params):.2f}% of all params are trainable.
  """
)


  6742609920 total params,
  4194304 trainable params,
  0.06% of all params are trainable.
  
