#### **What is LoRA?**

LoRA is a technique that fine-tunes large language models by freezing the original weights and learning small low-rank matrices instead. This makes training faster, cheaper, and memory-efficient.

LoRA solves this by freezing the base model and training only two small matrices (A and B) for selected layers.
This reduces training cost by 100‚Äì1000√ó.

For a normal weight matrix $$ W \in \mathbb{R}^{d_out \times d_in} $$, LoRA does:

$$
W' = W + \Delta W
$$

But LoRA forces:

$$
\Delta W = B A
$$

Where:

$$
A (Lower matrix) \in \mathbb{R}^{r \times d_in}
$$

$$
B (upper matrix) \in \mathbb{R}^{d_out \times r}
$$

#### **Are A and B randomly initialized?**

**A** and **B** are initialized as:

- A: random (Gaussian)  
- B: zeros

This ensures:

$$
\Delta W = B A = 0
$$

So at the beginning, LoRA does **not change the model at all**.

The model slowly learns useful updates inside A and B.

üüß **3. How LoRA Learns**

During fine-tuning:

- The base weight \( W \) stays **frozen**
- Only \( A \) and \( B \) receive gradients
- The rank \( r \) controls the learning capacity

$$
\text{Larger } r \;\Rightarrow\; \text{learns more patterns}
$$

$$
\text{Smaller } r \;\Rightarrow\; \text{cheaper and less risk of overfitting}
$$

###### When applied to GPT-3 175B, LoRA reduced trainable parameters by 10,000x and GPU memory requirements by 3x compared to full fine-tuning.

üü¶ **6. When to Merge?**

**Merge when you want:**
- faster inference  
- to export the model for deployment  
- to avoid carrying separate adapter files  

**Don‚Äôt merge if:**
- you want to keep multiple adapters for different tasks  
- you want composable LoRA modules (‚Äústacking‚Äù adapters)

üü´ **7. Where LoRA Is Applied?**

Typically applied to:

- Attention **query/key/value** projections  
- Sometimes **feed-forward** layers  
- Not usually applied to **embeddings** or **layer norms**


üü¶ **LoRA Hyperparameters: Œ± (alpha) and Dropout**

LoRA has three important hyperparameters:

- **rank (r)** ‚Äî size of A/B matrices  
- **alpha (Œ±)** ‚Äî scaling factor  
- **dropout** ‚Äî regularization to prevent overfitting  

You already understand rank.  
Now let‚Äôs break down **alpha** and **dropout**.

---

## üü© 1. What Is LoRA Alpha?

LoRA alpha controls how strong the LoRA update is.

LoRA produces:

$$
\Delta W = BA
$$

But before adding it to the model, LoRA scales it:

$$
W' = W + \frac{\alpha}{r} (BA)
$$

So **alpha is a multiplier** on the LoRA update.

### ‚úîÔ∏è Intuition for Alpha

- Higher Œ± ‚Üí **stronger adaptation**  
- Lower Œ± ‚Üí **softer adaptation**  

Since LoRA uses a small rank (e.g., \( r = 4 \) or \( r = 8 \)),  
**alpha compensates for the small size.**

Typical values:

- alpha = 16, 32, 64 

**Rule of thumb:**

- alpha approx 2 times r 
- alpha = 8 times r for harder tasks

### üìå Why divide by r?

To keep the update scale stable.

- Larger \( r \) ‚Üí more parameters in \( BA \) ‚Üí update becomes stronger  
- Dividing by \( r \) normalizes it

This makes training **stable and consistent** regardless of rank.

---

## üüß 2. What Is LoRA Dropout?

LoRA dropout randomly zeroes out **inputs to the LoRA adapter** during training.

It does **not** affect:

- the base model \( W \)  
- inference-time behavior  

It only affects training of \( A \) and \( B \), preventing overfitting.

Example:

- `dropout = 0.1` ‚Üí 10% of LoRA updates are dropped during training.

### ‚úîÔ∏è Why Use LoRA Dropout?

Because LoRA has **very few parameters**, it can overfit quickly, especially with:

- small datasets  
- easy tasks  
- high learning rates  
- large alpha  

Dropout improves generalization.

---

## üü© Recommended Values

| Setting | When to Use |
|--------|--------------|
| **dropout = 0.0** | Large dataset (100k+), stable training |
| **dropout = 0.05‚Äì0.1** | Small datasets (<10k) |
| **dropout = 0.15‚Äì0.3** | Very small datasets (<1k), strong regularization |

---

## üü™ Summary Table

| Hyperparameter | Purpose | What happens when increased? |
|----------------|---------|------------------------------|
| **rank (r)** | adapter capacity | more expressive, more params |
| **alpha (Œ±)** | strength of LoRA update | stronger effect on base model |
| **dropout** | regularization | less overfitting, but slower learning |
