🧭 Our Roadmap to Mastery

1. What is Fine-Tuning and Why It's Expensive

2. PEFT - Parameter-Efficient Fine-Tuning

3. LoRA - Low-Rank Adaptation

  - Deep Dive into LoRA

    - Linear Layers & Matrix Decomposition

    - The core idea of LoRA

    - Implementation walkthrough

    - Pros & Limitations

5. Comparison: LoRA vs Other PEFT methods

6. Advanced: QLoRA, AdaLoRA, LoRA with Transformers

7. Hands-On: Fine-tuning a HuggingFace model using LoRA
___

## 🔹 1. What is Fine-Tuning and Why It's Expensive?

🧠 What is Fine-Tuning?

Fine-tuning is a process where you take a pre-trained model (like BERT, GPT, or ViT) and adjust its parameters slightly so that it performs well on a specific downstream task — such as classification, QA, or summarization.

- Pre-trained models are trained on large, generic datasets (e.g., Wikipedia, Common Crawl).

- Fine-tuning adapts the model to a smaller, task-specific dataset (e.g., product classification, sentiment analysis).

🧪 Think of it like customizing a car — the base car is great, but you add features for your specific needs (GPS, baby seat, etc.).

🔍 Example:

- Pretrained BERT → Fine-tune on your product description classification task.

- Pretrained ViT → Fine-tune on your fashion image categorization.

⚙️ What Happens Under the Hood?

All model parameters (sometimes hundreds of millions) are updated through backpropagation using your task-specific data.

🤔 Real-World Problem

You want to adapt an LLM like GPT or BERT to 50 different tasks, but:

- You don’t have enough GPU resources.

- You don’t want to retrain the whole model each time.

- You want to reuse the base model and store only small "adapters" for each task.

🔄 This is where PEFT comes in!

Rather than updating all parameters, PEFT updates only a small number of parameters (often <1%) and freezes the rest.
___

## 🔹 2. PEFT – Parameter-Efficient Fine-Tuning

🧠 What is PEFT?

PEFT (Parameter-Efficient Fine-Tuning) is a set of techniques that adapts large pre-trained models for new tasks by only training a small subset of parameters.

🔧 Instead of fine-tuning the entire engine, you’re just tweaking a few components to fit your task.

🚀 Why PEFT?

    Traditional                             Fine-Tuning PEFT
    ------------------------------------------------------------------
    Updates all parameters	            Updates only a small subset
    High memory + compute cost	        Low memory + fast training
    Large model copies per task	       Small “adapter” per task
    Difficult multi-task scaling	      Easy multi-task deployment

🛠️ Core Idea Behind PEFT

- Freeze the original model weights

- Add/train only a few lightweight modules

- This allows reuse of the base model, and you just plug in the task-specific pieces.

📚 Types of PEFT Methods
Let’s look at some of the most common and impactful PEFT approaches:

    Method                        | Core Idea
    ------------------------------------------------------------------------------------
    Adapter Tuning                | Add small bottleneck layers between transformer layers
    LoRA (Low-Rank Adaptation)    | Inject low-rank matrices into weight updates
    Prefix Tuning                 | Add learnable tokens at the start of each input
    Prompt Tuning                 | Add learnable prompts instead of modifying the model
    BitFit                        | Only fine-tune bias terms
    IA³ / HyperLoRA               | Modify only input-dependent scaling terms

🧾 TL;DR of PEFT

- Motivation: Avoid retraining massive models.

- Strategy: Freeze the main model → Add small trainable modules.

- Result: Huge savings in time, compute, and storage — without sacrificing much performance.
___

## 🔹 3. LoRA – Low-Rank Adaptation

🧠 What Problem Does LoRA Solve?

Let’s recall:

In transformers, Linear layers (like Wq, Wk, Wv, Wo) are the heaviest components (millions of parameters).
In standard fine-tuning, we update all of them.

💡 LoRA’s idea:

Don't update the full weight matrix. Instead, inject a low-rank decomposition to simulate the change in weights — saving parameters, compute, and memory.

🧮 Step-by-Step: LoRA Math Intuition
Let’s break it down:

🧱 1. Standard Linear Layer
A transformer layer has a linear transformation:

$$ y = Wx $$

where:
- $W ∈ R^{d_{out}} \times R^{d_{in}}$
- $x ∈ R^{d_{in}}$




🧱 2. In Full Fine-Tuning:

You update
$𝑊$
directly → too many parameters.

💡 3. In LoRA:

Keep $𝑊$ frozen, and add a low-rank adapter:  

$$𝑊′ = 𝑊 + Δ𝑊 \space\space where \space\space Δ𝑊 = 𝐵𝐴$$

- $𝐴 ∈ 𝑅^{𝑟×𝑑_{𝑖𝑛}}$

- $𝐵 ∈ 𝑅^{𝑑_{𝑜𝑢𝑡}×𝑟}  $

- $𝑟 ≪ 𝑑$  → Low-rank matrices!  

Then:  

$$𝑦 = (𝑊 + 𝐵𝐴)𝑥 = 𝑊𝑥 + 𝐵𝐴𝑥$$

🔑 Only train $A$ and $B$, keep $W$ frozen.  

🔢 How Efficient Is This?

If 𝑟=8, and
𝑊 is 768×768 (over 590K parameters):

- Fine-tuning W → 590,000 parameters

- LoRA (B and A) → 8 × 768 + 768 × 8 = 12,288 parameters

That's ~50× smaller!

⚙️ LoRA Injection Points

LoRA is typically applied to query and value projections in transformer blocks:

- $W_q, W_v$ → fine-tuned with LoRA

- Everything else (e.g., FFN, other heads) → frozen

This leads to very little memory usage, and fast training.

✅ Benefits of LoRA

Benefit | Description
--------|-----------------------------------------
Efficiency | Tiny number of trainable parameters
Speed | Fast training with fewer gradients
Storage | Only store LoRA adapters (small files)
Plug-n-Play | Easily load/unload adapters for tasks
Modularity | Multiple tasks can reuse the same base model

⚠️ Any Trade-offs?

- Slight loss in performance for some tasks if 𝑟 is too low.

- Requires framework support (like 🤗 Transformers or PEFT lib).

## 🔹 4. LoRA vs Other PEFT Methods
(📊 When to Use What?)

To choose the right PEFT technique, you need to understand:

- How each one works

- Their strengths and limitations

- Their use cases

Let’s break them down clearly.

🧠 Summary of Common PEFT Techniques

| PEFT Method        | Core Idea                                  | Param Count   | Key Use Cases             |
| ------------------ | ------------------------------------------ | ------------- | ------------------------- |
| **LoRA**           | Inject low-rank updates into linear layers | 🔸 Small      | NLP, CV, LLMs             |
| **Adapter Tuning** | Add bottleneck layers between layers       | 🔸 Small–Med  | Multilingual, NLP         |
| **Prefix Tuning**  | Add learnable vectors at input level       | 🔹 Very Small | Generation tasks          |
| **Prompt Tuning**  | Train soft tokens prepended to input       | 🔹 Very Small | Prompt-based tasks        |
| **BitFit**         | Train only bias terms                      | 🔹 Smallest   | Simple tasks, diagnostics |

🔍 Let’s Compare Them One by One:

✅ LoRA

**How it works:** Decomposes updates into low-rank matrices → applied to linear layers.

**Pros:**

- Highly efficient for transformers

- Doesn’t change architecture

- Works well with attention-heavy models

**Cons:**

- Slight complexity in implementation

Best for: LLMs, ViTs, when attention modules dominate.

✅ Adapter Tuning

**How it works:** Inserts small trainable MLP layers (bottlenecks) between frozen layers.

**Pros:**

- Strong performance in multilingual and multi-task learning

- Good generalization

**Cons:**

- Adds more parameters than LoRA

- Slightly larger memory footprint

Best for: NLP, multilingual tasks, earlier transformer models (like BERT).

✅ Prefix Tuning

**How it works:** Adds trainable vectors (prefixes) to the input of each transformer block.

**Pros:**

- Ultra-lightweight

- Keeps model architecture unchanged

**Cons:**

- Best for generation tasks only (e.g., summarization, translation)

- Weaker for classification

Best for: GPT-style autoregressive models.

✅ Prompt Tuning

**How it works:** Adds a set of soft (learnable) tokens to the input prompt.

**Pros:**

- Lightweight

- Minimal parameter footprint

**Cons:**

- Needs good initialization

- Can be task-specific

Best for: Simple prompt-driven tasks, often in zero-/few-shot setups.

✅ BitFit

**How it works:** Only train bias terms in the model.

**Pros:**

- Dead simple

- Tiny memory cost

**Cons:**

- Often underperforms on complex tasks

Best for: Diagnostics, toy experiments.

| Situation                                                    | Is LoRA a good fit? |
| ------------------------------------------------------------ | ------------------- |
| LLMs or ViT-based models? |                   ✅ Yes                                                   |
| Memory/Compute constrained? | ✅ Yes                                                 |
| Inference performance matters? | ✅ Yes                                              |
| Want task-specific adapters? | ✅ Yes                                                |
| Need extreme miniaturization? | ❌ Go with Prompt/Prefix Tuning                      |
| Working with encoder-only models? | ✅ LoRA or Adapters                              |


🧾 TL;DR Summary

| If You Want...                      | Use This PEFT                     |
| ----------------------------------- | --------------------------------- |
| Balance of efficiency + performance | ✅ **LoRA**                        |
| Simple tuning + very low memory     | ✅ **BitFit** or **Prompt Tuning** |
| Strong multilingual performance     | ✅ **Adapters**                    |
| Fast inference + low storage        | ✅ **LoRA**                        |
| Tiny models for mobile/edge         | ✅ **Prompt Tuning**               |


🧠 Mastery Insight

In modern LLM workflows, LoRA is the go-to method due to its:

- High efficiency

- Reusability

- Ease of plugging into HuggingFace and PEFT pipelines

## 🔹 5. Advanced PEFT: QLoRA, AdaLoRA & Transformers Integration
These techniques help push the limits of LoRA even further — either by reducing memory (QLoRA), making the model adapt intelligently (AdaLoRA), or integrating easily in production workflows.

🧠 5.1 QLoRA – Quantized LoRA

❓ What is QLoRA?

QLoRA = Quantized Model + LoRA adapters

It allows you to fine-tune a massive model (like 65B) on a single GPU with:

- 4-bit quantization of base model

- LoRA adapters on top

✅ This enables low-memory fine-tuning without sacrificing quality.

⚙️ Key Ingredients of QLoRA

| Component            | Role                                    |
| -------------------- | --------------------------------------- |
| 4-bit Quantization   | Compress base model weights (less VRAM) |
| LoRA Adapters        | Learn task-specific deltas              |
| Double Quantization  | Further compress optimizer state        |
| Paginated Optimizers | Efficient memory paging during training |


🔬 How It Works:

1. Load a pretrained model in 4-bit quantized format.

2. Freeze all base weights.

3. Inject LoRA adapters.

4. Train only LoRA (as usual).

5. Save just the adapters.

⚡ The base model is compressed AND untouched → memory-efficient + reusable!

🧠 Why Is QLoRA Important?

- You can fine-tune models like LLaMA-7B or LLaMA-65B on 1 consumer GPU (24GB).

- Dramatic memory savings (~70%+)

- Achieves state-of-the-art performance with very few parameters trained

🔹 5.2 AdaLoRA – Adaptive LoRA

❓ What is AdaLoRA?

Instead of keeping the rank r constant, AdaLoRA dynamically adjusts it during training.

💡 Some layers benefit from higher capacity; others do not. AdaLoRA learns which layers matter more and allocates ranks intelligently.

🧠 How It Works

1. Start with max rank for all LoRA layers.

2. Use a rank allocator that prunes or adjusts ranks over time.

3. Converge to a compressed set of LoRA adapters.

🔄 The model learns where to focus its capacity — leads to better performance with the same budget.

🔬 Benefits of AdaLoRA

| Benefit            | Description                                 |
| ------------------ | ------------------------------------------- |
| Smarter LoRA       | Learns to use rank more effectively         |
| Higher performance | Better than fixed-rank LoRA in many tasks   |
| Compression-aware  | Automatically compresses unimportant layers |


🔹 5.3 LoRA in HuggingFace & PEFT Workflows

✅ HuggingFace PEFT Integration

You can use LoRA in 🤗 PEFT with just a few lines:

```
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8, lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1,
    task_type="SEQ_CLS"  # or CAUSAL_LM, TOKEN_CLS, etc.
)

model = get_peft_model(base_model, config)
```
The library:

- Automatically freezes base model

- Adds LoRA

- Makes saving/loading adapters easy

🧠 When to Use Each?

| Scenario                            | Use This                              |
| ----------------------------------- | ------------------------------------- |
| You have a small GPU (<= 24GB)      | ✅ QLoRA                               |
| You want best possible accuracy     | ✅ AdaLoRA                             |
| You want a production-friendly flow | ✅ PEFT + LoRA                         |
| You want multi-task support         | ✅ LoRA or Adapters                    |
| You need edge/mobile deployment     | ❌ Avoid full LoRA (use Prompt Tuning) |


✅ Milestone Unlocked: Advanced LoRA

You now understand:

- 🔷 QLoRA: Fine-tune huge models with small VRAM

- 🔷 AdaLoRA: Learn dynamic ranks

- 🔷 How to use LoRA in real-world pipelines

## 🔹 6. Practical Project: Fine-Tune a Model with LoRA (End-to-End)
We'll start with a text classification task — simple but powerful — and use LoRA to fine-tune a transformer on it.

🗂️ Project: Sentiment Classification using LoRA

| Task | Sentiment Analysis (binary: pos/neg) |
|---------------------|--------------------------|
| Dataset | IMDB / Yelp / SST-2 (small but real) |
| Model | bert-base-uncased |
| Library Stack | HuggingFace Transformers + Datasets + PEFT |
| PEFT Method | LoRA |
| Training Setup | 1 GPU (or CPU if needed, slower) |

In [None]:
# 🔧 Step 1: Install Required Libraries
# If you're working in Colab or your local machine, install:

!pip install transformers datasets peft accelerate bitsandbytes

In [None]:
# 🧩 Step 2: Load Dataset
# Let’s use the HuggingFace datasets library:

from datasets import load_dataset

# Load SST-2 (Stanford Sentiment Treebank)
dataset = load_dataset("glue", "sst2")

# SST-2 is binary: label = 1 (positive), 0 (negative)

In [None]:
# 🧠 Step 3: Tokenize Data
from transformers import AutoTokenizer

model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize(example):
    return tokenizer(example["sentence"], truncation=True, padding="max_length", max_length=128)

encoded = dataset.map(tokenize, batched=True)
encoded = encoded.rename_column("label", "labels")
encoded.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

In [None]:
# 🏗️ Step 4: Load Base Model & Apply LoRA

from transformers import AutoModelForSequenceClassification
from peft import LoraConfig, get_peft_model, TaskType

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Define LoRA config
config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["query", "value"],  # for BERT
    lora_dropout=0.1,
    bias="none",
    task_type=TaskType.SEQ_CLS
)

# Inject LoRA into model
model = get_peft_model(model, config)

# Print total vs trainable params
model.print_trainable_parameters()

In [None]:
# 🏋️ Step 5: Train the LoRA Model
# We’ll use 🤗 Trainer for simplicity:

from transformers import TrainingArguments, Trainer

args = TrainingArguments(
    output_dir="./lora-sst2",
    learning_rate=2e-4,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=3,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_dir="./logs",
    logging_steps=10,
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=encoded["train"],
    eval_dataset=encoded["validation"]
)

trainer.train()

In [None]:
# 💾 Step 6: Save the LoRA Adapter (Not Full Model)

# Only LoRA adapter weights
model.save_pretrained("lora_adapter_sst2")

Later, you can plug this adapter into the base BERT model easily!

✅ What You’ve Learned from This Practical:

| Concept             | You’ve Done It! ✅ |
| ------------------- | ----------------- |
| Load a dataset      | ✅                 |
| Tokenize it         | ✅                 |
| Load base model     | ✅                 |
| Add LoRA adapters   | ✅                 |
| Train only adapters | ✅                 |
| Save LoRA only      | ✅                 |
