üß≠ Our Roadmap to Mastery

1. What is Fine-Tuning and Why It's Expensive

2. PEFT - Parameter-Efficient Fine-Tuning

3. LoRA - Low-Rank Adaptation

  - Deep Dive into LoRA

    - Linear Layers & Matrix Decomposition

    - The core idea of LoRA

    - Implementation walkthrough

    - Pros & Limitations

5. Comparison: LoRA vs Other PEFT methods

6. Advanced: QLoRA, AdaLoRA, LoRA with Transformers

7. Hands-On: Fine-tuning a HuggingFace model using LoRA
___

## üîπ 1. What is Fine-Tuning and Why It's Expensive?

üß† What is Fine-Tuning?

Fine-tuning is a process where you take a pre-trained model (like BERT, GPT, or ViT) and adjust its parameters slightly so that it performs well on a specific downstream task ‚Äî such as classification, QA, or summarization.

- Pre-trained models are trained on large, generic datasets (e.g., Wikipedia, Common Crawl).

- Fine-tuning adapts the model to a smaller, task-specific dataset (e.g., product classification, sentiment analysis).

üß™ Think of it like customizing a car ‚Äî the base car is great, but you add features for your specific needs (GPS, baby seat, etc.).

üîç Example:

- Pretrained BERT ‚Üí Fine-tune on your product description classification task.

- Pretrained ViT ‚Üí Fine-tune on your fashion image categorization.

‚öôÔ∏è What Happens Under the Hood?

All model parameters (sometimes hundreds of millions) are updated through backpropagation using your task-specific data.

ü§î Real-World Problem

You want to adapt an LLM like GPT or BERT to 50 different tasks, but:

- You don‚Äôt have enough GPU resources.

- You don‚Äôt want to retrain the whole model each time.

- You want to reuse the base model and store only small "adapters" for each task.

üîÑ This is where PEFT comes in!

Rather than updating all parameters, PEFT updates only a small number of parameters (often <1%) and freezes the rest.
___

## üîπ 2. PEFT ‚Äì Parameter-Efficient Fine-Tuning

üß† What is PEFT?

PEFT (Parameter-Efficient Fine-Tuning) is a set of techniques that adapts large pre-trained models for new tasks by only training a small subset of parameters.

üîß Instead of fine-tuning the entire engine, you‚Äôre just tweaking a few components to fit your task.

üöÄ Why PEFT?

    Traditional                             Fine-Tuning PEFT
    ------------------------------------------------------------------
    Updates all parameters	            Updates only a small subset
    High memory + compute cost	        Low memory + fast training
    Large model copies per task	       Small ‚Äúadapter‚Äù per task
    Difficult multi-task scaling	      Easy multi-task deployment

üõ†Ô∏è Core Idea Behind PEFT

- Freeze the original model weights

- Add/train only a few lightweight modules

- This allows reuse of the base model, and you just plug in the task-specific pieces.

üìö Types of PEFT Methods
Let‚Äôs look at some of the most common and impactful PEFT approaches:

    Method                        | Core Idea
    ------------------------------------------------------------------------------------
    Adapter Tuning                | Add small bottleneck layers between transformer layers
    LoRA (Low-Rank Adaptation)    | Inject low-rank matrices into weight updates
    Prefix Tuning                 | Add learnable tokens at the start of each input
    Prompt Tuning                 | Add learnable prompts instead of modifying the model
    BitFit                        | Only fine-tune bias terms
    IA¬≥ / HyperLoRA               | Modify only input-dependent scaling terms

üßæ TL;DR of PEFT

- Motivation: Avoid retraining massive models.

- Strategy: Freeze the main model ‚Üí Add small trainable modules.

- Result: Huge savings in time, compute, and storage ‚Äî without sacrificing much performance.
___

## üîπ 3. LoRA ‚Äì Low-Rank Adaptation

üß† What Problem Does LoRA Solve?

Let‚Äôs recall:

In transformers, Linear layers (like Wq, Wk, Wv, Wo) are the heaviest components (millions of parameters).
In standard fine-tuning, we update all of them.

üí° LoRA‚Äôs idea:

Don't update the full weight matrix. Instead, inject a low-rank decomposition to simulate the change in weights ‚Äî saving parameters, compute, and memory.

üßÆ Step-by-Step: LoRA Math Intuition
Let‚Äôs break it down:

üß± 1. Standard Linear Layer
A transformer layer has a linear transformation:

$$ y = Wx $$

where:
- $W ‚àà R^{d_{out}} \times R^{d_{in}}$
- $x ‚àà R^{d_{in}}$




üß± 2. In Full Fine-Tuning:

You update
$ùëä$
directly ‚Üí too many parameters.

üí° 3. In LoRA:

Keep $ùëä$ frozen, and add a low-rank adapter:  

$$ùëä‚Ä≤ = ùëä + Œîùëä \space\space where \space\space Œîùëä = ùêµùê¥$$

- $ùê¥ ‚àà ùëÖ^{ùëü√óùëë_{ùëñùëõ}}$

- $ùêµ ‚àà ùëÖ^{ùëë_{ùëúùë¢ùë°}√óùëü}  $

- $ùëü ‚â™ ùëë$  ‚Üí Low-rank matrices!  

Then:  

$$ùë¶ = (ùëä + ùêµùê¥)ùë• = ùëäùë• + ùêµùê¥ùë•$$

üîë Only train $A$ and $B$, keep $W$ frozen.  

üî¢ How Efficient Is This?

If ùëü=8, and
ùëä is 768√ó768 (over 590K parameters):

- Fine-tuning W ‚Üí 590,000 parameters

- LoRA (B and A) ‚Üí 8 √ó 768 + 768 √ó 8 = 12,288 parameters

That's ~50√ó smaller!

‚öôÔ∏è LoRA Injection Points

LoRA is typically applied to query and value projections in transformer blocks:

- $W_q, W_v$ ‚Üí fine-tuned with LoRA

- Everything else (e.g., FFN, other heads) ‚Üí frozen

This leads to very little memory usage, and fast training.

‚úÖ Benefits of LoRA

Benefit | Description
--------|-----------------------------------------
Efficiency | Tiny number of trainable parameters
Speed | Fast training with fewer gradients
Storage | Only store LoRA adapters (small files)
Plug-n-Play | Easily load/unload adapters for tasks
Modularity | Multiple tasks can reuse the same base model

‚ö†Ô∏è Any Trade-offs?

- Slight loss in performance for some tasks if ùëü is too low.

- Requires framework support (like ü§ó Transformers or PEFT lib).

## üîπ 4. LoRA vs Other PEFT Methods
(üìä When to Use What?)

To choose the right PEFT technique, you need to understand:

- How each one works

- Their strengths and limitations

- Their use cases

Let‚Äôs break them down clearly.

üß† Summary of Common PEFT Techniques

| PEFT Method        | Core Idea                                  | Param Count   | Key Use Cases             |
| ------------------ | ------------------------------------------ | ------------- | ------------------------- |
| **LoRA**           | Inject low-rank updates into linear layers | üî∏ Small      | NLP, CV, LLMs             |
| **Adapter Tuning** | Add bottleneck layers between layers       | üî∏ Small‚ÄìMed  | Multilingual, NLP         |
| **Prefix Tuning**  | Add learnable vectors at input level       | üîπ Very Small | Generation tasks          |
| **Prompt Tuning**  | Train soft tokens prepended to input       | üîπ Very Small | Prompt-based tasks        |
| **BitFit**         | Train only bias terms                      | üîπ Smallest   | Simple tasks, diagnostics |

üîç Let‚Äôs Compare Them One by One:

‚úÖ LoRA

**How it works:** Decomposes updates into low-rank matrices ‚Üí applied to linear layers.

**Pros:**

- Highly efficient for transformers

- Doesn‚Äôt change architecture

- Works well with attention-heavy models

**Cons:**

- Slight complexity in implementation

Best for: LLMs, ViTs, when attention modules dominate.

‚úÖ Adapter Tuning

**How it works:** Inserts small trainable MLP layers (bottlenecks) between frozen layers.

**Pros:**

- Strong performance in multilingual and multi-task learning

- Good generalization

**Cons:**

- Adds more parameters than LoRA

- Slightly larger memory footprint

Best for: NLP, multilingual tasks, earlier transformer models (like BERT).

‚úÖ Prefix Tuning

**How it works:** Adds trainable vectors (prefixes) to the input of each transformer block.

**Pros:**

- Ultra-lightweight

- Keeps model architecture unchanged

**Cons:**

- Best for generation tasks only (e.g., summarization, translation)

- Weaker for classification

Best for: GPT-style autoregressive models.

‚úÖ Prompt Tuning

**How it works:** Adds a set of soft (learnable) tokens to the input prompt.

**Pros:**

- Lightweight

- Minimal parameter footprint

**Cons:**

- Needs good initialization

- Can be task-specific

Best for: Simple prompt-driven tasks, often in zero-/few-shot setups.

‚úÖ BitFit

**How it works:** Only train bias terms in the model.

**Pros:**

- Dead simple

- Tiny memory cost

**Cons:**

- Often underperforms on complex tasks

Best for: Diagnostics, toy experiments.

| Situation                                                    | Is LoRA a good fit? |
| ------------------------------------------------------------ | ------------------- |
| LLMs or ViT-based models? |                   ‚úÖ Yes                                                   |
| Memory/Compute constrained? | ‚úÖ Yes                                                 |
| Inference performance matters? | ‚úÖ Yes                                              |
| Want task-specific adapters? | ‚úÖ Yes                                                |
| Need extreme miniaturization? | ‚ùå Go with Prompt/Prefix Tuning                      |
| Working with encoder-only models? | ‚úÖ LoRA or Adapters                              |


üßæ TL;DR Summary

| If You Want...                      | Use This PEFT                     |
| ----------------------------------- | --------------------------------- |
| Balance of efficiency + performance | ‚úÖ **LoRA**                        |
| Simple tuning + very low memory     | ‚úÖ **BitFit** or **Prompt Tuning** |
| Strong multilingual performance     | ‚úÖ **Adapters**                    |
| Fast inference + low storage        | ‚úÖ **LoRA**                        |
| Tiny models for mobile/edge         | ‚úÖ **Prompt Tuning**               |


üß† Mastery Insight

In modern LLM workflows, LoRA is the go-to method due to its:

- High efficiency

- Reusability

- Ease of plugging into HuggingFace and PEFT pipelines

## üîπ 5. Advanced PEFT: QLoRA, AdaLoRA & Transformers Integration
These techniques help push the limits of LoRA even further ‚Äî either by reducing memory (QLoRA), making the model adapt intelligently (AdaLoRA), or integrating easily in production workflows.

üß† 5.1 QLoRA ‚Äì Quantized LoRA

‚ùì What is QLoRA?

QLoRA = Quantized Model + LoRA adapters

It allows you to fine-tune a massive model (like 65B) on a single GPU with:

- 4-bit quantization of base model

- LoRA adapters on top

‚úÖ This enables low-memory fine-tuning without sacrificing quality.

‚öôÔ∏è Key Ingredients of QLoRA

| Component            | Role                                    |
| -------------------- | --------------------------------------- |
| 4-bit Quantization   | Compress base model weights (less VRAM) |
| LoRA Adapters        | Learn task-specific deltas              |
| Double Quantization  | Further compress optimizer state        |
| Paginated Optimizers | Efficient memory paging during training |


üî¨ How It Works:

1. Load a pretrained model in 4-bit quantized format.

2. Freeze all base weights.

3. Inject LoRA adapters.

4. Train only LoRA (as usual).

5. Save just the adapters.

‚ö° The base model is compressed AND untouched ‚Üí memory-efficient + reusable!

üß† Why Is QLoRA Important?

- You can fine-tune models like LLaMA-7B or LLaMA-65B on 1 consumer GPU (24GB).

- Dramatic memory savings (~70%+)

- Achieves state-of-the-art performance with very few parameters trained

üîπ 5.2 AdaLoRA ‚Äì Adaptive LoRA

‚ùì What is AdaLoRA?

Instead of keeping the rank r constant, AdaLoRA dynamically adjusts it during training.

üí° Some layers benefit from higher capacity; others do not. AdaLoRA learns which layers matter more and allocates ranks intelligently.

üß† How It Works

1. Start with max rank for all LoRA layers.

2. Use a rank allocator that prunes or adjusts ranks over time.

3. Converge to a compressed set of LoRA adapters.

üîÑ The model learns where to focus its capacity ‚Äî leads to better performance with the same budget.

üî¨ Benefits of AdaLoRA

| Benefit            | Description                                 |
| ------------------ | ------------------------------------------- |
| Smarter LoRA       | Learns to use rank more effectively         |
| Higher performance | Better than fixed-rank LoRA in many tasks   |
| Compression-aware  | Automatically compresses unimportant layers |


üîπ 5.3 LoRA in HuggingFace & PEFT Workflows

‚úÖ HuggingFace PEFT Integration

You can use LoRA in ü§ó PEFT with just a few lines:

```
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8, lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1,
    task_type="SEQ_CLS"  # or CAUSAL_LM, TOKEN_CLS, etc.
)

model = get_peft_model(base_model, config)
```
The library:

- Automatically freezes base model

- Adds LoRA

- Makes saving/loading adapters easy

üß† When to Use Each?

| Scenario                            | Use This                              |
| ----------------------------------- | ------------------------------------- |
| You have a small GPU (<= 24GB)      | ‚úÖ QLoRA                               |
| You want best possible accuracy     | ‚úÖ AdaLoRA                             |
| You want a production-friendly flow | ‚úÖ PEFT + LoRA                         |
| You want multi-task support         | ‚úÖ LoRA or Adapters                    |
| You need edge/mobile deployment     | ‚ùå Avoid full LoRA (use Prompt Tuning) |


‚úÖ Milestone Unlocked: Advanced LoRA

You now understand:

- üî∑ QLoRA: Fine-tune huge models with small VRAM

- üî∑ AdaLoRA: Learn dynamic ranks

- üî∑ How to use LoRA in real-world pipelines

## üîπ 6. Practical Project: Fine-Tune a Model with LoRA (End-to-End)
We'll start with a text classification task ‚Äî simple but powerful ‚Äî and use LoRA to fine-tune a transformer on it.

üóÇÔ∏è Project: Sentiment Classification using LoRA

| Task | Sentiment Analysis (binary: pos/neg) |
|---------------------|--------------------------|
| Dataset | IMDB / Yelp / SST-2 (small but real) |
| Model | bert-base-uncased |
| Library Stack | HuggingFace Transformers + Datasets + PEFT |
| PEFT Method | LoRA |
| Training Setup | 1 GPU (or CPU if needed, slower) |

In [None]:
# üîß Step 1: Install Required Libraries
# If you're working in Colab or your local machine, install:

!pip install transformers datasets peft accelerate bitsandbytes

In [None]:
# üß© Step 2: Load Dataset
# Let‚Äôs use the HuggingFace datasets library:

from datasets import load_dataset

# Load SST-2 (Stanford Sentiment Treebank)
dataset = load_dataset("glue", "sst2")

# SST-2 is binary: label = 1 (positive), 0 (negative)

In [None]:
# üß† Step 3: Tokenize Data
from transformers import AutoTokenizer

model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize(example):
    return tokenizer(example["sentence"], truncation=True, padding="max_length", max_length=128)

encoded = dataset.map(tokenize, batched=True)
encoded = encoded.rename_column("label", "labels")
encoded.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

In [None]:
# üèóÔ∏è Step 4: Load Base Model & Apply LoRA

from transformers import AutoModelForSequenceClassification
from peft import LoraConfig, get_peft_model, TaskType

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Define LoRA config
config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["query", "value"],  # for BERT
    lora_dropout=0.1,
    bias="none",
    task_type=TaskType.SEQ_CLS
)

# Inject LoRA into model
model = get_peft_model(model, config)

# Print total vs trainable params
model.print_trainable_parameters()

In [None]:
# üèãÔ∏è Step 5: Train the LoRA Model
# We‚Äôll use ü§ó Trainer for simplicity:

from transformers import TrainingArguments, Trainer

args = TrainingArguments(
    output_dir="./lora-sst2",
    learning_rate=2e-4,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=3,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_dir="./logs",
    logging_steps=10,
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=encoded["train"],
    eval_dataset=encoded["validation"]
)

trainer.train()

In [None]:
# üíæ Step 6: Save the LoRA Adapter (Not Full Model)

# Only LoRA adapter weights
model.save_pretrained("lora_adapter_sst2")

Later, you can plug this adapter into the base BERT model easily!

‚úÖ What You‚Äôve Learned from This Practical:

| Concept             | You‚Äôve Done It! ‚úÖ |
| ------------------- | ----------------- |
| Load a dataset      | ‚úÖ                 |
| Tokenize it         | ‚úÖ                 |
| Load base model     | ‚úÖ                 |
| Add LoRA adapters   | ‚úÖ                 |
| Train only adapters | ‚úÖ                 |
| Save LoRA only      | ‚úÖ                 |
