# 🧠 Mastering Parameter-Efficient Fine-Tuning (PEFT)

Focusing on:

## 🔧 LoRA (Low-Rank Adaptation)

### ✔️ Problem
Many LLMs have **billions of parameters**, making them expensive to fine-tune. These models are made up of many layers (e.g., 32 decoder layers in transformers), each including:

- 🤝 **Self-Attention**: Learns relationships between words  
- 🧠 **MLP (Multi-Layer Perceptron)**: Learns complex patterns  
- 🌊 **SiLU Activation**: A smooth activation function  
- 🧽 **LayerNorm**: Keeps values stable during training  

### ✔️ Solution
LoRA allows fine-tuning by **adding small trainable matrices** to the frozen base model weights.

- ❄️ Freeze original model weights  
- 🧩 Only train small adapter matrices  
- 📉 Drastically reduces trainable parameters  
- 🔁 LoRA adapters are pluggable into any transformer model  

### ✔️ How it works
Weight updates are **decomposed into low-rank matrices** (A & B).

---

## ⚙️ QLoRA

### ✔️ Problem
Fine-tuning large models = **high memory & compute cost**.

### ✔️ Solution  
QLoRA combines two major tricks:

1. 🧮 **Quantization** – Reduces memory (e.g., 16-bit → 4-bit = 75% less memory)  
2. 🧩 **LoRA** – Enables large model tuning on consumer GPUs

### ✔️ How it works

- 🧊 Load quantized base model (lower bit precision)  
- ❄️ Freeze base model weights  
- 💡 Apply **full-precision LoRA adapters** for fine-tuning  
- 🧠 Adapters learn effectively while base stays efficient

---

## 🔬 Hyperparameters

### ✔️ Problem  
Optimal fine-tuning requires **careful hyperparameter tuning**.

### ✔️ Solution  
Hyperparameters are tunable **training settings** — not part of the model architecture.

### ✔️ How it works – Key QLoRA Hyperparameters:

- `r` (rank): 🧮 Size of A & B matrices (e.g., 5, 8, 16)  
- `lora_alpha`: 🔧 Scaling factor for updates (e.g., 16, 32, 64)  
- `lora_dropout`: 🎲 Drop adapter weights randomly (e.g., 0.05, 0.1)  
- `target_modules`: 🎯 Where to apply LoRA (e.g., `q_proj`, `v_proj`)  
- `bias`: ⚖️ Bias handling (e.g., `"none"`, `"lora_only"`, `"all"`)

### ✔️ How hyperparameters affect performance:

- ⚡ Want faster training? → Increase `learning_rate` (carefully)  
- 🛡️ Overfitting? → Add `lora_dropout`, reduce `num_train_epochs`  
- 📉 Underfitting? → Increase `r`, `lora_alpha`, or `epochs`  
- 💾 Low on memory? → Reduce `r`, `batch_size`, or increase `gradient_accumulation_steps`

---

## 🎯 Benefits

✔️ Drastically reduces memory & compute needs  
✔️ Enables scalable, low-cost fine-tuning of LLMs  
✔️ Great for task-specific adaptation (e.g., domain-specific QA, LCMS lab data)

---

## 🚀 Use PEFT to:

- 🧪 Customize LLMs for **niche domains**  
- 💻 Deploy powerful models in **resource-constrained environments**  
- 🔁 Fine-tune **multiple adapters** for different tasks or use cases  

---

## 🗂️ Summary Table

| **Concept**         | **What It Does**                                 | **Relationship**                          |
|---------------------|--------------------------------------------------|-------------------------------------------|
| 🧠 **PEFT**         | Efficient fine-tuning strategy                   | LoRA and QLoRA are PEFT methods           |
| 🧩 **LoRA**         | Adds small adapter matrices (A & B)              | A specific PEFT method                    |
| 🧮 **QLoRA**        | LoRA + 4-bit quantized base model                | A memory-efficient extension of LoRA      |
| ⚙️ **Hyperparameters** | Control training behavior                         | Used in both LoRA & QLoRA setups           |


In [None]:
# ------------------------------------ Packages ----------------------------------
!pip install -q --upgrade torch==2.5.1+cu124 torchvision==0.20.1+cu124 torchaudio==2.5.1+cu124 --index-url https://download.pytorch.org/whl/cu124
!pip install -q requests bitsandbytes==0.46.0 transformers==4.48.3 accelerate==1.3.0
!pip install -q datasets requests peft

In [None]:
# pip installs

!pip install -q --upgrade torch==2.5.1+cu124 torchvision==0.20.1+cu124 torchaudio==2.5.1+cu124 --index-url https://download.pytorch.org/whl/cu124
!pip install -q requests bitsandbytes==0.46.0 transformers==4.48.3 accelerate==1.3.0
!pip install -q datasets requests peft

In [None]:
# ------------------------------------ Imports ----------------------------------
import os
import re
import math
from tqdm import tqdm
from google.colab import userdata
from huggingface_hub import login
import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, set_seed
from peft import LoraConfig, PeftModel
from datetime import datetime

In [None]:
# Constants

BASE_MODEL = "meta-llama/Meta-Llama-3.1-8B"
FINETUNED_MODEL = f"ed-donner/pricer-2024-09-13_13.04.39"

# Hyperparameters for QLoRA Fine-Tuning

LORA_R = 32
LORA_ALPHA = 64
TARGET_MODULES = ["q_proj", "v_proj", "k_proj", "o_proj"]

In [None]:
# Log in to HuggingFace
hf_token = userdata.get('HF_TOKEN')
login(hf_token, add_to_git_credential=True)

In [None]:
# ------------------------------------ Load (No Quantization) Base Model (initial test, full precision) ----------------------------------
# QLoRA loads base model in 4-bit to save memory
print("\n📦 Loading 4-bit Quantized Base Model...")
base_model = AutoModelForCausalLM.from_pretrained(BASE_MODEL, device_map="auto")
print(f"🧠 4-bit Quantized Base Model Memory Footprint: {base_model.get_memory_footprint() / 1e9:.2f} GB")

Restart your session!

In order to load the next model and clear out the cache of the last model, you'll now need to go to Runtime >> Restart session and run the initial cells (installs and imports and HuggingFace login) again.

This is to clean out the GPU.

In [None]:
# ------------------------------------ Load 8-bit Quantized Model (first optimization) ----------------------------------
print("\n📦 Loading 8-bit Quantized Base Model...")
quant_config = BitsAndBytesConfig(load_in_8bit=True)

base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    quantization_config=quant_config,
    device_map="auto",
)
print(f"🧠 8-bit Quantized Base Model Memory footprint: {base_model.get_memory_footprint() / 1e9:,.1f} GB")

In [None]:
# ------------------------------------ Load 4-bit Quantized Model (QLoRA) ----------------------------------
print("\n📦 Loading 4-bit Quantized Model (QLoRA) Model...")
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4"  # NormalFloat4 format: more accurate for LLMs
)
base_model = AutoModelForCausalLM.from_pretrained(BASE_MODEL, quantization_config=quant_config, device_map="auto")
print(f"🧠 4-bit Quantized Model (QLoRA) Model Memory footprint: {base_model.get_memory_footprint() / 1e9:,.2f} GB")

In [None]:
# ------------------------------------ Load Fine-Tuned LoRA Adapters ----------------------------------
print("\n📦 Loading Fine-Tuned LoRA Adapter...")
fine_tuned_model = PeftModel.from_pretrained(base_model, FINETUNED_MODEL)
print(f"🧠 Fine-Tuned Model Memory Footprint: {fine_tuned_model.get_memory_footprint() / 1e9:.2f} GB")

In [None]:
# ------------------------------------ LoRA Parameter Analysis ----------------------------------
print()
# Estimate LoRA adapter parameter counts for each attention projection
# LoRA introduces 2 matrices (A and B) per target module
lora_q_proj = 4096 * 32 + 4096 * 32
lora_k_proj = 4096 * 32 + 1024 * 32
lora_v_proj = 4096 * 32 + 1024 * 32
lora_o_proj = 4096 * 32 + 4096 * 32

# Total parameters for one transformer block (layer)
lora_layer = lora_q_proj + lora_k_proj + lora_v_proj + lora_o_proj

# Total layers in LLaMA 8B = 32 transformer blocks
params = lora_layer * 32

# Estimate total adapter size in MB (4 bytes per FP32 parameter)
size = (params * 4) / 1_000_000
print(f"\n📊 LoRA Adapter Params: {params:,}")
print(f"💾 Approx. LoRA Adapter Size: {size:.1f} MB")
