# 07 - LoRA & Parameter-Efficient Fine-tuning

**Fine-tune large models on consumer hardware.**

## Learning Objectives

By the end of this notebook, you will:
- Understand parameter-efficient methods
- Implement LoRA fine-tuning
- Use Hugging Face PEFT library
- Train models on limited hardware

## Table of Contents

1. [Why PEFT?](#why)
2. [LoRA Explained](#lora)
3. [Setting Up PEFT](#setup)
4. [Training with LoRA](#training)
5. [Inference](#inference)
6. [Exercises](#exercises)
7. [Checkpoint](#checkpoint)

In [None]:
# GUIDED: Setup
import os
import sys
from pathlib import Path

sys.path.append(str(Path.cwd().parent))

from dotenv import load_dotenv
load_dotenv(Path.cwd().parent / ".env")

# Check for GPU
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
elif torch.backends.mps.is_available():
    print("Using Apple MPS")

---
## 1. Why PEFT? <a id='why'></a>

### Full Fine-tuning Problems:
- **Memory**: Full model weights in GPU memory
- **Storage**: Save entire model per task
- **Time**: Update all parameters

```
Method              Parameters   Memory      Storage
─────────────────────────────────────────────────────
Full fine-tuning    100%        ~48GB       ~14GB
LoRA (r=8)          0.1%        ~8GB        ~20MB
QLoRA               0.1%        ~4GB        ~20MB
```

### PEFT Methods:
- **LoRA**: Low-Rank Adaptation
- **QLoRA**: Quantized LoRA
- **Prefix Tuning**: Trainable prefix tokens
- **Adapters**: Small bottleneck layers

---
## 2. LoRA Explained <a id='lora'></a>

Instead of updating all weights, LoRA adds small trainable matrices:

```
Original:  W (frozen)
LoRA:      W + BA  (where B and A are small matrices)

If W is [4096 x 4096] = 16M parameters
With rank r=8:
  B: [4096 x 8] = 32K parameters
  A: [8 x 4096] = 32K parameters
  Total: 64K (0.4% of original)
```

In [None]:
# GUIDED: Visualize LoRA concept
import torch
import torch.nn as nn

class LoRALayer(nn.Module):
    """Simple LoRA implementation for demonstration."""
    
    def __init__(self, original_layer: nn.Linear, rank: int = 8, alpha: float = 16):
        super().__init__()
        self.original = original_layer
        self.original.requires_grad_(False)  # Freeze original
        
        in_features = original_layer.in_features
        out_features = original_layer.out_features
        
        # LoRA matrices
        self.lora_A = nn.Parameter(torch.randn(in_features, rank) * 0.01)
        self.lora_B = nn.Parameter(torch.zeros(rank, out_features))
        
        self.scaling = alpha / rank
    
    def forward(self, x):
        # Original output + LoRA adjustment
        original_output = self.original(x)
        lora_output = (x @ self.lora_A @ self.lora_B) * self.scaling
        return original_output + lora_output

# Example
original = nn.Linear(1024, 1024)
lora = LoRALayer(original, rank=8)

original_params = sum(p.numel() for p in original.parameters())
lora_params = sum(p.numel() for p in lora.parameters() if p.requires_grad)

print(f"Original layer parameters: {original_params:,}")
print(f"LoRA trainable parameters: {lora_params:,}")
print(f"Reduction: {100 * (1 - lora_params/original_params):.1f}%")

---
## 3. Setting Up PEFT <a id='setup'></a>

In [None]:
# GUIDED: Load a base model with PEFT
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType

# Load a small model for demonstration
model_name = "facebook/opt-125m"  # Small model for demo

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float32,
    device_map="auto"
)

print(f"Loaded {model_name}")
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

In [None]:
# GUIDED: Configure LoRA
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,                     # Rank
    lora_alpha=16,           # Scaling factor
    lora_dropout=0.1,        # Dropout for regularization
    target_modules=["q_proj", "v_proj"],  # Which layers to adapt
)

# Apply LoRA to model
peft_model = get_peft_model(model, lora_config)

# Check trainable parameters
trainable = sum(p.numel() for p in peft_model.parameters() if p.requires_grad)
total = sum(p.numel() for p in peft_model.parameters())

print(f"Trainable parameters: {trainable:,} ({100*trainable/total:.2f}%)")
peft_model.print_trainable_parameters()

---
## 4. Training with LoRA <a id='training'></a>

In [None]:
# GUIDED: Prepare training data
from datasets import Dataset

# Simple training examples
training_data = [
    {"text": "Question: What is Python? Answer: Python is a programming language."},
    {"text": "Question: What is AI? Answer: AI stands for Artificial Intelligence."},
    {"text": "Question: What is ML? Answer: ML stands for Machine Learning."},
    {"text": "Question: What is NLP? Answer: NLP stands for Natural Language Processing."},
]

dataset = Dataset.from_list(training_data)

def tokenize(example):
    return tokenizer(
        example["text"],
        truncation=True,
        max_length=128,
        padding="max_length"
    )

tokenized_dataset = dataset.map(tokenize, remove_columns=["text"])
print(f"Dataset size: {len(tokenized_dataset)}")

In [None]:
# GUIDED: Set up training
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling

training_args = TrainingArguments(
    output_dir="./lora_output",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    learning_rate=3e-4,
    logging_steps=10,
    save_strategy="epoch",
    warmup_steps=10,
    report_to="none",  # Disable wandb
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False  # Not masked LM, we're doing causal LM
)

trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator,
)

print("Trainer configured!")
print("Uncomment the next cell to train (takes a few minutes)")

In [None]:
# GUIDED: Train the model (uncomment to run)
# trainer.train()

# Save the LoRA weights
# peft_model.save_pretrained("./lora_output/final")

print("Training demonstration (uncomment to actually train)")

---
## 5. Inference <a id='inference'></a>

In [None]:
# GUIDED: Load and use LoRA model
from peft import PeftModel

# Load base model
# base_model = AutoModelForCausalLM.from_pretrained(model_name)

# Load LoRA weights
# model_with_lora = PeftModel.from_pretrained(base_model, "./lora_output/final")

# Generate
# inputs = tokenizer("Question: What is Python?", return_tensors="pt")
# outputs = model_with_lora.generate(**inputs, max_new_tokens=50)
# print(tokenizer.decode(outputs[0]))

print("Inference demonstration (uncomment after training)")

In [None]:
# GUIDED: Merge LoRA weights (for deployment)
# merged_model = model_with_lora.merge_and_unload()
# merged_model.save_pretrained("./merged_model")

print("""Merging LoRA weights:
- Creates a single model with LoRA weights merged
- No additional latency at inference
- Larger file size (full model)
- Good for production deployment
""")

---
## 6. Exercises <a id='exercises'></a>

### Exercise 1: Experiment with Rank

Compare different LoRA ranks (4, 8, 16, 32).

In [None]:
# TODO: Create models with different ranks and compare parameter counts

# Your code here:


### Exercise 2: Train for Classification

Adapt the code for a text classification task.

In [None]:
# TODO: Use LoRA for sentiment classification

# Your code here:


---
## 7. Checkpoint <a id='checkpoint'></a>

Before moving on, verify:

- [ ] You understand why PEFT is useful
- [ ] You know how LoRA works
- [ ] You can configure and train with PEFT
- [ ] You can load and use LoRA models

### Next Steps

In the next notebook, we'll cover **Evaluation & Testing** - building robust test suites for AI systems!