# Colab 2: LoRA Parameter-Efficient Fine-tuning with TinyLlama
This notebook demonstrates LoRA fine-tuning using the SAME model and dataset as Colab 1 for direct comparison.

## Key Differences from Colab 1:
- LoRA only updates ~1-5% of parameters
- Much lower memory usage
- Faster training
- Smaller adapter files

In [None]:
# Install Unsloth (same as Colab 1)
!pip install -q "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install -q --no-deps xformers trl peft accelerate bitsandbytes

In [None]:
from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

In [None]:
# Model configuration (SAME as Colab 1)
max_seq_length = 2048
dtype = None
load_in_4bit = True  # Can use 4-bit with LoRA!

# Load TinyLlama model (SAME as Colab 1)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/tinyllama",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

# IMPORTANT: This is the KEY DIFFERENCE - we use get_peft_model for LoRA
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # LoRA rank
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    use_rslora=False,
)

# Print trainable parameters (MUCH fewer than full fine-tuning!)
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())
print(f"LoRA Fine-tuning Mode:")
print(f"Trainable params: {trainable_params:,} ({100 * trainable_params / total_params:.2f}%)")
print(f"Total params: {total_params:,}")
print(f"\nOnly LoRA adapter parameters will be updated!")

## Test Model BEFORE Training
Let's see the baseline performance:

In [None]:
# Same Alpaca prompt format
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

# Same test function
def test_model(model, tokenizer, instruction, input_text=""):
    FastLanguageModel.for_inference(model)
    prompt = alpaca_prompt.format(instruction, input_text, "")
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    outputs = model.generate(
        **inputs,
        max_new_tokens=128,
        temperature=0.7,
        pad_token_id=tokenizer.eos_token_id
    )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    response_start = response.find("### Response:") + len("### Response:")
    return response[response_start:].strip()

print("=" * 50)
print("BEFORE LoRA TRAINING:")
print("=" * 50)
print("\nQ: What is 2+2?")
print("A:", test_model(model, tokenizer, "What is 2+2?"))
print("\nQ: Name three colors")
print("A:", test_model(model, tokenizer, "Name three colors"))
print("\nQ: Write a haiku about coding")
print("A:", test_model(model, tokenizer, "Write a haiku about coding"))

In [None]:
# Prepare the SAME dataset as Colab 1
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        text = alpaca_prompt.format(instruction, input, output)
        texts.append(text)
    return {"text": texts}

# Load SAME dataset subset
dataset = load_dataset("yahma/alpaca-cleaned", split="train[:5000]")
dataset = dataset.map(formatting_prompts_func, batched=True)

print(f"Training on {len(dataset)} samples (same as Colab 1)")

In [None]:
# Training configuration for LoRA
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=2,  # Can use larger batch with LoRA!
        gradient_accumulation_steps=4,
        warmup_steps=10,
        max_steps=200,  # Same steps as Colab 1 for comparison
        learning_rate=2e-4,  # Higher LR works with LoRA
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="cosine",
        seed=3407,  # Same seed for reproducibility
        output_dir="outputs_lora_finetuning",
        save_steps=50,
        report_to="none",
    ),
)

model.config.use_cache = False

In [None]:
# Start training
import time
print("Starting LoRA fine-tuning...")
print("This only updates LoRA adapter weights (~1-5% of parameters).")
print("Watch how fast it trains compared to full fine-tuning!\n")

start_time = time.time()
trainer_stats = trainer.train()
training_time = time.time() - start_time

print(f"\nTraining completed in {training_time:.1f} seconds!")
print(f"Final loss: {trainer_stats.training_loss:.4f}")

## Test Model AFTER LoRA Training

In [None]:
# Re-enable cache
model.config.use_cache = True

print("=" * 50)
print("AFTER LoRA TRAINING:")
print("=" * 50)
print("\nQ: What is 2+2?")
print("A:", test_model(model, tokenizer, "What is 2+2?"))
print("\nQ: Name three colors")
print("A:", test_model(model, tokenizer, "Name three colors"))
print("\nQ: Write a haiku about coding")
print("A:", test_model(model, tokenizer, "Write a haiku about coding"))
print("\nQ: Explain photosynthesis in simple terms")
print("A:", test_model(model, tokenizer, "Explain photosynthesis in simple terms"))
print("\nQ: What is the capital of France?")
print("A:", test_model(model, tokenizer, "What is the capital of France?"))

In [None]:
# Save ONLY the LoRA adapter (not the full model!)
model.save_pretrained("tinyllama_lora_adapter")
tokenizer.save_pretrained("tinyllama_lora_adapter")
print("LoRA adapter saved!")

# Check adapter size (MUCH smaller than full model!)
import os
adapter_size = 0
for root, dirs, files in os.walk("tinyllama_lora_adapter"):
    for f in files:
        if f.endswith(('.bin', '.safetensors')):
            adapter_size += os.path.getsize(os.path.join(root, f))

print(f"LoRA adapter size: {adapter_size / 1024 / 1024:.2f} MB")
print(f"Compare this to the full model which is ~2000 MB!")

In [None]:
# Optional: Merge LoRA weights with base model
print("\nMerging LoRA adapter with base model...")
model.save_pretrained_merged("tinyllama_lora_merged", tokenizer, save_method="merged_16bit")
print("Merged model saved!")

## Comparison: LoRA vs Full Fine-tuning

| Aspect | Full Fine-tuning (Colab 1) | LoRA (Colab 2) |
|--------|---------------------------|----------------|
| Trainable Parameters | 100% (~1.1B) | ~1-5% (~20M) |
| Memory Usage | High (~8GB) | Low (~2GB) |
| Training Speed | Slower | 3-5x Faster |
| Adapter Size | Full model (~2GB) | Small (~50MB) |
| Performance | Best | Very Good |
| Risk of Overfitting | Higher | Lower |

### LoRA Parameters Explained:
- **r (rank)**: Controls capacity (4-128, higher = more capacity)
- **lora_alpha**: Scaling factor (typically r or 2*r)
- **target_modules**: Which layers to apply LoRA to
- **lora_dropout**: Regularization (0-0.1)

### When to Use LoRA:
- Limited GPU memory
- Need to train multiple task-specific adapters
- Want to preserve base model capabilities
- Quick experimentation and iteration

### When to Use Full Fine-tuning:
- Maximum performance is critical
- Completely new domain/language
- Have abundant GPU resources
- Don't need original model capabilities

### Video Recording Points:
1. Compare parameter counts (1-5% vs 100%)
2. Show training speed difference
3. Compare adapter size vs full model
4. Demonstrate similar quality outputs
5. Explain the low-rank decomposition concept
6. Show how to swap LoRA adapters