# 04 - Model Fine-Tuning with QLoRA

This notebook fine-tunes Llama 3.2 1B using QLoRA (Quantized Low-Rank Adaptation) on CPU.

## What we'll do:
1. Load the base model with 8-bit quantization
2. Configure LoRA for parameter-efficient fine-tuning
3. Set up training configuration (optimized for CPU)
4. Train the model
5. Save the fine-tuned model
6. Monitor training progress

## Expected Training Time:
- **~12-24 hours** on Intel MacBook Pro 2018 (CPU-only)
- You can pause and resume training if needed

## 1. Setup and Imports

In [None]:
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    TrainerCallback,
)
from peft import (
    LoraConfig,
    get_peft_model,
    prepare_model_for_kbit_training,
    PeftModel,
)
from trl import SFTTrainer
from datasets import load_dataset
import json
from pathlib import Path
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

print("✓ Imports successful")

## 2. Configuration

Key parameters optimized for CPU training.

In [None]:
# Model configuration
MODEL_ID = "meta-llama/Llama-3.2-1B"
OUTPUT_DIR = "../models/llama-3.2-1b-brd-extraction"
FINAL_MODEL_DIR = "../models/final/llama-3.2-1b-brd-final"

# LoRA configuration (optimized for CPU)
LORA_CONFIG = {
    "r": 8,  # Rank - lower for faster training
    "lora_alpha": 16,  # Scaling factor
    "lora_dropout": 0.05,
    "target_modules": [
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
}

# Training configuration (CPU-optimized)
TRAINING_CONFIG = {
    "num_train_epochs": 3,
    "per_device_train_batch_size": 1,  # Must be 1 on CPU
    "gradient_accumulation_steps": 32,  # Effective batch size = 32
    "learning_rate": 2e-4,
    "max_seq_length": 2048,
    "warmup_steps": 100,
    "logging_steps": 10,
    "save_steps": 100,
    "save_total_limit": 3,
}

print("Configuration:")
print(f"  Model: {MODEL_ID}")
print(f"  LoRA Rank: {LORA_CONFIG['r']}")
print(f"  Epochs: {TRAINING_CONFIG['num_train_epochs']}")
print(f"  Effective Batch Size: {TRAINING_CONFIG['gradient_accumulation_steps']}")
print(f"  Max Sequence Length: {TRAINING_CONFIG['max_seq_length']}")
print("\n✓ Configuration set")

## 3. Load Dataset

In [None]:
# Load prepared datasets
data_files = {
    "train": "../data/processed/train.json",
    "validation": "../data/processed/validation.json",
}

dataset = load_dataset("json", data_files=data_files)

print("Dataset loaded:")
print(f"  Training samples: {len(dataset['train'])}")
print(f"  Validation samples: {len(dataset['validation'])}")
print("\n✓ Dataset ready")

## 4. Load Model and Tokenizer with 8-bit Quantization

In [None]:
print("Loading model with 8-bit quantization...")
print("This may take a few minutes...\n")

# Configure 8-bit quantization
quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0,
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=quantization_config,
    device_map="auto",
    trust_remote_code=True,
    low_cpu_mem_usage=True,
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

print("✓ Model loaded")
print(f"  Memory footprint: {model.get_memory_footprint() / 1e9:.2f} GB")
print(f"  Total parameters: {model.num_parameters() / 1e6:.0f}M")

## 5. Prepare Model for Training with LoRA

In [None]:
print("Preparing model for training...\n")

# Prepare for k-bit training
model = prepare_model_for_kbit_training(model)

# Configure LoRA
lora_config = LoraConfig(
    r=LORA_CONFIG["r"],
    lora_alpha=LORA_CONFIG["lora_alpha"],
    target_modules=LORA_CONFIG["target_modules"],
    lora_dropout=LORA_CONFIG["lora_dropout"],
    bias="none",
    task_type="CAUSAL_LM",
)

# Apply LoRA
model = get_peft_model(model, lora_config)

# Print trainable parameters
print("Trainable Parameters:")
print("=" * 60)
model.print_trainable_parameters()
print("=" * 60)
print("\n✓ Model prepared with LoRA")
print("  Only ~0.5-2% of parameters will be trained!")

## 6. Configure Training Arguments

In [None]:
# Create output directory
Path(OUTPUT_DIR).mkdir(parents=True, exist_ok=True)

# Training arguments (optimized for CPU)
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=TRAINING_CONFIG["num_train_epochs"],
    per_device_train_batch_size=TRAINING_CONFIG["per_device_train_batch_size"],
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=TRAINING_CONFIG["gradient_accumulation_steps"],
    learning_rate=TRAINING_CONFIG["learning_rate"],
    warmup_steps=TRAINING_CONFIG["warmup_steps"],
    logging_steps=TRAINING_CONFIG["logging_steps"],
    save_steps=TRAINING_CONFIG["save_steps"],
    save_total_limit=TRAINING_CONFIG["save_total_limit"],
    evaluation_strategy="steps",
    eval_steps=200,
    fp16=False,  # CPU doesn't support fp16
    bf16=False,  # Use fp32 on CPU
    gradient_checkpointing=True,  # Save memory
    optim="adamw_torch",
    lr_scheduler_type="cosine",
    max_grad_norm=0.3,
    report_to="none",  # Change to "tensorboard" if you want logging
    logging_dir=f"{OUTPUT_DIR}/logs",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
)

print("Training Arguments:")
print("=" * 60)
print(f"  Output Directory: {OUTPUT_DIR}")
print(f"  Epochs: {training_args.num_train_epochs}")
print(f"  Batch Size: {training_args.per_device_train_batch_size}")
print(f"  Gradient Accumulation: {training_args.gradient_accumulation_steps}")
print(f"  Effective Batch Size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"  Learning Rate: {training_args.learning_rate}")
print(f"  Warmup Steps: {training_args.warmup_steps}")
print(f"  Gradient Checkpointing: {training_args.gradient_checkpointing}")
print("=" * 60)
print("\n✓ Training arguments configured")

## 7. Create Custom Callback for Progress Tracking

In [None]:
class ProgressCallback(TrainerCallback):
    """Custom callback to track and display training progress."""
    
    def __init__(self):
        self.training_start = None
        self.best_eval_loss = float('inf')
    
    def on_train_begin(self, args, state, control, **kwargs):
        self.training_start = datetime.now()
        print("\n" + "="*80)
        print("TRAINING STARTED")
        print(f"Start Time: {self.training_start.strftime('%Y-%m-%d %H:%M:%S')}")
        print("="*80 + "\n")
    
    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs:
            # Calculate progress
            total_steps = state.max_steps
            current_step = state.global_step
            progress = (current_step / total_steps) * 100
            
            # Time elapsed
            elapsed = datetime.now() - self.training_start
            
            # Display key metrics
            if 'loss' in logs:
                print(f"Step {current_step}/{total_steps} ({progress:.1f}%) | "
                      f"Loss: {logs['loss']:.4f} | "
                      f"Elapsed: {str(elapsed).split('.')[0]}")
            
            if 'eval_loss' in logs:
                eval_loss = logs['eval_loss']
                if eval_loss < self.best_eval_loss:
                    self.best_eval_loss = eval_loss
                    print(f"  ★ New best eval loss: {eval_loss:.4f}")
    
    def on_train_end(self, args, state, control, **kwargs):
        elapsed = datetime.now() - self.training_start
        print("\n" + "="*80)
        print("TRAINING COMPLETED")
        print(f"Total Time: {str(elapsed).split('.')[0]}")
        print(f"Best Eval Loss: {self.best_eval_loss:.4f}")
        print("="*80 + "\n")

print("✓ Progress callback created")

## 8. Initialize Trainer

In [None]:
print("Initializing trainer...\n")

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    tokenizer=tokenizer,
    args=training_args,
    max_seq_length=TRAINING_CONFIG["max_seq_length"],
    dataset_text_field="text",
    packing=False,  # Safer for structured output tasks
    callbacks=[ProgressCallback()],
)

print("✓ Trainer initialized")
print(f"\nEstimated training steps: {trainer.args.max_steps}")
print(f"Estimated time on CPU: 12-24 hours\n")

## 9. Start Training

**Important Notes:**
- Training will take 12-24 hours on CPU
- You can interrupt and resume from last checkpoint
- Monitor the loss - it should decrease over time
- Checkpoints are saved every 100 steps

**To resume from checkpoint:**
```python
trainer.train(resume_from_checkpoint=True)
```

In [None]:
print("Starting training...")
print("This will take approximately 12-24 hours on CPU.")
print("You can stop training with Ctrl+C and resume later.\n")

# Start training
train_result = trainer.train()

# Save training metrics
metrics = train_result.metrics
trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)

print("\n✓ Training completed successfully!")

## 10. Evaluate on Validation Set

In [None]:
print("Evaluating on validation set...\n")

eval_metrics = trainer.evaluate()

print("Validation Metrics:")
print("=" * 60)
for key, value in eval_metrics.items():
    print(f"  {key}: {value:.4f}")
print("=" * 60)

trainer.log_metrics("eval", eval_metrics)
trainer.save_metrics("eval", eval_metrics)

print("\n✓ Evaluation complete")

## 11. Save Final Model

In [None]:
print("Saving final model...\n")

# Create final model directory
Path(FINAL_MODEL_DIR).mkdir(parents=True, exist_ok=True)

# Save model and tokenizer
trainer.save_model(FINAL_MODEL_DIR)
tokenizer.save_pretrained(FINAL_MODEL_DIR)

# Save training configuration
training_info = {
    "base_model": MODEL_ID,
    "training_date": datetime.now().isoformat(),
    "lora_config": LORA_CONFIG,
    "training_config": TRAINING_CONFIG,
    "final_metrics": eval_metrics,
    "training_samples": len(dataset["train"]),
    "validation_samples": len(dataset["validation"]),
}

with open(f"{FINAL_MODEL_DIR}/training_info.json", "w") as f:
    json.dump(training_info, f, indent=2)

print(f"✓ Model saved to: {FINAL_MODEL_DIR}")
print(f"✓ Training info saved")
print("\nFiles saved:")
print(f"  - Model weights (LoRA adapters)")
print(f"  - Tokenizer")
print(f"  - Training configuration")
print(f"  - Metrics and logs")

## 12. Quick Test of Fine-tuned Model

In [None]:
print("Testing fine-tuned model...\n")

# Sample BRD for testing
test_brd = """Business Requirements Document
Project: Customer Portal Development

We need to build a web-based customer portal for our e-commerce platform. The portal will allow customers to view order history, track shipments, and manage their accounts. The project requires 2 full-stack developers working for 10 weeks. Total estimated effort is 400 hours with a budget of $50,000.
"""

prompt = f"""### Instruction:
Extract the project estimation fields from the following Business Requirements Document.
Return a JSON object with these exact fields: effort_hours (number), timeline_weeks (number), cost_usd (number).
Return ONLY the JSON object, no additional text.

### Input:
{test_brd}

### Output:
"""

# Generate
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    temperature=0.1,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id,
)

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
output = generated_text.split("### Output:")[-1].strip()

print("Test BRD:")
print("="*80)
print(test_brd)
print("="*80)

print("\nExtracted JSON:")
print("="*80)
print(output)
print("="*80)

print("\n✓ Fine-tuned model is working!")
print("\nNote: For production use, integrate with Pydantic AI (see notebook 06)")

## 13. Visualize Training Loss

In [None]:
import matplotlib.pyplot as plt
import json

# Load training logs
try:
    with open(f"{OUTPUT_DIR}/trainer_state.json", "r") as f:
        state = json.load(f)
    
    # Extract loss history
    train_loss = [(log['step'], log['loss']) for log in state['log_history'] if 'loss' in log]
    eval_loss = [(log['step'], log['eval_loss']) for log in state['log_history'] if 'eval_loss' in log]
    
    # Plot
    plt.figure(figsize=(12, 5))
    
    # Training loss
    plt.subplot(1, 2, 1)
    if train_loss:
        steps, losses = zip(*train_loss)
        plt.plot(steps, losses, 'b-', linewidth=2)
        plt.title('Training Loss', fontsize=14, fontweight='bold')
        plt.xlabel('Steps')
        plt.ylabel('Loss')
        plt.grid(True, alpha=0.3)
    
    # Evaluation loss
    plt.subplot(1, 2, 2)
    if eval_loss:
        steps, losses = zip(*eval_loss)
        plt.plot(steps, losses, 'r-', linewidth=2, marker='o')
        plt.title('Validation Loss', fontsize=14, fontweight='bold')
        plt.xlabel('Steps')
        plt.ylabel('Loss')
        plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig(f"{FINAL_MODEL_DIR}/training_curves.png", dpi=300, bbox_inches='tight')
    plt.show()
    
    print("✓ Training curves saved")
except FileNotFoundError:
    print("Training state not found. This is normal if training was just started.")

## Summary

### What we've done:
- ✓ Loaded Llama 3.2 1B with 8-bit quantization
- ✓ Configured QLoRA for parameter-efficient fine-tuning
- ✓ Set up CPU-optimized training configuration
- ✓ Trained the model on BRD extraction task
- ✓ Evaluated on validation set
- ✓ Saved fine-tuned model and configuration
- ✓ Tested the fine-tuned model
- ✓ Visualized training progress

### Training Techniques Used:
- **QLoRA**: Quantized Low-Rank Adaptation for efficient training
- **8-bit Quantization**: Reduced memory footprint for CPU training
- **Gradient Accumulation**: Simulated larger batch sizes
- **Gradient Checkpointing**: Reduced memory usage
- **Cosine Learning Rate Schedule**: Smooth convergence

### Model Performance:
- Only **~0.5-2%** of parameters were trained (LoRA adapters)
- Final validation loss: *[see metrics above]*
- Model can now extract structured JSON from BRDs

### Next Steps:
Move on to `05_evaluation.ipynb` for comprehensive model evaluation.

### Notes:
- Training on CPU is slow but effective for 1B models
- LoRA adapters are small (~10-50 MB) and portable
- You can share just the adapters, not the full model
- For inference, load base model + LoRA adapters