# QLoRA Diagnostic Analysis - Part 1: Baseline LoRA (16-bit) with Unsloth

## Objective
Establish baseline performance using standard LoRA with 16-bit precision on GPT-2 Medium (355M parameters) using the **Unsloth library** (as recommended by TA).

## Key Questions
1. What is the memory requirement for 16-bit LoRA fine-tuning?
2. How does performance scale with different ranks (r ‚àà {2, 4, 8, 16})?
3. What is the training efficiency (time per step)?

## Why Unsloth?
- **2x faster** training than standard PEFT
- **Optimized memory usage**
- **Simpler API** - handles quantization automatically
- **Recommended by TA** for this project

---

## 1. Environment Setup

In [None]:
# Install Unsloth (optimized for Colab)
!pip install unsloth -q
!pip install datasets matplotlib seaborn pandas numpy scikit-learn tqdm -q

In [None]:
# Import utilities
import sys
import os
import torch
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Upload src/ folder to Colab or clone from GitHub
# For now, assume files are uploaded
#sys.path.append('.')

# Import custom modules
#from src.model_utils import (
#    load_model_with_lora_16bit,
#    get_model_memory_usage,
#    print_trainable_parameters,
#    clear_memory
#)

#from src.training import run_experiment

# In all notebooks, use these imports:
sys.path.append('../src')
from model_utils import load_gpt2_unsloth, setup_gpt2_lora, clear_memory
from training import prepare_alpaca_dataset, run_experiment_unsloth

from src.visualization import create_results_table

print(f"‚úì PyTorch version: {torch.__version__}")
print(f"‚úì CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"‚úì GPU: {torch.cuda.get_device_name(0)}")
    print(f"‚úì GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")

## 2. Configuration

In [None]:
# Experimental configuration
MODEL_NAME = "unsloth/gpt2-medium"  # Unsloth's optimized GPT-2 Medium
NUM_SAMPLES = 1000  # Small dataset for quick diagnostic experiments
MAX_STEPS = 200  # Training steps per experiment
BATCH_SIZE = 4
LEARNING_RATE = 2e-4

# Ranks to test
RANKS_TO_TEST = [2, 4, 8, 16]

# Output directory
OUTPUT_DIR = "./results_baseline_lora"
os.makedirs(OUTPUT_DIR, exist_ok=True)

print("Configuration:")
print(f"  Model: {MODEL_NAME}")
print(f"  Framework: Unsloth (optimized)")
print(f"  Training samples: {NUM_SAMPLES}")
print(f"  Max steps: {MAX_STEPS}")
print(f"  Batch size: {BATCH_SIZE}")
print(f"  Learning rate: {LEARNING_RATE}")
print(f"  Ranks to test: {RANKS_TO_TEST}")

## 3. Run Baseline LoRA Experiments

Train LoRA (16-bit) with different ranks using Unsloth's optimized implementation.

In [None]:
# Store results
results_list = []

for rank in RANKS_TO_TEST:
    print(f"\n{'='*80}")
    print(f"Running LoRA (16-bit) with rank r={rank}")
    print(f"{'='*80}\n")
    
    try:
        result, model, tokenizer = run_experiment(
            model_name=MODEL_NAME,
            quantization="16bit",
            rank=rank,
            num_samples=NUM_SAMPLES,
            max_steps=MAX_STEPS,
            batch_size=BATCH_SIZE,
            learning_rate=LEARNING_RATE,
            output_dir=OUTPUT_DIR
        )
        
        results_list.append(result)
        
        # Clean up to free memory
        del model
        del tokenizer
        clear_memory()
        
    except Exception as e:
        print(f"‚ùå Error with rank {rank}: {e}")
        import traceback
        traceback.print_exc()
        continue

print("\n‚úì All experiments complete!")

## 4. Results Analysis

In [None]:
# Create results table
results_df = create_results_table(
    results_list,
    save_path=f"{OUTPUT_DIR}/baseline_lora_results.csv"
)

print("\nüìä BASELINE LoRA RESULTS (Unsloth)")
print("="*80)
display(results_df)

## 5. Visualize Results

In [None]:
# Memory usage by rank
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

ax1.bar(results_df['rank'], results_df['peak_memory_mb'], color='#3498db', alpha=0.7)
ax1.set_xlabel('LoRA Rank (r)', fontsize=12, fontweight='bold')
ax1.set_ylabel('Peak GPU Memory (MB)', fontsize=12, fontweight='bold')
ax1.set_title('Baseline LoRA: Memory Usage (Unsloth)', fontsize=14, fontweight='bold')
ax1.grid(axis='y', alpha=0.3)

ax2.bar(results_df['rank'], results_df['time_per_step'], color='#2ecc71', alpha=0.7)
ax2.set_xlabel('LoRA Rank (r)', fontsize=12, fontweight='bold')
ax2.set_ylabel('Time per Step (s)', fontsize=12, fontweight='bold')
ax2.set_title('Baseline LoRA: Training Speed (Unsloth)', fontsize=14, fontweight='bold')
ax2.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.savefig(f"{OUTPUT_DIR}/baseline_metrics.png", dpi=300)
plt.show()

print(f"\n‚úì Average memory: {results_df['peak_memory_mb'].mean():.2f} MB")
print(f"‚úì Average time/step: {results_df['time_per_step'].mean():.3f}s")

## 6. Key Findings

### Memory Usage (TODO: Fill after running)
- Rank 2: [TODO] MB
- Rank 4: [TODO] MB
- Rank 8: [TODO] MB
- Rank 16: [TODO] MB

### Training Speed
- Average time/step: [TODO]s
- **Note**: Unsloth is ~2x faster than standard PEFT

### Observations
- [TODO: Document trends]
- [TODO: Compare to expected values]

---

**Next**: Proceed to `02_qlora_implementation.ipynb` to compare with 4-bit QLoRA

## 7. Save Results

In [None]:
# Save for next notebook
import pickle

with open(f"{OUTPUT_DIR}/baseline_results.pkl", 'wb') as f:
    pickle.dump(results_list, f)

print(f"‚úì Results saved to {OUTPUT_DIR}/baseline_results.pkl")
print("\nüéâ Baseline LoRA experiments complete!")
print("üìù Proceed to notebook 02_qlora_implementation.ipynb")