# QLoRA Diagnostic Analysis - Part 1: Baseline LoRA (16-bit) with Unsloth

## Objective
Establish baseline performance using standard LoRA with 16-bit precision on GPT-2 Medium (355M parameters) using **Unsloth** library (as recommended by TA).

## Key Questions
1. What is the memory requirement for 16-bit LoRA fine-tuning?
2. How does performance scale with different ranks (r ‚àà {2, 4, 8, 16})?
3. What is the training efficiency (time per step)?

---

## 1. Environment Setup

### 1.1 Install Unsloth

In [None]:
%%capture
# Install Unsloth - optimized LoRA/QLoRA library
import torch

# Check CUDA version
major_version, minor_version = torch.cuda.get_device_capability()
print(f"GPU Compute Capability: {major_version}.{minor_version}")

# Install Unsloth
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

# Additional dependencies
!pip install -q datasets matplotlib seaborn pandas numpy scikit-learn tqdm

### 1.2 Import Libraries

In [None]:
# Import utilities
import sys
import os
import torch
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle

# Import Unsloth
from unsloth import FastLanguageModel

# Add src to path (upload src/ folder to Colab first)
sys.path.append('../src')

# Import custom modules with clean names
from model_utils import load_gpt2_unsloth, setup_gpt2_lora, clear_memory, get_model_memory_usage
from training import prepare_alpaca_dataset, run_experiment_unsloth
from visualization import create_results_table

print(f"‚úì PyTorch version: {torch.__version__}")
print(f"‚úì CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"‚úì GPU: {torch.cuda.get_device_name(0)}")
    print(f"‚úì GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")

## 2. Configuration

In [None]:
# Experimental configuration
MODEL_NAME = "gpt2-medium"  # 355M parameters
NUM_SAMPLES = 1000  # Small dataset for quick diagnostic experiments
MAX_STEPS = 200  # Training steps per experiment
BATCH_SIZE = 4
LEARNING_RATE = 2e-4

# Ranks to test
RANKS_TO_TEST = [2, 4, 8, 16]

# Output directory
OUTPUT_DIR = "./results_baseline_lora"
os.makedirs(OUTPUT_DIR, exist_ok=True)

print("Configuration:")
print(f"  Model: {MODEL_NAME}")
print(f"  Library: Unsloth (optimized)")
print(f"  Quantization: 16-bit (baseline)")
print(f"  Training samples: {NUM_SAMPLES}")
print(f"  Max steps: {MAX_STEPS}")
print(f"  Batch size: {BATCH_SIZE}")
print(f"  Learning rate: {LEARNING_RATE}")
print(f"  Ranks to test: {RANKS_TO_TEST}")

## 3. Run Baseline LoRA Experiments

We'll train LoRA (16-bit) with different ranks using Unsloth.

In [None]:
# Store results
results_list = []

for rank in RANKS_TO_TEST:
    print(f"\n{'='*80}")
    print(f"Running LoRA (16-bit) with rank r={rank} using Unsloth")
    print(f"{'='*80}\n")
    
    try:
        result, model, tokenizer = run_experiment_unsloth(
            model_name=MODEL_NAME,
            load_in_4bit=False,  # 16-bit LoRA
            rank=rank,
            num_samples=NUM_SAMPLES,
            max_steps=MAX_STEPS,
            batch_size=BATCH_SIZE,
            learning_rate=LEARNING_RATE,
            output_dir=OUTPUT_DIR
        )
        
        results_list.append(result)
        
        # Clean up to free memory
        del model
        del tokenizer
        clear_memory()
        
    except Exception as e:
        print(f"‚ùå Error with rank {rank}: {e}")
        import traceback
        traceback.print_exc()
        continue

print("\n‚úì All experiments complete!")

## 4. Results Analysis

### 4.1 Create Results Table

In [None]:
# Create comprehensive results table
results_df = create_results_table(
    results_list,
    save_path=f"{OUTPUT_DIR}/baseline_lora_results.csv"
)

print("\nüìä BASELINE LoRA RESULTS (Unsloth)")
print("="*80)
display(results_df)

### 4.2 Memory Usage Analysis

In [None]:
# Plot memory usage by rank
plt.figure(figsize=(10, 6))
plt.bar(results_df['rank'], results_df['peak_memory_mb'], color='#3498db', alpha=0.7, edgecolor='black')
plt.xlabel('LoRA Rank (r)', fontsize=12, fontweight='bold')
plt.ylabel('Peak GPU Memory (MB)', fontsize=12, fontweight='bold')
plt.title('Baseline LoRA (16-bit) with Unsloth: Memory Usage by Rank', fontsize=14, fontweight='bold')
plt.grid(axis='y', alpha=0.3)
plt.xticks(results_df['rank'])

# Add value labels on bars
for idx, row in results_df.iterrows():
    plt.text(row['rank'], row['peak_memory_mb'] + 50, 
             f"{row['peak_memory_mb']:.0f}", 
             ha='center', fontsize=10)

plt.tight_layout()
plt.savefig(f"{OUTPUT_DIR}/baseline_memory_by_rank.png", dpi=300, bbox_inches='tight')
plt.show()

print(f"\nüìä Memory Statistics:")
print(f"  Average: {results_df['peak_memory_mb'].mean():.2f} MB")
print(f"  Min (r={results_df.loc[results_df['peak_memory_mb'].idxmin(), 'rank']:.0f}): {results_df['peak_memory_mb'].min():.2f} MB")
print(f"  Max (r={results_df.loc[results_df['peak_memory_mb'].idxmax(), 'rank']:.0f}): {results_df['peak_memory_mb'].max():.2f} MB")

### 4.3 Training Efficiency

In [None]:
# Create dual plot: time per step and training loss
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Time per step
ax1.bar(results_df['rank'], results_df['time_per_step'], color='#2ecc71', alpha=0.7, edgecolor='black')
ax1.set_xlabel('LoRA Rank (r)', fontsize=12, fontweight='bold')
ax1.set_ylabel('Time per Step (seconds)', fontsize=12, fontweight='bold')
ax1.set_title('Training Speed by Rank', fontweight='bold')
ax1.grid(axis='y', alpha=0.3)
ax1.set_xticks(results_df['rank'])

# Training loss
ax2.plot(results_df['rank'], results_df['training_loss'], 
         marker='o', linewidth=2.5, markersize=10, color='#e74c3c', alpha=0.8)
ax2.set_xlabel('LoRA Rank (r)', fontsize=12, fontweight='bold')
ax2.set_ylabel('Training Loss', fontsize=12, fontweight='bold')
ax2.set_title('Training Loss by Rank', fontweight='bold')
ax2.grid(alpha=0.3)
ax2.set_xticks(results_df['rank'])

plt.tight_layout()
plt.savefig(f"{OUTPUT_DIR}/baseline_efficiency.png", dpi=300, bbox_inches='tight')
plt.show()

print(f"\n‚ö° Training Efficiency:")
print(f"  Average time per step: {results_df['time_per_step'].mean():.3f}s")
print(f"  Average training loss: {results_df['training_loss'].mean():.4f}")

## 5. Key Findings

### Results Summary

Fill in after running experiments:

**Memory Usage (Unsloth-optimized):**
- Rank 2: ______ MB
- Rank 4: ______ MB
- Rank 8: ______ MB
- Rank 16: ______ MB

**Training Speed:**
- Average time per step: ______ s

**Training Loss:**
- Best rank (lowest loss): r = ______
- Loss at r=8: ______

**Observations:**
- [Document trends - does memory scale linearly with rank?]
- [Note Unsloth performance - is it faster than expected?]
- [Identify optimal rank for baseline comparison]

---

**Next Steps:**
- Proceed to Part 2: Implement QLoRA (4-bit) with Unsloth and compare results

## 6. Save Results for Next Notebook

In [None]:
# Save results for comparison in subsequent notebooks
with open(f"{OUTPUT_DIR}/baseline_results.pkl", 'wb') as f:
    pickle.dump(results_list, f)

print(f"‚úì Results saved to {OUTPUT_DIR}/baseline_results.pkl")
print(f"‚úì CSV saved to {OUTPUT_DIR}/baseline_lora_results.csv")
print(f"‚úì Plots saved to {OUTPUT_DIR}/")
print("\nüéâ Baseline LoRA experiments complete!")
print("üìù Proceed to notebook 02_qlora_implementation.ipynb")