# QLoRA Diagnostic Analysis - Part 2: QLoRA (4-bit) Implementation with Unsloth

## Objective
Implement QLoRA with 4-bit NF4 quantization using **Unsloth** and compare against the 16-bit LoRA baseline from Part 1.

## Key Questions
1. How much memory does 4-bit quantization save compared to 16-bit?
2. Does QLoRA preserve performance (comparable training loss)?
3. What is the optimal rank for QLoRA?

---

## 1. Environment Setup

### 1.1 Install Unsloth

In [None]:
%%capture
# Install Unsloth - optimized LoRA/QLoRA library
import torch

# Check CUDA version
major_version, minor_version = torch.cuda.get_device_capability()
print(f"GPU Compute Capability: {major_version}.{minor_version}")

# Install Unsloth
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

# Additional dependencies
!pip install -q datasets matplotlib seaborn pandas numpy scikit-learn tqdm

### 1.2 Import Libraries

In [None]:
# Import utilities
import sys
import os
import torch
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle

# Import Unsloth
from unsloth import FastLanguageModel

# Add src to path (upload src/ folder to Colab first)
sys.path.append('../src')

# Import custom modules with clean names
from model_utils import load_gpt2_unsloth, setup_gpt2_lora, clear_memory
from training import prepare_alpaca_dataset, run_experiment_unsloth
from visualization import create_results_table

print(f"‚úì PyTorch version: {torch.__version__}")
print(f"‚úì CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"‚úì GPU: {torch.cuda.get_device_name(0)}")
    print(f"‚úì GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")

## 2. Configuration

In [None]:
# Experimental configuration
MODEL_NAME = "gpt2-medium"  # 355M parameters
NUM_SAMPLES = 1000  # Match baseline
MAX_STEPS = 200
BATCH_SIZE = 4
LEARNING_RATE = 2e-4

# Ranks to test (match baseline)
RANKS_TO_TEST = [2, 4, 8, 16]

# Output directory
OUTPUT_DIR = "./results_qlora"
os.makedirs(OUTPUT_DIR, exist_ok=True)

print("Configuration:")
print(f"  Model: {MODEL_NAME}")
print(f"  Library: Unsloth (optimized)")
print(f"  Quantization: 4-bit NF4 (QLoRA)")
print(f"  Training samples: {NUM_SAMPLES}")
print(f"  Max steps: {MAX_STEPS}")
print(f"  Ranks to test: {RANKS_TO_TEST}")

## 3. Load Baseline Results for Comparison

In [None]:
# Load baseline LoRA results
try:
    with open('../results_baseline_lora/baseline_results.pkl', 'rb') as f:
        baseline_results = pickle.load(f)
    print(f"‚úì Loaded {len(baseline_results)} baseline results")
    baseline_df = pd.DataFrame(baseline_results)
    print("\nBaseline Summary:")
    display(baseline_df[['rank', 'peak_memory_mb', 'time_per_step', 'training_loss']])
except FileNotFoundError:
    print("‚ö†Ô∏è  Baseline results not found. Run 01_baseline_lora.ipynb first.")
    baseline_results = None
    baseline_df = None

## 4. Run QLoRA Experiments

Train QLoRA (4-bit quantized base + high-precision adapters) with different ranks using Unsloth.

In [None]:
# Store results
qlora_results_list = []

for rank in RANKS_TO_TEST:
    print(f"\n{'='*80}")
    print(f"Running QLoRA (4-bit) with rank r={rank} using Unsloth")
    print(f"{'='*80}\n")
    
    try:
        result, model, tokenizer = run_experiment_unsloth(
            model_name=MODEL_NAME,
            load_in_4bit=True,  # QLoRA: 4-bit quantization
            rank=rank,
            num_samples=NUM_SAMPLES,
            max_steps=MAX_STEPS,
            batch_size=BATCH_SIZE,
            learning_rate=LEARNING_RATE,
            output_dir=OUTPUT_DIR
        )
        
        qlora_results_list.append(result)
        
        # Clean up
        del model
        del tokenizer
        clear_memory()
        
    except Exception as e:
        print(f"‚ùå Error with rank {rank}: {e}")
        import traceback
        traceback.print_exc()
        continue

print("\n‚úì All QLoRA experiments complete!")

## 5. Results Analysis

### 5.1 Create Results Table

In [None]:
# Create QLoRA results table
qlora_df = create_results_table(
    qlora_results_list,
    save_path=f"{OUTPUT_DIR}/qlora_results.csv"
)

print("\nüìä QLORA RESULTS (Unsloth)")
print("="*80)
display(qlora_df)

### 5.2 Compare LoRA vs QLoRA

In [None]:
if baseline_results:
    # Combine results
    combined_df = pd.concat([baseline_df, qlora_df], ignore_index=True)
    
    # Calculate memory reduction
    comparison = pd.DataFrame()
    for rank in RANKS_TO_TEST:
        lora_mem = baseline_df[baseline_df['rank'] == rank]['peak_memory_mb'].values[0]
        qlora_mem = qlora_df[qlora_df['rank'] == rank]['peak_memory_mb'].values[0]
        reduction = ((lora_mem - qlora_mem) / lora_mem) * 100
        
        lora_loss = baseline_df[baseline_df['rank'] == rank]['training_loss'].values[0]
        qlora_loss = qlora_df[qlora_df['rank'] == rank]['training_loss'].values[0]
        
        comparison = pd.concat([comparison, pd.DataFrame({
            'rank': [rank],
            'lora_memory_mb': [lora_mem],
            'qlora_memory_mb': [qlora_mem],
            'memory_reduction_%': [reduction],
            'lora_loss': [lora_loss],
            'qlora_loss': [qlora_loss]
        })], ignore_index=True)
    
    print("\nüîã MEMORY COMPARISON: LoRA vs QLoRA (Unsloth)")
    print("="*80)
    display(comparison)
    
    print(f"\n‚ú® Average memory reduction: {comparison['memory_reduction_%'].mean():.2f}%")
    print(f"‚ú® Unsloth optimization benefit: Faster training + reduced memory!")
    
    # Save comparison
    os.makedirs('../results/tables', exist_ok=True)
    comparison.to_csv('../results/tables/memory_comparison.csv', index=False)
else:
    print("‚ö†Ô∏è  Baseline results not available. Skipping comparison.")
    comparison = None

### 5.3 Visualize Memory Comparison

In [None]:
if baseline_results:
    fig, ax = plt.subplots(figsize=(12, 6))
    
    x = np.arange(len(RANKS_TO_TEST))
    width = 0.35
    
    lora_data = baseline_df.sort_values('rank')
    qlora_data = qlora_df.sort_values('rank')
    
    bars1 = ax.bar(x - width/2, lora_data['peak_memory_mb'], width, 
                   label='LoRA (16-bit)', color='#3498db', alpha=0.8, edgecolor='black')
    bars2 = ax.bar(x + width/2, qlora_data['peak_memory_mb'], width,
                   label='QLoRA (4-bit)', color='#e74c3c', alpha=0.8, edgecolor='black')
    
    ax.set_xlabel('LoRA Rank (r)', fontsize=12, fontweight='bold')
    ax.set_ylabel('Peak GPU Memory (MB)', fontsize=12, fontweight='bold')
    ax.set_title('Memory Usage Comparison: LoRA vs QLoRA (Unsloth Optimized)', 
                 fontsize=14, fontweight='bold')
    ax.set_xticks(x)
    ax.set_xticklabels([f'r={r}' for r in RANKS_TO_TEST])
    ax.legend(fontsize=11)
    ax.grid(axis='y', alpha=0.3)
    
    # Add value labels
    for bar in bars1:
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height,
                f'{height:.0f}', ha='center', va='bottom', fontsize=9)
    
    for bar in bars2:
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height,
                f'{height:.0f}', ha='center', va='bottom', fontsize=9)
    
    plt.tight_layout()
    os.makedirs('../results/figures', exist_ok=True)
    plt.savefig('../results/figures/memory_comparison.png', dpi=300, bbox_inches='tight')
    plt.show()

### 5.4 Training Efficiency Comparison

In [None]:
if baseline_results:
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
    
    # Time per step
    x = np.arange(len(RANKS_TO_TEST))
    width = 0.35
    
    ax1.bar(x - width/2, baseline_df['time_per_step'], width, 
            label='LoRA (16-bit)', color='#3498db', alpha=0.8, edgecolor='black')
    ax1.bar(x + width/2, qlora_df['time_per_step'], width,
            label='QLoRA (4-bit)', color='#e74c3c', alpha=0.8, edgecolor='black')
    ax1.set_xlabel('Rank', fontweight='bold')
    ax1.set_ylabel('Time per Step (s)', fontweight='bold')
    ax1.set_title('Training Speed Comparison (Unsloth)')
    ax1.set_xticks(x)
    ax1.set_xticklabels([f'r={r}' for r in RANKS_TO_TEST])
    ax1.legend()
    ax1.grid(axis='y', alpha=0.3)
    
    # Training loss
    ax2.plot(baseline_df['rank'], baseline_df['training_loss'], 
             marker='o', linewidth=2.5, markersize=10, label='LoRA (16-bit)', color='#3498db')
    ax2.plot(qlora_df['rank'], qlora_df['training_loss'],
             marker='s', linewidth=2.5, markersize=10, label='QLoRA (4-bit)', color='#e74c3c')
    ax2.set_xlabel('Rank', fontweight='bold')
    ax2.set_ylabel('Training Loss', fontweight='bold')
    ax2.set_title('Training Loss Comparison')
    ax2.legend()
    ax2.grid(alpha=0.3)
    ax2.set_xticks(RANKS_TO_TEST)
    
    plt.tight_layout()
    plt.savefig('../results/figures/training_efficiency.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    # Calculate speedup
    avg_speedup = baseline_df['time_per_step'].mean() / qlora_df['time_per_step'].mean()
    print(f"\n‚ö° Average training speed: {avg_speedup:.2f}x (QLoRA vs LoRA)")

## 6. Key Findings

### Fill in after running experiments:

**Memory Reduction (with Unsloth optimization):**
- Average reduction: _____%
- Rank 8: LoRA _____ MB ‚Üí QLoRA _____ MB

**Performance:**
- Training loss comparable: [YES/NO]
- Time per step: [FASTER/SLOWER/SIMILAR]
- Loss difference at r=8: ______

**Unsloth Benefits Observed:**
- Speedup vs baseline: _____x
- Additional memory savings: _____%

**Observations:**
- [Document trends]
- [Note any unexpected behavior]
- [Compare to theoretical expectations]

---

**Next Steps:**
- Proceed to Part 3: Diagnostic analysis (hypothesis testing, failure modes)

## 7. Save Results

In [None]:
# Save QLoRA results
with open(f"{OUTPUT_DIR}/qlora_results.pkl", 'wb') as f:
    pickle.dump(qlora_results_list, f)

print(f"‚úì Results saved to {OUTPUT_DIR}/qlora_results.pkl")
print(f"‚úì CSV saved to {OUTPUT_DIR}/qlora_results.csv")
if comparison is not None:
    print(f"‚úì Comparison saved to ../results/tables/memory_comparison.csv")
print(f"‚úì Plots saved to ../results/figures/")
print("\nüéâ QLoRA experiments complete!")
print("üìù Proceed to notebook 03_diagnostic_analysis.ipynb")