# QLoRA Fine-Tuning: Qwen2.5-7B on AG News

This notebook demonstrates **QLoRA (Quantized LoRA)** fine-tuning using a **4-bit quantized base model** for maximum memory efficiency and faster training on memory-bandwidth-limited hardware.

## Overview

| Aspect | Details |
|--------|---------|
| **Model** | Qwen/Qwen2.5-7B-Instruct (4-bit quantized) |
| **Method** | QLoRA (4-bit base + LoRA adapters) |
| **Framework** | HuggingFace + PEFT + TRL + bitsandbytes |
| **Dataset** | AG News (120K train, 7.6K test) |
| **Task** | 4-class text classification |
| **Expected Time** | ~4-6 hours |
| **Memory** | ~8-12 GB |

## Why QLoRA is Faster on DGX Spark

DGX Spark's bottleneck is **memory bandwidth** (~273 GB/s). QLoRA addresses this:

| Aspect | LoRA (BF16) | QLoRA (4-bit) | Benefit |
|--------|-------------|---------------|---------|
| Model weights | 14 GB | **3.5 GB** | 4x smaller |
| Memory bandwidth/iter | High | **4x lower** | Faster forward pass |
| GPU memory | ~25 GB | **~10 GB** | More headroom |
| Training time | ~12 hours | **~4-6 hours** | 2-3x faster |

## Trade-offs

| Aspect | Impact |
|--------|--------|
| Speed | **2-3x faster** (less memory to transfer) |
| Memory | **60% less** GPU memory |
| Quality | Slight degradation (~1-2% accuracy loss typical) |

## Prerequisites

This notebook must run inside the PEFT Docker container:
```bash
./start_docker.sh start peft
# Then open http://localhost:8889
```

## 1. Environment Setup and Verification

In [None]:
import torch
import os

print("=" * 60)
print("Environment Verification - QLoRA")
print("=" * 60)

# Check CUDA
print(f"\nPyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Compute Capability: {torch.cuda.get_device_capability(0)}")
    
    # Memory info
    try:
        total_mem = torch.cuda.get_device_properties(0).total_memory / 1e9
        print(f"GPU Memory: {total_mem:.1f} GB")
    except:
        print("GPU Memory: Unified memory system (DGX Spark)")
else:
    raise RuntimeError("CUDA not available!")

# Check bitsandbytes (required for 4-bit quantization)
try:
    import bitsandbytes as bnb
    print(f"\nbitsandbytes version: {bnb.__version__}")
    print("✓ 4-bit quantization available")
except ImportError:
    raise RuntimeError("bitsandbytes not installed! Required for QLoRA.")

# Check working directory
print(f"\nWorking directory: {os.getcwd()}")
print(f"Dataset available: {os.path.exists('/fine-tuning-dense/datasets/train.jsonl')}")

## 2. Configuration

QLoRA uses the same LoRA configuration but with a **4-bit quantized base model**.

In [None]:
# =============================================================================
# Model Configuration
# =============================================================================
MODEL_NAME = "Qwen/Qwen2.5-7B-Instruct"
MAX_SEQ_LENGTH = 512  # AG News articles are short (~120 tokens avg)
LOAD_IN_4BIT = True   # *** KEY DIFFERENCE: QLoRA uses 4-bit quantization ***

# =============================================================================
# LoRA Configuration (same as LoRA)
# =============================================================================
LORA_R = 16           # LoRA rank
LORA_ALPHA = 32       # LoRA scaling factor
LORA_DROPOUT = 0.05   # Small dropout for regularization (QLoRA benefits from this)

# Target modules for Qwen2.5
TARGET_MODULES = [
    "q_proj", "k_proj", "v_proj", "o_proj",  # Attention
    "gate_proj", "up_proj", "down_proj",      # MLP
]

# =============================================================================
# Training Configuration
# =============================================================================
BATCH_SIZE = 16       # Can use larger batch with 4-bit (less memory)
GRADIENT_ACCUMULATION_STEPS = 2  # Effective batch size = 32
LEARNING_RATE = 2e-4
NUM_EPOCHS = 1
WARMUP_RATIO = 0.03
WEIGHT_DECAY = 0.01

# =============================================================================
# Output Configuration
# =============================================================================
OUTPUT_DIR = "./adapters/qwen7b-ag-news-qlora"
LOGGING_STEPS = 50
SAVE_STEPS = 500

# =============================================================================
# Dataset Paths
# =============================================================================
TRAIN_DATA_PATH = "/fine-tuning-dense/datasets/train.jsonl"

print("QLoRA Configuration loaded!")
print(f"  Model: {MODEL_NAME}")
print(f"  4-bit quantization: {LOAD_IN_4BIT}")
print(f"  LoRA rank: {LORA_R}")
print(f"  Batch size: {BATCH_SIZE} x {GRADIENT_ACCUMULATION_STEPS} = {BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS}")
print(f"  Output: {OUTPUT_DIR}")

## 3. Load Model with 4-bit Quantization

The key to QLoRA is loading the base model in 4-bit precision using `BitsAndBytesConfig`.

**Memory savings**: 14 GB (BF16) → 3.5 GB (4-bit)

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

print("Loading model with 4-bit quantization (QLoRA)...")

# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # NormalFloat4 - best for LLMs
    bnb_4bit_compute_dtype=torch.bfloat16,  # Compute in BF16
    bnb_4bit_use_double_quant=True,      # Double quantization for more savings
)

print("  Quantization config:")
print(f"    - 4-bit type: nf4 (NormalFloat4)")
print(f"    - Compute dtype: bfloat16")
print(f"    - Double quantization: enabled")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    MODEL_NAME,
    trust_remote_code=True,
)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Load model with 4-bit quantization
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
    attn_implementation="flash_attention_2",
)

# Check memory usage
if torch.cuda.is_available():
    mem_used = torch.cuda.memory_allocated() / 1e9
    print(f"\n✓ Model loaded in 4-bit!")
    print(f"  GPU memory used: {mem_used:.2f} GB")
    print(f"  (vs ~14 GB for BF16 - {100*(1 - mem_used/14):.0f}% savings)")

## 4. Apply LoRA Adapters

With QLoRA, we need to prepare the 4-bit model for training before applying LoRA.

In [None]:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

print("Preparing 4-bit model for training...")

# Prepare the quantized model for training
# This enables gradient checkpointing and casts certain layers to float32
model = prepare_model_for_kbit_training(model)

print("Applying LoRA adapters...")

# Configure LoRA
lora_config = LoraConfig(
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    target_modules=TARGET_MODULES,
    bias="none",
    task_type="CAUSAL_LM",
)

# Apply LoRA
model = get_peft_model(model, lora_config)

# Count parameters
def count_parameters(model):
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total = sum(p.numel() for p in model.parameters())
    return trainable, total

trainable, total = count_parameters(model)
print(f"\nQLoRA Configuration:")
print(f"  Base model: 4-bit quantized")
print(f"  LoRA rank: {LORA_R}")
print(f"  LoRA alpha: {LORA_ALPHA}")
print(f"\nParameter Count:")
print(f"  Trainable: {trainable:,} ({100*trainable/total:.2f}%)")
print(f"  Total: {total:,}")

model.print_trainable_parameters()

## 5. Load and Format Dataset

In [None]:
from datasets import load_dataset

print(f"Loading dataset from: {TRAIN_DATA_PATH}")

# Load the JSONL dataset
dataset = load_dataset("json", data_files=TRAIN_DATA_PATH, split="train")

print(f"  Total examples: {len(dataset):,}")

# Format with chat template
def formatting_prompts_func(examples):
    texts = []
    for messages in examples["messages"]:
        text = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=False,
        )
        texts.append(text)
    return {"text": texts}

print("Applying chat template...")
formatted_dataset = dataset.map(
    formatting_prompts_func,
    batched=True,
    num_proc=4,
    desc="Formatting",
)

print(f"✓ Dataset formatted: {len(formatted_dataset):,} examples")

## 6. Configure Training

QLoRA can use **larger batch sizes** due to lower memory usage.

In [None]:
from trl import SFTTrainer, SFTConfig

# Enable cuDNN benchmark
torch.backends.cudnn.benchmark = True

# Calculate steps
total_steps = (len(formatted_dataset) // (BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS)) * NUM_EPOCHS
print(f"Training configuration:")
print(f"  Batch size: {BATCH_SIZE} x {GRADIENT_ACCUMULATION_STEPS} = {BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS}")
print(f"  Estimated steps: {total_steps:,}")

# SFT Configuration
sft_config = SFTConfig(
    output_dir=OUTPUT_DIR,
    
    # Training
    num_train_epochs=NUM_EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
    
    # Optimizer - use paged adamw for 4-bit training
    learning_rate=LEARNING_RATE,
    weight_decay=WEIGHT_DECAY,
    warmup_ratio=WARMUP_RATIO,
    optim="paged_adamw_8bit",  # Better for QLoRA
    
    # Precision
    bf16=True,
    fp16=False,
    
    # Sequence handling
    max_length=MAX_SEQ_LENGTH,
    packing=True,
    
    # Logging
    logging_steps=LOGGING_STEPS,
    logging_first_step=True,
    
    # Checkpointing
    save_strategy="steps",
    save_steps=SAVE_STEPS,
    save_total_limit=2,
    
    # Performance
    dataloader_num_workers=4,
    
    # Misc
    seed=42,
    report_to="none",
)

print("✓ SFTConfig created")

## 7. Train

Expected time: **~4-6 hours** (2-3x faster than LoRA due to 4-bit quantization).

In [None]:
# Create trainer
trainer = SFTTrainer(
    model=model,
    processing_class=tokenizer,
    train_dataset=formatted_dataset,
    args=sft_config,
)

print("Trainer created!")
print(f"\nStarting QLoRA training...")
print("=" * 60)

In [None]:
import time

start_time = time.time()

# Train!
trainer_stats = trainer.train()

elapsed_time = time.time() - start_time
hours, remainder = divmod(elapsed_time, 3600)
minutes, seconds = divmod(remainder, 60)

print("\n" + "=" * 60)
print("QLoRA Training Complete!")
print("=" * 60)
print(f"\nTraining time: {int(hours)}h {int(minutes)}m {int(seconds)}s")
print(f"Final loss: {trainer_stats.training_loss:.4f}")
print(f"Total steps: {trainer_stats.global_step}")

## 8. Save the QLoRA Adapter

The adapter weights are the same size as LoRA (~200 MB) - only the LoRA weights are saved, not the quantized base model.

In [None]:
# Save the adapter
adapter_path = f"{OUTPUT_DIR}/final"

print(f"Saving QLoRA adapter to: {adapter_path}")

model.save_pretrained(adapter_path)
tokenizer.save_pretrained(adapter_path)

# Check saved files
saved_files = os.listdir(adapter_path)
total_size = sum(os.path.getsize(os.path.join(adapter_path, f)) for f in saved_files)

print(f"\nSaved files:")
for f in sorted(saved_files):
    size = os.path.getsize(os.path.join(adapter_path, f))
    print(f"  {f}: {size / 1e6:.2f} MB")

print(f"\nTotal adapter size: {total_size / 1e6:.2f} MB")
print(f"✓ QLoRA adapter saved!")

## 9. Quick Evaluation

Test the QLoRA fine-tuned model on sample articles.

In [None]:
# Set to eval mode
model.eval()

SYSTEM_PROMPT = """You are a news article classifier. Categorize articles into one of four categories:
- World: Politics, government, international affairs
- Sports: Athletic events, games, teams
- Business: Companies, markets, finance
- Sci/Tech: Technology, scientific research

Respond with a JSON object containing only the category field."""

test_articles = [
    ("The Federal Reserve announced a quarter-point interest rate cut.", "Business"),
    ("Scientists at CERN discovered a new subatomic particle.", "Sci/Tech"),
    ("The Lakers defeated the Celtics 112-108 in overtime.", "Sports"),
    ("The UN Security Council voted to impose new sanctions.", "World"),
]

print("Testing QLoRA fine-tuned model:")
print("=" * 60)

correct = 0
for article, expected in test_articles:
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": f"Classify: {article}"},
    ]
    
    inputs = tokenizer.apply_chat_template(
        messages,
        return_tensors="pt",
        add_generation_prompt=True,
        return_dict=False,
    ).to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            inputs,
            max_new_tokens=50,
            temperature=0.0,
            do_sample=False,
            pad_token_id=tokenizer.pad_token_id,
        )
    
    response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
    is_correct = expected.lower() in response.lower()
    if is_correct:
        correct += 1
    
    print(f"\n{expected}: {'✓' if is_correct else '✗'}")
    print(f"  Response: {response.strip()[:80]}")

print(f"\n{'=' * 60}")
print(f"Quick test accuracy: {correct}/{len(test_articles)} ({100*correct/len(test_articles):.0f}%)")

## 10. Conclusions

*To be filled after running*

### Training Results

| Metric | QLoRA | LoRA | Full Fine-Tuning |
|--------|-------|------|------------------|
| Training Time | TBD | ~12h | ~10h |
| Final Loss | TBD | TBD | ~0.45 |
| GPU Memory | ~10 GB | ~25 GB | ~70 GB |
| Adapter Size | ~200 MB | ~200 MB | ~14 GB |

### Performance Comparison

| Method | Expected Accuracy | Training Time |
|--------|------------------|---------------|
| Full Fine-Tuning | 88.33% | ~10 hours |
| LoRA | ~85-88% | ~12 hours |
| **QLoRA** | ~83-86% | **~4-6 hours** |

### Key Observations

1. **Training speed**: TBD
2. **Memory efficiency**: TBD  
3. **Quality vs LoRA**: TBD