# QLoRA Fine-Tuning with Unsloth: Qwen2.5-7B on AG News

This notebook demonstrates **QLoRA (Quantized LoRA)** fine-tuning using **Unsloth's FastLanguageModel** with a 4-bit quantized base model.

## Overview

| Aspect | Details |
|--------|---------|
| **Model** | unsloth/Qwen2.5-7B-Instruct (4-bit) |
| **Method** | QLoRA (4-bit base + LoRA adapters) |
| **Framework** | Unsloth + TRL + bitsandbytes |
| **Dataset** | AG News (120K train, 7.6K test) |
| **Task** | 4-class text classification |
| **Expected Time** | ~6-8 hours |
| **Memory** | ~8-12 GB |

## Base Model Performance (Target to Beat)

| Metric | Base Model | Target |
|--------|------------|--------|
| **Accuracy** | 78.76% | >85% |
| **F1 (macro)** | 77.97% | >82% |
| **Sci/Tech F1** | 62.06% | >75% |
| **Business Precision** | 63.66% | >75% |

## QLoRA vs LoRA

| Aspect | LoRA (16-bit) | QLoRA (4-bit) |
|--------|---------------|---------------|
| Model weights | 14 GB | 3.5 GB |
| Memory usage | ~25 GB | ~10 GB |
| Speed | Faster | Slower (dequantization) |
| Quality | Baseline | ~1-2% accuracy loss |

## Prerequisites

```bash
./start_docker.sh start finetune
# Then open http://localhost:8888
```

## 1. Environment Setup

In [None]:
import torch
import os

print("=" * 60)
print("Environment Verification - QLoRA")
print("=" * 60)

print(f"\nPyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Compute Capability: {torch.cuda.get_device_capability(0)}")
    try:
        total_mem = torch.cuda.get_device_properties(0).total_memory / 1e9
        print(f"GPU Memory: {total_mem:.1f} GB")
    except:
        print("GPU Memory: Unified memory system (DGX Spark)")
else:
    raise RuntimeError("CUDA not available!")

# Check bitsandbytes
try:
    import bitsandbytes as bnb
    print(f"\nbitsandbytes version: {bnb.__version__}")
    print("✓ 4-bit quantization available")
except ImportError:
    raise RuntimeError("bitsandbytes not installed!")

print(f"\nWorking directory: {os.getcwd()}")
print(f"Dataset available: {os.path.exists('/fine-tuning-dense/datasets/train.jsonl')}")

## 2. Configuration

In [None]:
# =============================================================================
# Model Configuration
# =============================================================================
MODEL_NAME = "unsloth/Qwen2.5-7B-Instruct"  # Unsloth optimized version
MAX_SEQ_LENGTH = 512
LOAD_IN_4BIT = True  # QLoRA uses 4-bit quantization

# =============================================================================
# LoRA Configuration
# =============================================================================
LORA_R = 16
LORA_ALPHA = 32
LORA_DROPOUT = 0  # Must be 0 for Unsloth optimization!

TARGET_MODULES = [
    "q_proj", "k_proj", "v_proj", "o_proj",
    "gate_proj", "up_proj", "down_proj",
]

# =============================================================================
# Training Configuration
# =============================================================================
BATCH_SIZE = 4
GRADIENT_ACCUMULATION_STEPS = 4
LEARNING_RATE = 2e-4
NUM_EPOCHS = 1
WARMUP_RATIO = 0.03
WEIGHT_DECAY = 0.01

# =============================================================================
# Output Configuration
# =============================================================================
OUTPUT_DIR = "./adapters/qwen7b-ag-news-qlora"
LOGGING_STEPS = 50
SAVE_STEPS = 500

TRAIN_DATA_PATH = "/fine-tuning-dense/datasets/train.jsonl"

print("QLoRA Configuration loaded!")
print(f"  Model: {MODEL_NAME}")
print(f"  4-bit quantization: {LOAD_IN_4BIT}")
print(f"  LoRA rank: {LORA_R}, alpha: {LORA_ALPHA}")
print(f"  Batch size: {BATCH_SIZE} x {GRADIENT_ACCUMULATION_STEPS} = {BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS}")
print(f"  Output: {OUTPUT_DIR}")

## 3. Load Model with 4-bit Quantization

Using `FastLanguageModel` with `load_in_4bit=True` for QLoRA.

In [None]:
from unsloth import FastLanguageModel

print("Loading model with 4-bit quantization (QLoRA)...")
print(f"  Model: {MODEL_NAME}")

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=MAX_SEQ_LENGTH,
    dtype=None,  # Auto-detect
    load_in_4bit=True,  # QLoRA: 4-bit quantized base model
    full_finetuning=False,
)

# Check memory
mem_used = torch.cuda.memory_allocated() / 1e9

print(f"\n✓ Model loaded in 4-bit!")
print(f"  GPU memory used: {mem_used:.2f} GB")
print(f"  (vs ~14 GB for BF16 - 60% savings)")

## 4. Apply LoRA with Unsloth Optimizations

In [None]:
print("Applying LoRA with Unsloth optimizations...")

model = FastLanguageModel.get_peft_model(
    model,
    r=LORA_R,
    target_modules=TARGET_MODULES,
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,  # Must be 0 for Unsloth optimization
    bias="none",
    use_gradient_checkpointing="unsloth",  # 30% less VRAM
    random_state=42,
    use_rslora=False,
    loftq_config=None,
)

print(f"\n✓ LoRA applied!")
model.print_trainable_parameters()

## 5. Load Training Dataset

In [None]:
from datasets import load_dataset

print(f"Loading dataset from: {TRAIN_DATA_PATH}")

dataset = load_dataset("json", data_files=TRAIN_DATA_PATH, split="train")

print(f"\nDataset loaded:")
print(f"  Total examples: {len(dataset):,}")
print(f"  Columns: {dataset.column_names}")

print(f"\nSample entry:")
sample = dataset[0]
for msg in sample["messages"]:
    role = msg["role"]
    content = msg["content"][:80] + "..." if len(msg["content"]) > 80 else msg["content"]
    print(f"  [{role}]: {content}")

## 6. Format Dataset

In [None]:
def formatting_prompts_func(examples):
    """Format examples using the tokenizer's chat template."""
    texts = []
    for messages in examples["messages"]:
        text = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=False,
        )
        texts.append(text)
    return {"text": texts}

print("Applying chat template to dataset...")
formatted_dataset = dataset.map(
    formatting_prompts_func,
    batched=True,
    num_proc=4,
    desc="Formatting",
)

print(f"\nFormatted dataset columns: {formatted_dataset.column_names}")
print(f"\nSample (first 400 chars):")
print(formatted_dataset[0]["text"][:400])

## 7. Configure Training

In [None]:
from trl import SFTTrainer, SFTConfig

total_steps = (len(formatted_dataset) // (BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS)) * NUM_EPOCHS

print(f"Training configuration:")
print(f"  Total examples: {len(formatted_dataset):,}")
print(f"  Batch size: {BATCH_SIZE} x {GRADIENT_ACCUMULATION_STEPS} = {BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS}")
print(f"  Estimated total steps: {total_steps:,}")

sft_config = SFTConfig(
    output_dir=OUTPUT_DIR,
    num_train_epochs=NUM_EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
    learning_rate=LEARNING_RATE,
    weight_decay=WEIGHT_DECAY,
    warmup_ratio=WARMUP_RATIO,
    optim="adamw_8bit",
    bf16=True,
    fp16=False,
    max_length=MAX_SEQ_LENGTH,
    packing=True,
    logging_steps=LOGGING_STEPS,
    logging_first_step=True,
    save_strategy="steps",
    save_steps=SAVE_STEPS,
    save_total_limit=2,
    dataloader_num_workers=4,
    seed=42,
    report_to="none",
)

print("\n✓ SFTConfig created!")

## 8. Create Trainer and Start Training

In [None]:
trainer = SFTTrainer(
    model=model,
    processing_class=tokenizer,
    train_dataset=formatted_dataset,
    args=sft_config,
)

print("✓ Trainer created!")
print(f"\nStarting QLoRA training...")
print("=" * 60)

In [None]:
import time

start_time = time.time()

trainer_stats = trainer.train()

elapsed_time = time.time() - start_time
hours, remainder = divmod(elapsed_time, 3600)
minutes, seconds = divmod(remainder, 60)

print("\n" + "=" * 60)
print("QLoRA Training Complete!")
print("=" * 60)
print(f"\nTraining time: {int(hours)}h {int(minutes)}m {int(seconds)}s")
print(f"Final loss: {trainer_stats.training_loss:.4f}")
print(f"Total steps: {trainer_stats.global_step}")

## 9. Save the QLoRA Adapter

In [None]:
adapter_path = f"{OUTPUT_DIR}/final"

print(f"Saving QLoRA adapter to: {adapter_path}")

model.save_pretrained(adapter_path)
tokenizer.save_pretrained(adapter_path)

import os
saved_files = os.listdir(adapter_path)
total_size = sum(os.path.getsize(os.path.join(adapter_path, f)) for f in saved_files)

print(f"\nSaved files:")
for f in sorted(saved_files):
    size = os.path.getsize(os.path.join(adapter_path, f))
    print(f"  {f}: {size / 1e6:.2f} MB")

print(f"\nTotal adapter size: {total_size / 1e6:.2f} MB")
print(f"\n✓ QLoRA adapter saved!")

## 10. Quick Evaluation

In [None]:
FastLanguageModel.for_inference(model)

SYSTEM_PROMPT = """You are a news article classifier. Categorize into: World, Sports, Business, or Sci/Tech.
Respond with JSON: {"category": "<category>"}"""

test_articles = [
    ("The Federal Reserve announced a quarter-point interest rate cut.", "Business"),
    ("Scientists at CERN discovered a new subatomic particle.", "Sci/Tech"),
    ("The Lakers defeated the Celtics 112-108 in overtime.", "Sports"),
    ("The UN Security Council voted to impose new sanctions.", "World"),
]

print("Testing QLoRA fine-tuned model:")
print("=" * 60)

correct = 0
for article, expected in test_articles:
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": f"Classify: {article}"},
    ]
    
    inputs = tokenizer.apply_chat_template(
        messages,
        return_tensors="pt",
        add_generation_prompt=True,
    ).to(model.device)
    
    outputs = model.generate(
        inputs,
        max_new_tokens=50,
        temperature=0.0,
        do_sample=False,
    )
    
    response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
    is_correct = expected.lower() in response.lower()
    if is_correct:
        correct += 1
    
    print(f"\nArticle: {article[:50]}...")
    print(f"Expected: {expected}")
    print(f"Response: {response.strip()}")
    print(f"Status: {'✓' if is_correct else '✗'}")

print(f"\n" + "=" * 60)
print(f"Quick test accuracy: {correct}/{len(test_articles)} ({100*correct/len(test_articles):.0f}%)")

## Conclusions

*To be filled after running the notebook*

### Training Results

| Metric | Value |
|--------|-------|
| Training Time | TBD |
| Final Loss | TBD |
| Total Steps | TBD |
| Adapter Size | TBD |
| GPU Memory | TBD |

### Performance Comparison

| Metric | HuggingFace QLoRA | Unsloth QLoRA |
|--------|-------------------|----------------|
| Tokens/sec | ~527 | TBD |
| Training time | ~18h | TBD |