# LoRA Fine-Tuning: Qwen2.5-7B on AG News

This notebook demonstrates **LoRA (Low-Rank Adaptation)** fine-tuning using **HuggingFace Transformers + PEFT + TRL**.

## Overview

| Aspect | Details |
|--------|---------|
| **Model** | Qwen/Qwen2.5-7B-Instruct (cached) |
| **Method** | LoRA (16-bit base model) |
| **Framework** | HuggingFace + PEFT + TRL |
| **Dataset** | AG News (120K train, 7.6K test) |
| **Task** | 4-class text classification |
| **Expected Time** | ~2-4 hours |
| **Memory** | ~20-25 GB |

## LoRA vs Full Fine-Tuning

| Aspect | LoRA | Full Fine-Tuning |
|--------|------|------------------|
| Parameters trained | ~70M (1%) | 7.6B (100%) |
| Memory usage | ~20-25 GB | ~70 GB |
| Training time | 2-4 hours | ~10 hours |
| Output size | ~200 MB | ~14 GB |

## Prerequisites

This notebook must run inside the PEFT Docker container:
```bash
./start_docker.sh start peft
# Then open http://localhost:8889
```

**Note**: Uses the already-cached `Qwen/Qwen2.5-7B-Instruct` model - no download required.

## 1. Environment Setup and Verification

First, let's verify we're running in the correct environment with GPU access.

In [1]:
import torch
import os

print("=" * 60)
print("Environment Verification")
print("=" * 60)

# Check CUDA
print(f"\nPyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Compute Capability: {torch.cuda.get_device_capability(0)}")
    
    # Memory info (may show N/A on unified memory systems)
    try:
        total_mem = torch.cuda.get_device_properties(0).total_memory / 1e9
        print(f"GPU Memory: {total_mem:.1f} GB")
    except:
        print("GPU Memory: Unified memory system (DGX Spark)")
else:
    raise RuntimeError("CUDA not available! Make sure you're running in the Docker container.")

# Check working directory
print(f"\nWorking directory: {os.getcwd()}")
print(f"Dataset available: {os.path.exists('/fine-tuning-dense/datasets/train.jsonl')}")

Environment Verification

PyTorch version: 2.10.0a0+b558c986e8.nv25.11
CUDA available: True
CUDA version: 13.0
GPU: NVIDIA GB10
GPU Compute Capability: (12, 1)
GPU Memory: 128.5 GB

Working directory: /fine-tuning
Dataset available: True


## 2. Configuration

Define all hyperparameters and settings for LoRA fine-tuning.

In [2]:
# =============================================================================
# Model Configuration
# =============================================================================
MODEL_NAME = "Qwen/Qwen2.5-7B-Instruct"
MAX_SEQ_LENGTH = 512  # AG News articles are short (~120 tokens avg)
LOAD_IN_4BIT = False  # LoRA uses 16-bit base model (set True for QLoRA)

# =============================================================================
# LoRA Configuration
# =============================================================================
LORA_R = 16           # LoRA rank (higher = more capacity, more memory)
LORA_ALPHA = 32       # LoRA scaling factor (typically 2x rank)
LORA_DROPOUT = 0.0    # Dropout for LoRA layers (0 for Unsloth)

# Target modules for Qwen2.5 - attention projections
TARGET_MODULES = [
    "q_proj", "k_proj", "v_proj", "o_proj",  # Attention
    "gate_proj", "up_proj", "down_proj",      # MLP (optional, more capacity)
]

# =============================================================================
# Training Configuration
# =============================================================================
BATCH_SIZE = 8
GRADIENT_ACCUMULATION_STEPS = 4  # Effective batch size = 32
LEARNING_RATE = 2e-4             # Higher than full fine-tuning (only training adapters)
NUM_EPOCHS = 1
WARMUP_RATIO = 0.03
WEIGHT_DECAY = 0.01

# =============================================================================
# Output Configuration
# =============================================================================
OUTPUT_DIR = "./adapters/qwen7b-ag-news-lora"
LOGGING_STEPS = 50
SAVE_STEPS = 500

# =============================================================================
# Dataset Paths (inside Docker container)
# =============================================================================
TRAIN_DATA_PATH = "/fine-tuning-dense/datasets/train.jsonl"

print("Configuration loaded!")
print(f"  Model: {MODEL_NAME}")
print(f"  LoRA rank: {LORA_R}")
print(f"  Target modules: {TARGET_MODULES}")
print(f"  Batch size: {BATCH_SIZE} x {GRADIENT_ACCUMULATION_STEPS} = {BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS}")
print(f"  Learning rate: {LEARNING_RATE}")
print(f"  Output: {OUTPUT_DIR}")

Configuration loaded!
  Model: Qwen/Qwen2.5-7B-Instruct
  LoRA rank: 16
  Target modules: ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj']
  Batch size: 8 x 4 = 32
  Learning rate: 0.0002
  Output: ./adapters/qwen7b-ag-news-lora


## 3. Load Model from Cache

Load the model using standard HuggingFace transformers. This uses the **already cached** model at `~/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B-Instruct/`.

**Note**: We're using standard HuggingFace loading instead of Unsloth's `FastLanguageModel` to avoid downloading Unsloth's separate model version.

In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer

print("Loading model from HuggingFace cache...")
print(f"  Model: {MODEL_NAME}")
print(f"  This uses the already-cached model (no download needed)")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    MODEL_NAME,
    trust_remote_code=True,
)

# Ensure pad token is set
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Load model in bfloat16
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
    attn_implementation="flash_attention_2",  # Use Flash Attention 2
)

print(f"\nModel loaded: {MODEL_NAME}")
print(f"Model dtype: {model.dtype}")
print(f"Tokenizer vocab size: {len(tokenizer)}")
print(f"Pad token: {tokenizer.pad_token}")

  from .autonotebook import tqdm as notebook_tqdm


Loading model from HuggingFace cache...
  Model: Qwen/Qwen2.5-7B-Instruct
  This uses the already-cached model (no download needed)


`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|██████████| 4/4 [01:14<00:00, 18.61s/it]



Model loaded: Qwen/Qwen2.5-7B-Instruct
Model dtype: torch.bfloat16
Tokenizer vocab size: 151665
Pad token: <|endoftext|>


## 4. Apply LoRA Adapters with PEFT

Add LoRA adapters using the PEFT library. Only these small adapter weights (~1% of total) will be trained.

In [4]:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

print("Applying LoRA adapters with PEFT...")

# Enable gradient checkpointing for memory efficiency
model.gradient_checkpointing_enable()

# Configure LoRA
lora_config = LoraConfig(
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    target_modules=TARGET_MODULES,
    bias="none",
    task_type="CAUSAL_LM",
)

# Apply LoRA to the model
model = get_peft_model(model, lora_config)

# Count trainable parameters
def count_parameters(model):
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total = sum(p.numel() for p in model.parameters())
    return trainable, total

trainable, total = count_parameters(model)
print(f"\nLoRA Configuration:")
print(f"  Rank (r): {LORA_R}")
print(f"  Alpha: {LORA_ALPHA}")
print(f"  Target modules: {TARGET_MODULES}")
print(f"\nParameter Count:")
print(f"  Trainable: {trainable:,} ({100*trainable/total:.2f}%)")
print(f"  Total: {total:,}")
print(f"  Frozen: {total - trainable:,} ({100*(total-trainable)/total:.2f}%)")

# Print trainable modules
model.print_trainable_parameters()

Applying LoRA adapters with PEFT...

LoRA Configuration:
  Rank (r): 16
  Alpha: 32
  Target modules: ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj']

Parameter Count:
  Trainable: 40,370,176 (0.53%)
  Total: 7,655,986,688
  Frozen: 7,615,616,512 (99.47%)
trainable params: 40,370,176 || all params: 7,655,986,688 || trainable%: 0.5273


## 5. Load Training Dataset

Load the AG News dataset prepared for fine-tuning (in chat format).

In [5]:
from datasets import load_dataset

print(f"Loading dataset from: {TRAIN_DATA_PATH}")

# Load the JSONL dataset
dataset = load_dataset("json", data_files=TRAIN_DATA_PATH, split="train")

print(f"\nDataset loaded:")
print(f"  Total examples: {len(dataset):,}")
print(f"  Columns: {dataset.column_names}")

# Show a sample
print(f"\nSample entry:")
sample = dataset[0]
for msg in sample["messages"]:
    role = msg["role"]
    content = msg["content"][:100] + "..." if len(msg["content"]) > 100 else msg["content"]
    print(f"  [{role}]: {content}")

Loading dataset from: /fine-tuning-dense/datasets/train.jsonl

Dataset loaded:
  Total examples: 120,000
  Columns: ['messages']

Sample entry:
  [system]: You are a news article classifier. Your task is to categorize news articles into exactly one of four...
  [user]: Classify the following news article:

Thirst, Fear and Bribes on Desert Escape from Africa  AGADEZ, ...
  [assistant]: {"category":"World"}


## 6. Format Dataset for Training

Apply the chat template to convert messages into the format expected by the model.

In [6]:
def formatting_prompts_func(examples):
    """Format examples using the tokenizer's chat template."""
    texts = []
    for messages in examples["messages"]:
        # Apply chat template
        text = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=False,
        )
        texts.append(text)
    return {"text": texts}

# Apply formatting
print("Applying chat template to dataset...")
formatted_dataset = dataset.map(
    formatting_prompts_func,
    batched=True,
    num_proc=4,
    desc="Formatting",
)

print(f"\nFormatted dataset:")
print(f"  Columns: {formatted_dataset.column_names}")

# Show formatted sample
print(f"\nFormatted sample (first 500 chars):")
print(formatted_dataset[0]["text"][:500])

Applying chat template to dataset...

Formatted dataset:
  Columns: ['messages', 'text']

Formatted sample (first 500 chars):
<|im_start|>system
You are a news article classifier. Your task is to categorize news articles into exactly one of four categories:

- World: News about politics, government, elections, diplomacy, conflicts, and public affairs (domestic or international)
- Sports: News about athletic events, games, players, teams, coaches, tournaments, and championships
- Business: News about companies, markets, finance, economy, trade, corporate activities, and business services
- Sci/Tech: News about technolog


## 7. Configure Training

Set up the SFTTrainer with optimized settings for DGX Spark.

In [7]:
from trl import SFTTrainer, SFTConfig

# Enable cuDNN benchmark for consistent input sizes
torch.backends.cudnn.benchmark = True

# Calculate total steps
total_steps = (len(formatted_dataset) // (BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS)) * NUM_EPOCHS
print(f"Training configuration:")
print(f"  Total examples: {len(formatted_dataset):,}")
print(f"  Batch size: {BATCH_SIZE}")
print(f"  Gradient accumulation: {GRADIENT_ACCUMULATION_STEPS}")
print(f"  Effective batch size: {BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS}")
print(f"  Epochs: {NUM_EPOCHS}")
print(f"  Estimated total steps: {total_steps:,}")

# SFT Configuration
# Note: TRL 0.12+ uses 'max_length' instead of 'max_seq_length'
sft_config = SFTConfig(
    # Output
    output_dir=OUTPUT_DIR,
    
    # Training
    num_train_epochs=NUM_EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
    
    # Optimizer
    learning_rate=LEARNING_RATE,
    weight_decay=WEIGHT_DECAY,
    warmup_ratio=WARMUP_RATIO,
    optim="adamw_8bit",  # Memory-efficient optimizer
    
    # Precision
    bf16=True,
    fp16=False,
    
    # Sequence handling
    max_length=MAX_SEQ_LENGTH,  # Renamed from max_seq_length in TRL 0.12+
    packing=True,  # Pack multiple sequences per batch (30-40% speedup)
    
    # Logging
    logging_steps=LOGGING_STEPS,
    logging_first_step=True,
    
    # Checkpointing
    save_strategy="steps",
    save_steps=SAVE_STEPS,
    save_total_limit=2,
    
    # Performance
    dataloader_num_workers=4,
    
    # Misc
    seed=42,
    report_to="none",  # Disable wandb/tensorboard
)

print("\nSFTConfig created successfully!")

Training configuration:
  Total examples: 120,000
  Batch size: 8
  Gradient accumulation: 4
  Effective batch size: 32
  Epochs: 1
  Estimated total steps: 3,750

SFTConfig created successfully!


## 8. Create Trainer and Start Training

This will take approximately 2-4 hours depending on GPU utilization.

In [9]:
# Create the trainer
# Note: TRL 0.12+ uses 'processing_class' instead of 'tokenizer'
trainer = SFTTrainer(
    model=model,
    processing_class=tokenizer,
    train_dataset=formatted_dataset,
    args=sft_config,
)

print("Trainer created!")
print(f"\nStarting training...")
print("=" * 60)

Tokenizing train dataset: 100%|██████████| 120000/120000 [00:41<00:00, 2909.14 examples/s]
Packing train dataset: 100%|██████████| 120000/120000 [00:00<00:00, 259816.34 examples/s]
The model is already on multiple devices. Skipping the move to device specified in `args`.


Trainer created!

Starting training...


In [None]:
import time

start_time = time.time()

# Train!
trainer_stats = trainer.train()

elapsed_time = time.time() - start_time
hours, remainder = divmod(elapsed_time, 3600)
minutes, seconds = divmod(remainder, 60)

print("\n" + "=" * 60)
print("Training Complete!")
print("=" * 60)
print(f"\nTraining time: {int(hours)}h {int(minutes)}m {int(seconds)}s")
print(f"Final loss: {trainer_stats.training_loss:.4f}")
print(f"Total steps: {trainer_stats.global_step}")

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.


Step,Training Loss
1,2.9202


## 9. Save the LoRA Adapter

Save only the adapter weights (not the full model). This will be ~200 MB instead of ~14 GB.

In [None]:
# Save the LoRA adapter
adapter_path = f"{OUTPUT_DIR}/final"

print(f"Saving LoRA adapter to: {adapter_path}")

# Save using Unsloth's optimized method
model.save_pretrained(adapter_path)
tokenizer.save_pretrained(adapter_path)

# Check the saved files
import os
saved_files = os.listdir(adapter_path)
total_size = sum(os.path.getsize(os.path.join(adapter_path, f)) for f in saved_files)

print(f"\nSaved files:")
for f in sorted(saved_files):
    size = os.path.getsize(os.path.join(adapter_path, f))
    print(f"  {f}: {size / 1e6:.2f} MB")

print(f"\nTotal adapter size: {total_size / 1e6:.2f} MB")
print(f"\n✓ LoRA adapter saved successfully!")

## 10. Quick Evaluation

Test the fine-tuned model on a few examples to verify it works correctly.

In [None]:
# Set model to evaluation mode for inference
model.eval()

# System prompt for classification
SYSTEM_PROMPT = """You are a news article classifier. Your task is to categorize news articles into exactly one of four categories:

- World: News about politics, government, elections, diplomacy, conflicts, and public affairs (domestic or international)
- Sports: News about athletic events, games, players, teams, coaches, tournaments, and championships
- Business: News about companies, markets, finance, economy, trade, corporate activities, and business services
- Sci/Tech: News about technology products, software, hardware, scientific research, gadgets, and tech innovations

Respond with a JSON object containing only the category field."""

# Test articles
test_articles = [
    ("The Federal Reserve announced a quarter-point interest rate cut, signaling confidence in the economy.", "Business"),
    ("Scientists at CERN discovered a new subatomic particle that could revolutionize our understanding of physics.", "Sci/Tech"),
    ("The Lakers defeated the Celtics 112-108 in overtime, with LeBron James scoring 35 points.", "Sports"),
    ("The United Nations Security Council voted to impose new sanctions on North Korea.", "World"),
]

print("Testing fine-tuned model:")
print("=" * 60)

for article, expected in test_articles:
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": f"Classify the following news article:\n\n{article}"},
    ]
    
    inputs = tokenizer.apply_chat_template(
        messages,
        return_tensors="pt",
        add_generation_prompt=True,
        return_dict=False,
    ).to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            inputs,
            max_new_tokens=50,
            temperature=0.0,
            do_sample=False,
            pad_token_id=tokenizer.pad_token_id,
        )
    
    response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
    
    # Check if correct
    is_correct = expected.lower() in response.lower()
    status = "✓" if is_correct else "✗"
    
    print(f"\nArticle: {article[:60]}...")
    print(f"Expected: {expected}")
    print(f"Response: {response.strip()}")
    print(f"Status: {status}")

## 11. Next Steps

### Deploy with vLLM

To serve the LoRA adapter with vLLM, update `docker-compose-qwen7b.yml`:

```yaml
command: >
  vllm serve Qwen/Qwen2.5-7B-Instruct
  --host 0.0.0.0
  --port 8000
  --enable-lora
  --lora-modules lora-agnews=/checkpoints/qwen7b-ag-news-lora/final
```

Then start the server:
```bash
./start_docker.sh start qwen7b
```

### Full Evaluation

For comprehensive evaluation on the test set, create an evaluation notebook similar to `full_finetuning_performance.ipynb`.

### Compare with Full Fine-Tuning

| Metric | Full Fine-Tuning | LoRA (Expected) |
|--------|------------------|-----------------|
| Accuracy | 88.33% | ~85-88% |
| Training Time | ~10 hours | ~2-4 hours |
| Output Size | ~14 GB | ~200 MB |
| Memory Used | ~70 GB | ~20-25 GB |

## Conclusions

*To be filled after running the notebook*

### Training Results

| Metric | Value |
|--------|-------|
| Training Time | TBD |
| Final Loss | TBD |
| Total Steps | TBD |
| Adapter Size | TBD |

### Quick Test Results

| Category | Correct/Total |
|----------|---------------|
| World | TBD |
| Sports | TBD |
| Business | TBD |
| Sci/Tech | TBD |

### Key Observations

1. **Training speed**: TBD
2. **Memory usage**: TBD
3. **Quality**: TBD