# Full Fine-Tuning: Qwen2.5-7B on AG News

This notebook performs **full fine-tuning** (updating all 7B parameters) of the Qwen2.5-7B-Instruct model on the AG News classification dataset.

## What is Full Fine-Tuning?

Unlike LoRA/QLoRA which only trains small adapter layers, **full fine-tuning** updates ALL model parameters:

| Aspect | Full Fine-Tuning | LoRA |
|--------|------------------|------|
| Parameters updated | 7,000,000,000 (100%) | ~70,000,000 (1%) |
| Memory required | ~60-70 GB | ~16-24 GB |
| Training time | 8-15 hours | 2-4 hours |
| Output size | ~14 GB | ~100-500 MB |
| Risk of overfitting | Higher | Lower |
| Potential improvement | Higher | Moderate |

## Environment

**This notebook is designed to run inside the NVIDIA PyTorch Docker container** (`nvcr.io/nvidia/pytorch:25.11-py3`) which provides:
- Native sm_120/121 CUDA kernels (optimized for Blackwell GB10)
- Transformer Engine 2.9+ with FP8 support
- Flash Attention 2 compiled for Blackwell
- Triton with optimized kernels

## Prerequisites

Before running this notebook:

1. **Start the training container**:
   ```bash
   cd ~/Projects/xiaohui-agentic-playground/6-open-source
   ./start_docker.sh start finetune
   ```

2. **Open Jupyter** at http://localhost:8888

3. **Training data prepared** at `datasets/train.jsonl` (120K samples)

## Expected Results

- **Base model accuracy**: 78.63%
- **Target accuracy**: 85-92% (with fine-tuning)
- **Training time**: ~8-15 hours for 1 epoch (on DGX Spark)

## 1. Environment Setup & Pre-flight Checks

First, let's verify that:
- GPU is available and has sufficient memory
- Required libraries are installed
- Training data exists

In [1]:
# Pre-flight checks
import torch
import os
from pathlib import Path

print("=" * 60)
print("PRE-FLIGHT CHECKS")
print("=" * 60)

# Check Docker environment
print(f"\n[1] Docker Environment:")
if os.path.exists("/usr/local/cuda-13.0") or os.path.exists("/usr/local/cuda"):
    print("    ✓ Running in NVIDIA PyTorch container")
    # Check for Transformer Engine
    try:
        import transformer_engine as te
        print(f"    ✓ Transformer Engine: {te.__version__}")
    except ImportError:
        print("    ⚠ Transformer Engine not found (optional)")
    # Check for Triton
    try:
        import triton
        print(f"    ✓ Triton: {triton.__version__}")
    except ImportError:
        print("    ⚠ Triton not found")
else:
    print("    ⚠ Not in NVIDIA container - may have suboptimal performance")
    print("    ⚠ Start with: ./start_docker.sh start finetune")

# Check GPU
print(f"\n[2] GPU Availability:")
if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
    compute_cap = torch.cuda.get_device_capability(0)
    print(f"    ✓ CUDA available: {gpu_name}")
    print(f"    ✓ Compute capability: sm_{compute_cap[0]}{compute_cap[1]}")
    print(f"    ✓ Total memory: {gpu_memory:.1f} GB")
    
    # Check current memory usage
    allocated = torch.cuda.memory_allocated(0) / 1e9
    reserved = torch.cuda.memory_reserved(0) / 1e9
    print(f"    ✓ Currently allocated: {allocated:.1f} GB")
    print(f"    ✓ Currently reserved: {reserved:.1f} GB")
else:
    print("    ✗ CUDA NOT available - cannot proceed!")

# Check training data
print(f"\n[3] Training Data:")
train_file = Path("datasets/train.jsonl")
if train_file.exists():
    size_mb = train_file.stat().st_size / 1e6
    with open(train_file) as f:
        num_lines = sum(1 for _ in f)
    print(f"    ✓ Found: {train_file}")
    print(f"    ✓ Size: {size_mb:.1f} MB")
    print(f"    ✓ Examples: {num_lines:,}")
else:
    print(f"    ✗ Training data not found at {train_file}")

# Check output directory
print(f"\n[4] Output Directory:")
output_dir = Path("checkpoints/qwen7b-ag-news-full")
output_dir.mkdir(parents=True, exist_ok=True)
print(f"    ✓ Will save to: {output_dir}")

print("\n" + "=" * 60)
print("PRE-FLIGHT CHECKS COMPLETE")
print("=" * 60)

PRE-FLIGHT CHECKS

[1] GPU Availability:
    ✓ CUDA available: NVIDIA GB10
    ✓ Total memory: 128.5 GB
    ✓ Currently allocated: 0.0 GB
    ✓ Currently reserved: 0.0 GB

[2] Training Data:
    ✓ Found: datasets/train.jsonl
    ✓ Size: 154.9 MB
    ✓ Examples: 120,000

[3] Output Directory:
    ✓ Will save to: checkpoints/qwen7b-ag-news-full

PRE-FLIGHT CHECKS COMPLETE


    Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
    Minimum and Maximum cuda capability supported by this version of PyTorch is
    (8.0) - (12.0)
    
  queued_call()


In [2]:
# Library versions - Docker container already has these installed
import transformers
import trl
import accelerate

print("Library versions:")
print(f"  torch: {torch.__version__}")
print(f"  transformers: {transformers.__version__}")
print(f"  trl: {trl.__version__}")
print(f"  accelerate: {accelerate.__version__}")

# Check for Transformer Engine (optional, for FP8)
try:
    import transformer_engine as te
    print(f"  transformer_engine: {te.__version__}")
except ImportError:
    print("  transformer_engine: not installed")

# Check CUDA version
print(f"\nCUDA info:")
print(f"  CUDA version (torch): {torch.version.cuda}")
print(f"  cuDNN version: {torch.backends.cudnn.version()}")

  from .autonotebook import tqdm as notebook_tqdm


Library versions:
  transformers: 5.0.0
  trl: 0.27.1
  accelerate: 1.12.0
  torch: 2.10.0+cu128


## 2. Configuration

Define all hyperparameters and settings for training.

### Key Hyperparameters Explained:

| Parameter | Value | Explanation |
|-----------|-------|-------------|
| `learning_rate` | 2e-5 | Lower than LoRA (1e-4) because we're updating all params |
| `num_train_epochs` | 1 | Full fine-tuning often needs fewer epochs |
| `per_device_train_batch_size` | 4 | Increased from 2 (Docker has better memory management) |
| `gradient_accumulation_steps` | 8 | Effective batch size = 4 × 8 = 32 |
| `gradient_checkpointing` | True | Trades compute for memory (essential!) |
| `bf16` | True | BFloat16 precision for speed + stability |

### Expected Training Time

On DGX Spark with NVIDIA Docker container: **~8-15 hours** for 1 epoch on 120K examples.

In [3]:
# =============================================================================
# CONFIGURATION
# =============================================================================

# Model
MODEL_NAME = "Qwen/Qwen2.5-7B-Instruct"
OUTPUT_DIR = "checkpoints/qwen7b-ag-news-full"

# Data
TRAIN_FILE = "datasets/train.jsonl"
MAX_SEQ_LENGTH = 512  # Our data is short (~120 tokens avg)

# Training hyperparameters
LEARNING_RATE = 2e-5
NUM_EPOCHS = 1
BATCH_SIZE = 4  # Per device (increased from 2 - Docker has better memory management)
GRADIENT_ACCUMULATION_STEPS = 8  # Effective batch = 32
WARMUP_RATIO = 0.03
WEIGHT_DECAY = 0.01
MAX_GRAD_NORM = 1.0

# Checkpointing
SAVE_STEPS = 500
LOGGING_STEPS = 50
SAVE_TOTAL_LIMIT = 2  # Keep only 2 checkpoints to save disk space

# Memory optimization
USE_GRADIENT_CHECKPOINTING = True
USE_BF16 = True

# Performance optimization (Docker container)
ENABLE_TORCH_COMPILE = False  # Set True for potential 10-20% speedup (experimental)

print("Configuration:")
print(f"  Model: {MODEL_NAME}")
print(f"  Output: {OUTPUT_DIR}")
print(f"  Learning rate: {LEARNING_RATE}")
print(f"  Epochs: {NUM_EPOCHS}")
print(f"  Batch size: {BATCH_SIZE} × {GRADIENT_ACCUMULATION_STEPS} = {BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS} effective")
print(f"  Max sequence length: {MAX_SEQ_LENGTH}")
print(f"  torch.compile: {ENABLE_TORCH_COMPILE}")

Configuration:
  Model: Qwen/Qwen2.5-7B-Instruct
  Output: checkpoints/qwen7b-ag-news-full
  Learning rate: 2e-05
  Epochs: 1
  Batch size: 2 × 16 = 32 effective
  Max sequence length: 512


## 3. Load and Prepare Dataset

Load the prepared training data and format it for the SFTTrainer.

In [4]:
from datasets import load_dataset

print("Loading training dataset...")
dataset = load_dataset("json", data_files=TRAIN_FILE, split="train")

print(f"\nDataset loaded:")
print(f"  Examples: {len(dataset):,}")
print(f"  Features: {dataset.features}")

# Show a sample
print(f"\nSample example:")
sample = dataset[0]
for msg in sample["messages"]:
    role = msg["role"]
    content = msg["content"][:100] + "..." if len(msg["content"]) > 100 else msg["content"]
    print(f"  [{role}]: {content}")

Loading training dataset...

Dataset loaded:
  Examples: 120,000
  Features: {'messages': List({'role': Value('string'), 'content': Value('string')})}

Sample example:
  [system]: You are a news article classifier. Your task is to categorize news articles into exactly one of four...
  [user]: Classify the following news article:

Thirst, Fear and Bribes on Desert Escape from Africa  AGADEZ, ...
  [assistant]: {"category":"World"}


## 4. Load Base Model

Load the Qwen2.5-7B-Instruct model in BF16 precision.

**Important**: We enable `gradient_checkpointing` to reduce memory usage. This trades compute for memory by recomputing activations during backward pass instead of storing them.

In [5]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)

# Ensure pad token is set
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    print(f"  Set pad_token to eos_token: {tokenizer.pad_token}")

print(f"\nTokenizer loaded:")
print(f"  Vocab size: {tokenizer.vocab_size:,}")
print(f"  Model max length: {tokenizer.model_max_length:,}")

Loading tokenizer...

Tokenizer loaded:
  Vocab size: 151,643
  Model max length: 131,072


In [6]:
print("Loading model (this may take a few minutes)...")
print(f"  Model: {MODEL_NAME}")
print(f"  Precision: BF16")
print(f"  Gradient checkpointing: {USE_GRADIENT_CHECKPOINTING}")

# Determine best attention implementation
# Docker container has Flash Attention 2 compiled for Blackwell
try:
    import flash_attn
    attn_impl = "flash_attention_2"
    print(f"  Attention: Flash Attention 2 (optimized for Blackwell)")
except ImportError:
    attn_impl = "sdpa"
    print(f"  Attention: SDPA (Flash Attention 2 not available)")

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
    attn_implementation=attn_impl,
)

# Enable gradient checkpointing to save memory
if USE_GRADIENT_CHECKPOINTING:
    model.gradient_checkpointing_enable()
    print("  ✓ Gradient checkpointing enabled")

# Optional: torch.compile for potential speedup
if ENABLE_TORCH_COMPILE:
    print("  Compiling model with torch.compile (this may take a few minutes)...")
    model = torch.compile(model)
    print("  ✓ Model compiled")

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"\nModel loaded:")
print(f"  Total parameters: {total_params:,} ({total_params/1e9:.2f}B)")
print(f"  Trainable parameters: {trainable_params:,} ({trainable_params/1e9:.2f}B)")
print(f"  Trainable %: {100 * trainable_params / total_params:.2f}%")

# Check memory usage
if torch.cuda.is_available():
    allocated = torch.cuda.memory_allocated(0) / 1e9
    print(f"\nGPU memory after model load: {allocated:.1f} GB")

Loading model (this may take a few minutes)...
  Model: Qwen/Qwen2.5-7B-Instruct
  Precision: BF16
  Gradient checkpointing: True


Loading weights: 100%|██████████| 339/339 [01:11<00:00,  4.74it/s, Materializing param=model.norm.weight]                              


  ✓ Gradient checkpointing enabled

Model loaded:
  Total parameters: 7,615,616,512 (7.62B)
  Trainable parameters: 7,615,616,512 (7.62B)
  Trainable %: 100.00%

GPU memory after model load: 15.2 GB


## 5. Training Configuration Preview

Preview the training configuration. The actual configuration will be set via `SFTConfig` in the next section.

In [7]:
# Calculate expected training steps for reference
# Note: We'll use SFTConfig in the next cell which combines TrainingArguments + SFT options
num_training_steps = (len(dataset) // (BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS)) * NUM_EPOCHS
warmup_steps = int(0.03 * num_training_steps)  # 3% warmup

print("Training Configuration Preview:")
print(f"  Output directory: {OUTPUT_DIR}")
print(f"  Epochs: {NUM_EPOCHS}")
print(f"  Batch size: {BATCH_SIZE} × {GRADIENT_ACCUMULATION_STEPS} = {BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS} effective")
print(f"  Learning rate: {LEARNING_RATE}")
print(f"  Warmup steps: {warmup_steps}")
print(f"  Total training steps: ~{num_training_steps:,}")
print(f"  Checkpoint every: {SAVE_STEPS} steps")
print(f"\n  Expected training time: ~8-15 hours (DGX Spark with Docker)")

Training Configuration Preview:
  Output directory: checkpoints/qwen7b-ag-news-full
  Epochs: 1
  Batch size: 2 × 16 = 32 effective
  Learning rate: 2e-05
  Warmup steps: 112
  Total training steps: ~3,750
  Checkpoint every: 500 steps


## 6. Initialize SFTTrainer

We use TRL's `SFTTrainer` (Supervised Fine-Tuning Trainer) with `SFTConfig` which handles:
- Chat template formatting
- **Assistant-only loss**: Only compute loss on assistant responses (via `assistant_only_loss=True`)
- Efficient data collation
- All training hyperparameters in one config object

In [8]:
from trl import SFTTrainer, SFTConfig

# In TRL 0.27+, use SFTConfig with assistant_only_loss instead of DataCollatorForCompletionOnlyLM
# This automatically masks the loss so we only train on assistant responses

print("Initializing SFTTrainer...")

# Create SFTConfig with all training parameters
sft_config = SFTConfig(
    output_dir=OUTPUT_DIR,
    
    # Training duration
    num_train_epochs=NUM_EPOCHS,
    
    # Batch size
    per_device_train_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
    
    # Optimizer
    learning_rate=LEARNING_RATE,
    weight_decay=WEIGHT_DECAY,
    warmup_steps=int(0.03 * (len(dataset) // (BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS))),  # 3% warmup
    max_grad_norm=MAX_GRAD_NORM,
    optim="adamw_torch",
    
    # Precision
    bf16=USE_BF16,
    
    # Checkpointing
    save_strategy="steps",
    save_steps=SAVE_STEPS,
    save_total_limit=SAVE_TOTAL_LIMIT,
    
    # Logging
    logging_steps=LOGGING_STEPS,
    logging_first_step=True,
    report_to="none",
    
    # Memory optimization
    gradient_checkpointing=USE_GRADIENT_CHECKPOINTING,
    
    # Note: assistant_only_loss=True requires tokenizer chat template with {% generation %} keyword
    # Qwen2.5's template doesn't support this, so we train on full sequences
    # This is acceptable since our assistant responses are very short (just category JSON)
    assistant_only_loss=False,
    
    # Sequence length (max_length in TRL 0.27+)
    max_length=MAX_SEQ_LENGTH,
    
    # Other
    seed=42,
)

trainer = SFTTrainer(
    model=model,
    processing_class=tokenizer,
    args=sft_config,
    train_dataset=dataset,
)

print("\n✓ SFTTrainer initialized")
print(f"  Training examples: {len(trainer.train_dataset):,}")
print(f"  Loss computed on: {'assistant only' if sft_config.assistant_only_loss else 'full sequence'}")

# Final memory check
if torch.cuda.is_available():
    allocated = torch.cuda.memory_allocated(0) / 1e9
    print(f"  GPU memory before training: {allocated:.1f} GB")

Initializing SFTTrainer...

✓ SFTTrainer initialized
  Training examples: 120,000
  Loss computed on: full sequence
  GPU memory before training: 15.2 GB


## 7. Train!

Now we start the training process. This will take **~8-15 hours** for 1 epoch on 120K examples on DGX Spark.

**What to monitor:**
- `loss` should decrease over time
- GPU memory usage (watch for OOM errors)
- Training speed (samples/second)

**Checkpoints are saved automatically** every 500 steps, so if training is interrupted, you can resume from the last checkpoint.

**Note**: The DGX Spark's unified memory architecture (273 GB/s bandwidth) is the primary bottleneck for full fine-tuning. The NVIDIA Docker container provides optimized kernels for sm_121, but memory bandwidth remains the limiting factor.

In [None]:
import time

print("=" * 60)
print("STARTING TRAINING")
print("=" * 60)
print(f"\nModel: {MODEL_NAME}")
print(f"Dataset: {len(dataset):,} examples")
print(f"Epochs: {NUM_EPOCHS}")
print(f"Expected time: ~8-15 hours (DGX Spark with Docker)")
print("\nCheckpoints will be saved to:", OUTPUT_DIR)
print("\n" + "=" * 60)

start_time = time.time()

# Train!
trainer.train()

end_time = time.time()
training_time = (end_time - start_time) / 3600  # Convert to hours

print("\n" + "=" * 60)
print("TRAINING COMPLETE")
print("=" * 60)
print(f"\nTotal training time: {training_time:.2f} hours")

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.


STARTING TRAINING

Model: Qwen/Qwen2.5-7B-Instruct
Dataset: 120,000 examples
Epochs: 1
Expected time: 3-5 hours

Checkpoints will be saved to: checkpoints/qwen7b-ag-news-full



Step,Training Loss
1,2.894846
50,1.491878
100,0.480344
150,0.474743


KeyboardInterrupt: 

: 

## 8. Save the Fine-Tuned Model

Save the complete fine-tuned model (all weights, ~14 GB).

In [None]:
print("Saving fine-tuned model...")

# Save the model
final_model_path = f"{OUTPUT_DIR}/final"
trainer.save_model(final_model_path)
tokenizer.save_pretrained(final_model_path)

print(f"\n✓ Model saved to: {final_model_path}")

# Check size
import subprocess
result = subprocess.run(["du", "-sh", final_model_path], capture_output=True, text=True)
print(f"✓ Model size: {result.stdout.strip()}")

## 9. Quick Evaluation

Let's do a quick sanity check by classifying a few examples with the fine-tuned model.

In [None]:
# Quick evaluation on a few examples
print("=" * 60)
print("QUICK EVALUATION")
print("=" * 60)

# Test examples (one from each category)
test_examples = [
    {"text": "President Biden announces new climate policy at UN summit", "expected": "World"},
    {"text": "Lakers defeat Celtics 115-108 in overtime thriller", "expected": "Sports"},
    {"text": "Apple stock rises 5% after strong quarterly earnings report", "expected": "Business"},
    {"text": "Google releases new AI model that can generate realistic images", "expected": "Sci/Tech"},
]

# System prompt (same as training)
system_prompt = """You are a news article classifier. Your task is to categorize news articles into exactly one of four categories:

- World: News about politics, government, elections, diplomacy, conflicts, and public affairs (domestic or international)
- Sports: News about athletic events, games, players, teams, coaches, tournaments, and championships
- Business: News about companies, markets, finance, economy, trade, corporate activities, and business services
- Sci/Tech: News about technology products, software, hardware, scientific research, gadgets, and tech innovations

Rules:
- Focus on the PRIMARY topic of the article
- Ignore HTML artifacts (like #39; or &lt;b&gt;) - they are formatting errors
- If an article is truncated, classify based on the available content
- When a topic spans multiple categories, choose the one that best represents the main focus"""

model.eval()

for example in test_examples:
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"Classify the following news article:\n\n{example['text']}"}
    ]
    
    inputs = tokenizer.apply_chat_template(
        messages, 
        return_tensors="pt",
        add_generation_prompt=True
    ).to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            inputs,
            max_new_tokens=50,
            temperature=0.0,
            do_sample=False,
            pad_token_id=tokenizer.pad_token_id,
        )
    
    response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
    
    print(f"\nArticle: {example['text'][:60]}...")
    print(f"Expected: {example['expected']}")
    print(f"Predicted: {response.strip()}")
    print("-" * 40)

## 10. Next Steps

Now that training is complete:

### 1. Full Evaluation
Run the fine-tuned model through the same evaluation as the base model:
- Use `base_model_performance.ipynb` as a template
- Load the fine-tuned model from `checkpoints/qwen7b-ag-news-full/final`
- Compare accuracy, F1, confusion matrix

### 2. Serve with vLLM
To serve the fine-tuned model:
```bash
# Update docker-compose-qwen7b.yml to point to fine-tuned model:
# command: vllm serve /checkpoints/qwen7b-ag-news-full/final ...

# Then start the server:
./start_docker.sh start qwen7b
```

### 3. Compare Results

| Metric | Base Model | Fine-Tuned | Improvement |
|--------|------------|------------|-------------|
| Accuracy | 78.63% | TBD | TBD |
| F1 (macro) | 77.80% | TBD | TBD |
| Sci/Tech Recall | 46.37% | TBD | TBD |

In [None]:
# Final summary
print("=" * 60)
print("TRAINING SUMMARY")
print("=" * 60)
print(f"""
Model:              {MODEL_NAME}
Dataset:            {len(dataset):,} examples
Training time:      {training_time:.2f} hours
Output directory:   {OUTPUT_DIR}

Next steps:
1. Run full evaluation on test set
2. Compare with base model results
3. Serve with vLLM if results are good
""")