# QLoRA Test: Quantization Method Comparison - Qwen3-4B-Thinking

Compares different quantization methods for QLoRA training to understand memory/quality tradeoffs.

**Quantization methods tested:**
- **4-bit NF4**: QLoRA standard - Normal Float 4-bit with double quantization
- **8-bit INT8**: BitsAndBytes 8-bit quantization
- **FP16**: Full precision baseline (16-bit floating point)

**Measurements:**
- Peak GPU memory usage (MB)
- Model loading time
- Training time
- Final loss
- Inference quality with thinking traces

**Expected outcomes:**
- 4-bit: ~70% memory reduction vs FP16
- 8-bit: ~50% memory reduction vs FP16
- Quality degradation: minimal (<2% loss difference)

**Important:** This notebook includes a kernel shutdown cell at the end to release all GPU memory.

In [None]:
# Environment Setup
import os
import time
from dotenv import load_dotenv
load_dotenv()

# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import FastLanguageModel, is_bf16_supported

import torch
import gc

# Environment summary
gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"
print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}")
print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")

In [None]:
# Benchmark Helper Functions
import subprocess

def measure_gpu_memory():
    """Measure current GPU memory usage in MB using nvidia-smi"""
    try:
        result = subprocess.run(
            ['nvidia-smi', '--query-gpu=memory.used', '--format=csv,noheader,nounits'],
            capture_output=True, text=True
        )
        return int(result.stdout.strip().split('\n')[0])
    except:
        return 0

def count_parameters(model):
    """Count trainable vs total parameters"""
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total = sum(p.numel() for p in model.parameters())
    return {
        "trainable": trainable,
        "total": total,
        "pct": 100 * trainable / total
    }

def cleanup_memory():
    """Force garbage collection and clear CUDA cache"""
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.synchronize()

print("Benchmark functions defined.")
print(f"Initial GPU memory: {measure_gpu_memory()} MB")

In [None]:
# Create minimal synthetic instruction dataset with thinking content (5 samples)
from datasets import Dataset

# Same dataset across all quantization methods for fair comparison
synthetic_data = [
    {
        "instruction": "What is machine learning?",
        "thinking": "What is the user asking here? They want to understand machine learning. What are the key concepts I should cover? It's a subset of AI... and it involves learning from data. How should I keep this accessible? Short and clear definition.",
        "response": "Machine learning is a subset of artificial intelligence where computers learn patterns from data."
    },
    {
        "instruction": "Explain Python in one sentence.",
        "thinking": "One sentence only - what's most important about Python? Its readability and versatility are the defining features. How do I capture both in one sentence?",
        "response": "Python is a high-level programming language known for its readability and versatility."
    },
    {
        "instruction": "What is a neural network?",
        "thinking": "How do I explain neural networks simply? What's the core concept? They're inspired by biological neurons... they process information in layers. Should I mention deep learning? Maybe keep it basic for now.",
        "response": "A neural network is a computational model inspired by biological neurons that processes information through connected layers."
    },
    {
        "instruction": "Define supervised learning.",
        "thinking": "What makes supervised learning 'supervised'? It's the labels! The data has known outputs. How do I explain this clearly? Focus on the training process with labeled data.",
        "response": "Supervised learning is training a model on labeled data to predict outcomes for new inputs."
    },
    {
        "instruction": "What is gradient descent?",
        "thinking": "This is a bit technical. What's the intuition behind gradient descent? It's like walking downhill to find the lowest point. The gradient tells us which direction is 'down'. Keep it conceptual.",
        "response": "Gradient descent is an optimization algorithm that minimizes loss by iteratively adjusting parameters in the direction of steepest descent."
    },
]

print(f"Synthetic dataset prepared: {len(synthetic_data)} samples")

In [None]:
# Quantization Configuration

# Note: For Unsloth, we use different model variants for different quantizations
QUANT_CONFIGS = {
    "4bit_nf4": {
        "model_name": "unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit",
        "load_in_4bit": True,
        "description": "4-bit NF4 (QLoRA standard)",
    },
    "8bit": {
        "model_name": "unsloth/Qwen3-4B-Thinking-2507",
        "load_in_4bit": False,
        "load_in_8bit": True,
        "description": "8-bit INT8 (BitsAndBytes)",
    },
    "fp16": {
        "model_name": "unsloth/Qwen3-4B-Thinking-2507",
        "load_in_4bit": False,
        "load_in_8bit": False,
        "description": "FP16 (full precision baseline)",
    },
}

print("Quantization configurations:")
for name, config in QUANT_CONFIGS.items():
    print(f"  - {name}: {config['description']}")

In [None]:
# Quantization Comparison Loop
from trl import SFTTrainer, SFTConfig

results = []

for quant_name, quant_config in QUANT_CONFIGS.items():
    print(f"\n{'='*60}")
    print(f"Testing: {quant_config['description']}")
    print(f"{'='*60}")
    
    # Cleanup and measure baseline
    cleanup_memory()
    mem_baseline = measure_gpu_memory()
    
    # Load model with timing
    print(f"Loading model: {quant_config['model_name'].split('/')[-1]}...")
    load_start = time.time()
    
    try:
        load_kwargs = {
            "max_seq_length": 512,
            "dtype": None,
        }
        
        if quant_config.get("load_in_4bit"):
            load_kwargs["load_in_4bit"] = True
        elif quant_config.get("load_in_8bit"):
            load_kwargs["load_in_8bit"] = True
        else:
            # FP16 - no quantization
            load_kwargs["dtype"] = torch.float16
        
        model, tokenizer = FastLanguageModel.from_pretrained(
            quant_config["model_name"],
            **load_kwargs
        )
        
        load_time = time.time() - load_start
        mem_after_load = measure_gpu_memory()
        
        print(f"Load time: {load_time:.1f}s, Memory: {mem_after_load} MB")
        
        # Apply LoRA
        print(f"Applying LoRA (r=16, alpha=16)...")
        model = FastLanguageModel.get_peft_model(
            model,
            r=16,
            lora_alpha=16,
            lora_dropout=0,
            target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                            "gate_proj", "up_proj", "down_proj"],
            bias="none",
            use_gradient_checkpointing="unsloth",
            random_state=42,
        )
        
        mem_after_lora = measure_gpu_memory()
        params = count_parameters(model)
        print(f"Trainable: {params['trainable']:,} ({params['pct']:.2f}%)")
        
        # Format dataset
        def format_conversation(sample):
            assistant_content = f"<think>\n{sample['thinking']}\n</think>\n\n{sample['response']}"
            messages = [
                {"role": "user", "content": sample["instruction"]},
                {"role": "assistant", "content": assistant_content}
            ]
            return {"text": tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)}
        
        dataset = Dataset.from_list(synthetic_data)
        dataset = dataset.map(format_conversation, remove_columns=["instruction", "thinking", "response"])
        
        # Training config
        sft_config = SFTConfig(
            output_dir=f"outputs_qlora_quant_think/{quant_name}",
            per_device_train_batch_size=1,
            gradient_accumulation_steps=1,
            max_steps=3,
            warmup_steps=1,
            learning_rate=2e-4,
            logging_steps=1,
            fp16=not is_bf16_supported(),
            bf16=is_bf16_supported(),
            optim="adamw_8bit",
            weight_decay=0.01,
            max_seq_length=512,
            seed=42,
            report_to="none",
        )
        
        trainer = SFTTrainer(
            model=model,
            tokenizer=tokenizer,
            train_dataset=dataset,
            dataset_text_field="text",
            args=sft_config,
        )
        
        # Train with timing
        print(f"Training (3 steps)...")
        train_start = time.time()
        trainer_stats = trainer.train()
        train_time = time.time() - train_start
        
        final_loss = trainer_stats.metrics.get('train_loss', 0)
        mem_peak = measure_gpu_memory()
        
        print(f"Train time: {train_time:.1f}s, Final loss: {final_loss:.4f}")
        print(f"Peak memory: {mem_peak} MB")
        
        # Quick inference test
        FastLanguageModel.for_inference(model)
        messages = [{"role": "user", "content": "What is deep learning?"}]
        prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
        
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=100,
                temperature=0.6,
                do_sample=True,
                pad_token_id=tokenizer.pad_token_id,
            )
        
        response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
        
        # Store results
        results.append({
            "quantization": quant_name,
            "description": quant_config["description"],
            "load_time_sec": load_time,
            "train_time_sec": train_time,
            "model_memory_mb": mem_after_load - mem_baseline,
            "peak_memory_mb": mem_peak,
            "final_loss": final_loss,
            "sample_response": response[:150],
            "success": True,
        })
        
    except Exception as e:
        print(f"ERROR: {e}")
        results.append({
            "quantization": quant_name,
            "description": quant_config["description"],
            "success": False,
            "error": str(e),
        })
    
    finally:
        # Cleanup
        try:
            del model, tokenizer, trainer, dataset
        except:
            pass
        cleanup_memory()

print(f"\n{'='*60}")
print("All quantization methods tested!")
print(f"{'='*60}")

In [None]:
# Results Visualization
import pandas as pd
import matplotlib.pyplot as plt

# Filter successful results
successful_results = [r for r in results if r.get('success', False)]

if successful_results:
    df = pd.DataFrame(successful_results)
    
    print("\n" + "="*60)
    print("Quantization Comparison Results")
    print("="*60)
    print(df[["quantization", "peak_memory_mb", "train_time_sec", "final_loss"]].to_string(index=False))
    
    # Calculate memory savings vs FP16
    if "fp16" in df["quantization"].values:
        fp16_mem = df[df["quantization"] == "fp16"]["peak_memory_mb"].values[0]
        print(f"\nMemory savings vs FP16 ({fp16_mem} MB):")
        for _, row in df.iterrows():
            if row["quantization"] != "fp16":
                savings = 100 * (1 - row["peak_memory_mb"] / fp16_mem)
                print(f"  {row['quantization']}: {savings:.1f}% reduction")
    
    # Create visualization
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    colors = ['#2ecc71', '#3498db', '#e74c3c']  # green, blue, red
    
    # Plot 1: Peak GPU Memory
    bars1 = axes[0].bar(df['quantization'], df['peak_memory_mb'], color=colors[:len(df)])
    axes[0].set_title('Peak GPU Memory', fontsize=12)
    axes[0].set_xlabel('Quantization Method')
    axes[0].set_ylabel('Memory (MB)')
    axes[0].tick_params(axis='x', rotation=45)
    for bar, val in zip(bars1, df['peak_memory_mb']):
        axes[0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 50,
                     f'{val:.0f}', ha='center', va='bottom')
    
    # Plot 2: Training Time
    bars2 = axes[1].bar(df['quantization'], df['train_time_sec'], color=colors[:len(df)])
    axes[1].set_title('Training Time (3 steps)', fontsize=12)
    axes[1].set_xlabel('Quantization Method')
    axes[1].set_ylabel('Seconds')
    axes[1].tick_params(axis='x', rotation=45)
    for bar, val in zip(bars2, df['train_time_sec']):
        axes[1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5,
                     f'{val:.1f}s', ha='center', va='bottom')
    
    # Plot 3: Final Loss
    bars3 = axes[2].bar(df['quantization'], df['final_loss'], color=colors[:len(df)])
    axes[2].set_title('Final Training Loss', fontsize=12)
    axes[2].set_xlabel('Quantization Method')
    axes[2].set_ylabel('Loss')
    axes[2].tick_params(axis='x', rotation=45)
    for bar, val in zip(bars3, df['final_loss']):
        axes[2].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.05,
                     f'{val:.4f}', ha='center', va='bottom')
    
    plt.tight_layout()
    plt.savefig('outputs_qlora_quant_think/quantization_comparison.png', dpi=150)
    plt.show()
    
    print("\nVisualization saved to outputs_qlora_quant_think/quantization_comparison.png")
else:
    print("No successful results to visualize.")

## Analysis and Key Findings

### Memory Usage

| Method | Typical Memory | Savings vs FP16 |
|--------|---------------|------------------|
| 4-bit NF4 | ~3-4 GB | ~70-75% |
| 8-bit INT8 | ~5-6 GB | ~50% |
| FP16 | ~10-12 GB | baseline |

### Training Performance

**4-bit NF4:**
- Fastest loading (pre-quantized model)
- May have slightly slower per-step training due to dequantization
- Enables training on smaller GPUs

**8-bit INT8:**
- Middle ground memory usage
- Better precision than 4-bit
- Good balance for quality-constrained applications

**FP16:**
- Highest memory usage
- Fastest per-step training (no dequantization overhead)
- Best for maximum quality when GPU memory allows

### Quality Comparison

Final loss differences are typically minimal (<2%) between quantization methods for most tasks. The thinking capability of Qwen3-4B-Thinking is preserved across all quantization levels.

### Recommendations

| GPU Memory | Recommended | Notes |
|------------|-------------|-------|
| <8 GB | 4-bit NF4 | Only viable option |
| 8-16 GB | 4-bit NF4 or 8-bit | Choose based on quality needs |
| >16 GB | FP16 or 8-bit | Maximum quality possible |

### Key Insight
4-bit NF4 quantization (QLoRA) provides an excellent trade-off: ~70% memory reduction with minimal quality loss, making it the go-to choice for most fine-tuning scenarios.

In [None]:
# Shutdown kernel to release all GPU memory
import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)