# Lab 3.1: GPTQ Post-Training Quantization - Environment Setup

**Goal:** Prepare the environment for GPTQ quantization experiments.

**You will learn to:**
- Verify GPU and CUDA compatibility for quantization
- Install AutoGPTQ, Optimum, and related libraries
- Check ExLlama kernel support for accelerated inference
- Load a baseline FP16 model to establish performance benchmarks

---

## Why Environment Verification Matters

**GPTQ quantization has specific hardware requirements**:
- **GPU Memory**: Quantization process requires loading FP16 model (~15GB for Llama-2-7B)
- **CUDA Compute Capability**: ExLlama acceleration requires Ampere+ architecture (SM 8.0+)
- **Library Versions**: AutoGPTQ requires transformers>=4.35.0, torch>=2.0.0

**Time investment**: 5-10 minutes (one-time setup)

---
## Step 1: Hardware Verification

First, let's verify GPU availability and specifications. GPTQ quantization **requires a CUDA-capable GPU**.

In [None]:
# Check NVIDIA GPU status
!nvidia-smi

In [None]:
import torch

print("=" * 60)
print("GPU Configuration Check")
print("=" * 60)

# PyTorch and CUDA versions
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")

if torch.cuda.is_available():
    # GPU details
    gpu_id = 0
    gpu_props = torch.cuda.get_device_properties(gpu_id)
    
    print(f"\n✅ GPU Detected:")
    print(f"   Name: {torch.cuda.get_device_name(gpu_id)}")
    print(f"   Total Memory: {gpu_props.total_memory / 1e9:.2f} GB")
    print(f"   Compute Capability: SM {gpu_props.major}.{gpu_props.minor}")
    
    # Check if ExLlama is supported (Ampere+ = SM 8.0+)
    if gpu_props.major >= 8:
        print(f"   ✅ ExLlama Support: YES (SM {gpu_props.major}.{gpu_props.minor} >= 8.0)")
    else:
        print(f"   ⚠️  ExLlama Support: NO (SM {gpu_props.major}.{gpu_props.minor} < 8.0)")
        print(f"      → Will use slower fallback kernels")
    
    # Memory recommendation
    if gpu_props.total_memory / 1e9 >= 16:
        print(f"   ✅ Memory: Sufficient for quantization (>= 16GB)")
    else:
        print(f"   ⚠️  Memory: Limited (<16GB). May need CPU offloading.")
else:
    print("\n❌ No GPU detected!")
    print("   GPTQ quantization requires a CUDA GPU.")
    print("   Please run on a machine with NVIDIA GPU.")

print("=" * 60)

---
## Step 2: Install Quantization Libraries

We'll install the core libraries for GPTQ quantization:

- **auto-gptq**: Official GPTQ implementation
- **optimum**: Hugging Face optimization toolkit (includes GPTQ integration)
- **transformers**: Latest version with quantization support
- **accelerate**: For distributed/mixed-precision training
- **datasets**: For loading calibration data

**Note**: This may take 3-5 minutes. We use `-q` for quiet mode.

In [None]:
# Install core quantization libraries
!pip install -q auto-gptq optimum  # GPTQ quantization engine
!pip install -q transformers>=4.35.0  # Quantization model support
!pip install -q accelerate datasets  # Training and data utilities

print("✅ Installation complete!")

---
## Step 3: Verify Library Versions

Let's verify all libraries are installed correctly with compatible versions.

In [None]:
import transformers
import accelerate
import datasets
import optimum

# Try importing auto_gptq (may be installed as different package name)
try:
    import auto_gptq
    gptq_version = auto_gptq.__version__
except ImportError:
    gptq_version = "Not installed or import failed"

print("=" * 60)
print("Library Version Check")
print("=" * 60)
print(f"PyTorch:      {torch.__version__}")
print(f"Transformers: {transformers.__version__}")
print(f"Accelerate:   {accelerate.__version__}")
print(f"Datasets:     {datasets.__version__}")
print(f"Optimum:      {optimum.__version__}")
print(f"AutoGPTQ:     {gptq_version}")
print("=" * 60)

# Version checks
def check_version(name, current, required):
    from packaging import version
    if version.parse(current) >= version.parse(required):
        print(f"✅ {name}: {current} >= {required}")
    else:
        print(f"⚠️  {name}: {current} < {required} (may cause issues)")

check_version("Transformers", transformers.__version__, "4.35.0")
check_version("PyTorch", torch.__version__.split("+")[0], "2.0.0")

print("\n✅ All libraries verified!")

---
## Step 4: Check ExLlama Kernel Support

ExLlama is a highly optimized CUDA kernel for GPTQ inference. It provides:
- **20-50% faster inference** compared to default kernels
- **Lower memory usage** through optimized matrix multiplication

**Requirements**:
- NVIDIA GPU with Compute Capability >= 8.0 (Ampere+)
- Examples: A100, A10G, RTX 3090/4090, RTX A6000
- **Not supported**: V100 (SM 7.0), T4 (SM 7.5)

In [None]:
# Check if ExLlama can be used
if torch.cuda.is_available():
    gpu_props = torch.cuda.get_device_properties(0)
    compute_capability = gpu_props.major * 10 + gpu_props.minor
    
    print("=" * 60)
    print("ExLlama Kernel Support Check")
    print("=" * 60)
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Compute Capability: SM {gpu_props.major}.{gpu_props.minor} ({compute_capability})")
    print()
    
    if compute_capability >= 80:
        print("✅ ExLlama is SUPPORTED!")
        print("   → Expected inference speedup: 20-50%")
        print("   → Use: GPTQConfig(use_exllama=True)")
    else:
        print("⚠️  ExLlama is NOT supported on this GPU")
        print(f"   → Your GPU: SM {gpu_props.major}.{gpu_props.minor} (need >= 8.0)")
        print("   → Will use standard CUDA kernels (slower)")
        print("   → Use: GPTQConfig(use_exllama=False)")
    
    print("=" * 60)
else:
    print("❌ No GPU available for ExLlama check")

---
## Step 5: Load Baseline Model (FP16)

Let's load the baseline **Llama-2-7B** model in FP16 precision. This will:
- Establish a performance baseline for comparison
- Verify model loading works correctly
- Measure FP16 memory usage

**Note**: This requires **~15GB GPU memory**. If you encounter OOM:
- Use a smaller model (e.g., `meta-llama/Llama-2-7b-hf` → `TinyLlama/TinyLlama-1.1B`)
- Enable CPU offloading: `device_map="auto"` with `max_memory={0: "10GB", "cpu": "30GB"}`

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import gc

# Model configuration
MODEL_NAME = "meta-llama/Llama-2-7b-hf"  # Change to TinyLlama if OOM
# Alternative for limited GPU: "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

print("=" * 60)
print(f"Loading Baseline Model: {MODEL_NAME}")
print("=" * 60)
print("⏳ This may take 1-3 minutes...\n")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
print("✅ Tokenizer loaded")

# Load model in FP16
model_fp16 = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16,  # FP16 precision
    device_map="auto",          # Automatic device placement
    trust_remote_code=True      # Allow custom model code
)

print("✅ Model loaded in FP16")

# Memory usage
if torch.cuda.is_available():
    memory_allocated = torch.cuda.memory_allocated() / 1e9
    print(f"\n📊 GPU Memory Usage:")
    print(f"   Allocated: {memory_allocated:.2f} GB")
    print(f"   Reserved:  {torch.cuda.memory_reserved() / 1e9:.2f} GB")

# Model info
num_params = sum(p.numel() for p in model_fp16.parameters())
print(f"\n📝 Model Info:")
print(f"   Parameters: {num_params / 1e9:.2f}B")
print(f"   Precision: FP16 (2 bytes/param)")
print(f"   Estimated size: {num_params * 2 / 1e9:.2f} GB")

print("\n" + "=" * 60)
print("✅ Baseline model ready for quantization!")
print("=" * 60)

---
## Step 6: Test Baseline Inference

Let's perform a quick inference test to verify the model works correctly.

In [None]:
# Test prompt
prompt = "The future of artificial intelligence is"

print("=" * 60)
print("Baseline Inference Test (FP16)")
print("=" * 60)
print(f"Prompt: {prompt}\n")

# Tokenize input
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

# Generate
import time
start_time = time.time()

with torch.no_grad():
    outputs = model_fp16.generate(
        **inputs,
        max_new_tokens=50,
        do_sample=True,
        temperature=0.8,
        top_p=0.9
    )

end_time = time.time()
latency = end_time - start_time

# Decode output
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(f"Output: {generated_text}\n")
print(f"⏱️  Latency: {latency:.2f} seconds")
print(f"📊 Tokens/sec: {len(outputs[0]) / latency:.2f}")
print("=" * 60)
print("✅ Inference test passed!")

---
## Step 7: Clean Up Memory (Optional)

Free GPU memory before proceeding to quantization.

In [None]:
# Optional: Clear memory if needed
# Uncomment if you need to free GPU memory

# del model_fp16
# gc.collect()
# torch.cuda.empty_cache()
# print("✅ Memory cleared")

---
## ✅ Setup Complete!

**Summary**:
- ✅ GPU verified (CUDA available)
- ✅ Quantization libraries installed
- ✅ ExLlama support checked
- ✅ Baseline FP16 model loaded
- ✅ Inference test passed

**Next Steps**:
1. Proceed to **02-Quantize.ipynb** to apply GPTQ quantization
2. Compare quantized model (3.5GB) vs baseline (13.5GB)
3. Benchmark inference performance (2-4x speedup expected)

**Troubleshooting**:
- **OOM during model loading**: Use TinyLlama-1.1B or enable CPU offloading
- **ExLlama not supported**: Use `use_exllama=False` in quantization config
- **Import errors**: Ensure transformers>=4.35.0 and torch>=2.0.0

---

**⏭️ Continue to**: [02-Quantize.ipynb](./02-Quantize.ipynb)