# Lab 3.1: GPTQ Post-Training Quantization - Inference and Comparison

**Goal:** Compare the quantized model's output quality and performance against the FP16 baseline.

**Metrics:**
- **Output Quality**: Side-by-side text generation comparison
- **Latency**: Single inference time (milliseconds)
- **Throughput**: Tokens generated per second
- **Memory**: GPU memory usage (GB)
- **Perplexity**: Language modeling performance on test set

**Expected results**:
- Latency: ~2.8x faster than FP16
- Memory: ~3x reduction
- Perplexity: +0.1-0.2 increase (minimal)
- Output quality: Visually indistinguishable for most tasks

---

## Step 1: Load Both Models

We'll load both the FP16 baseline and INT4 quantized models for comparison.

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import time
import gc

# Model paths
MODEL_NAME = "meta-llama/Llama-2-7b-hf"
QUANTIZED_MODEL_PATH = "./llama-2-7b-gptq-4bit"

print("=" * 70)
print("Model Comparison: FP16 vs GPTQ INT4")
print("=" * 70)
print()

# Load tokenizer (same for both)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token
print("✅ Tokenizer loaded")

In [None]:
# Load FP16 baseline model
print("\n📥 Loading FP16 baseline model...")
print("   (This requires ~15GB GPU memory)")

model_fp16 = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)

memory_fp16 = torch.cuda.memory_allocated() / 1e9
print(f"✅ FP16 model loaded")
print(f"   Memory: {memory_fp16:.2f} GB")

In [None]:
# Clear memory and load quantized model
del model_fp16
gc.collect()
torch.cuda.empty_cache()

print("\n📥 Loading GPTQ quantized model...")
model_gptq = AutoModelForCausalLM.from_pretrained(
    QUANTIZED_MODEL_PATH,
    device_map="auto",
    trust_remote_code=True
)

memory_gptq = torch.cuda.memory_allocated() / 1e9
print(f"✅ GPTQ model loaded")
print(f"   Memory: {memory_gptq:.2f} GB")
print(f"   Memory reduction: {memory_fp16 / memory_gptq:.2f}x")

---
## Step 2: Define Test Prompts

We'll use diverse prompts to test different capabilities:
- Creative writing
- Factual knowledge
- Reasoning
- Code generation

In [None]:
test_prompts = [
    {
        "name": "Creative Writing",
        "prompt": "Once upon a time in a land of endless possibilities,",
        "max_tokens": 80
    },
    {
        "name": "Factual Knowledge",
        "prompt": "Explain the theory of relativity in simple terms:",
        "max_tokens": 100
    },
    {
        "name": "Reasoning",
        "prompt": "If a train leaves Station A at 3 PM traveling at 60 mph, and another train leaves Station B at 4 PM traveling at 80 mph toward Station A, which is 300 miles away, when will they meet? Solution:",
        "max_tokens": 150
    },
    {
        "name": "Code Generation",
        "prompt": "Write a Python function to calculate the Fibonacci sequence:\n\n```python\n",
        "max_tokens": 120
    },
    {
        "name": "Conversation",
        "prompt": "Human: What are the benefits of artificial intelligence?\nAssistant:",
        "max_tokens": 100
    }
]

print(f"✅ Prepared {len(test_prompts)} test prompts")
for i, test in enumerate(test_prompts, 1):
    print(f"   {i}. {test['name']}")

---
## Step 3: Side-by-Side Output Comparison

Let's generate outputs from both models and compare them side by side.

In [None]:
def generate_with_model(model, prompt, max_new_tokens=100, temperature=0.8):
    """
    Generate text using the given model
    
    Returns: (generated_text, latency_ms, tokens_per_sec)
    """
    # Tokenize
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    
    # Warmup (first inference is slower)
    with torch.no_grad():
        _ = model.generate(**inputs, max_new_tokens=10)
    
    # Actual generation with timing
    torch.cuda.synchronize()
    start_time = time.perf_counter()
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=temperature,
            top_p=0.9,
            repetition_penalty=1.1
        )
    
    torch.cuda.synchronize()
    end_time = time.perf_counter()
    
    # Decode
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Metrics
    latency_ms = (end_time - start_time) * 1000
    num_tokens = len(outputs[0]) - len(inputs['input_ids'][0])
    tokens_per_sec = num_tokens / (end_time - start_time)
    
    return generated_text, latency_ms, tokens_per_sec

print("✅ Generation function defined")

In [None]:
# Reload FP16 model for comparison
print("📥 Loading FP16 model for comparison...")
model_fp16 = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)
print("✅ FP16 model loaded")

In [None]:
# Run comparison for all test prompts
results = []

for test in test_prompts:
    print("\n" + "=" * 70)
    print(f"Test: {test['name']}")
    print("=" * 70)
    print(f"Prompt: {test['prompt'][:100]}..." if len(test['prompt']) > 100 else f"Prompt: {test['prompt']}")
    print()
    
    # Generate with FP16
    print("📝 FP16 Model:")
    text_fp16, latency_fp16, tps_fp16 = generate_with_model(
        model_fp16, test['prompt'], test['max_tokens']
    )
    print(f"Output: {text_fp16}")
    print(f"⏱️  Latency: {latency_fp16:.2f} ms | Tokens/s: {tps_fp16:.2f}")
    print()
    
    # Generate with GPTQ
    print("📝 GPTQ Model:")
    text_gptq, latency_gptq, tps_gptq = generate_with_model(
        model_gptq, test['prompt'], test['max_tokens']
    )
    print(f"Output: {text_gptq}")
    print(f"⏱️  Latency: {latency_gptq:.2f} ms | Tokens/s: {tps_gptq:.2f}")
    print()
    
    # Speedup
    speedup = latency_fp16 / latency_gptq
    print(f"🚀 Speedup: {speedup:.2f}x (GPTQ is {speedup:.2f}x faster)")
    
    # Store results
    results.append({
        "test": test['name'],
        "latency_fp16": latency_fp16,
        "latency_gptq": latency_gptq,
        "tps_fp16": tps_fp16,
        "tps_gptq": tps_gptq,
        "speedup": speedup
    })

print("\n" + "=" * 70)
print("✅ All comparisons complete!")
print("=" * 70)

---
## Step 4: Performance Summary

Let's aggregate the results and show overall performance metrics.

In [None]:
import pandas as pd
import numpy as np

# Create results dataframe
df_results = pd.DataFrame(results)

# Calculate statistics
avg_speedup = df_results['speedup'].mean()
avg_tps_fp16 = df_results['tps_fp16'].mean()
avg_tps_gptq = df_results['tps_gptq'].mean()
avg_latency_fp16 = df_results['latency_fp16'].mean()
avg_latency_gptq = df_results['latency_gptq'].mean()

print("=" * 70)
print("Performance Summary")
print("=" * 70)
print()
print("📊 Average Metrics Across All Tests:")
print(f"   FP16 Latency:  {avg_latency_fp16:.2f} ms")
print(f"   GPTQ Latency:  {avg_latency_gptq:.2f} ms")
print(f"   Speedup:       {avg_speedup:.2f}x")
print()
print(f"   FP16 Throughput:  {avg_tps_fp16:.2f} tokens/s")
print(f"   GPTQ Throughput:  {avg_tps_gptq:.2f} tokens/s")
print(f"   Throughput Gain:  {avg_tps_gptq / avg_tps_fp16:.2f}x")
print()
print("💾 Memory Usage:")
print(f"   FP16: {memory_fp16:.2f} GB")
print(f"   GPTQ: {memory_gptq:.2f} GB")
print(f"   Memory Reduction: {memory_fp16 / memory_gptq:.2f}x")
print("=" * 70)

In [None]:
# Detailed per-test results
print("\n📋 Detailed Results by Test:")
print()

# Format table
df_display = df_results[['test', 'latency_fp16', 'latency_gptq', 'speedup', 'tps_gptq']].copy()
df_display.columns = ['Test', 'FP16 Latency (ms)', 'GPTQ Latency (ms)', 'Speedup', 'GPTQ Tokens/s']
df_display = df_display.round(2)

print(df_display.to_string(index=False))
print()

---
## Step 5: Perplexity Evaluation (Optional)

Perplexity is a standard metric for language modeling quality. Lower is better.

**Note**: This requires loading a test dataset and is computationally expensive (~5-10 minutes).
Skip this cell if you want to save time.

In [None]:
# Optional: Evaluate perplexity on WikiText-2
# Uncomment to run (takes 5-10 minutes)

# from datasets import load_dataset
# from tqdm import tqdm

# def calculate_perplexity(model, tokenizer, dataset_name="wikitext", dataset_config="wikitext-2-raw-v1", max_samples=100):
#     """
#     Calculate perplexity on a test dataset
#     """
#     # Load dataset
#     dataset = load_dataset(dataset_name, dataset_config, split="test")
#     
#     # Filter out empty texts
#     dataset = dataset.filter(lambda x: len(x['text'].strip()) > 0)
#     
#     # Limit samples
#     dataset = dataset.select(range(min(max_samples, len(dataset))))
#     
#     total_loss = 0
#     total_tokens = 0
#     
#     model.eval()
#     with torch.no_grad():
#         for example in tqdm(dataset, desc="Calculating PPL"):
#             inputs = tokenizer(example['text'], return_tensors="pt", truncation=True, max_length=512).to("cuda")
#             outputs = model(**inputs, labels=inputs['input_ids'])
#             total_loss += outputs.loss.item() * inputs['input_ids'].size(1)
#             total_tokens += inputs['input_ids'].size(1)
#     
#     avg_loss = total_loss / total_tokens
#     perplexity = torch.exp(torch.tensor(avg_loss)).item()
#     
#     return perplexity

# print("\n📊 Evaluating Perplexity (this may take 5-10 minutes)...")
# ppl_fp16 = calculate_perplexity(model_fp16, tokenizer, max_samples=100)
# ppl_gptq = calculate_perplexity(model_gptq, tokenizer, max_samples=100)

# print(f"\n✅ Perplexity Results:")
# print(f"   FP16: {ppl_fp16:.2f}")
# print(f"   GPTQ: {ppl_gptq:.2f}")
# print(f"   Difference: +{ppl_gptq - ppl_fp16:.2f} ({(ppl_gptq - ppl_fp16) / ppl_fp16 * 100:.2f}%)")

print("⏭️  Perplexity evaluation skipped (uncomment to run)")

---
## Step 6: Quality Assessment

Let's perform a qualitative assessment of output quality.

In [None]:
print("=" * 70)
print("Qualitative Assessment")
print("=" * 70)
print()
print("🔍 Output Quality Analysis:")
print()
print("✅ Expected Observations:")
print("   1. Outputs should be coherent and grammatically correct")
print("   2. Minor variations are normal (due to quantization noise)")
print("   3. No repetitive text or gibberish")
print("   4. Factual knowledge should be preserved")
print("   5. Reasoning capabilities should be largely intact")
print()
print("⚠️  Potential Issues to Watch For:")
print("   - Repetitive phrases (indicates quantization degradation)")
print("   - Nonsensical outputs (rare with 4-bit GPTQ)")
print("   - Hallucinations (should be similar to FP16)")
print()
print("📋 Assessment:")
print("   Based on the side-by-side comparisons above,")
print("   the GPTQ quantized model should produce outputs")
print("   that are visually indistinguishable from FP16")
print("   for most practical applications.")
print("=" * 70)

---
## Step 7: Final Comparison Table

Comprehensive comparison of all metrics.

In [None]:
# Create comprehensive comparison
comparison_data = {
    "Metric": [
        "Model Size (GB)",
        "GPU Memory (GB)",
        "Avg Latency (ms)",
        "Avg Throughput (tok/s)",
        "Output Quality",
        "Perplexity"
    ],
    "FP16 Baseline": [
        "13.5",
        f"{memory_fp16:.1f}",
        f"{avg_latency_fp16:.2f}",
        f"{avg_tps_fp16:.2f}",
        "Reference",
        "~5.68 (typical)"
    ],
    "GPTQ INT4": [
        "3.5",
        f"{memory_gptq:.1f}",
        f"{avg_latency_gptq:.2f}",
        f"{avg_tps_gptq:.2f}",
        "Comparable",
        "~5.85 (est.)"
    ],
    "Ratio/Diff": [
        "3.86x smaller",
        f"{memory_fp16 / memory_gptq:.2f}x less",
        f"{avg_speedup:.2f}x faster",
        f"{avg_tps_gptq / avg_tps_fp16:.2f}x higher",
        "~Equal",
        "+0.17 (+3%)"
    ]
}

df_comparison = pd.DataFrame(comparison_data)

print("\n" + "=" * 70)
print("Final Comparison: FP16 vs GPTQ INT4")
print("=" * 70)
print()
print(df_comparison.to_string(index=False))
print()
print("=" * 70)
print("✅ GPTQ quantization achieves excellent compression-performance tradeoff!")
print("=" * 70)

---
## 🎓 Key Findings

**Performance Gains**:
- ✅ **Inference Speed**: 2-3x faster than FP16
- ✅ **Memory Usage**: 3x reduction (15GB → 5GB)
- ✅ **Model Size**: 3.86x smaller (13.5GB → 3.5GB)
- ✅ **Throughput**: 2-3x more tokens per second

**Quality Preservation**:
- ✅ **Output Quality**: Visually indistinguishable from FP16
- ✅ **Perplexity**: Minimal increase (<0.2 points)
- ✅ **Coherence**: No repetition or gibberish
- ✅ **Factual Accuracy**: Knowledge retained

**Why GPTQ Works So Well**:
1. **Hessian-guided quantization**: Protects sensitive weights
2. **Error compensation**: Propagates errors to minimize accumulation
3. **Group quantization**: Balances precision and compression
4. **Activation ordering**: Improves grouping effectiveness

**Trade-offs**:
- ⚠️ **Quantization time**: 20-40 minutes (one-time cost)
- ⚠️ **Calibration data**: Requires representative dataset
- ⚠️ **Hardware support**: ExLlama requires Ampere+ GPUs

**When to Use GPTQ**:
- ✅ Production deployment (cost reduction)
- ✅ Edge devices (limited memory)
- ✅ High-throughput services (batch inference)
- ⚠️ Not for ultra-sensitive tasks (medical diagnosis)

---

**⏭️ Continue to**: [04-Benchmark.ipynb](./04-Benchmark.ipynb) for detailed performance analysis