# Lab-2.2 Part 3: Quantization Inference

## Objectives
- Understand quantization fundamentals
- Implement INT8/INT4 quantization
- Compare GPTQ, AWQ, and BitsAndBytes
- Evaluate quality vs performance tradeoffs

## Estimated Time: 60-90 minutes

---
## 1. Quantization Fundamentals

### What is Quantization?

Quantization reduces precision of model weights and activations:

```
FP32 (32-bit):  ±3.4 × 10³⁸ range, high precision
FP16 (16-bit):  ±65,504 range, good precision  
INT8 (8-bit):   -128 to 127, reduced precision
INT4 (4-bit):   -8 to 7, minimal precision
```

**Benefits**:
- 2-4x smaller model size
- 1.5-3x faster inference (memory-bound → less data transfer)
- Enables larger batch sizes

In [None]:
# Imports
import torch
import time
import numpy as np
import matplotlib.pyplot as plt
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

print(f"PyTorch: {torch.__version__}")
print(f"CUDA: {torch.cuda.is_available()}")

### Quantization Comparison

| Method | Bits | Model Size | Speed | Quality | Use Case |
|--------|------|-----------|--------|---------|----------|
| FP16 | 16 | 100% | 1.0x | 100% | Standard |
| INT8 | 8 | 50% | 1.5-2x | 98-99% | Balanced |
| INT4 | 4 | 25% | 2-3x | 95-97% | Resource-constrained |
| FP8 (H100) | 8 | 50% | 2-3x | 99-99.5% | Latest GPUs |

In [None]:
# Model size comparison
def calculate_model_size(num_params_b: float, precision_bytes: int) -> float:
    """Calculate model size in GB."""
    return num_params_b * precision_bytes

# Llama-2-7B
num_params = 7.0  # Billion

sizes = {
    'FP32': calculate_model_size(num_params, 4),
    'FP16': calculate_model_size(num_params, 2),
    'INT8': calculate_model_size(num_params, 1),
    'INT4': calculate_model_size(num_params, 0.5),
}

print("Model Size Comparison (Llama-2-7B)")
print("=" * 60)
for precision, size in sizes.items():
    reduction = sizes['FP16'] / size
    print(f"{precision:6s}: {size:5.2f} GB ({reduction:.1f}x reduction vs FP16)")
print("=" * 60)

# Visualize
plt.figure(figsize=(10, 6))
precisions = list(sizes.keys())
size_values = list(sizes.values())
colors = ['#ff6b6b', '#ffd93d', '#51cf66', '#4dabf7']

bars = plt.bar(precisions, size_values, color=colors)
plt.ylabel('Model Size (GB)')
plt.title('Model Size vs Precision (Llama-2-7B)')
plt.grid(axis='y', alpha=0.3)

for bar, size in zip(bars, size_values):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height(),
             f'{size:.1f} GB', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

---
## 2. INT8 Quantization with BitsAndBytes

In [None]:
# Check if bitsandbytes is installed
try:
    import bitsandbytes as bnb
    print(f"✅ bitsandbytes: {bnb.__version__}")
except ImportError:
    print("❌ bitsandbytes not installed")
    print("Install: pip install bitsandbytes")

In [None]:
# Load FP16 model (baseline)
MODEL_NAME = "facebook/opt-1.3b"

print(f"Loading FP16 baseline: {MODEL_NAME}...")
start = time.time()

model_fp16 = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

load_time_fp16 = time.time() - start

# Check model size
fp16_memory = model_fp16.get_memory_footprint() / 1e9

print(f"✅ Loaded in {load_time_fp16:.2f}s")
print(f"   Memory: {fp16_memory:.2f} GB")

In [None]:
# Load INT8 quantized model
print(f"\nLoading INT8 quantized: {MODEL_NAME}...")
start = time.time()

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0,
)

model_int8 = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=quantization_config,
    device_map="auto",
)

load_time_int8 = time.time() - start
int8_memory = model_int8.get_memory_footprint() / 1e9

print(f"✅ Loaded in {load_time_int8:.2f}s")
print(f"   Memory: {int8_memory:.2f} GB")
print(f"   Reduction: {fp16_memory/int8_memory:.1f}x smaller")

---
## 3. Performance Comparison

In [None]:
def benchmark_model(model, tokenizer, prompts, max_tokens=50):
    """Benchmark model inference."""
    total_time = 0
    total_tokens = 0
    
    for prompt in prompts:
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
        
        start = time.time()
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=max_tokens,
                do_sample=True,
                temperature=0.8,
                pad_token_id=tokenizer.eos_token_id,
            )
        elapsed = time.time() - start
        
        total_time += elapsed
        total_tokens += len(outputs[0]) - len(inputs.input_ids[0])
    
    return {
        'total_time': total_time,
        'total_tokens': total_tokens,
        'throughput': total_tokens / total_time,
        'avg_time': total_time / len(prompts),
    }

# Test prompts
test_prompts = [
    "Explain machine learning:",
    "What is Python programming?",
    "The benefits of AI are",
]

print("Benchmarking FP16 model...")
fp16_results = benchmark_model(model_fp16, tokenizer, test_prompts)

print("\nBenchmarking INT8 model...")
int8_results = benchmark_model(model_int8, tokenizer, test_prompts)

print("\n" + "=" * 80)
print("PERFORMANCE COMPARISON")
print("=" * 80)
print(f"\nFP16:")
print(f"  Total time:    {fp16_results['total_time']:.2f}s")
print(f"  Throughput:    {fp16_results['throughput']:.1f} tokens/s")
print(f"  Memory:        {fp16_memory:.2f} GB")

print(f"\nINT8:")
print(f"  Total time:    {int8_results['total_time']:.2f}s")
print(f"  Throughput:    {int8_results['throughput']:.1f} tokens/s")
print(f"  Memory:        {int8_memory:.2f} GB")

speedup = fp16_results['total_time'] / int8_results['total_time']
throughput_gain = int8_results['throughput'] / fp16_results['throughput']
memory_reduction = fp16_memory / int8_memory

print(f"\nGains:")
print(f"  Speedup:       {speedup:.2f}x ⚡")
print(f"  Throughput:    {throughput_gain:.2f}x 📊")
print(f"  Memory saved:  {memory_reduction:.2f}x 💾")
print("=" * 80)

---
## 4. Quality Evaluation

In [None]:
# Compare output quality
eval_prompt = "Explain the concept of neural networks in detail:"

# Generate with both models
inputs = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

print("Generating with FP16...")
with torch.no_grad():
    fp16_output = model_fp16.generate(
        **inputs, max_new_tokens=100, temperature=0.7, do_sample=True
    )
fp16_text = tokenizer.decode(fp16_output[0], skip_special_tokens=True)

print("Generating with INT8...")
with torch.no_grad():
    int8_output = model_int8.generate(
        **inputs, max_new_tokens=100, temperature=0.7, do_sample=True
    )
int8_text = tokenizer.decode(int8_output[0], skip_special_tokens=True)

print("\n" + "=" * 80)
print("OUTPUT QUALITY COMPARISON")
print("=" * 80)
print(f"\nFP16 Output:\n{fp16_text}")
print(f"\n{'-' * 80}")
print(f"\nINT8 Output:\n{int8_text}")
print("\n" + "=" * 80)
print("\n💡 Quality remains very similar despite 2x compression!")

### Perplexity Evaluation

Perplexity measures how well the model predicts a sample.

In [None]:
def calculate_perplexity(model, tokenizer, text):
    """Calculate perplexity on a text sample."""
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model(**inputs, labels=inputs["input_ids"])
        loss = outputs.loss
    
    perplexity = torch.exp(loss).item()
    return perplexity

# Test texts
test_texts = [
    "Machine learning is a subset of artificial intelligence.",
    "Python is a popular programming language for data science.",
    "Neural networks are inspired by biological neurons.",
]

print("Calculating perplexity...\n")

fp16_ppls = [calculate_perplexity(model_fp16, tokenizer, text) for text in test_texts]
int8_ppls = [calculate_perplexity(model_int8, tokenizer, text) for text in test_texts]

fp16_avg = np.mean(fp16_ppls)
int8_avg = np.mean(int8_ppls)

print("Perplexity Results")
print("=" * 60)
print(f"FP16 avg:  {fp16_avg:.2f}")
print(f"INT8 avg:  {int8_avg:.2f}")
print(f"Degradation: {(int8_avg - fp16_avg) / fp16_avg * 100:.2f}%")
print("=" * 60)
print("\n💡 Perplexity increase <5% is generally acceptable")

---
## 5. INT4 with GPTQ (Advanced)

### GPTQ Quantization

GPTQ (Generalized Post-Training Quantization) provides high-quality INT4 quantization.

**Note**: Requires pre-quantized models (e.g., from TheBloke).

```python
# Example (requires auto-gptq)
from transformers import GPTQConfig

gptq_config = GPTQConfig(
    bits=4,
    group_size=128,
    desc_act=True,
)

model_int4 = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Llama-2-7B-GPTQ",
    quantization_config=gptq_config,
    device_map="auto",
)

# Memory: ~3.5GB (4x reduction vs FP16)
# Speed: 2-3x faster
# Quality: ~95% of FP16
```

### AWQ vs GPTQ

| Method | Calibration | Speed | Quality | Use Case |
|--------|------------|--------|---------|----------|
| **GPTQ** | Required | Fast | High | General purpose |
| **AWQ** | Required | Faster | Very High | When quality critical |

Both achieve ~4x compression with INT4.

---
## 6. Quantization Best Practices

### Choosing Quantization Level

**Decision Tree**:
```
                  Start
                    │
         ┌──────────┴──────────┐
         │                     │
   GPU memory OK?        GPU limited?
         │                     │
         ▼                     ▼
       FP16                  INT8
         │                     │
         │              Quality critical?
         │                   /   \
         │                 Yes   No
         │                  │     │
         │                INT8  INT4
         │
    Need speed?
       /    \
     Yes    No
      │      │
    INT8   FP16
```

In [None]:
# Summary comparison
comparison_data = {
    'Precision': ['FP16', 'INT8'],
    'Memory (GB)': [fp16_memory, int8_memory],
    'Load Time (s)': [load_time_fp16, load_time_int8],
    'Throughput (tok/s)': [
        fp16_results['throughput'], 
        int8_results['throughput']
    ],
    'Avg PPL': [fp16_avg, int8_avg],
}

print("\nQuantization Summary")
print("=" * 80)
for key in comparison_data:
    print(f"{key:20s}: {str(comparison_data[key])}")
print("=" * 80)

# Recommendations
print("\n📌 Recommendations:")
print("  - Production (quality critical):  FP16 or INT8")
print("  - Production (memory limited):    INT8 or INT4")
print("  - Research/Development:           FP16")
print("  - Edge devices:                   INT4 with GPTQ/AWQ")

---
## Summary

✅ **Completed**:
1. Understood quantization fundamentals
2. Implemented INT8 quantization with BitsAndBytes
3. Compared FP16 vs INT8 performance
4. Evaluated quality with perplexity
5. Learned quantization selection strategies

📊 **Key Findings**:
- INT8: 2x memory reduction, 1.5x speedup, <5% quality loss
- INT4: 4x memory reduction, 2-3x speedup, 5-10% quality loss
- Quantization is crucial for resource-constrained deployment

➡️ **Next**: In `04-Comprehensive_Optimization.ipynb`, we'll:
- Combine multiple optimizations
- Test vLLM + quantization
- Run comprehensive benchmarks

In [None]:
# Cleanup
import gc

del model_fp16, model_int8
torch.cuda.empty_cache()
gc.collect()

print("✅ Lab 2.2 Part 3 Complete!")