# Lab-2.2 Part 4: Comprehensive Optimization

## Objectives
- Combine multiple optimization techniques
- Test vLLM + Quantization
- Run end-to-end benchmarks
- Perform cost-benefit analysis

## Estimated Time: 60-90 minutes

---
## 1. Optimization Stack

### Optimization Techniques Summary

| Technique | Memory | Latency | Throughput | Difficulty |
|-----------|--------|---------|------------|------------|
| **PagedAttention** | ↓40-60% | → | ↑20-30% | Easy (use vLLM) |
| **Continuous Batching** | → | ↑10-20% | ↑200-300% | Easy (use vLLM) |
| **FlashAttention** | → | ↓30-40% | ↑20-30% | Easy (built-in) |
| **Quantization (INT8)** | ↓50% | ↓30-40% | ↑50-100% | Medium |
| **Speculative Decoding** | ↑20% | ↓50-70% | ↑150-300% | Hard |
| **GQA** | ↓75% | → | ↑50% | Hard (model arch) |

Legend: ↓ = reduce, ↑ = increase, → = neutral

In [None]:
# Imports
from vllm import LLM, SamplingParams
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
import time
import numpy as np
import matplotlib.pyplot as plt

print(f"PyTorch: {torch.__version__}")
print(f"vLLM: {vllm.__version__}")
print(f"CUDA: {torch.cuda.is_available()}")

---
## 2. Baseline: HuggingFace FP16

In [None]:
MODEL_NAME = "facebook/opt-1.3b"

print("Loading HuggingFace FP16 baseline...")
hf_model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16,
).to("cuda")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

hf_memory = hf_model.get_memory_footprint() / 1e9
print(f"✅ Loaded, Memory: {hf_memory:.2f} GB")

In [None]:
# Benchmark baseline
def benchmark_batch(model, tokenizer, prompts, max_tokens=50):
    """Benchmark batch generation."""
    inputs = tokenizer(prompts, return_tensors="pt", padding=True).to("cuda")
    
    start = time.time()
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            do_sample=True,
            temperature=0.8,
            pad_token_id=tokenizer.eos_token_id,
        )
    elapsed = time.time() - start
    
    total_tokens = sum(len(out) for out in outputs) - len(prompts) * len(inputs.input_ids[0])
    
    return {
        'time': elapsed,
        'tokens': total_tokens,
        'throughput': total_tokens / elapsed,
    }

test_prompts = [
    "Explain AI:",
    "What is Python?",
    "Benefits of ML:",
    "Future of tech:",
]

print("Benchmarking HuggingFace FP16...")
baseline = benchmark_batch(hf_model, tokenizer, test_prompts)

print(f"  Time: {baseline['time']:.2f}s")
print(f"  Throughput: {baseline['throughput']:.1f} tokens/s")

---
## 3. Optimization 1: vLLM (PagedAttention + Continuous Batching)

In [None]:
# Load with vLLM
print("Loading vLLM...")
vllm_model = LLM(
    model=MODEL_NAME,
    gpu_memory_utilization=0.5,
    max_model_len=512,
)
print("✅ vLLM loaded")

# Benchmark
sampling_params = SamplingParams(
    temperature=0.8,
    max_tokens=50,
)

print("\nBenchmarking vLLM...")
start = time.time()
outputs = vllm_model.generate(test_prompts, sampling_params)
vllm_time = time.time() - start

vllm_tokens = sum(len(o.outputs[0].token_ids) for o in outputs)
vllm_throughput = vllm_tokens / vllm_time

print(f"  Time: {vllm_time:.2f}s")
print(f"  Throughput: {vllm_throughput:.1f} tokens/s")
print(f"  Speedup: {baseline['time']/vllm_time:.2f}x")

---
## 4. Optimization 2: Quantization

In [None]:
# Load quantized model with HuggingFace
print("Loading INT8 quantized model...")

quant_config = BitsAndBytesConfig(load_in_8bit=True)
hf_int8 = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=quant_config,
    device_map="auto",
)

int8_memory = hf_int8.get_memory_footprint() / 1e9
print(f"✅ Loaded, Memory: {int8_memory:.2f} GB ({hf_memory/int8_memory:.1f}x reduction)")

# Benchmark
print("\nBenchmarking INT8...")
int8_results = benchmark_batch(hf_int8, tokenizer, test_prompts)

print(f"  Time: {int8_results['time']:.2f}s")
print(f"  Throughput: {int8_results['throughput']:.1f} tokens/s")
print(f"  Speedup: {baseline['time']/int8_results['time']:.2f}x")

---
## 5. Optimization 3: vLLM + Quantization

### Combined Strategy

**Note**: vLLM with quantization requires quantized model files.

```python
# Example: vLLM with AWQ quantization
vllm_quantized = LLM(
    model="TheBloke/Llama-2-7B-AWQ",
    quantization="awq",
    gpu_memory_utilization=0.9,
)

# Expected gains:
# - Memory: 2x reduction (from quantization)
# - Throughput: 15-20x vs HuggingFace (vLLM + quant)
# - Latency: 4-5x improvement
```

---
## 6. Comprehensive Benchmark

In [None]:
# Collect all results
results = {
    'HF FP16': {
        'throughput': baseline['throughput'],
        'time': baseline['time'],
        'memory': hf_memory,
    },
    'HF INT8': {
        'throughput': int8_results['throughput'],
        'time': int8_results['time'],
        'memory': int8_memory,
    },
    'vLLM FP16': {
        'throughput': vllm_throughput,
        'time': vllm_time,
        'memory': hf_memory,  # Similar
    },
}

print("\nComprehensive Benchmark Results")
print("=" * 80)
print(f"{'Method':<20} {'Throughput':<20} {'Time':<15} {'Memory'}")
print("=" * 80)

for method, metrics in results.items():
    speedup = baseline['time'] / metrics['time']
    throughput_gain = metrics['throughput'] / baseline['throughput']
    memory_reduction = baseline['memory'] / metrics['memory']
    
    print(f"{method:<20} {metrics['throughput']:>6.1f} tok/s ({throughput_gain:.1f}x) "
          f"{metrics['time']:>6.2f}s ({speedup:.1f}x) "
          f"{metrics['memory']:>5.2f} GB ({memory_reduction:.1f}x)")

print("=" * 80)

In [None]:
# Visualize comprehensive comparison
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(14, 10))

methods = list(results.keys())
throughputs = [results[m]['throughput'] for m in methods]
times = [results[m]['time'] for m in methods]
memories = [results[m]['memory'] for m in methods]
colors = ['#ff6b6b', '#ffd93d', '#51cf66']

# Throughput
ax1.bar(methods, throughputs, color=colors)
ax1.set_ylabel('Throughput (tokens/s)')
ax1.set_title('Throughput Comparison')
ax1.tick_params(axis='x', rotation=15)
ax1.grid(axis='y', alpha=0.3)

# Time
ax2.bar(methods, times, color=colors)
ax2.set_ylabel('Time (seconds)')
ax2.set_title('Generation Time Comparison')
ax2.tick_params(axis='x', rotation=15)
ax2.grid(axis='y', alpha=0.3)

# Memory
ax3.bar(methods, memories, color=colors)
ax3.set_ylabel('Memory (GB)')
ax3.set_title('Memory Usage Comparison')
ax3.tick_params(axis='x', rotation=15)
ax3.grid(axis='y', alpha=0.3)

# Speedup
speedups = [baseline['time']/results[m]['time'] for m in methods]
ax4.bar(methods, speedups, color=colors)
ax4.axhline(y=1.0, color='r', linestyle='--', linewidth=2, label='Baseline')
ax4.set_ylabel('Speedup vs Baseline')
ax4.set_title('Overall Speedup')
ax4.tick_params(axis='x', rotation=15)
ax4.legend()
ax4.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

---
## 7. Cost-Benefit Analysis

In [None]:
# Calculate cost savings
def calculate_cost_savings(
    baseline_throughput: float,
    optimized_throughput: float,
    gpu_cost_per_hour: float = 3.0,  # A100 cost
    tokens_per_day: int = 1_000_000,
) -> dict:
    """
    Calculate cost savings from optimization.
    """
    # Hours needed to process tokens
    baseline_hours = tokens_per_day / (baseline_throughput * 3600)
    optimized_hours = tokens_per_day / (optimized_throughput * 3600)
    
    # Daily cost
    baseline_cost = baseline_hours * gpu_cost_per_hour
    optimized_cost = optimized_hours * gpu_cost_per_hour
    
    savings = baseline_cost - optimized_cost
    savings_pct = savings / baseline_cost * 100
    
    return {
        'baseline_cost': baseline_cost,
        'optimized_cost': optimized_cost,
        'daily_savings': savings,
        'monthly_savings': savings * 30,
        'savings_pct': savings_pct,
    }

# Calculate for vLLM
print("Cost-Benefit Analysis")
print("=" * 80)
print(f"Assumptions: A100 GPU ($3/hr), 1M tokens/day\n")

vllm_savings = calculate_cost_savings(
    baseline['throughput'],
    vllm_throughput,
)

print(f"Baseline (HF FP16):")
print(f"  Daily cost:     ${vllm_savings['baseline_cost']:.2f}")
print(f"\nOptimized (vLLM):")
print(f"  Daily cost:     ${vllm_savings['optimized_cost']:.2f}")
print(f"  Daily savings:  ${vllm_savings['daily_savings']:.2f} ({vllm_savings['savings_pct']:.1f}%)")
print(f"  Monthly savings: ${vllm_savings['monthly_savings']:.2f}")
print(f"  Yearly savings:  ${vllm_savings['monthly_savings']*12:.2f}")
print("=" * 80)

print("\n💰 Significant cost savings with vLLM optimization!")

---
## 8. Decision Matrix

In [None]:
# Create decision matrix
import pandas as pd

decision_matrix = pd.DataFrame({
    'Optimization': [
        'Baseline (HF FP16)',
        'PagedAttention (vLLM)',
        'Quantization (INT8)',
        'Speculative Decoding',
        'vLLM + INT8',
        'All Combined',
    ],
    'Speedup': [1.0, 5.0, 1.5, 2.0, 7.5, 15.0],
    'Memory': [1.0, 0.6, 0.5, 1.2, 0.3, 0.3],
    'Complexity': ['Low', 'Low', 'Medium', 'High', 'Medium', 'High'],
    'Quality': [100, 100, 98, 100, 98, 97],
    'Use Case': [
        'Development',
        'Production (recommended)',
        'Memory-limited',
        'Latency-critical',
        'High-throughput',
        'Maximum performance',
    ]
})

print("\nOptimization Decision Matrix")
print("=" * 100)
print(decision_matrix.to_string(index=False))
print("=" * 100)
print("\nRecommendations:")
print("  🥇 Production: vLLM (easy + effective)")
print("  🥈 Memory-limited: vLLM + INT8")
print("  🥉 Extreme performance: All techniques combined")

---
## Summary

✅ **Completed Lab-2.2**:
1. Analyzed KV Cache optimization
2. Implemented Speculative Decoding
3. Applied quantization techniques
4. Combined multiple optimizations
5. Performed cost-benefit analysis

📊 **Key Achievements**:
- vLLM: 5-10x throughput improvement
- Quantization: 2x memory reduction
- Speculative Decoding: 1.5-3x latency reduction
- Combined: 10-20x overall improvement

💡 **Best Practices**:
1. **Start simple**: Use vLLM first (biggest impact)
2. **Add quantization**: If memory-constrained
3. **Advanced techniques**: Only if needed (Speculative Decoding)
4. **Monitor quality**: Always validate output quality
5. **Measure everything**: Benchmark before and after

🎓 **Skills Mastered**:
- KV Cache management
- PagedAttention principles
- Speculative Decoding implementation
- Quantization strategies
- Performance optimization methodology

---

## Next Steps

Continue with:
- **Lab-2.3**: FastAPI Service Construction
- **Lab-2.4**: Production Environment Deployment

---

## Resources

- [vLLM Documentation](https://docs.vllm.ai/)
- [BitsAndBytes](https://github.com/TimDettmers/bitsandbytes)
- [GPTQ Paper](https://arxiv.org/abs/2210.17323)
- [AWQ Paper](https://arxiv.org/abs/2306.00978)

In [None]:
# Final cleanup
import gc

del hf_model, hf_int8, vllm_model
torch.cuda.empty_cache()
gc.collect()

print("\n" + "=" * 80)
print("🎉 Congratulations! Lab-2.2 Complete!")
print("=" * 80)
print("\nYou've mastered advanced inference optimization techniques!")