# Lab-2.1 Part 1: vLLM Setup and Installation

## Objectives
- Verify environment (CUDA, GPU)
- Install vLLM
- Run basic inference test
- Understand PagedAttention basics

## Estimated Time: 30-60 minutes

---
## 1. Environment Verification

In [None]:
# Check GPU availability
import torch
import sys

print(f"Python version: {sys.version}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"Number of GPUs: {torch.cuda.device_count()}")
    
    for i in range(torch.cuda.device_count()):
        print(f"\nGPU {i}: {torch.cuda.get_device_name(i)}")
        print(f"  Memory: {torch.cuda.get_device_properties(i).total_memory / 1e9:.2f} GB")
        print(f"  Compute Capability: {torch.cuda.get_device_properties(i).major}.{torch.cuda.get_device_properties(i).minor}")
else:
    print("\n⚠️ WARNING: No CUDA GPU detected!")
    print("vLLM requires a CUDA-compatible GPU to run efficiently.")

### Check GPU Memory

In [None]:
# Detailed GPU memory check
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    
    for i in range(torch.cuda.device_count()):
        allocated = torch.cuda.memory_allocated(i) / 1e9
        reserved = torch.cuda.memory_reserved(i) / 1e9
        total = torch.cuda.get_device_properties(i).total_memory / 1e9
        
        print(f"GPU {i}:")
        print(f"  Allocated: {allocated:.2f} GB")
        print(f"  Reserved:  {reserved:.2f} GB")
        print(f"  Free:      {total - reserved:.2f} GB")
        print(f"  Total:     {total:.2f} GB")
        print()

---
## 2. Install vLLM

vLLM can be installed via pip. For CUDA 12.1+:

In [None]:
# Check if vLLM is already installed
try:
    import vllm
    print(f"✅ vLLM is already installed: v{vllm.__version__}")
except ImportError:
    print("❌ vLLM is not installed.")
    print("\nInstalling vLLM...")
    print("Run this in terminal:")
    print("  pip install vllm")

### Installation Command

If vLLM is not installed, run in terminal:

```bash
# Basic installation
pip install vllm

# Or specify CUDA version
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu121
```

---
## 3. Basic Inference Test

Let's test vLLM with a small model first.

In [None]:
# Import vLLM
from vllm import LLM, SamplingParams
import time

print("✅ vLLM imported successfully!")
print(f"Version: {vllm.__version__}")

### Load a Small Model

We'll use GPT-2 for quick testing (124M parameters).

In [None]:
# Initialize vLLM with GPT-2
print("Loading GPT-2 model...")
start_time = time.time()

llm = LLM(
    model="gpt2",
    gpu_memory_utilization=0.3,  # Use 30% GPU memory for testing
    max_model_len=512,           # Limit context length
)

load_time = time.time() - start_time
print(f"✅ Model loaded in {load_time:.2f} seconds")

### Generate Text

In [None]:
# Define sampling parameters
sampling_params = SamplingParams(
    temperature=0.8,
    top_p=0.95,
    max_tokens=50,
)

# Test prompts
prompts = [
    "Once upon a time in a distant land,",
    "The future of artificial intelligence is",
    "Python is a programming language that",
]

print("Generating text...\n")
start_time = time.time()

outputs = llm.generate(prompts, sampling_params)

generation_time = time.time() - start_time

# Display results
for i, output in enumerate(outputs):
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt {i+1}: {prompt}")
    print(f"Generated: {generated_text}")
    print("-" * 80)

print(f"\n⏱️  Total generation time: {generation_time:.2f} seconds")
print(f"⏱️  Average time per prompt: {generation_time/len(prompts):.2f} seconds")

---
## 4. PagedAttention Overview

PagedAttention is vLLM's key innovation for efficient KV cache management.

### Traditional KV Cache Problem

Traditional approach allocates contiguous memory:

```
Request 1 (len=1024): ████████░░░░░░░░ (allocated 2048, used 1024)
Request 2 (len=512):  ████░░░░░░░░░░░░ (allocated 2048, used 512)

Memory waste: ~60%
```

### PagedAttention Solution

PagedAttention uses paging (like virtual memory):

```
Physical blocks: [P0][P1][P2][P3][P4][P5]...

Request 1: P0 → P1 → P2 → P3 (1024 tokens, 4 blocks)
Request 2: P4 → P5           (512 tokens, 2 blocks)

Memory waste: ~0%
```

In [None]:
# Visualize memory efficiency
import numpy as np
import matplotlib.pyplot as plt

# Simulated data
approaches = ['Traditional', 'PagedAttention']
memory_used = [40, 95]  # Percentage
memory_wasted = [60, 5]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Memory utilization
ax1.bar(approaches, memory_used, color=['#ff6b6b', '#51cf66'])
ax1.set_ylabel('Memory Utilization (%)')
ax1.set_title('Memory Utilization Comparison')
ax1.set_ylim(0, 100)
for i, v in enumerate(memory_used):
    ax1.text(i, v + 2, f"{v}%", ha='center', fontweight='bold')

# Memory waste
ax2.bar(approaches, memory_wasted, color=['#ff6b6b', '#51cf66'])
ax2.set_ylabel('Memory Waste (%)')
ax2.set_title('Memory Waste Comparison')
ax2.set_ylim(0, 100)
for i, v in enumerate(memory_wasted):
    ax2.text(i, v + 2, f"{v}%", ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

print("\n📊 PagedAttention improves memory utilization from 40% to 95%!")

---
## 5. Compare with HuggingFace

Let's compare vLLM with standard HuggingFace inference.

In [None]:
# HuggingFace baseline
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

print("Loading HuggingFace GPT-2...")
hf_model = AutoModelForCausalLM.from_pretrained("gpt2").to("cuda")
hf_tokenizer = AutoTokenizer.from_pretrained("gpt2")
hf_tokenizer.pad_token = hf_tokenizer.eos_token

print("✅ HuggingFace model loaded")

In [None]:
# HuggingFace generation
test_prompt = "The future of artificial intelligence is"

print("Testing HuggingFace...")
inputs = hf_tokenizer(test_prompt, return_tensors="pt").to("cuda")

start_time = time.time()
with torch.no_grad():
    hf_outputs = hf_model.generate(
        **inputs,
        max_new_tokens=50,
        temperature=0.8,
        top_p=0.95,
        do_sample=True,
    )
hf_time = time.time() - start_time

hf_text = hf_tokenizer.decode(hf_outputs[0], skip_special_tokens=True)
print(f"HuggingFace output: {hf_text}")
print(f"Time: {hf_time:.3f}s")

In [None]:
# vLLM generation (same prompt)
print("\nTesting vLLM...")
start_time = time.time()
vllm_outputs = llm.generate([test_prompt], sampling_params)
vllm_time = time.time() - start_time

vllm_text = vllm_outputs[0].outputs[0].text
print(f"vLLM output: {test_prompt}{vllm_text}")
print(f"Time: {vllm_time:.3f}s")

In [None]:
# Performance comparison
speedup = hf_time / vllm_time

print("\n" + "="*80)
print("PERFORMANCE COMPARISON")
print("="*80)
print(f"HuggingFace:  {hf_time:.3f}s")
print(f"vLLM:         {vllm_time:.3f}s")
print(f"Speedup:      {speedup:.2f}x faster ⚡")
print("="*80)

---
## 6. Check vLLM Configuration

In [None]:
# Inspect vLLM engine configuration
print("vLLM Engine Configuration:")
print(f"  Model: {llm.llm_engine.model_config.model}")
print(f"  Max model length: {llm.llm_engine.model_config.max_model_len}")
print(f"  GPU memory utilization: {llm.llm_engine.cache_config.gpu_memory_utilization}")
print(f"  Block size: {llm.llm_engine.cache_config.block_size}")

---
## Summary

✅ **Completed**:
1. Verified CUDA and GPU environment
2. Installed vLLM
3. Ran basic inference test
4. Understood PagedAttention benefits
5. Compared vLLM vs HuggingFace

📊 **Key Findings**:
- vLLM provides significant speedup over HuggingFace
- PagedAttention improves memory utilization from ~40% to ~95%
- Simple API similar to HuggingFace

➡️ **Next**: In `02-Basic_Inference.ipynb`, we'll explore:
- Batch inference
- Advanced sampling strategies
- Memory profiling
- Throughput optimization

---
## Exercises

1. **Try different models**: Replace GPT-2 with other models (e.g., `facebook/opt-125m`)
2. **Adjust parameters**: Experiment with `gpu_memory_utilization` and `max_model_len`
3. **Measure memory**: Use `nvidia-smi` to monitor GPU memory usage
4. **Batch size**: Test with different numbers of prompts (1, 4, 8, 16)

In [None]:
# Clean up
import gc

del llm
del hf_model
torch.cuda.empty_cache()
gc.collect()

print("✅ Cleaned up GPU memory")