# Lab-2.1 Part 4: Production Deployment

## Objectives
- Deploy OpenAI-compatible API server
- Optimize performance for production
- Set up monitoring and logging
- Learn deployment best practices

## Estimated Time: 60-90 minutes

---
## 1. OpenAI-Compatible API Server

vLLM provides an OpenAI-compatible API server out of the box.

### Starting the Server

Run this in a **separate terminal**:

```bash
# Basic usage
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-hf \
    --host 0.0.0.0 \
    --port 8000

# With more options
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-hf \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.9 \
    --max-num-seqs 32 \
    --max-model-len 2048
```

The server will be available at: `http://localhost:8000`

### Test API Endpoint

**Note**: Make sure the vLLM server is running before executing the following cells.

In [None]:
# Test if server is running
import requests
import json

API_URL = "http://localhost:8000"

try:
    response = requests.get(f"{API_URL}/health")
    print(f"✅ Server is running!")
    print(f"Status: {response.status_code}")
except requests.exceptions.ConnectionError:
    print("❌ Server is not running.")
    print("Please start the server in a terminal first.")

### Completions API

In [None]:
# Test /v1/completions endpoint
def call_completions_api(prompt: str, **kwargs):
    """Call vLLM completions API."""
    payload = {
        "model": "meta-llama/Llama-2-7b-hf",
        "prompt": prompt,
        "max_tokens": kwargs.get("max_tokens", 50),
        "temperature": kwargs.get("temperature", 0.8),
        "top_p": kwargs.get("top_p", 0.95),
    }
    
    response = requests.post(
        f"{API_URL}/v1/completions",
        headers={"Content-Type": "application/json"},
        json=payload,
    )
    
    return response.json()

# Test
print("Testing /v1/completions endpoint...\n")

result = call_completions_api("The future of AI is")

print("Response:")
print(json.dumps(result, indent=2))

### Chat Completions API

In [None]:
# Test /v1/chat/completions endpoint
def call_chat_api(messages: list, **kwargs):
    """Call vLLM chat completions API."""
    payload = {
        "model": "meta-llama/Llama-2-7b-hf",
        "messages": messages,
        "max_tokens": kwargs.get("max_tokens", 100),
        "temperature": kwargs.get("temperature", 0.8),
    }
    
    response = requests.post(
        f"{API_URL}/v1/chat/completions",
        headers={"Content-Type": "application/json"},
        json=payload,
    )
    
    return response.json()

# Test
print("Testing /v1/chat/completions endpoint...\n")

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "What is machine learning?"}
]

result = call_chat_api(messages)

if "choices" in result:
    print("Assistant:", result["choices"][0]["message"]["content"])
    print(f"\nTokens used: {result['usage']['total_tokens']}")
else:
    print("Response:", result)

### Use OpenAI Python Client

vLLM is fully compatible with OpenAI's Python client.

In [None]:
# Install openai if needed
try:
    import openai
except ImportError:
    print("Installing openai...")
    !pip install openai -q
    import openai

print(f"OpenAI version: {openai.__version__}")

In [None]:
# Configure OpenAI client to use vLLM
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy-key",  # vLLM doesn't require real API key
)

print("Testing with OpenAI client...\n")

response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-hf",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    max_tokens=150,
    temperature=0.7,
)

print("Response:")
print(response.choices[0].message.content)
print(f"\nTokens: {response.usage.total_tokens}")

---
## 2. Performance Tuning

### Key Configuration Parameters

#### GPU Memory Utilization
```bash
--gpu-memory-utilization 0.9  # Use 90% of GPU memory
```
- Higher value → larger batch size → better throughput
- Keep some memory for overhead (10%)

#### Max Number of Sequences
```bash
--max-num-seqs 32  # Process up to 32 requests concurrently
```
- Higher value → better throughput
- Limited by GPU memory

#### Max Batched Tokens
```bash
--max-num-batched-tokens 8192
```
- Controls prefill batch size
- Affects TTFT (Time to First Token)

#### Block Size
```bash
--block-size 16  # PagedAttention block size
```
- Usually 16 is optimal
- Smaller = less waste, but more overhead

### Load Testing

Use `locust` or `wrk` for load testing.

In [None]:
# Simple concurrent test
import concurrent.futures
import time

def send_request(request_id: int):
    """Send a single request."""
    start = time.time()
    
    try:
        response = client.chat.completions.create(
            model="meta-llama/Llama-2-7b-hf",
            messages=[
                {"role": "user", "content": f"Tell me a fact about number {request_id}."}
            ],
            max_tokens=50,
        )
        elapsed = time.time() - start
        return {"id": request_id, "time": elapsed, "success": True}
    except Exception as e:
        elapsed = time.time() - start
        return {"id": request_id, "time": elapsed, "success": False, "error": str(e)}

# Run concurrent requests
NUM_REQUESTS = 10
NUM_WORKERS = 5

print(f"Sending {NUM_REQUESTS} concurrent requests with {NUM_WORKERS} workers...\n")

start_time = time.time()

with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_WORKERS) as executor:
    futures = [executor.submit(send_request, i) for i in range(NUM_REQUESTS)]
    results = [f.result() for f in concurrent.futures.as_completed(futures)]

total_time = time.time() - start_time

# Analyze results
successful = sum(1 for r in results if r["success"])
latencies = [r["time"] for r in results if r["success"]]

print("LOAD TEST RESULTS")
print("="*80)
print(f"Total requests:     {NUM_REQUESTS}")
print(f"Successful:         {successful}")
print(f"Failed:             {NUM_REQUESTS - successful}")
print(f"Total time:         {total_time:.2f}s")
print(f"Requests/sec:       {NUM_REQUESTS/total_time:.2f}")
print(f"Avg latency:        {sum(latencies)/len(latencies):.3f}s")
print(f"Min latency:        {min(latencies):.3f}s")
print(f"Max latency:        {max(latencies):.3f}s")
print("="*80)

---
## 3. Monitoring and Logging

### Prometheus Metrics

vLLM exposes Prometheus metrics at `/metrics`:

In [None]:
# Fetch Prometheus metrics
try:
    response = requests.get(f"{API_URL}/metrics")
    metrics = response.text
    
    print("Sample metrics:\n")
    print("="*80)
    
    # Show first 20 lines
    for line in metrics.split('\n')[:20]:
        if line and not line.startswith('#'):
            print(line)
    
    print("...")
    print("="*80)
    
except Exception as e:
    print(f"Error fetching metrics: {e}")

### Key Metrics to Monitor

1. **Request Metrics**
   - `vllm:num_requests_running` - Active requests
   - `vllm:num_requests_waiting` - Queued requests
   - `vllm:request_success_total` - Successful requests

2. **Latency Metrics**
   - `vllm:time_to_first_token_seconds` - TTFT
   - `vllm:time_per_output_token_seconds` - ITL
   - `vllm:e2e_request_latency_seconds` - End-to-end latency

3. **GPU Metrics**
   - `vllm:gpu_cache_usage_perc` - KV cache utilization
   - `vllm:gpu_memory_usage_bytes` - GPU memory

4. **Throughput Metrics**
   - `vllm:num_preemptions_total` - Request preemptions
   - `vllm:prompt_tokens_total` - Input tokens
   - `vllm:generation_tokens_total` - Output tokens

### Custom Logging

In [None]:
# Simple request logger
import logging
from datetime import datetime

# Setup logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
)
logger = logging.getLogger(__name__)

def logged_request(prompt: str, **kwargs):
    """Make request with logging."""
    request_id = datetime.now().strftime("%Y%m%d%H%M%S%f")
    
    logger.info(f"Request {request_id} started")
    logger.info(f"  Prompt: {prompt[:50]}...")
    
    start = time.time()
    
    try:
        response = client.chat.completions.create(
            model="meta-llama/Llama-2-7b-hf",
            messages=[{"role": "user", "content": prompt}],
            **kwargs
        )
        
        elapsed = time.time() - start
        tokens = response.usage.total_tokens
        
        logger.info(f"Request {request_id} completed")
        logger.info(f"  Time: {elapsed:.3f}s")
        logger.info(f"  Tokens: {tokens}")
        logger.info(f"  Throughput: {tokens/elapsed:.1f} tokens/s")
        
        return response
        
    except Exception as e:
        elapsed = time.time() - start
        logger.error(f"Request {request_id} failed")
        logger.error(f"  Error: {e}")
        logger.error(f"  Time: {elapsed:.3f}s")
        raise

# Test logged request
print("Testing logged request:\n")
response = logged_request("What is Python?", max_tokens=50)

---
## 4. Deployment Best Practices

### Docker Deployment

#### Dockerfile Example

```dockerfile
FROM nvidia/cuda:12.1.0-devel-ubuntu22.04

# Install Python
RUN apt-get update && apt-get install -y \
    python3.10 \
    python3-pip \
    && rm -rf /var/lib/apt/lists/*

# Install vLLM
RUN pip3 install vllm

# Download model (optional, can mount volume instead)
# RUN python3 -c "from transformers import AutoModel; AutoModel.from_pretrained('meta-llama/Llama-2-7b-hf')"

# Expose port
EXPOSE 8000

# Start server
CMD ["python3", "-m", "vllm.entrypoints.openai.api_server", \
     "--model", "meta-llama/Llama-2-7b-hf", \
     "--host", "0.0.0.0", \
     "--port", "8000"]
```

#### Build and Run

```bash
# Build
docker build -t vllm-server .

# Run
docker run --gpus all -p 8000:8000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    vllm-server
```

### Kubernetes Deployment

#### Deployment YAML

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm-server
  template:
    metadata:
      labels:
        app: vllm-server
    spec:
      containers:
      - name: vllm
        image: vllm-server:latest
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            memory: "32Gi"
            cpu: "4"
        env:
        - name: CUDA_VISIBLE_DEVICES
          value: "0"
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm-server
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8000
  type: LoadBalancer
```

### Health Checks

In [None]:
# Implement health check endpoint
def check_health():
    """Check if vLLM server is healthy."""
    try:
        # Check health endpoint
        response = requests.get(f"{API_URL}/health", timeout=5)
        if response.status_code != 200:
            return False, "Health check failed"
        
        # Test with simple request
        test_response = client.chat.completions.create(
            model="meta-llama/Llama-2-7b-hf",
            messages=[{"role": "user", "content": "test"}],
            max_tokens=5,
        )
        
        return True, "Healthy"
        
    except Exception as e:
        return False, str(e)

# Run health check
is_healthy, message = check_health()

if is_healthy:
    print("✅ Server is healthy")
else:
    print(f"❌ Server health check failed: {message}")

### Production Checklist

Before deploying to production:

- [ ] **Performance Testing**
  - [ ] Load testing completed
  - [ ] Latency benchmarks acceptable
  - [ ] Memory usage stable

- [ ] **Monitoring**
  - [ ] Prometheus metrics configured
  - [ ] Grafana dashboards set up
  - [ ] Alerts configured

- [ ] **Reliability**
  - [ ] Health checks implemented
  - [ ] Auto-restart on failure
  - [ ] Load balancing configured

- [ ] **Security**
  - [ ] API authentication enabled
  - [ ] Rate limiting configured
  - [ ] Input validation implemented

- [ ] **Scalability**
  - [ ] Horizontal scaling tested
  - [ ] Auto-scaling configured
  - [ ] Resource limits set

---
## 5. Common Issues and Solutions

### Issue 1: Out of Memory

**Symptoms**: CUDA OOM errors

**Solutions**:
```bash
# Reduce GPU memory utilization
--gpu-memory-utilization 0.8

# Reduce max sequences
--max-num-seqs 16

# Reduce context length
--max-model-len 1024
```

### Issue 2: High Latency

**Symptoms**: Slow response times

**Solutions**:
```bash
# Increase batch size
--max-num-seqs 64

# Increase batched tokens
--max-num-batched-tokens 16384

# Enable tensor parallelism (multi-GPU)
--tensor-parallel-size 2
```

### Issue 3: Request Timeouts

**Symptoms**: Requests timing out

**Solutions**:
- Reduce `max_tokens` in requests
- Increase server timeout settings
- Scale horizontally with load balancer

### Issue 4: Inconsistent Performance

**Symptoms**: Variable latency

**Solutions**:
- Check for competing GPU processes
- Monitor GPU temperature throttling
- Ensure stable power supply
- Use dedicated GPU instances

---
## Summary

✅ **Completed Lab-2.1**:
1. Deployed OpenAI-compatible API server
2. Tested completions and chat APIs
3. Performed load testing
4. Set up monitoring and logging
5. Learned deployment best practices

📊 **Key Achievements**:
- Production-ready vLLM deployment
- OpenAI API compatibility
- Performance monitoring setup
- Understanding of optimization parameters

🎓 **Skills Acquired**:
- vLLM installation and configuration
- PagedAttention understanding
- Batch inference optimization
- Advanced sampling strategies
- Production deployment

---

## Next Steps

Continue with:
- **Lab-2.2**: Inference Optimization Techniques
- **Lab-2.3**: FastAPI Service Construction
- **Lab-2.4**: Production Environment Deployment

---

## Resources

- [vLLM Documentation](https://docs.vllm.ai/)
- [vLLM GitHub](https://github.com/vllm-project/vllm)
- [PagedAttention Paper](https://arxiv.org/abs/2309.06180)
- [OpenAI API Reference](https://platform.openai.com/docs/api-reference)

In [None]:
# Final cleanup
print("✅ Lab-2.1 Complete!")
print("\nCongratulations on mastering vLLM deployment! 🎉")