# Lab-2.1 Part 4: Production Deployment

## Objectives
- Deploy OpenAI-compatible API server
- Optimize performance for production
- Set up monitoring and logging
- Learn deployment best practices

## Estimated Time: 60-90 minutes

---
## 1. OpenAI-Compatible API Server

vLLM provides an OpenAI-compatible API server out of the box.

### Starting the Server

Run this in a **separate terminal**:

```bash
# Basic usage
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-hf \
    --host 0.0.0.0 \
    --port 8000

# With more options
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-hf \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.9 \
    --max-num-seqs 32 \
    --max-model-len 2048
```

The server will be available at: `http://localhost:8000`

### Test API Endpoint

**Note**: Make sure the vLLM server is running before executing the following cells.

In [1]:
# Test if server is running
import requests
import json

API_URL = "http://localhost:8000"

try:
    response = requests.get(f"{API_URL}/health")
    print(f"✅ Server is running!")
    print(f"Status: {response.status_code}")
except requests.exceptions.ConnectionError:
    print("❌ Server is not running.")
    print("Please start the server in a terminal first.")

✅ Server is running!
Status: 200


### Completions API

In [None]:
# Test /v1/completions endpoint
def call_completions_api(prompt: str, **kwargs):
    """Call vLLM completions API."""
    payload = {
        "model": "meta-llama/Llama-2-7b-hf",
        "prompt": prompt,
        "max_tokens": kwargs.get("max_tokens", 50),
        "temperature": kwargs.get("temperature", 0.8),
        "top_p": kwargs.get("top_p", 0.95),
    }
    
    response = requests.post(
        f"{API_URL}/v1/completions",
        headers={"Content-Type": "application/json"},
        json=payload,
    )
    
    return response.json()

# Test
print("Testing /v1/completions endpoint...\n")

result = call_completions_api("The future of AI is")

print("Response:")
print(json.dumps(result, indent=2))

Testing /v1/completions endpoint...

Response:
{
  "id": "cmpl-442002760ddb4897bd2416e29b76ae5e",
  "object": "text_completion",
  "created": 1761616011,
  "model": "meta-llama/Llama-2-7b-hf",
  "choices": [
    {
      "index": 0,
      "text": " happening now\nThis article is the second in a series of articles highlighting the real-world impact of AI.\nAI: The Future is Now\nLinda Rendle\nPresident, Fujitsu North America\nWe have",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "token_ids": null,
      "prompt_logprobs": null,
      "prompt_token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 7,
    "total_tokens": 57,
    "completion_tokens": 50,
    "prompt_tokens_details": null
  },
  "kv_transfer_params": null
}


In [6]:
json.dumps(result, indent=2)

# 取出 choices 下的 text
completion_text = None
if "choices" in result and len(result["choices"]) > 0:
    completion_text = result["choices"][0].get("text")
    print("Completion Text:")
    print(completion_text)
else:
    print("No completion text found in response.")



Completion Text:
 happening now
This article is the second in a series of articles highlighting the real-world impact of AI.
AI: The Future is Now
Linda Rendle
President, Fujitsu North America
We have


### Chat Completions API

In [7]:
# Test /v1/chat/completions endpoint
def call_chat_api(messages: list, **kwargs):
    """Call vLLM chat completions API."""
    payload = {
        "model": "meta-llama/Llama-2-7b-hf",
        "messages": messages,
        "max_tokens": kwargs.get("max_tokens", 100),
        "temperature": kwargs.get("temperature", 0.8),
    }
    
    response = requests.post(
        f"{API_URL}/v1/chat/completions",
        headers={"Content-Type": "application/json"},
        json=payload,
    )
    
    return response.json()

# Test
print("Testing /v1/chat/completions endpoint...\n")

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "What is machine learning?"}
]

result = call_chat_api(messages)

if "choices" in result:
    print("Assistant:", result["choices"][0]["message"]["content"])
    print(f"\nTokens used: {result['usage']['total_tokens']}")
else:
    print("Response:", result)

Testing /v1/chat/completions endpoint...

Response: {'error': {'message': 'As of transformers v4.44, default chat template is no longer allowed, so you must provide a chat template if the tokenizer does not define one. None', 'type': 'BadRequestError', 'param': None, 'code': 400}}


### Use OpenAI Python Client

vLLM is fully compatible with OpenAI's Python client.

In [9]:
# Install openai if needed
try:
    import openai
except ImportError:
    print("Installing openai...")
    !pip install openai -q
    import openai

print(f"OpenAI version: {openai.__version__}")


OpenAI version: 2.6.0


In [12]:
# ✅ 修正版：適用 vLLM + Llama-2-7b-chat-hf
from openai import OpenAI

# vLLM 預設端口通常是 http://localhost:8000/v1
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy-key",  # vLLM 不需實際 API Key
)

print("Testing with OpenAI client...\n")

# ✅ 改用 Chat 版本模型（內建 chat_template）
response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    max_tokens=150,
    temperature=0.7,
)

print("Response:")
print(response.choices[0].message.content)
print(f"\nTokens: {response.usage.total_tokens}")


Testing with OpenAI client...

Response:
  Ah, an excellent question! *adjusts glasses* Quantum computing, my dear, is a revolutionary field that harnesses the power of quantum mechanics to perform calculations that are beyond the reach of classical computers. *nods*

You see, classical computers use bits, which are either a 0 or a 1, to store and process information. But, quantum computers use quantum bits, or qubits, which can exist in multiple states simultaneously! *excitedly* This means that qubits can process an enormous amount of information all at once, making quantum computers incredibly fast and efficient. *winks*

But wait, there's more! Quantum computers can also

Tokens: 185


---
## 2. Performance Tuning

### Key Configuration Parameters

#### GPU Memory Utilization
```bash
--gpu-memory-utilization 0.9  # Use 90% of GPU memory
```
- Higher value → larger batch size → better throughput
- Keep some memory for overhead (10%)

#### Max Number of Sequences
```bash
--max-num-seqs 32  # Process up to 32 requests concurrently
```
- Higher value → better throughput
- Limited by GPU memory

#### Max Batched Tokens
```bash
--max-num-batched-tokens 8192
```
- Controls prefill batch size
- Affects TTFT (Time to First Token)

#### Block Size
```bash
--block-size 16  # PagedAttention block size
```
- Usually 16 is optimal
- Smaller = less waste, but more overhead

### Load Testing

Use `locust` or `wrk` for load testing.

In [13]:
# Simple concurrent test
import concurrent.futures
import time

def send_request(request_id: int):
    """Send a single request."""
    start = time.time()
    
    try:
        response = client.chat.completions.create(
            model="meta-llama/Llama-2-7b-chat-hf",
            messages=[
                {"role": "user", "content": f"Tell me a fact about number {request_id}."}
            ],
            max_tokens=50,
        )
        elapsed = time.time() - start
        return {"id": request_id, "time": elapsed, "success": True}
    except Exception as e:
        elapsed = time.time() - start
        return {"id": request_id, "time": elapsed, "success": False, "error": str(e)}

# Run concurrent requests
NUM_REQUESTS = 10
NUM_WORKERS = 5

print(f"Sending {NUM_REQUESTS} concurrent requests with {NUM_WORKERS} workers...\n")

start_time = time.time()

with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_WORKERS) as executor:
    futures = [executor.submit(send_request, i) for i in range(NUM_REQUESTS)]
    results = [f.result() for f in concurrent.futures.as_completed(futures)]

total_time = time.time() - start_time

# Analyze results
successful = sum(1 for r in results if r["success"])
latencies = [r["time"] for r in results if r["success"]]

print("LOAD TEST RESULTS")
print("="*80)
print(f"Total requests:     {NUM_REQUESTS}")
print(f"Successful:         {successful}")
print(f"Failed:             {NUM_REQUESTS - successful}")
print(f"Total time:         {total_time:.2f}s")
print(f"Requests/sec:       {NUM_REQUESTS/total_time:.2f}")
print(f"Avg latency:        {sum(latencies)/len(latencies):.3f}s")
print(f"Min latency:        {min(latencies):.3f}s")
print(f"Max latency:        {max(latencies):.3f}s")
print("="*80)

Sending 10 concurrent requests with 5 workers...

LOAD TEST RESULTS
Total requests:     10
Successful:         10
Failed:             0
Total time:         6.69s
Requests/sec:       1.49
Avg latency:        3.054s
Min latency:        2.330s
Max latency:        3.377s


---
## 3. Monitoring and Logging

### Prometheus Metrics

vLLM exposes Prometheus metrics at `/metrics`:

In [23]:
# Fetch Prometheus metrics
try:
    response = requests.get(f"{API_URL}/metrics")
    metrics = response.text
    
    print("Sample metrics:\n")
    print("="*80)
    
    # Show first 20 lines
    for line in metrics.split('\n')[:]:
        if line and not line.startswith('#'):
            print(line)
    
    print("...")
    print("="*80)
    
except Exception as e:
    print(f"Error fetching metrics: {e}")


# Prometheus metrics output explained:
#
# The metrics endpoint exposes various internal statistics in a structured text format.
# Each line either starts with '#' (metadata) or is a metric data record.
#
# - Lines starting with `# HELP` provide a description of the metric.
# - Lines starting with `# TYPE` indicate the metric type (`counter`, `gauge`, `histogram`, etc.).
# - Metrics lines themselves have the form: 
#     <metric_name>{label1="value1",...} <value>
#
# Common structure/examples:
#   # HELP process_resident_memory_bytes Resident memory size in bytes.
#   # TYPE process_resident_memory_bytes gauge
#   process_resident_memory_bytes 9.81323776e+08
#
# 這裡以 vLLM 以及 Python 服務常見指標說明：
# - 指標名稱如 `python_gc_objects_collected_total{generation="0"}` 表示 GC 的 objects collected 數量，gauge/counter 格式。
# - `{generation="0"}` 是該 metric 的 label，可根據不同維度切分資料。多數時候會看到例如 GPU/handler/method/endpoint 等 label。
# - 指標數值（如 `10797.0`）是當前監控時的讀數，可用於 Prometheus 查詢/監控/告警。
#
# 典型 Prometheus metric log 結構：
#   - 說明：# HELP
#   - 型別：# TYPE
#   - 具體資料：metric_name{labels} value
#
# vLLM 相關指標舉例（用於監控推理伺服器狀態）：
#   vllm:num_requests_running         # 當前正在執行中的請求數（gauge）
#   vllm:num_requests_waiting         # 等待排程的請求數（gauge）
#   vllm:e2e_request_latency_seconds  # 請求端到端延遲（histogram/gauge）
#   vllm:gpu_cache_usage_perc         # GPU KV-cache 使用率（gauge）
#   vllm:generation_tokens_total      # 產生的token總數（counter）
#
# 你可以根據這些指標設計 dashboard 或 alert，例如：
#   - 若 num_requests_waiting 過高，代表伺服器壅塞或資源不足。
#   - 若 e2e_request_latency_seconds 顯著升高，需檢查模型、硬體或上游負載情況。
#   - GPU cache/memory 使用率貼近上限時，易出現 OOM 或延遲。


# ---
# 上述 Prometheus 指標的含意說明：
#
# 1. `python_gc_objects_collected_total{generation="X"}`：表示 Python 垃圾回收 (GC) 針對世代 X（0、1、2）累計成功回收的物件數量。GC 會將內存中的物件根據存活時間分為三個世代：0（新）、1（中）、2（老），以分層清理減少效能損耗。
#
# 2. `python_gc_objects_uncollectable_total{generation="X"}`：世代 X 中累計無法被 GC 回收的物件數（通常因循環引用且無法正確解構）。理論上此數維持 0 代表沒有內存洩漏風險。
#
# 3. `python_gc_collections_total{generation="X"}`：GC 對世代 X 已執行的回收次數。這能反映垃圾回收觸發頻率，值越大代表 GC 次數越多。
#
# 4. `python_info{implementation="CPython",major="3",minor="10",patchlevel="12",version="3.10.12"} 1.0`：說明 Python 執行環境的版本資訊。此例為 CPython 3.10.12。該指標值為常數 1，僅作為標籤資訊供監控系統查詢環境。
#
# 這些指標可協助你監控 Python 應用的記憶體管理與垃圾回收狀況，判斷系統是否存在異常物件累積、循環引用（uncollectable）、GC 頻繁觸發等問題，進而優化資源分配及減少潛在效能瓶頸。


Sample metrics:

python_gc_objects_collected_total{generation="0"} 10797.0
python_gc_objects_collected_total{generation="1"} 1616.0
python_gc_objects_collected_total{generation="2"} 209.0
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
python_gc_collections_total{generation="0"} 1273.0
python_gc_collections_total{generation="1"} 114.0
python_gc_collections_total{generation="2"} 9.0
python_info{implementation="CPython",major="3",minor="10",patchlevel="12",version="3.10.12"} 1.0
process_virtual_memory_bytes 1.032974336e+010
process_resident_memory_bytes 9.81815296e+08
process_start_time_seconds 1.76161721182e+09
process_cpu_seconds_total 12.479999999999999
process_open_fds 51.0
process_max_fds 1.048576e+06
vllm:num_requests_running{engine="0",model_name="meta-llama/Llama-2-7b-chat-hf"} 0.0
vllm:num_requests_waiting{engine="0",model_name="meta-llama/Llama-2-7b-chat

### Key Metrics to Monitor

1. **Request Metrics**
   - `vllm:num_requests_running` - Active requests
   - `vllm:num_requests_waiting` - Queued requests
   - `vllm:request_success_total` - Successful requests

2. **Latency Metrics**
   - `vllm:time_to_first_token_seconds` - TTFT
   - `vllm:time_per_output_token_seconds` - ITL
   - `vllm:e2e_request_latency_seconds` - End-to-end latency

3. **GPU Metrics**
   - `vllm:gpu_cache_usage_perc` - KV cache utilization
   - `vllm:gpu_memory_usage_bytes` - GPU memory

4. **Throughput Metrics**
   - `vllm:num_preemptions_total` - Request preemptions
   - `vllm:prompt_tokens_total` - Input tokens
   - `vllm:generation_tokens_total` - Output tokens

### Custom Logging

In [25]:
# Simple request logger
import logging
from datetime import datetime

# Setup logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
)
logger = logging.getLogger(__name__)

def logged_request(prompt: str, **kwargs):
    """Make request with logging."""
    request_id = datetime.now().strftime("%Y%m%d%H%M%S%f")
    
    logger.info(f"Request {request_id} started")
    logger.info(f"  Prompt: {prompt[:50]}...")
    
    start = time.time()
    
    try:
        response = client.chat.completions.create(
            model="meta-llama/Llama-2-7b-chat-hf",
            messages=[{"role": "user", "content": prompt}],
            **kwargs
        )
        
        elapsed = time.time() - start
        tokens = response.usage.total_tokens
        
        logger.info(f"Request {request_id} completed")
        logger.info(f"  Time: {elapsed:.3f}s")
        logger.info(f"  Tokens: {tokens}")
        logger.info(f"  Throughput: {tokens/elapsed:.1f} tokens/s")
        
        return response
        
    except Exception as e:
        elapsed = time.time() - start
        logger.error(f"Request {request_id} failed")
        logger.error(f"  Error: {e}")
        logger.error(f"  Time: {elapsed:.3f}s")
        raise

# Test logged request
print("Testing logged request:\n")
response = logged_request("What is Python?", max_tokens=50)

2025-10-28 10:34:50,325 - INFO - Request 20251028103450325778 started
2025-10-28 10:34:50,326 - INFO -   Prompt: What is Python?...


Testing logged request:



2025-10-28 10:34:53,546 - INFO - HTTP Request: POST http://localhost:8000/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-28 10:34:53,547 - INFO - Request 20251028103450325778 completed
2025-10-28 10:34:53,547 - INFO -   Time: 3.220s
2025-10-28 10:34:53,547 - INFO -   Tokens: 62
2025-10-28 10:34:53,548 - INFO -   Throughput: 19.3 tokens/s


---
## 4. Deployment Best Practices

### Docker Deployment

#### Dockerfile Example

```dockerfile
FROM nvidia/cuda:12.1.0-devel-ubuntu22.04

# Install Python
RUN apt-get update && apt-get install -y \
    python3.10 \
    python3-pip \
    && rm -rf /var/lib/apt/lists/*

# Install vLLM
RUN pip3 install vllm

# Download model (optional, can mount volume instead)
# RUN python3 -c "from transformers import AutoModel; AutoModel.from_pretrained('meta-llama/Llama-2-7b-hf')"

# Expose port
EXPOSE 8000

# Start server
CMD ["python3", "-m", "vllm.entrypoints.openai.api_server", \
     "--model", "meta-llama/Llama-2-7b-hf", \
     "--host", "0.0.0.0", \
     "--port", "8000"]
```

#### Build and Run

```bash
# Build
docker build -t vllm-server .

# Run
docker run --gpus all -p 8000:8000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    vllm-server
```

### Kubernetes Deployment

#### Deployment YAML

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm-server
  template:
    metadata:
      labels:
        app: vllm-server
    spec:
      containers:
      - name: vllm
        image: vllm-server:latest
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            memory: "32Gi"
            cpu: "4"
        env:
        - name: CUDA_VISIBLE_DEVICES
          value: "0"
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm-server
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8000
  type: LoadBalancer
```

### Health Checks

In [None]:
# Implement health check endpoint
def check_health():
    """Check if vLLM server is healthy."""
    try:
        # Check health endpoint
        response = requests.get(f"{API_URL}/health", timeout=5)
        if response.status_code != 200:
            return False, "Health check failed"
        
        # Test with simple request
        test_response = client.chat.completions.create(
            model="meta-llama/Llama-2-7b-hf",
            messages=[{"role": "user", "content": "test"}],
            max_tokens=5,
        )
        
        return True, "Healthy"
        
    except Exception as e:
        return False, str(e)

# Run health check
is_healthy, message = check_health()

if is_healthy:
    print("✅ Server is healthy")
else:
    print(f"❌ Server health check failed: {message}")

### Production Checklist

Before deploying to production:

- [ ] **Performance Testing**
  - [ ] Load testing completed
  - [ ] Latency benchmarks acceptable
  - [ ] Memory usage stable

- [ ] **Monitoring**
  - [ ] Prometheus metrics configured
  - [ ] Grafana dashboards set up
  - [ ] Alerts configured

- [ ] **Reliability**
  - [ ] Health checks implemented
  - [ ] Auto-restart on failure
  - [ ] Load balancing configured

- [ ] **Security**
  - [ ] API authentication enabled
  - [ ] Rate limiting configured
  - [ ] Input validation implemented

- [ ] **Scalability**
  - [ ] Horizontal scaling tested
  - [ ] Auto-scaling configured
  - [ ] Resource limits set

---
## 5. Common Issues and Solutions

### Issue 1: Out of Memory

**Symptoms**: CUDA OOM errors

**Solutions**:
```bash
# Reduce GPU memory utilization
--gpu-memory-utilization 0.8

# Reduce max sequences
--max-num-seqs 16

# Reduce context length
--max-model-len 1024
```

### Issue 2: High Latency

**Symptoms**: Slow response times

**Solutions**:
```bash
# Increase batch size
--max-num-seqs 64

# Increase batched tokens
--max-num-batched-tokens 16384

# Enable tensor parallelism (multi-GPU)
--tensor-parallel-size 2
```

### Issue 3: Request Timeouts

**Symptoms**: Requests timing out

**Solutions**:
- Reduce `max_tokens` in requests
- Increase server timeout settings
- Scale horizontally with load balancer

### Issue 4: Inconsistent Performance

**Symptoms**: Variable latency

**Solutions**:
- Check for competing GPU processes
- Monitor GPU temperature throttling
- Ensure stable power supply
- Use dedicated GPU instances

---
## Summary

✅ **Completed Lab-2.1**:
1. Deployed OpenAI-compatible API server
2. Tested completions and chat APIs
3. Performed load testing
4. Set up monitoring and logging
5. Learned deployment best practices

📊 **Key Achievements**:
- Production-ready vLLM deployment
- OpenAI API compatibility
- Performance monitoring setup
- Understanding of optimization parameters

🎓 **Skills Acquired**:
- vLLM installation and configuration
- PagedAttention understanding
- Batch inference optimization
- Advanced sampling strategies
- Production deployment

---

## Next Steps

Continue with:
- **Lab-2.2**: Inference Optimization Techniques
- **Lab-2.3**: FastAPI Service Construction
- **Lab-2.4**: Production Environment Deployment

---

## Resources

- [vLLM Documentation](https://docs.vllm.ai/)
- [vLLM GitHub](https://github.com/vllm-project/vllm)
- [PagedAttention Paper](https://arxiv.org/abs/2309.06180)
- [OpenAI API Reference](https://platform.openai.com/docs/api-reference)

In [None]:
# Final cleanup
print("✅ Lab-2.1 Complete!")
print("\nCongratulations on mastering vLLM deployment! 🎉")