# Lab-2.3 Part 4: Monitoring and Deployment

## Objectives
- Add Prometheus metrics
- Implement structured logging
- Create Docker containers
- Set up health checks

## Estimated Time: 60-90 minutes

---
## 1. Prometheus Metrics

In [None]:
# Check prometheus_client
try:
    import prometheus_client
    print(f"✅ prometheus_client: {prometheus_client.__version__}")
except ImportError:
    print("❌ Install: pip install prometheus-client")

In [None]:
%%writefile app_metrics.py
from fastapi import FastAPI
from fastapi.responses import Response
from pydantic import BaseModel
from prometheus_client import Counter, Histogram, Gauge, generate_latest, CONTENT_TYPE_LATEST
from vllm import LLM, SamplingParams
import time

app = FastAPI()

# Prometheus metrics
request_counter = Counter(
    'llm_requests_total',
    'Total number of requests',
    ['endpoint', 'status']
)

request_duration = Histogram(
    'llm_request_duration_seconds',
    'Request duration in seconds',
    ['endpoint'],
    buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0]
)

tokens_generated = Counter(
    'llm_tokens_generated_total',
    'Total tokens generated'
)

active_requests = Gauge(
    'llm_active_requests',
    'Number of active requests'
)

# Global engine
llm_engine = None

class GenerateRequest(BaseModel):
    prompt: str
    max_tokens: int = 100

@app.on_event("startup")
async def startup():
    global llm_engine
    llm_engine = LLM(model="gpt2", gpu_memory_utilization=0.3)
    print("✅ Engine loaded")

@app.post("/generate")
async def generate(request: GenerateRequest):
    """Generate with metrics tracking."""
    active_requests.inc()
    
    try:
        start = time.time()
        
        sampling_params = SamplingParams(
            max_tokens=request.max_tokens,
            temperature=0.8
        )
        
        outputs = llm_engine.generate([request.prompt], sampling_params)
        
        duration = time.time() - start
        generated_tokens = len(outputs[0].outputs[0].token_ids)
        
        # Update metrics
        request_counter.labels(endpoint='generate', status='success').inc()
        request_duration.labels(endpoint='generate').observe(duration)
        tokens_generated.inc(generated_tokens)
        
        return {
            "text": outputs[0].outputs[0].text,
            "tokens": generated_tokens,
            "duration": duration
        }
        
    except Exception as e:
        request_counter.labels(endpoint='generate', status='error').inc()
        raise
    finally:
        active_requests.dec()

@app.get("/metrics")
async def metrics():
    """Prometheus metrics endpoint."""
    return Response(
        content=generate_latest(),
        media_type=CONTENT_TYPE_LATEST
    )

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

---
## 2. Structured Logging

In [None]:
%%writefile app_logging.py
from fastapi import FastAPI, Request
import logging
import json
import time
from datetime import datetime

app = FastAPI()

# Configure structured logging
class JSONFormatter(logging.Formatter):
    def format(self, record):
        log_data = {
            "timestamp": datetime.utcnow().isoformat(),
            "level": record.levelname,
            "message": record.getMessage(),
            "module": record.module,
        }
        return json.dumps(log_data)

handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())

logger = logging.getLogger(__name__)
logger.addHandler(handler)
logger.setLevel(logging.INFO)

@app.middleware("http")
async def log_requests(request: Request, call_next):
    """Log all requests."""
    start_time = time.time()
    
    # Log request
    logger.info(f"Request: {request.method} {request.url.path}")
    
    # Process request
    response = await call_next(request)
    
    # Log response
    duration = time.time() - start_time
    logger.info(f"Response: {response.status_code} ({duration:.3f}s)")
    
    # Add custom headers
    response.headers["X-Process-Time"] = str(duration)
    
    return response

@app.get("/test")
async def test():
    logger.info("Test endpoint called")
    return {"message": "test"}

---
## 3. Docker Deployment

### Dockerfile

In [None]:
%%writefile Dockerfile
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04

# Install Python
RUN apt-get update && apt-get install -y \
    python3.10 \
    python3-pip \
    && rm -rf /var/lib/apt/lists/*

# Set working directory
WORKDIR /app

# Copy requirements
COPY requirements.txt .

# Install dependencies
RUN pip3 install --no-cache-dir -r requirements.txt

# Copy application
COPY app_vllm.py .

# Expose port
EXPOSE 8000

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=40s --retries=3 \
    CMD curl -f http://localhost:8000/health || exit 1

# Run application
CMD ["uvicorn", "app_vllm:app", "--host", "0.0.0.0", "--port", "8000"]

In [None]:
%%writefile requirements.txt
fastapi==0.104.0
uvicorn[standard]==0.24.0
vllm>=0.6.0
prometheus-client==0.19.0
pydantic==2.5.0

### Docker Commands

```bash
# Build image
docker build -t vllm-api:latest .

# Run container
docker run --gpus all \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm-api:latest

# Check logs
docker logs -f <container_id>
```

### Docker Compose

In [None]:
%%writefile docker-compose.yml
version: '3.8'

services:
  vllm-api:
    build: .
    ports:
      - "8000:8000"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    environment:
      - CUDA_VISIBLE_DEVICES=0
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s

  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    depends_on:
      - prometheus

In [None]:
%%writefile prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'vllm-api'
    static_configs:
      - targets: ['vllm-api:8000']
    metrics_path: '/metrics'

---
## 4. Production Checklist

### Deployment Checklist

#### Infrastructure ✅
- [ ] GPU drivers and CUDA installed
- [ ] Docker and nvidia-docker runtime
- [ ] Sufficient disk space for models
- [ ] Network firewall configured

#### Application ✅
- [ ] Environment variables configured
- [ ] Model downloaded and cached
- [ ] API keys and authentication
- [ ] Rate limiting enabled

#### Monitoring ✅
- [ ] Prometheus metrics exposed
- [ ] Grafana dashboards created
- [ ] Alert rules configured
- [ ] Log aggregation setup

#### Reliability ✅
- [ ] Health checks implemented
- [ ] Graceful shutdown handling
- [ ] Auto-restart on failure
- [ ] Backup and disaster recovery

#### Performance ✅
- [ ] Load testing completed
- [ ] Latency benchmarks acceptable
- [ ] Memory usage optimized
- [ ] Auto-scaling configured

---
## Summary

✅ **Completed Lab-2.3**:
1. Built complete FastAPI service
2. Implemented async processing
3. Integrated vLLM backend
4. Added Prometheus monitoring
5. Created Docker deployment

🎓 **Skills Mastered**:
- FastAPI development
- Async/await patterns
- Streaming responses (SSE, WebSocket)
- vLLM integration
- Prometheus metrics
- Docker containerization

📊 **Production-Ready Features**:
- OpenAI-compatible API
- High-performance vLLM backend
- Concurrent request handling
- Monitoring and observability
- Container deployment

---

## Next Steps

Continue with:
- **Lab-2.4**: Production Environment Deployment
- **Lab-2.5**: Performance Monitoring and Tuning

---

## Resources

- [FastAPI Docs](https://fastapi.tiangolo.com/)
- [vLLM Server Guide](https://docs.vllm.ai/en/latest/serving/)
- [Prometheus Python Client](https://github.com/prometheus/client_python)
- [Docker Best Practices](https://docs.docker.com/develop/dev-best-practices/)

In [None]:
print("\n" + "=" * 80)
print("🎉 Congratulations! Lab-2.3 Complete!")
print("=" * 80)
print("\nYou've built a production-ready LLM API service! 🚀")