# Lab-2.4 Part 2: Deployment Implementation

## Objectives
- Create optimized Docker images
- Design Kubernetes deployment manifests
- Implement auto-scaling configuration
- Set up model registry and version management
- Design CI/CD pipeline

## Estimated Time: 60-120 minutes

---
## 1. Docker Optimization

In [None]:
# Generate optimized Dockerfile
import os

# Multi-stage Dockerfile for production
dockerfile_content = '''
# Multi-stage Dockerfile for vLLM + FastAPI service
FROM nvidia/cuda:12.1.0-devel-ubuntu22.04 AS builder

# Install system dependencies
RUN apt-get update && apt-get install -y \\
    python3.10 \\
    python3.10-dev \\
    python3-pip \\
    git \\
    curl \\
    && rm -rf /var/lib/apt/lists/*

# Create virtual environment
RUN python3.10 -m pip install --upgrade pip
RUN python3.10 -m pip install wheel setuptools

# Install Python dependencies
COPY requirements.txt /tmp/
RUN pip install --no-cache-dir -r /tmp/requirements.txt

# Compile flash-attention (optional)
RUN pip install flash-attn --no-build-isolation

# Production stage
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04 AS production

# Install minimal runtime dependencies
RUN apt-get update && apt-get install -y \\
    python3.10 \\
    python3.10-distutils \\
    && rm -rf /var/lib/apt/lists/*

# Copy Python environment from builder
COPY --from=builder /usr/local/lib/python3.10 /usr/local/lib/python3.10
COPY --from=builder /usr/local/bin /usr/local/bin

# Create application user
RUN useradd --create-home --shell /bin/bash app
USER app
WORKDIR /home/app

# Copy application code
COPY --chown=app:app src/ ./src/
COPY --chown=app:app config/ ./config/

# Environment variables
ENV PYTHONPATH=/home/app/src
ENV CUDA_VISIBLE_DEVICES=0

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=120s --retries=3 \\
  CMD curl -f http://localhost:8000/health || exit 1

# Expose port
EXPOSE 8000

# Start command
CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "1"]
'''

# Save Dockerfile
with open('Dockerfile.production', 'w') as f:
    f.write(dockerfile_content)

print("✅ Generated optimized Dockerfile")
print("\nKey optimizations:")
print("• Multi-stage build (reduces image size)")
print("• Non-root user (security)")
print("• Health check (container health)")
print("• Minimal runtime dependencies")
print("• Layer caching optimization")

In [None]:
# Generate requirements.txt for Docker build
requirements_txt = '''
# Core inference
vllm>=0.6.0
torch>=2.5.1
transformers>=4.57.0

# Web framework
fastapi>=0.104.0
uvicorn[standard]>=0.24.0

# Monitoring
prometheus-client>=0.19.0

# Utilities
pydantic>=2.0.0
httpx>=0.25.0
python-json-logger>=2.0.0

# Optional optimizations
# flash-attn --no-build-isolation
bitsandbytes>=0.48.1
'''

with open('requirements.txt', 'w') as f:
    f.write(requirements_txt.strip())

print("✅ Generated requirements.txt")

In [None]:
# Generate docker-compose.yml for local development
docker_compose = '''
version: '3.8'

services:
  llm-service:
    build:
      context: .
      dockerfile: Dockerfile.production
    ports:
      - "8000:8000"
    environment:
      - CUDA_VISIBLE_DEVICES=0
      - VLLM_GPU_MEMORY_UTILIZATION=0.9
      - VLLM_MAX_MODEL_LEN=2048
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    volumes:
      - model-cache:/home/app/.cache
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 2m

  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./config/prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana-data:/var/lib/grafana
      - ./config/grafana:/etc/grafana/provisioning

volumes:
  model-cache:
  prometheus-data:
  grafana-data:
'''

with open('docker-compose.yml', 'w') as f:
    f.write(docker_compose.strip())

print("✅ Generated docker-compose.yml for local testing")

---
## 2. Kubernetes Deployment

In [None]:
# Generate Kubernetes deployment manifest
k8s_deployment = f'''
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-service
  namespace: production
  labels:
    app: llm-service
    version: v1.0.0
spec:
  replicas: {resources['instances']}
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0  # Zero downtime deployment
  selector:
    matchLabels:
      app: llm-service
  template:
    metadata:
      labels:
        app: llm-service
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8000"
        prometheus.io/path: "/metrics"
    spec:
      containers:
      - name: llm-service
        image: your-registry/llm-service:v1.0.0
        imagePullPolicy: Always
        ports:
        - containerPort: 8000
          name: http
        resources:
          requests:
            nvidia.com/gpu: 1
            memory: "32Gi"
            cpu: "8"
          limits:
            nvidia.com/gpu: 1
            memory: "48Gi"
            cpu: "12"
        env:
        - name: CUDA_VISIBLE_DEVICES
          value: "0"
        - name: VLLM_GPU_MEMORY_UTILIZATION
          value: "0.9"
        - name: VLLM_MAX_MODEL_LEN
          value: "{requirements.max_context_length}"
        - name: MODEL_NAME
          valueFrom:
            configMapKeyRef:
              name: llm-config
              key: model_name
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 120
          periodSeconds: 30
          timeoutSeconds: 10
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /ready
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        volumeMounts:
        - name: model-cache
          mountPath: /home/app/.cache
        - name: config
          mountPath: /home/app/config
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: model-cache-pvc
      - name: config
        configMap:
          name: llm-config
      nodeSelector:
        node-type: gpu
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
'''

# Save deployment manifest
os.makedirs('k8s', exist_ok=True)
with open('k8s/deployment.yaml', 'w') as f:
    f.write(k8s_deployment.strip())

print("✅ Generated Kubernetes Deployment manifest")
print(f"   Replicas: {resources['instances']}")
print("   Features: Rolling updates, health checks, resource limits")

In [None]:
# Generate Service manifest
k8s_service = '''
apiVersion: v1
kind: Service
metadata:
  name: llm-service
  namespace: production
  labels:
    app: llm-service
spec:
  type: ClusterIP
  ports:
  - port: 80
    targetPort: 8000
    protocol: TCP
    name: http
  selector:
    app: llm-service
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: llm-config
  namespace: production
data:
  model_name: "meta-llama/Llama-2-7b-hf"
  max_model_len: "4096"
  tensor_parallel_size: "1"
  gpu_memory_utilization: "0.9"
  max_num_seqs: "32"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-cache-pvc
  namespace: production
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi
  storageClassName: gp3
'''

with open('k8s/service.yaml', 'w') as f:
    f.write(k8s_service.strip())

print("✅ Generated Kubernetes Service and ConfigMap")

---
## 3. Auto-Scaling Configuration

In [None]:
# Horizontal Pod Autoscaler (HPA)
hpa_manifest = f'''
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-service-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-service
  minReplicas: {max(1, resources['instances'] // 2)}
  maxReplicas: {resources['instances'] * 2}
  metrics:
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  - type: Pods
    pods:
      metric:
        name: requests_per_second
      target:
        type: AverageValue
        averageValue: "50"  # Scale up if > 50 RPS per pod
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60
'''

with open('k8s/hpa.yaml', 'w') as f:
    f.write(hpa_manifest.strip())

print("✅ Generated HPA configuration")
print(f"   Min replicas: {max(1, resources['instances'] // 2)}")
print(f"   Max replicas: {resources['instances'] * 2}")
print("   Metrics: Memory utilization, Requests per second")

In [None]:
# Vertical Pod Autoscaler (VPA) - optional
vpa_manifest = '''
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: llm-service-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-service
  updatePolicy:
    updateMode: "Auto"  # Auto-update resource requests
  resourcePolicy:
    containerPolicies:
    - containerName: llm-service
      minAllowed:
        cpu: 4
        memory: 16Gi
      maxAllowed:
        cpu: 16
        memory: 64Gi
      controlledResources: ["cpu", "memory"]
'''

with open('k8s/vpa.yaml', 'w') as f:
    f.write(vpa_manifest.strip())

print("✅ Generated VPA configuration")
print("   Automatically adjusts CPU/Memory requests based on usage")

---
## 4. Load Balancer and Ingress

In [None]:
# Generate Ingress configuration
ingress_manifest = '''
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: llm-service-ingress
  namespace: production
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/proxy-connect-timeout: "600"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
    nginx.ingress.kubernetes.io/proxy-body-size: "32m"
    nginx.ingress.kubernetes.io/rate-limit: "100"
    nginx.ingress.kubernetes.io/rate-limit-window: "1m"
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
  tls:
  - hosts:
    - api.yourcompany.com
    secretName: llm-service-tls
  rules:
  - host: api.yourcompany.com
    http:
      paths:
      - path: /v1
        pathType: Prefix
        backend:
          service:
            name: llm-service
            port:
              number: 80
      - path: /health
        pathType: Exact
        backend:
          service:
            name: llm-service
            port:
              number: 80
      - path: /metrics
        pathType: Exact
        backend:
          service:
            name: llm-service
            port:
              number: 80
'''

with open('k8s/ingress.yaml', 'w') as f:
    f.write(ingress_manifest.strip())

print("✅ Generated Ingress configuration")
print("   Features: SSL termination, rate limiting, health checks")

---
## 5. Helm Chart Structure

In [None]:
# Generate Helm chart structure
helm_values = f'''
# Default values for llm-service Helm chart

replicaCount: {resources['instances']}

image:
  repository: your-registry/llm-service
  pullPolicy: Always
  tag: "v1.0.0"

model:
  name: "meta-llama/Llama-2-7b-hf"
  maxModelLen: {requirements.max_context_length}
  tensorParallelSize: 1
  gpuMemoryUtilization: 0.9
  maxNumSeqs: 32

service:
  type: ClusterIP
  port: 80
  targetPort: 8000

ingress:
  enabled: true
  className: "nginx"
  annotations:
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
    nginx.ingress.kubernetes.io/rate-limit: "100"
  hosts:
    - host: api.yourcompany.com
      paths:
        - path: /v1
          pathType: Prefix
  tls:
    - secretName: llm-service-tls
      hosts:
        - api.yourcompany.com

resources:
  limits:
    nvidia.com/gpu: 1
    memory: 48Gi
    cpu: 12
  requests:
    nvidia.com/gpu: 1
    memory: 32Gi
    cpu: 8

autoscaling:
  enabled: true
  minReplicas: {max(1, resources['instances'] // 2)}
  maxReplicas: {resources['instances'] * 2}
  targetCPUUtilizationPercentage: 80
  targetMemoryUtilizationPercentage: 80

nodeSelector:
  node-type: gpu

tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

persistence:
  enabled: true
  storageClass: "gp3"
  size: 50Gi

monitoring:
  prometheus:
    enabled: true
    path: /metrics
    port: 8000

# Security settings
securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  runAsGroup: 1000
  fsGroup: 1000

# Pod disruption budget
podDisruptionBudget:
  enabled: true
  minAvailable: 1
'''

os.makedirs('helm/llm-service', exist_ok=True)
with open('helm/llm-service/values.yaml', 'w') as f:
    f.write(helm_values.strip())

print("✅ Generated Helm values.yaml")

In [None]:
# Generate Chart.yaml
chart_yaml = '''
apiVersion: v2
name: llm-service
description: Production LLM serving with vLLM and FastAPI

version: 1.0.0
appVersion: "v1.0.0"

maintainers:
  - name: LLM Team
    email: llm-team@yourcompany.com

dependencies:
  - name: prometheus
    version: "25.x.x"
    repository: https://prometheus-community.github.io/helm-charts
    condition: prometheus.enabled
  
  - name: grafana
    version: "8.x.x"
    repository: https://grafana.github.io/helm-charts
    condition: grafana.enabled

keywords:
  - llm
  - inference
  - vllm
  - fastapi
  - ai

home: https://github.com/yourorg/llm-service
sources:
  - https://github.com/vllm-project/vllm
  - https://fastapi.tiangolo.com/
'''

with open('helm/llm-service/Chart.yaml', 'w') as f:
    f.write(chart_yaml.strip())

print("✅ Generated Helm Chart.yaml")
print("   Includes dependencies for Prometheus and Grafana")

---
## 6. CI/CD Pipeline Design

In [None]:
# Generate GitHub Actions workflow
github_workflow = '''
name: LLM Service CI/CD

on:
  push:
    branches: [ main, develop ]
  pull_request:
    branches: [ main ]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}/llm-service

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
    - name: Checkout
      uses: actions/checkout@v4

    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.10'

    - name: Install dependencies
      run: |
        pip install -r requirements.txt
        pip install pytest pytest-asyncio

    - name: Run tests
      run: |
        pytest tests/ -v

    - name: Lint code
      run: |
        pip install black flake8
        black --check src/
        flake8 src/

  build:
    needs: test
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write
    steps:
    - name: Checkout
      uses: actions/checkout@v4

    - name: Log in to Container Registry
      uses: docker/login-action@v3
      with:
        registry: ${{ env.REGISTRY }}
        username: ${{ github.actor }}
        password: ${{ secrets.GITHUB_TOKEN }}

    - name: Extract metadata
      id: meta
      uses: docker/metadata-action@v5
      with:
        images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
        tags: |
          type=ref,event=branch
          type=ref,event=pr
          type=sha,prefix={{branch}}-
          type=raw,value=latest,enable={{is_default_branch}}

    - name: Build and push Docker image
      uses: docker/build-push-action@v5
      with:
        context: .
        file: Dockerfile.production
        push: true
        tags: ${{ steps.meta.outputs.tags }}
        labels: ${{ steps.meta.outputs.labels }}

  deploy:
    needs: build
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    steps:
    - name: Checkout
      uses: actions/checkout@v4

    - name: Configure kubectl
      run: |
        echo "${{ secrets.KUBECONFIG }}" | base64 -d > $HOME/.kube/config

    - name: Install Helm
      uses: azure/setup-helm@v3
      with:
        version: '3.12.0'

    - name: Deploy to staging
      run: |
        helm upgrade --install llm-service-staging ./helm/llm-service \\
          --namespace staging \\
          --set image.tag=${GITHUB_SHA:0:8} \\
          --set replicaCount=1

    - name: Run smoke tests
      run: |
        kubectl wait --for=condition=available --timeout=300s deployment/llm-service -n staging
        python tests/smoke_test.py --host staging.internal

    - name: Deploy to production
      if: success()
      run: |
        helm upgrade --install llm-service-prod ./helm/llm-service \\
          --namespace production \\
          --set image.tag=${GITHUB_SHA:0:8}
'''

os.makedirs('.github/workflows', exist_ok=True)
with open('.github/workflows/ci-cd.yml', 'w') as f:
    f.write(github_workflow.strip())

print("✅ Generated GitHub Actions CI/CD pipeline")
print("\nPipeline stages:")
print("1. Test: Run unit tests and linting")
print("2. Build: Create and push Docker image")
print("3. Deploy: Deploy to staging, run smoke tests, deploy to production")

# Generate deployment script
deploy_script = f'''
#!/bin/bash
# Production deployment script

set -euo pipefail

# Configuration
NAMESPACE="production"
RELEASE_NAME="llm-service-prod"
IMAGE_TAG="${{1:-latest}}"

echo "Deploying LLM service to production..."
echo "Namespace: $NAMESPACE"
echo "Release: $RELEASE_NAME"
echo "Image tag: $IMAGE_TAG"

# Create namespace if not exists
kubectl create namespace $NAMESPACE --dry-run=client -o yaml | kubectl apply -f -

# Deploy with Helm
helm upgrade --install $RELEASE_NAME ./helm/llm-service \\
  --namespace $NAMESPACE \\
  --set image.tag=$IMAGE_TAG \\
  --set replicaCount={resources['instances']} \\
  --wait \\
  --timeout 10m

# Wait for rollout
kubectl rollout status deployment/llm-service -n $NAMESPACE

# Check health
kubectl wait --for=condition=available --timeout=300s deployment/llm-service -n $NAMESPACE

echo "✅ Deployment completed successfully!"
echo "\nService endpoints:"
kubectl get ingress -n $NAMESPACE
'''

with open('deploy.sh', 'w') as f:
    f.write(deploy_script.strip())
    
os.chmod('deploy.sh', 0o755)

print("\n✅ Generated deployment script (deploy.sh)")
print("   Usage: ./deploy.sh [image-tag]")

---
## Summary

✅ **Completed**:
1. Created optimized multi-stage Dockerfile
2. Generated Kubernetes deployment manifests
3. Configured auto-scaling (HPA and VPA)
4. Set up Ingress with SSL and rate limiting
5. Created Helm chart structure
6. Designed CI/CD pipeline with GitHub Actions

📁 **Generated Files**:
- `Dockerfile.production`: Optimized container image
- `docker-compose.yml`: Local development stack
- `k8s/`: Kubernetes manifests
- `helm/`: Helm chart for deployment
- `.github/workflows/ci-cd.yml`: CI/CD pipeline
- `deploy.sh`: Production deployment script

🎯 **Production Ready Features**:
- Zero-downtime deployments
- Auto-scaling based on load
- Health checks and monitoring
- SSL termination and security
- Multi-environment support

➡️ **Next**: In `03-Performance_and_Cost.ipynb`, we'll cover:
- Performance optimization strategies
- Cost optimization techniques
- SLI/SLO definition and monitoring
- Resource allocation optimization

---
## Exercises

1. **Modify Helm values**: Change replica count and resource limits
2. **Add environment**: Create a development namespace with different configs
3. **Security hardening**: Add network policies and pod security standards
4. **Multi-region**: Modify manifests for cross-region deployment