# Week 7: Production Deployment

## Learning Objectives
- Learn best practices for deploying LLMs in production
- Understand model serving, API design, and load balancing
- Implement monitoring, logging, and error handling
- Explore scaling strategies and cost optimization

## Table of Contents
1. [Introduction to Production Deployment](#introduction)
2. [Model Serving](#model-serving)
3. [API Design & Load Balancing](#api-design)
4. [Monitoring & Logging](#monitoring-logging)
5. [Error Handling & Reliability](#error-handling)
6. [Scaling Strategies](#scaling)
7. [Cost Optimization](#cost-optimization)
8. [Hands-on Project](#hands-on-project)

## 1. Introduction to Production Deployment <a id='introduction'></a>

### What is Production Deployment?
Production deployment is the process of making your LLM-based application available to real users. This involves moving from a development environment to a scalable, reliable, and secure infrastructure.

- Why production deployment is different from prototyping
- Key challenges: scalability, reliability, security, cost
- Overview of deployment options: cloud, on-premises, hybrid

## 2. Model Serving <a id='model-serving'></a>

### Model Serving: Approaches and Tools
- What is model serving?
- Batch vs. real-time inference
- Model server options: FastAPI, Flask, TorchServe, Triton Inference Server, Hugging Face Inference Endpoints
- Containerization: Docker basics for ML models
- Example: Serving a Hugging Face model with FastAPI

In [None]:
# Example: Serve a Hugging Face model with FastAPI
from fastapi import FastAPI
from transformers import pipeline

app = FastAPI()
classifier = pipeline("sentiment-analysis")

@app.post("/predict")
def predict(text: str):
    return classifier(text)

## 3. API Design & Load Balancing <a id='api-design'></a>

### API Design & Load Balancing
- REST vs. gRPC for ML APIs
- Designing robust endpoints (input validation, error handling)
- Load balancing strategies: round-robin, least connections, cloud-native solutions (AWS ELB, Azure Load Balancer)
- Example: API versioning and health checks

In [None]:
# Example: Add a health check endpoint to FastAPI
@app.get("/health")
def health():
    return {"status": "ok"}

## 4. Monitoring & Logging <a id='monitoring-logging'></a>

### Monitoring & Logging
- Importance of monitoring in production
- Metrics to track: latency, throughput, error rate, resource usage
- Logging best practices (structured logs, log aggregation)
- Tools: Prometheus, Grafana, ELK stack, cloud monitoring solutions
- Example: Adding logging to FastAPI

In [None]:
import logging
logging.basicConfig(level=logging.INFO)

@app.post("/predict-logged")
def predict_logged(text: str):
    logging.info(f"Prediction requested for: {text}")
    return classifier(text)

## 5. Error Handling & Reliability <a id='error-handling'></a>

### Error Handling & Reliability
- Common failure modes in ML systems
- Graceful degradation and fallback strategies
- Circuit breakers, retries, and timeouts
- Example: Handling exceptions in FastAPI

In [None]:
from fastapi import HTTPException

@app.post("/predict-safe")
def predict_safe(text: str):
    try:
        return classifier(text)
    except Exception as e:
        logging.error(f"Error during prediction: {e}")
        raise HTTPException(status_code=500, detail="Model inference failed.")

## 6. Scaling Strategies <a id='scaling'></a>

### Scaling Strategies
- Vertical vs. horizontal scaling
- Auto-scaling in the cloud (Kubernetes, AWS ECS, Azure AKS)
- Model sharding and multi-model serving
- Caching for performance (Redis, CDN)
- Example: Docker Compose for scaling services

In [None]:
# Example: Docker Compose YAML for scaling FastAPI service
# (This is a YAML snippet, not Python code)
# version: '3'
# services:
#   app:
#     build: .
#     ports:
#       - "8000:8000"
#     deploy:
#       replicas: 3

## 7. Cost Optimization <a id='cost-optimization'></a>

### Cost Optimization
- Monitoring and controlling cloud costs
- Model quantization and distillation for cheaper inference
- Spot instances and serverless options
- Example: Using ONNX for efficient inference

In [None]:
# Example: Export a Hugging Face model to ONNX for optimized inference
from transformers import AutoModelForSequenceClassification
import torch

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
dummy_input = torch.zeros(1, 8, dtype=torch.long)
torch.onnx.export(model, (dummy_input,), "model.onnx")

## 8. Hands-on Project <a id='hands-on-project'></a>

### Hands-on Project: Open-Source Production Deployment of a Scalable LLM API for Marketing, Sales, and Finance

This project guides you through a real-world, open-source style deployment of an LLM-powered API for business use cases. You'll use widely adopted open-source tools and best practices for reliability, scalability, and maintainability.

#### Step 1: Design Your API
- Define endpoints for:
  - `/generate-marketing`: Generate product descriptions, ad copy, etc.
  - `/sales-support`: Lead qualification, automated Q&A
  - `/financial-analysis`: Summarize reports, extract trends
- Specify input/output schemas using OpenAPI (FastAPI auto-generates docs)

#### Step 2: Build the Service
- Use FastAPI for the web server and API documentation
- Use Hugging Face Transformers for LLM inference
- Structure your codebase with clear separation (routers, services, utils)

```python
from fastapi import FastAPI
from transformers import pipeline

app = FastAPI()
marketing_gen = pipeline("text-generation", model="distilgpt2")

@app.post("/generate-marketing")
def generate_marketing(prompt: str):
    return marketing_gen(prompt)
# Repeat for /sales-support and /financial-analysis
```

#### Step 3: Add Monitoring, Logging, and Error Handling
- Integrate Prometheus for metrics (use [prometheus_fastapi_instrumentator](https://github.com/trallard/prometheus-fastapi-instrumentator))
- Use Python's `logging` module for structured logs
- Add exception handlers for robust error reporting

#### Step 4: Containerize with Docker
- Write a `Dockerfile` for reproducible builds
- Use multi-stage builds for smaller images

```dockerfile
FROM python:3.10-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
```

#### Step 5: Orchestrate with Docker Compose or Kubernetes
- Use `docker-compose.yml` to run multiple replicas and a reverse proxy (e.g., Traefik or Nginx)
- For advanced scaling, use Kubernetes (Helm charts, Horizontal Pod Autoscaler)

#### Step 6: Load Balancing and Auto-Scaling
- Use Nginx/Traefik for HTTP load balancing in Docker Compose
- In Kubernetes, configure a Service and HPA for auto-scaling

#### Step 7: Cost Optimization
- Quantize models with Hugging Face Optimum or ONNX Runtime
- Use batch inference endpoints for high-throughput scenarios
- Monitor resource usage with Grafana dashboards

#### Step 8: Open-Source Best Practices
- Add a `README.md` with API usage examples
- Use `.env` files for secrets/configuration (never hardcode credentials)
- Write unit/integration tests (e.g., with pytest and httpx)
- Set up CI/CD (e.g., GitHub Actions) for automated builds and tests

#### Step 9: Document and Share
- Publish your code on GitHub with an open-source license
- Include deployment instructions and sample API requests
- Encourage community contributions and feedback

---

**References & Further Reading:**
- [FastAPI Production Deployment Guide](https://fastapi.tiangolo.com/deployment/)
- [Dockerizing FastAPI](https://testdriven.io/blog/fastapi-docker-traefik/)
- [Prometheus FastAPI Instrumentator](https://github.com/trallard/prometheus-fastapi-instrumentator)
- [Hugging Face Optimum](https://huggingface.co/docs/optimum/intel/usage_guides/quantization)
- [Kubernetes for ML](https://github.com/kubeflow/kubeflow)
- [Open Source LLM API Example](https://github.com/huggingface/text-generation-inference)