# Tutorial 16: Serving and Prediction Pipelines

## Module 6: Deployment and Serving

---

## Learning Objectives

By the end of this tutorial, you will be able to:

1. **Design scalable serving systems** - Build high-throughput prediction services
2. **Implement model versioning** - Manage multiple model versions safely
3. **Build feature stores** - Create and use feature stores for serving
4. **Create prediction pipelines** - End-to-end serving with pre/post processing
5. **Handle batch and streaming** - Support different prediction patterns

---

## Table of Contents

1. [Introduction to Model Serving](#1-introduction)
2. [Serving Architecture Patterns](#2-architecture)
3. [Model Versioning and Registry](#3-versioning)
4. [Feature Stores](#4-feature-stores)
5. [Prediction Pipelines](#5-pipelines)
6. [Scaling and Load Balancing](#6-scaling)
7. [Hands-on Exercises](#7-exercises)
8. [Summary](#8-summary)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
import pickle
import json
import time
from datetime import datetime
from typing import Dict, List, Any, Optional, Tuple
from dataclasses import dataclass, field
from abc import ABC, abstractmethod
from collections import defaultdict
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = [12, 6]

print("Libraries imported successfully!")

---

## 1. Introduction to Model Serving <a name="1-introduction"></a>

Model serving deploys ML models to handle prediction requests.

### Requirements

| Requirement | Target |
|-------------|--------|
| Latency | <100ms p99 |
| Availability | 99.9%+ |
| Throughput | 10,000+ QPS |
| Scalability | Auto-scale |

In [None]:
# Create sample data
X, y = make_classification(n_samples=5000, n_features=20, n_informative=15,
                          n_classes=3, n_clusters_per_class=2, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train models
model_v1 = RandomForestClassifier(n_estimators=50, random_state=42)
model_v1.fit(X_train_scaled, y_train)

model_v2 = RandomForestClassifier(n_estimators=100, random_state=42)
model_v2.fit(X_train_scaled, y_train)

print(f"Data: {X_train.shape[0]} train, {X_test.shape[0]} test")
print(f"Model v1 acc: {accuracy_score(y_test, model_v1.predict(X_test_scaled)):.4f}")
print(f"Model v2 acc: {accuracy_score(y_test, model_v2.predict(X_test_scaled)):.4f}")

---

## 2. Serving Architecture Patterns <a name="2-architecture"></a>

| Pattern | Latency | Scalability | Use Case |
|---------|---------|-------------|----------|
| Monolithic | Low | Limited | Simple apps |
| Microservices | Medium | High | Complex systems |
| Serverless | Variable | Auto | Bursty traffic |

In [None]:
class PredictionService(ABC):
    @abstractmethod
    def predict(self, features: np.ndarray) -> Dict:
        pass
    
    @abstractmethod
    def health_check(self) -> Dict:
        pass


class MonolithicService(PredictionService):
    def __init__(self, model, preprocessor=None):
        self.model = model
        self.preprocessor = preprocessor
        self.requests = 0
        self.latencies = []
    
    def predict(self, features: np.ndarray) -> Dict:
        start = time.time()
        self.requests += 1
        
        if self.preprocessor:
            features = self.preprocessor.transform(features.reshape(1, -1))
        else:
            features = features.reshape(1, -1)
        
        pred = self.model.predict(features)[0]
        proba = self.model.predict_proba(features)[0] if hasattr(self.model, 'predict_proba') else None
        
        latency = time.time() - start
        self.latencies.append(latency)
        
        return {
            'prediction': int(pred),
            'probability': proba.tolist() if proba is not None else None,
            'latency_ms': latency * 1000
        }
    
    def health_check(self) -> Dict:
        return {'status': 'healthy', 'requests': self.requests}


class MicroserviceGateway:
    def __init__(self):
        self.services: Dict[str, PredictionService] = {}
        self.routes: Dict[str, str] = {}
    
    def register(self, name: str, service: PredictionService):
        self.services[name] = service
    
    def set_route(self, endpoint: str, service_name: str):
        self.routes[endpoint] = service_name
    
    def route(self, endpoint: str, features: np.ndarray) -> Dict:
        if endpoint not in self.routes:
            return {'error': 'Route not found'}
        service_name = self.routes[endpoint]
        if service_name not in self.services:
            return {'error': 'Service not found'}
        result = self.services[service_name].predict(features)
        result['service'] = service_name
        return result


# Demo
print("="*50)
print("Service Architecture Demo")
print("="*50)

# Monolithic
mono = MonolithicService(model_v1, scaler)
for i in range(100):
    mono.predict(X_test[i])
print(f"Monolithic: {mono.requests} requests, avg latency: {np.mean(mono.latencies)*1000:.3f}ms")

# Microservices
gateway = MicroserviceGateway()
gateway.register('v1', MonolithicService(model_v1, scaler))
gateway.register('v2', MonolithicService(model_v2, scaler))
gateway.set_route('/api/v1/predict', 'v1')
gateway.set_route('/api/v2/predict', 'v2')

r1 = gateway.route('/api/v1/predict', X_test[0])
r2 = gateway.route('/api/v2/predict', X_test[0])
print(f"Gateway v1: pred={r1['prediction']}, Gateway v2: pred={r2['prediction']}")

---

## 3. Model Versioning and Registry <a name="3-versioning"></a>

Model versioning enables safe deployments and rollbacks.

In [None]:
@dataclass
class ModelVersion:
    version: str
    model: Any
    created_at: datetime
    metrics: Dict = field(default_factory=dict)
    status: str = 'staging'


class ModelRegistry:
    def __init__(self, name: str):
        self.name = name
        self.versions: Dict[str, ModelVersion] = {}
        self.production: Optional[str] = None
        self.history: List[Dict] = []
    
    def register(self, version: str, model: Any, metrics: Dict = None):
        mv = ModelVersion(version, model, datetime.now(), metrics or {})
        self.versions[version] = mv
        self.history.append({'action': 'register', 'version': version, 'time': datetime.now().isoformat()})
        print(f"Registered: {version}")
    
    def promote(self, version: str):
        if version not in self.versions:
            raise ValueError(f"Version {version} not found")
        if self.production:
            self.versions[self.production].status = 'archived'
        self.versions[version].status = 'production'
        self.production = version
        self.history.append({'action': 'promote', 'version': version, 'time': datetime.now().isoformat()})
        print(f"Promoted: {version}")
    
    def rollback(self):
        promotions = [h for h in self.history if h['action'] == 'promote']
        if len(promotions) < 2:
            return None
        prev = promotions[-2]['version']
        self.promote(prev)
        return prev
    
    def get_production(self):
        return self.versions[self.production].model if self.production else None
    
    def list_versions(self) -> pd.DataFrame:
        data = [{'version': v, 'status': mv.status, 'accuracy': mv.metrics.get('accuracy', 'N/A')}
                for v, mv in self.versions.items()]
        return pd.DataFrame(data)


# Demo registry
print("="*50)
print("Model Registry Demo")
print("="*50)

registry = ModelRegistry('classifier')
registry.register('1.0.0', model_v1, {'accuracy': accuracy_score(y_test, model_v1.predict(X_test_scaled))})
registry.register('2.0.0', model_v2, {'accuracy': accuracy_score(y_test, model_v2.predict(X_test_scaled))})

registry.promote('1.0.0')
registry.promote('2.0.0')
registry.rollback()

print("\nVersions:")
print(registry.list_versions())

---

## 4. Feature Stores <a name="4-feature-stores"></a>

Feature stores provide consistent feature serving.

In [None]:
class FeatureStore:
    def __init__(self, name: str):
        self.name = name
        self.features: Dict[str, Dict] = {}  # feature definitions
        self.online: Dict[str, Dict] = {}    # entity -> features
        self.offline: List[Dict] = []        # historical data
    
    def register_feature(self, name: str, dtype: str, description: str = ''):
        self.features[name] = {'dtype': dtype, 'description': description}
    
    def ingest(self, entity_id: str, features: Dict):
        if entity_id not in self.online:
            self.online[entity_id] = {}
        self.online[entity_id].update(features)
        self.online[entity_id]['_updated'] = datetime.now().isoformat()
        self.offline.append({'entity_id': entity_id, 'timestamp': datetime.now().isoformat(), **features})
    
    def get_online(self, entity_id: str, feature_names: List[str] = None) -> Dict:
        if entity_id not in self.online:
            return {'error': 'Entity not found'}
        features = self.online[entity_id]
        if feature_names:
            features = {k: v for k, v in features.items() if k in feature_names or k.startswith('_')}
        return features
    
    def get_offline(self) -> pd.DataFrame:
        return pd.DataFrame(self.offline)


# Demo feature store
print("="*50)
print("Feature Store Demo")
print("="*50)

fs = FeatureStore('user_features')
fs.register_feature('age', 'float', 'User age')
fs.register_feature('income', 'float', 'Annual income')
fs.register_feature('purchases', 'int', 'Purchase count')

users = [
    ('u001', {'age': 25, 'income': 50000, 'purchases': 10}),
    ('u002', {'age': 35, 'income': 80000, 'purchases': 25}),
    ('u003', {'age': 45, 'income': 120000, 'purchases': 50}),
]

for uid, feats in users:
    fs.ingest(uid, feats)

print("Online features for u001:")
print(fs.get_online('u001'))

print("\nOffline data:")
print(fs.get_offline())

---

## 5. Prediction Pipelines <a name="5-pipelines"></a>

End-to-end pipelines with preprocessing and postprocessing.

In [None]:
class PipelineStage(ABC):
    @abstractmethod
    def process(self, data: Any) -> Any:
        pass
    
    @property
    @abstractmethod
    def name(self) -> str:
        pass


class PreprocessingStage(PipelineStage):
    def __init__(self, scaler):
        self.scaler = scaler
    
    @property
    def name(self):
        return 'preprocessing'
    
    def process(self, data):
        if data.ndim == 1:
            data = data.reshape(1, -1)
        return self.scaler.transform(data)


class PredictionStage(PipelineStage):
    def __init__(self, model):
        self.model = model
    
    @property
    def name(self):
        return 'prediction'
    
    def process(self, data):
        pred = self.model.predict(data)[0]
        proba = self.model.predict_proba(data)[0] if hasattr(self.model, 'predict_proba') else None
        return {'prediction': int(pred), 'probabilities': proba}


class PostprocessingStage(PipelineStage):
    def __init__(self, class_names: Dict[int, str]):
        self.class_names = class_names
    
    @property
    def name(self):
        return 'postprocessing'
    
    def process(self, data):
        pred = data['prediction']
        proba = data['probabilities']
        return {
            'class_id': pred,
            'class_name': self.class_names.get(pred, 'unknown'),
            'confidence': float(max(proba)) if proba is not None else None
        }


class ServingPipeline:
    def __init__(self):
        self.stages: List[PipelineStage] = []
        self.timings: Dict[str, List] = defaultdict(list)
    
    def add_stage(self, stage: PipelineStage):
        self.stages.append(stage)
    
    def run(self, input_data):
        total_start = time.time()
        data = input_data
        times = {}
        
        for stage in self.stages:
            start = time.time()
            data = stage.process(data)
            elapsed = time.time() - start
            times[stage.name] = elapsed
            self.timings[stage.name].append(elapsed)
        
        if isinstance(data, dict):
            data['_timing'] = times
            data['_total_ms'] = (time.time() - total_start) * 1000
        return data
    
    def get_stats(self):
        return pd.DataFrame([{
            'stage': s,
            'avg_ms': np.mean(t) * 1000,
            'p99_ms': np.percentile(t, 99) * 1000
        } for s, t in self.timings.items()])


# Demo pipeline
print("="*50)
print("Serving Pipeline Demo")
print("="*50)

pipeline = ServingPipeline()
pipeline.add_stage(PreprocessingStage(scaler))
pipeline.add_stage(PredictionStage(model_v2))
pipeline.add_stage(PostprocessingStage({0: 'Low', 1: 'Medium', 2: 'High'}))

for i in range(100):
    pipeline.run(X_test[i])

result = pipeline.run(X_test[0])
print("Example output:")
print(json.dumps(result, indent=2, default=str))

print("\nPipeline Stats:")
print(pipeline.get_stats())

In [None]:
# Visualize pipeline timing
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

stats = pipeline.get_stats()
axes[0].barh(stats['stage'], stats['avg_ms'], color='steelblue')
axes[0].set_xlabel('Avg Time (ms)')
axes[0].set_title('Pipeline Stage Timing')

# Total latency distribution
total_times = [sum(pipeline.timings[s][i] for s in pipeline.timings) * 1000 
               for i in range(len(list(pipeline.timings.values())[0]))]
axes[1].hist(total_times, bins=30, color='coral', edgecolor='black', alpha=0.7)
axes[1].axvline(np.mean(total_times), color='red', linestyle='--')
axes[1].set_xlabel('Latency (ms)')
axes[1].set_title('Total Latency Distribution')

plt.tight_layout()
plt.show()

---

## 6. Scaling and Load Balancing <a name="6-scaling"></a>

Handle high throughput with scaling and load balancing.

In [None]:
class LoadBalancer:
    def __init__(self, strategy='round_robin'):
        self.servers: List[PredictionService] = []
        self.strategy = strategy
        self.current = 0
        self.counts = []
    
    def add_server(self, server: PredictionService):
        self.servers.append(server)
        self.counts.append(0)
    
    def get_server(self):
        if not self.servers:
            raise ValueError("No servers")
        
        if self.strategy == 'round_robin':
            idx = self.current
            self.current = (self.current + 1) % len(self.servers)
        elif self.strategy == 'least_connections':
            idx = np.argmin(self.counts)
        else:
            idx = np.random.randint(len(self.servers))
        
        return idx, self.servers[idx]
    
    def route(self, features: np.ndarray) -> Dict:
        idx, server = self.get_server()
        self.counts[idx] += 1
        result = server.predict(features)
        result['server_id'] = idx
        return result
    
    def stats(self) -> pd.DataFrame:
        return pd.DataFrame([{'server': i, 'requests': c} for i, c in enumerate(self.counts)])


# Demo load balancing
print("="*50)
print("Load Balancing Demo")
print("="*50)

lb = LoadBalancer('round_robin')
for i in range(3):
    lb.add_server(MonolithicService(model_v2, scaler))

for i in range(300):
    lb.route(X_test[i % len(X_test)])

print("Load Distribution:")
print(lb.stats())

# Compare strategies
print("\nStrategy Comparison:")
for strategy in ['round_robin', 'least_connections', 'random']:
    lb_test = LoadBalancer(strategy)
    for i in range(3):
        lb_test.add_server(MonolithicService(model_v2, scaler))
    for i in range(300):
        lb_test.route(X_test[i % len(X_test)])
    print(f"  {strategy}: std={np.std(lb_test.counts):.1f}")

In [None]:
# Visualize load balancing
fig, ax = plt.subplots(figsize=(10, 4))

strategies = ['round_robin', 'least_connections', 'random']
distributions = []

for strategy in strategies:
    lb_test = LoadBalancer(strategy)
    for i in range(3):
        lb_test.add_server(MonolithicService(model_v2, scaler))
    for i in range(300):
        lb_test.route(X_test[i % len(X_test)])
    distributions.append(lb_test.counts)

x = np.arange(3)
width = 0.25

for i, (strategy, dist) in enumerate(zip(strategies, distributions)):
    ax.bar(x + i*width, dist, width, label=strategy)

ax.set_xlabel('Server')
ax.set_ylabel('Requests')
ax.set_title('Load Distribution by Strategy')
ax.set_xticks(x + width)
ax.set_xticklabels(['Server 0', 'Server 1', 'Server 2'])
ax.legend()
plt.show()

---

## 7. Hands-on Exercises <a name="7-exercises"></a>

### Exercise 1: Build a Complete Serving System

In [None]:
# Exercise 1: Complete serving system
print("Exercise 1: Complete Serving System")
print("="*50)

# Create registry
reg = ModelRegistry('production')
reg.register('v1', model_v1, {'accuracy': 0.85})
reg.register('v2', model_v2, {'accuracy': 0.90})
reg.promote('v2')

# Create pipeline with production model
prod_pipeline = ServingPipeline()
prod_pipeline.add_stage(PreprocessingStage(scaler))
prod_pipeline.add_stage(PredictionStage(reg.get_production()))
prod_pipeline.add_stage(PostprocessingStage({0: 'A', 1: 'B', 2: 'C'}))

# Create load balancer with multiple instances
final_lb = LoadBalancer('round_robin')
for i in range(3):
    final_lb.add_server(MonolithicService(reg.get_production(), scaler))

# Run traffic
for i in range(100):
    final_lb.route(X_test[i % len(X_test)])

print(f"Production model: {reg.production}")
print(f"Load distribution: {final_lb.counts}")
print("\nSample prediction:")
print(prod_pipeline.run(X_test[0]))

### Exercise 2: Feature Store Integration

In [None]:
# Exercise 2: Feature store integration
print("\nExercise 2: Feature Store Integration")
print("="*50)

class FeatureServingPipeline:
    def __init__(self, feature_store, model, feature_names):
        self.fs = feature_store
        self.model = model
        self.feature_names = feature_names
    
    def predict(self, entity_id):
        # Get features
        features = self.fs.get_online(entity_id, self.feature_names)
        if 'error' in features:
            return features
        
        # Build vector
        vector = np.array([features.get(f, 0) for f in self.feature_names]).reshape(1, -1)
        
        # Predict
        pred = self.model.predict(vector)[0]
        return {'entity_id': entity_id, 'prediction': int(pred), 'features': self.feature_names}


# Simple model for demo
simple_model = LogisticRegression(random_state=42)
demo_X = np.array([[25, 50000, 10], [35, 80000, 25], [45, 120000, 50]])
demo_y = np.array([0, 1, 1])
simple_model.fit(demo_X, demo_y)

fs_pipeline = FeatureServingPipeline(fs, simple_model, ['age', 'income', 'purchases'])

for uid, _ in users:
    result = fs_pipeline.predict(uid)
    print(f"{uid}: prediction={result['prediction']}")

---

## 8. Summary <a name="8-summary"></a>

### Key Takeaways

1. **Serving Architectures**: Choose based on latency, scalability, and complexity needs

2. **Model Versioning**: Essential for safe deployments and rollbacks

3. **Feature Stores**: Provide consistent feature serving for training and inference

4. **Prediction Pipelines**: Modular stages for preprocessing, prediction, postprocessing

5. **Scaling**: Load balancing distributes traffic across servers

### Best Practices

- Use model registry for version management
- Implement health checks for all services
- Monitor latency percentiles (p50, p95, p99)
- Design for horizontal scaling
- Cache frequently accessed features

### Production Checklist

- [ ] Model versioning and registry
- [ ] Feature store for online serving
- [ ] Load balancing across servers
- [ ] Health checks and monitoring
- [ ] Rollback capability
- [ ] Request logging

### Module 6 Complete!

You've learned:
- Tutorial 14: Deployment Strategies
- Tutorial 15: Model Compression
- Tutorial 16: Serving and Prediction Pipelines

Next: Module 7 - Monitoring and Infrastructure