# Tutorial 14: Deployment Strategies

## Module 6: Deployment and Serving

---

## Learning Objectives

By the end of this tutorial, you will be able to:

1. **Compare cloud vs on-device deployment** - Understand trade-offs between centralized and edge deployments
2. **Implement batch and online prediction** - Build systems for different prediction requirements
3. **Design prediction pipelines** - Create end-to-end serving architectures
4. **Apply deployment patterns** - Use shadow, canary, and blue-green deployments
5. **Build production-ready APIs** - Create scalable model serving endpoints

---

## Table of Contents

1. [Introduction to ML Deployment](#1-introduction)
2. [Cloud vs On-Device Deployment](#2-cloud-vs-device)
3. [Batch Prediction Systems](#3-batch-prediction)
4. [Online Prediction Systems](#4-online-prediction)
5. [Deployment Patterns](#5-deployment-patterns)
6. [Building Model Serving APIs](#6-model-serving-apis)
7. [Prediction Pipeline Design](#7-pipeline-design)
8. [Hands-on Exercises](#8-exercises)
9. [Summary](#9-summary)

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
import pickle
import json
import time
import hashlib
from datetime import datetime, timedelta
from typing import Dict, List, Any, Optional, Tuple
from dataclasses import dataclass
from collections import defaultdict
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = [12, 6]

print("Libraries imported successfully!")

---

## 1. Introduction to ML Deployment <a name="1-introduction"></a>

Deploying ML models to production is where the real value is created. A model that only exists in a Jupyter notebook does not help users. This tutorial covers strategies and patterns for successfully deploying ML systems.

### Key Deployment Challenges

| Challenge | Description |
|-----------|-------------|
| Environment Differences | Dev vs prod dependencies |
| Scale | 1 request to millions/sec |
| Reliability | High availability requirements |
| Latency | Real-time response needs |
| Updates | Safe model updates/rollbacks |

---

## 2. Cloud vs On-Device Deployment <a name="2-cloud-vs-device"></a>

The first major decision in deployment is where the model runs.

| Aspect | Cloud | On-Device |
|--------|-------|----------|
| Latency | Higher (network) | Lower (local) |
| Model Size | Unlimited | Limited |
| Compute | Powerful GPUs | Limited |
| Privacy | Data sent to server | Data stays local |
| Offline | Requires network | Works offline |
| Updates | Easy | App store approval |

In [None]:
@dataclass
class DeploymentOption:
    """Represents a deployment location option."""
    name: str
    latency_ms: float
    max_model_size_mb: float
    requires_network: bool
    privacy_level: str
    update_ease: str
    compute_power: str
    cost_per_inference: float


class DeploymentDecisionFramework:
    """Framework for deciding between deployment options."""
    
    def __init__(self):
        self.options = {
            'cloud_gpu': DeploymentOption('Cloud (GPU)', 50, 10000, True, 'low', 'easy', 'very_high', 0.0001),
            'cloud_cpu': DeploymentOption('Cloud (CPU)', 100, 5000, True, 'low', 'easy', 'high', 0.00001),
            'edge_server': DeploymentOption('Edge Server', 20, 2000, True, 'medium', 'medium', 'high', 0.00005),
            'mobile': DeploymentOption('Mobile Device', 10, 100, False, 'high', 'hard', 'low', 0),
            'browser': DeploymentOption('Browser (WebML)', 15, 50, False, 'high', 'medium', 'low', 0)
        }
    
    def evaluate_requirements(self, max_latency_ms, model_size_mb, requires_offline, privacy_req, budget):
        """Evaluate deployment options against requirements."""
        privacy_scores = {'low': 1, 'medium': 2, 'high': 3}
        results = []
        
        for key, opt in self.options.items():
            score = 100.0
            reasons = []
            
            if opt.latency_ms > max_latency_ms:
                score -= 30
                reasons.append('Latency too high')
            if model_size_mb > opt.max_model_size_mb:
                score -= 50
                reasons.append('Model too large')
            if requires_offline and opt.requires_network:
                score -= 40
                reasons.append('Requires network')
            if privacy_scores.get(opt.privacy_level, 1) < privacy_scores.get(privacy_req, 1):
                score -= 25
                reasons.append('Privacy insufficient')
            if opt.cost_per_inference * 1e6 > budget:
                score -= 20
                reasons.append('Over budget')
            
            results.append((key, opt, max(0, score), reasons))
        
        return sorted(results, key=lambda x: x[2], reverse=True)
    
    def visualize_comparison(self):
        """Visualize deployment options."""
        fig, axes = plt.subplots(2, 2, figsize=(14, 10))
        opts = list(self.options.values())
        names = [o.name for o in opts]
        colors = plt.cm.Set2(np.linspace(0, 1, len(names)))
        
        axes[0,0].barh(names, [o.latency_ms for o in opts], color=colors)
        axes[0,0].set_xlabel('Latency (ms)')
        axes[0,0].set_title('Inference Latency')
        
        axes[0,1].barh(names, [o.max_model_size_mb for o in opts], color=colors)
        axes[0,1].set_xlabel('Max Model Size (MB)')
        axes[0,1].set_title('Model Size Capacity')
        axes[0,1].set_xscale('log')
        
        axes[1,0].barh(names, [o.cost_per_inference * 1e6 for o in opts], color=colors)
        axes[1,0].set_xlabel('Cost per Million ($)')
        axes[1,0].set_title('Infrastructure Cost')
        
        compute_map = {'low': 1, 'medium': 2, 'high': 3, 'very_high': 4}
        privacy_map = {'low': 1, 'medium': 2, 'high': 3}
        x = np.arange(len(names))
        axes[1,1].bar(x - 0.2, [compute_map[o.compute_power] for o in opts], 0.4, label='Compute', color='steelblue')
        axes[1,1].bar(x + 0.2, [privacy_map[o.privacy_level] for o in opts], 0.4, label='Privacy', color='forestgreen')
        axes[1,1].set_xticks(x)
        axes[1,1].set_xticklabels(names, rotation=45, ha='right')
        axes[1,1].legend()
        axes[1,1].set_title('Compute and Privacy')
        
        plt.tight_layout()
        plt.show()


framework = DeploymentDecisionFramework()
framework.visualize_comparison()

In [None]:
# Evaluate deployment options for different use cases
print("USE CASE 1: Real-time Image Classification for Mobile")
print("="*55)
results = framework.evaluate_requirements(100, 50, True, 'medium', 10.0)
for i, (key, opt, score, reasons) in enumerate(results, 1):
    status = 'PASS' if score >= 70 else 'WARN' if score >= 40 else 'FAIL'
    print(f"{i}. {opt.name:<18} Score: {score:>5.0f}  [{status}]")
    for r in reasons:
        print(f"   - {r}")

print("\nUSE CASE 2: Large Language Model for Enterprise")
print("="*55)
results2 = framework.evaluate_requirements(500, 5000, False, 'low', 100.0)
for i, (key, opt, score, reasons) in enumerate(results2, 1):
    status = 'PASS' if score >= 70 else 'WARN' if score >= 40 else 'FAIL'
    print(f"{i}. {opt.name:<18} Score: {score:>5.0f}  [{status}]")
    for r in reasons:
        print(f"   - {r}")

---

## 3. Batch Prediction Systems <a name="3-batch-prediction"></a>

Batch prediction processes large volumes of data at scheduled intervals.

**Use Cases:** Recommendations, credit scoring, email targeting, fraud detection

In [None]:
class BatchPredictionPipeline:
    """Production-ready batch prediction pipeline."""
    
    def __init__(self, model, batch_size=1000):
        self.model = model
        self.batch_size = batch_size
        self.metrics = {'processed': 0, 'time': 0, 'batches': 0}
    
    def process_batch(self, data):
        start = time.time()
        preds = self.model.predict(data)
        proba = self.model.predict_proba(data) if hasattr(self.model, 'predict_proba') else None
        elapsed = time.time() - start
        return preds, proba, {'size': len(data), 'time': elapsed, 'throughput': len(data)/elapsed}
    
    def run_pipeline(self, data, verbose=True):
        start = time.time()
        n_samples = len(data)
        n_batches = (n_samples + self.batch_size - 1) // self.batch_size
        
        all_preds, all_proba, batch_metrics = [], [], []
        
        if verbose:
            print(f"Starting: {n_samples:,} samples, {n_batches} batches")
        
        for i in range(n_batches):
            batch = data[i*self.batch_size:(i+1)*self.batch_size]
            preds, proba, metrics = self.process_batch(batch)
            all_preds.extend(preds)
            if proba is not None:
                all_proba.extend(proba)
            batch_metrics.append(metrics)
        
        self.metrics['time'] = time.time() - start
        self.metrics['processed'] = n_samples
        
        results = pd.DataFrame({'sample_id': range(n_samples), 'prediction': all_preds})
        if all_proba:
            for j in range(len(all_proba[0])):
                results[f'prob_{j}'] = [p[j] for p in all_proba]
        
        if verbose:
            print(f"Completed in {self.metrics['time']:.2f}s ({n_samples/self.metrics['time']:.0f} samples/sec)")
        
        return results, batch_metrics


# Create sample data and train model
X, y = make_classification(n_samples=10000, n_features=20, n_classes=3, n_informative=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = RandomForestClassifier(n_estimators=50, random_state=42)
model.fit(X_train, y_train)

# Run batch prediction
pipeline = BatchPredictionPipeline(model, batch_size=500)
results, batch_metrics = pipeline.run_pipeline(X_test)
print("\nSample predictions:")
print(results.head())

In [None]:
# Visualize batch performance
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

times = [m['time'] for m in batch_metrics]
axes[0].bar(range(len(times)), times, color='steelblue', alpha=0.7)
axes[0].axhline(np.mean(times), color='red', linestyle='--', label=f'Mean: {np.mean(times):.3f}s')
axes[0].set_xlabel('Batch')
axes[0].set_ylabel('Time (s)')
axes[0].set_title('Processing Time per Batch')
axes[0].legend()

throughputs = [m['throughput'] for m in batch_metrics]
axes[1].plot(throughputs, marker='o', color='forestgreen')
axes[1].axhline(np.mean(throughputs), color='red', linestyle='--')
axes[1].set_xlabel('Batch')
axes[1].set_ylabel('Throughput')
axes[1].set_title('Throughput per Batch')

results['prediction'].value_counts().sort_index().plot(kind='bar', ax=axes[2], color='coral')
axes[2].set_xlabel('Class')
axes[2].set_ylabel('Count')
axes[2].set_title('Prediction Distribution')

plt.tight_layout()
plt.show()

---

## 4. Online Prediction Systems <a name="4-online-prediction"></a>

Online prediction serves real-time requests with low latency.

**Requirements:** Latency <100ms (p99), Availability 99.9%+, Throughput 10,000+ QPS

In [None]:
class PredictionCache:
    """LRU cache for predictions."""
    
    def __init__(self, max_size=10000, ttl=300):
        self.max_size = max_size
        self.ttl = ttl
        self.cache = {}
        self.order = []
        self.stats = {'hits': 0, 'misses': 0}
    
    def _key(self, features):
        return hashlib.md5(features.tobytes()).hexdigest()
    
    def get(self, features):
        key = self._key(features)
        if key in self.cache:
            val, ts = self.cache[key]
            if time.time() - ts < self.ttl:
                self.stats['hits'] += 1
                self.order.remove(key)
                self.order.append(key)
                return val
            del self.cache[key]
            self.order.remove(key)
        self.stats['misses'] += 1
        return None
    
    def set(self, features, prediction):
        key = self._key(features)
        while len(self.cache) >= self.max_size:
            oldest = self.order.pop(0)
            del self.cache[oldest]
        self.cache[key] = (prediction, time.time())
        self.order.append(key)
    
    @property
    def hit_rate(self):
        total = self.stats['hits'] + self.stats['misses']
        return self.stats['hits'] / total if total else 0


class OnlinePredictionService:
    """Online prediction service with caching."""
    
    def __init__(self, model, cache_enabled=True):
        self.model = model
        self.cache = PredictionCache() if cache_enabled else None
        self.latencies = []
        self.requests = 0
        self.errors = 0
    
    def predict(self, features):
        start = time.time()
        self.requests += 1
        
        try:
            if self.cache:
                cached = self.cache.get(features)
                if cached is not None:
                    latency = time.time() - start
                    self.latencies.append(latency)
                    return {'prediction': cached['pred'], 'probability': cached['prob'],
                            'latency_ms': latency * 1000, 'cache_hit': True}
            
            features_2d = features.reshape(1, -1)
            pred = self.model.predict(features_2d)[0]
            prob = self.model.predict_proba(features_2d)[0].tolist() if hasattr(self.model, 'predict_proba') else None
            
            if self.cache:
                self.cache.set(features, {'pred': int(pred), 'prob': prob})
            
            latency = time.time() - start
            self.latencies.append(latency)
            return {'prediction': int(pred), 'probability': prob, 'latency_ms': latency * 1000, 'cache_hit': False}
        except Exception as e:
            self.errors += 1
            return {'error': str(e)}
    
    def get_metrics(self):
        lats = [l * 1000 for l in self.latencies]
        return {
            'requests': self.requests,
            'error_rate': self.errors / self.requests if self.requests else 0,
            'p50_ms': np.percentile(lats, 50) if lats else 0,
            'p95_ms': np.percentile(lats, 95) if lats else 0,
            'p99_ms': np.percentile(lats, 99) if lats else 0,
            'cache_hit_rate': self.cache.hit_rate if self.cache else 0
        }


# Create service and simulate requests
service = OnlinePredictionService(model, cache_enabled=True)

print("Simulating 1000 online requests...")
unique_samples = X_test[:200]
indices = np.random.choice(200, 1000, replace=True)
results_online = [service.predict(unique_samples[i]) for i in indices]

metrics = service.get_metrics()
print(f"\nMetrics:")
print(f"  Requests: {metrics['requests']}")
print(f"  Error rate: {metrics['error_rate']:.2%}")
print(f"  P50: {metrics['p50_ms']:.3f}ms")
print(f"  P95: {metrics['p95_ms']:.3f}ms")
print(f"  P99: {metrics['p99_ms']:.3f}ms")
print(f"  Cache hit rate: {metrics['cache_hit_rate']:.2%}")

In [None]:
# Visualize online performance
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

lats = [r['latency_ms'] for r in results_online if 'latency_ms' in r]
axes[0].hist(lats, bins=50, color='steelblue', alpha=0.7, edgecolor='black')
axes[0].axvline(metrics['p50_ms'], color='orange', linestyle='--', label='P50')
axes[0].axvline(metrics['p99_ms'], color='red', linestyle='--', label='P99')
axes[0].set_xlabel('Latency (ms)')
axes[0].set_title('Latency Distribution')
axes[0].legend()

hits = sum(1 for r in results_online if r.get('cache_hit', False))
axes[1].pie([hits, len(results_online)-hits], labels=['Hit', 'Miss'], autopct='%1.1f%%', colors=['forestgreen', 'coral'])
axes[1].set_title('Cache Hit Rate')

rolling = pd.Series(lats).rolling(50).mean()
axes[2].plot(rolling, color='steelblue')
axes[2].set_xlabel('Request')
axes[2].set_ylabel('Rolling Avg Latency (ms)')
axes[2].set_title('Latency Over Time')

plt.tight_layout()
plt.show()

---

## 5. Deployment Patterns <a name="5-deployment-patterns"></a>

Safe deployment patterns minimize risk when updating models.

| Pattern | Risk | Rollback | Use Case |
|---------|------|----------|----------|
| Shadow | Very Low | N/A | Testing new model |
| Canary | Low | Fast | Gradual rollout |
| Blue-Green | Medium | Instant | Quick switches |

In [None]:
class ShadowDeployment:
    """Shadow deployment for comparing models."""
    
    def __init__(self, primary, shadow):
        self.primary = primary
        self.shadow = shadow
        self.log = []
    
    def predict(self, features):
        features_2d = features.reshape(1, -1)
        
        start = time.time()
        p_pred = self.primary.predict(features_2d)[0]
        p_time = time.time() - start
        
        start = time.time()
        s_pred = self.shadow.predict(features_2d)[0]
        s_time = time.time() - start
        
        self.log.append({'primary': int(p_pred), 'shadow': int(s_pred), 'match': p_pred == s_pred})
        return {'prediction': int(p_pred), 'latency_ms': p_time * 1000}
    
    def get_metrics(self):
        if not self.log:
            return {}
        df = pd.DataFrame(self.log)
        return {'total': len(df), 'agreement_rate': df['match'].mean()}


class CanaryDeployment:
    """Canary deployment for gradual rollout."""
    
    def __init__(self, stable, canary, canary_pct=5.0):
        self.stable = stable
        self.canary = canary
        self.canary_pct = canary_pct
        self.stable_count = 0
        self.canary_count = 0
    
    def predict(self, features):
        features_2d = features.reshape(1, -1)
        use_canary = np.random.random() * 100 < self.canary_pct
        
        model = self.canary if use_canary else self.stable
        if use_canary:
            self.canary_count += 1
        else:
            self.stable_count += 1
        
        pred = model.predict(features_2d)[0]
        return {'prediction': int(pred), 'model': 'canary' if use_canary else 'stable'}
    
    def increase_canary(self, increment=5.0):
        self.canary_pct = min(100, self.canary_pct + increment)
        return self.canary_pct


class BlueGreenDeployment:
    """Blue-green deployment for instant switches."""
    
    def __init__(self, blue, green):
        self.blue = blue
        self.green = green
        self.active = 'blue'
    
    def predict(self, features):
        features_2d = features.reshape(1, -1)
        model = self.blue if self.active == 'blue' else self.green
        pred = model.predict(features_2d)[0]
        return {'prediction': int(pred), 'environment': self.active}
    
    def switch(self):
        self.active = 'green' if self.active == 'blue' else 'blue'
        return self.active


# Train alternative models
model_v1 = RandomForestClassifier(n_estimators=30, random_state=42)
model_v1.fit(X_train, y_train)

model_v2 = GradientBoostingClassifier(n_estimators=50, random_state=42)
model_v2.fit(X_train, y_train)

print("Models trained for deployment patterns demonstration")

In [None]:
# Demonstrate Shadow Deployment
print("SHADOW DEPLOYMENT")
print("="*40)

shadow = ShadowDeployment(model_v1, model_v2)
for i in range(500):
    shadow.predict(X_test[i])

metrics = shadow.get_metrics()
print(f"Predictions: {metrics['total']}")
print(f"Agreement rate: {metrics['agreement_rate']:.2%}")

In [None]:
# Demonstrate Canary Deployment
print("CANARY DEPLOYMENT")
print("="*40)

canary = CanaryDeployment(model_v1, model_v2, canary_pct=10.0)
rollout_data = []

for phase in range(5):
    canary.stable_count = 0
    canary.canary_count = 0
    
    for i in range(200):
        canary.predict(X_test[i % len(X_test)])
    
    rollout_data.append({'phase': phase+1, 'pct': canary.canary_pct, 
                        'stable': canary.stable_count, 'canary': canary.canary_count})
    print(f"Phase {phase+1}: Canary {canary.canary_pct:.0f}% - Stable: {canary.stable_count}, Canary: {canary.canary_count}")
    canary.increase_canary(20)

# Visualize rollout
df = pd.DataFrame(rollout_data)
fig, ax = plt.subplots(figsize=(10, 5))
ax.bar(df['phase'] - 0.2, df['stable'], 0.4, label='Stable', color='steelblue')
ax.bar(df['phase'] + 0.2, df['canary'], 0.4, label='Canary', color='coral')
ax.set_xlabel('Phase')
ax.set_ylabel('Requests')
ax.set_title('Canary Deployment: Traffic Shift')
ax.legend()
plt.show()

In [None]:
# Demonstrate Blue-Green Deployment
print("BLUE-GREEN DEPLOYMENT")
print("="*40)

bg = BlueGreenDeployment(model_v1, model_v2)
print(f"Initial: {bg.active}")

results = [bg.predict(X_test[i]) for i in range(5)]
print(f"Predictions from {results[0]['environment']}")

new_env = bg.switch()
print(f"Switched to: {new_env}")

results2 = [bg.predict(X_test[i]) for i in range(5)]
print(f"Predictions from {results2[0]['environment']}")

rolled_back = bg.switch()
print(f"Rolled back to: {rolled_back}")

---

## 6. Building Model Serving APIs <a name="6-model-serving-apis"></a>

Creating production-ready APIs for model serving.

In [None]:
class ModelServingAPI:
    """Simulated FastAPI-style model serving."""
    
    def __init__(self, model, name='default', version='1.0'):
        self.model = model
        self.name = name
        self.version = version
        self.requests = 0
        self.errors = 0
        self.latencies = []
    
    def health(self):
        """GET /health"""
        return {'status': 'healthy', 'model': self.name, 'version': self.version}
    
    def predict(self, request):
        """POST /predict"""
        start = time.time()
        self.requests += 1
        
        try:
            if 'features' not in request:
                return {'error': 'Missing features', 'status': 400}
            
            features = np.array(request['features'])
            if features.ndim == 1:
                features = features.reshape(1, -1)
            
            preds = self.model.predict(features)
            proba = self.model.predict_proba(features) if hasattr(self.model, 'predict_proba') else None
            
            latency = time.time() - start
            self.latencies.append(latency)
            
            response = {'predictions': preds.tolist(), 'latency_ms': latency * 1000, 'status': 200}
            if proba is not None:
                response['probabilities'] = proba.tolist()
            return response
        except Exception as e:
            self.errors += 1
            return {'error': str(e), 'status': 500}
    
    def metrics(self):
        """GET /metrics"""
        lats = [l * 1000 for l in self.latencies]
        return {
            'requests_total': self.requests,
            'errors_total': self.errors,
            'latency_p50': np.percentile(lats, 50) if lats else 0,
            'latency_p99': np.percentile(lats, 99) if lats else 0
        }


# Create and test API
api = ModelServingAPI(model, 'classifier', '1.0.0')

print("Health Check:", api.health())
print("\nPrediction:", api.predict({'features': X_test[0].tolist()}))

# Load test
print("\nLoad testing (1000 requests)...")
for i in range(1000):
    api.predict({'features': X_test[i % len(X_test)].tolist()})

print("Metrics:", api.metrics())

---

## 7. Prediction Pipeline Design <a name="7-pipeline-design"></a>

End-to-end pipelines include preprocessing, prediction, and postprocessing.

In [None]:
class PredictionPipeline:
    """Complete prediction pipeline."""
    
    def __init__(self, preprocessor, model):
        self.preprocessor = preprocessor
        self.model = model
        self.timings = {'preprocess': [], 'predict': [], 'postprocess': []}
    
    def preprocess(self, raw_input):
        start = time.time()
        features = self.preprocessor.transform(raw_input) if self.preprocessor else raw_input
        self.timings['preprocess'].append(time.time() - start)
        return features
    
    def predict(self, features):
        start = time.time()
        if features.ndim == 1:
            features = features.reshape(1, -1)
        preds = self.model.predict(features)
        proba = self.model.predict_proba(features) if hasattr(self.model, 'predict_proba') else None
        self.timings['predict'].append(time.time() - start)
        return preds, proba
    
    def postprocess(self, preds, proba):
        start = time.time()
        class_names = {0: 'Class A', 1: 'Class B', 2: 'Class C'}
        results = []
        for i, pred in enumerate(preds):
            result = {'class_id': int(pred), 'class_name': class_names.get(int(pred), 'Unknown')}
            if proba is not None:
                result['confidence'] = float(max(proba[i]))
            results.append(result)
        self.timings['postprocess'].append(time.time() - start)
        return results
    
    def run(self, raw_input):
        features = self.preprocess(raw_input)
        preds, proba = self.predict(features)
        return self.postprocess(preds, proba)
    
    def get_timing(self):
        return {k: np.mean(v) * 1000 if v else 0 for k, v in self.timings.items()}


# Create pipeline
scaler = StandardScaler()
scaler.fit(X_train)

pipeline = PredictionPipeline(scaler, model)

# Run predictions
for i in range(100):
    pipeline.run(X_test[i:i+1])

example = pipeline.run(X_test[0:1])[0]
print("Example Result:")
print(f"  Class: {example['class_name']}")
print(f"  Confidence: {example['confidence']:.2%}")

print("\nTiming Breakdown:")
timing = pipeline.get_timing()
total = sum(timing.values())
for stage, ms in timing.items():
    print(f"  {stage}: {ms:.3f}ms ({ms/total*100:.1f}%)")

In [None]:
# Visualize pipeline timing
fig, ax = plt.subplots(figsize=(8, 5))
ax.pie(timing.values(), labels=timing.keys(), autopct='%1.1f%%', colors=['steelblue', 'coral', 'forestgreen'])
ax.set_title('Pipeline Stage Timing Distribution')
plt.show()

---

## 8. Hands-on Exercises <a name="8-exercises"></a>

### Exercise 1: Deployment Decision
Evaluate deployment options for a voice assistant that needs:
- Max latency: 50ms
- Model size: 200MB
- Must work offline
- High privacy

In [None]:
# Exercise 1 Solution
print("Exercise 1: Voice Assistant Deployment")
print("="*50)

results = framework.evaluate_requirements(
    max_latency_ms=50,
    model_size_mb=200,
    requires_offline=True,
    privacy_req='high',
    budget=50.0
)

for i, (key, opt, score, reasons) in enumerate(results[:3], 1):
    print(f"{i}. {opt.name}: Score {score:.0f}")
    for r in reasons:
        print(f"   - {r}")

print("\nRecommendation: Consider model compression or edge server hybrid approach")

### Exercise 2: Implement A/B Testing
Create an A/B testing deployment that routes 50% traffic to each model.

In [None]:
# Exercise 2 Solution
class ABTestDeployment:
    """A/B testing deployment."""
    
    def __init__(self, model_a, model_b, split=0.5):
        self.model_a = model_a
        self.model_b = model_b
        self.split = split
        self.results_a = []
        self.results_b = []
    
    def predict(self, features, user_id=None):
        # Deterministic assignment if user_id provided
        if user_id:
            use_a = hash(user_id) % 100 < self.split * 100
        else:
            use_a = np.random.random() < self.split
        
        features_2d = features.reshape(1, -1)
        model = self.model_a if use_a else self.model_b
        
        pred = model.predict(features_2d)[0]
        variant = 'A' if use_a else 'B'
        
        if use_a:
            self.results_a.append(pred)
        else:
            self.results_b.append(pred)
        
        return {'prediction': int(pred), 'variant': variant}
    
    def get_stats(self):
        return {
            'variant_a_count': len(self.results_a),
            'variant_b_count': len(self.results_b),
            'split_actual': len(self.results_a) / (len(self.results_a) + len(self.results_b)) if self.results_a or self.results_b else 0
        }


# Test A/B deployment
ab = ABTestDeployment(model_v1, model_v2, split=0.5)

for i in range(1000):
    ab.predict(X_test[i % len(X_test)])

stats = ab.get_stats()
print(f"A/B Test Results:")
print(f"  Variant A: {stats['variant_a_count']} requests")
print(f"  Variant B: {stats['variant_b_count']} requests")
print(f"  Actual split: {stats['split_actual']:.2%}")

---

## 9. Summary <a name="9-summary"></a>

### Key Takeaways

1. **Deployment Location**: Choose between cloud, edge, and on-device based on latency, privacy, and compute needs

2. **Batch vs Online**: Use batch for high-throughput, scheduled workloads; online for real-time, low-latency needs

3. **Deployment Patterns**:
   - Shadow: Test new models without user impact
   - Canary: Gradual rollout with quick rollback
   - Blue-Green: Instant switches between versions

4. **Production APIs**: Include health checks, metrics, proper error handling

5. **Pipelines**: Design end-to-end with preprocessing, prediction, postprocessing

### Best Practices

- Always start with shadow deployment for new models
- Use caching to reduce latency and compute costs
- Monitor latency percentiles (p50, p95, p99)
- Design for rollback from day one
- Track model version in all predictions

### Next Steps

- Tutorial 15: Model Compression Techniques
- Tutorial 16: Serving and Prediction Pipelines