# 03 - Model Serving with KServe

![Workflow](../docs/03-inference-workflow.png)

## What This Notebook Does

| Step | Action | Component |
|------|--------|------------|
| 1 | Create serving script | ConfigMap |
| 2 | Deploy InferenceService | KServe |
| 3 | Make predictions | REST API (V2 protocol) |

## Inference Flow

```
Client                 KServe                Feast Server
  ‚îÇ                      ‚îÇ                        ‚îÇ
  ‚îÇ {store_id, dept_id}  ‚îÇ                        ‚îÇ
  ‚îÇ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ                        ‚îÇ
  ‚îÇ                      ‚îÇ  get-online-features   ‚îÇ
  ‚îÇ                      ‚îÇ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ
  ‚îÇ                      ‚îÇ                        ‚îÇ
  ‚îÇ                      ‚îÇ‚óÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÇ
  ‚îÇ                      ‚îÇ  [15 features]         ‚îÇ
  ‚îÇ                      ‚îÇ                        ‚îÇ
  ‚îÇ                      ‚îÇ  model.predict()       ‚îÇ
  ‚îÇ‚óÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÇ                        ‚îÇ
  ‚îÇ  prediction: $96,763 ‚îÇ                        ‚îÇ
```

**Key:** Client sends entity IDs only; KServe fetches features from Feast.

**Prerequisites:** `01-feast-features.ipynb` and `02-training.ipynb` completed.

In [None]:
%pip install -q kserve pandas tqdm kubernetes
from kserve import (
    KServeClient,
    V1beta1InferenceService,
    V1beta1InferenceServiceSpec,
    V1beta1PredictorSpec,
    constants
)
from kubernetes import client as k8s_client
from kubernetes.client import V1Container, V1ResourceRequirements, V1VolumeMount, V1Volume, V1PersistentVolumeClaimVolumeSource, V1ConfigMapVolumeSource, V1EnvVar, V1Probe, V1HTTPGetAction
import pandas as pd
import requests
from tqdm.auto import tqdm
import time
import json

## Configuration

| Parameter | Value | Purpose |
|-----------|-------|----------|
| `MODEL_NAME` | `sales-forecast` | InferenceService name |
| `MODEL_DIR` | `/shared/models` | Path to trained model |
| `FEAST_SERVER_URL` | `http://feast-server:6566` | Online feature store |

In [None]:
NAMESPACE = "feast-trainer-demo"
MODEL_NAME = "sales-forecast"
MODEL_DIR = "/shared/models"
PVC_NAME = "shared"
CONFIGMAP_NAME = "sales-forecast-serve"

# Trainer image (has PyTorch pre-installed)
TRAINER_IMAGE = "quay.io/modh/training:py311-cuda124-torch251"

# Feast Feature Server
FEAST_SERVER_URL = f"http://feast-server.{NAMESPACE}.svc.cluster.local:6566"

print(f"Namespace: {NAMESPACE}")
print(f"Model: {MODEL_NAME}")
print(f"PVC: {PVC_NAME}")

## 1. Create Serving Script

The serving script implements KServe's `Model` interface:

| Method | Purpose |
|--------|----------|
| `load()` | Load model, scalers, feature columns |
| `preprocess()` | Extract entities ‚Üí Call Feast ‚Üí Build feature matrix |
| `predict()` | Scale features ‚Üí Model inference ‚Üí Inverse transform |
| `postprocess()` | Format as V2 InferResponse |

**Feast Integration:**
```python
def _get_features_from_feast(self, entities):
    # POST to Feast server's online feature API
    response = requests.post(
        f"{FEAST_SERVER_URL}/get-online-features",
        json={"feature_service": "inference_features", "entities": {...}}
    )
```

In [None]:
# Create ConfigMap with serving script (required by InferenceService)
from kubernetes.client import V1ConfigMap, V1ObjectMeta
from kubernetes import config

# Load in-cluster config or local kubeconfig
try:
    config.load_incluster_config()
except:
    config.load_kube_config()

core_v1 = k8s_client.CoreV1Api()

SERVE_SCRIPT = '''#!/usr/bin/env python3
"""
Sales Forecasting Inference Server (KServe V2 Protocol)

Uses KServe Model class for standard inference protocol.
Feast features are fetched in preprocess step.
"""
import os
import json
import requests
import torch
import torch.nn as nn
import joblib
import numpy as np
import logging
from typing import Dict, Union
import kserve
from kserve import Model, ModelServer, InferRequest, InferResponse, InferOutput

logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(message)s")
logger = logging.getLogger(__name__)

FEAST_SERVER_URL = os.getenv("FEAST_SERVER_URL", "http://feast-server.feast-trainer-demo.svc.cluster.local:6566")

class SalesMLP(nn.Module):
    def __init__(self, input_dim, hidden_dims=[256, 128, 64], dropout=0.2):
        super().__init__()
        layers = []
        prev_dim = input_dim
        for h_dim in hidden_dims:
            layers.extend([nn.Linear(prev_dim, h_dim), nn.BatchNorm1d(h_dim), nn.ReLU(), nn.Dropout(dropout)])
            prev_dim = h_dim
        layers.append(nn.Linear(prev_dim, 1))
        self.net = nn.Sequential(*layers)

    def forward(self, x):
        return self.net(x).squeeze(-1)


class SalesForecastModel(Model):
    def __init__(self, name: str):
        super().__init__(name)
        self.model = None
        self.scalers = None
        self.feature_cols = None
        self.metadata = None
        self.use_log_transform = False
        self.ready = False
    
    def load(self):
        model_dir = os.getenv("MODEL_DIR", "/shared/models")
        logger.info(f"Loading model from {model_dir}...")
        
        with open(f"{model_dir}/model_metadata.json") as f:
            self.metadata = json.load(f)
        
        hidden_dims = self.metadata.get("hidden_dims", [256, 128, 64])
        dropout = self.metadata.get("dropout", 0.2)
        input_dim = self.metadata["input_dim"]
        
        self.model = SalesMLP(input_dim, hidden_dims, dropout)
        self.model.load_state_dict(torch.load(f"{model_dir}/best_model.pt", map_location="cpu", weights_only=True))
        self.model.eval()
        
        self.scalers = joblib.load(f"{model_dir}/scalers.joblib")
        self.feature_cols = self.metadata["feature_columns"]
        self.use_log_transform = self.scalers.get("use_log_transform", False)
        
        logger.info(f"Model loaded: {len(self.feature_cols)} features, arch={hidden_dims}")
        self.ready = True
    
    def _get_features_from_feast(self, entity_rows):
        payload = {
            "feature_service": "inference_features",
            "entities": {
                "store_id": [row["store_id"] for row in entity_rows],
                "dept_id": [row["dept_id"] for row in entity_rows],
            }
        }
        response = requests.post(f"{FEAST_SERVER_URL}/get-online-features", json=payload, timeout=10)
        response.raise_for_status()
        result = response.json()
        
        feature_names = result.get("metadata", {}).get("feature_names", [])
        results = result.get("results", [])
        
        features = []
        for i in range(len(entity_rows)):
            row = {}
            for j, name in enumerate(feature_names):
                if name in self.feature_cols and j < len(results):
                    val = results[j].get("values", [None])[i] if i < len(results[j].get("values", [])) else None
                    row[name] = val if val is not None else 0
            features.append(row)
        return features
    
    def preprocess(self, payload: Union[Dict, InferRequest], headers: Dict = None) -> np.ndarray:
        if isinstance(payload, InferRequest):
            inputs = {inp.name: inp.data for inp in payload.inputs}
        else:
            inputs = {}
            for inp in payload.get("inputs", []):
                inputs[inp["name"]] = inp.get("data", inp.get("datatype"))
        
        if "entities" in inputs:
            entities = inputs["entities"]
            if isinstance(entities, list) and len(entities) > 0:
                logger.info(f"Fetching features from Feast for {len(entities)} entities")
                feature_rows = self._get_features_from_feast(entities)
                X = np.array([[row.get(c, 0) for c in self.feature_cols] for row in feature_rows], dtype=np.float32)
                return X
        
        if "features" in inputs:
            return np.array(inputs["features"], dtype=np.float32)
        
        raise ValueError("Input must have 'entities' or 'features'")
    
    def predict(self, X: np.ndarray, headers: Dict = None) -> np.ndarray:
        X_scaled = self.scalers["scaler_X"].transform(X)
        with torch.no_grad():
            preds = self.model(torch.FloatTensor(X_scaled)).numpy()
        predictions = self.scalers["scaler_y"].inverse_transform(preds.reshape(-1, 1)).flatten()
        if self.use_log_transform:
            predictions = np.expm1(predictions)
        return predictions
    
    def postprocess(self, predictions: np.ndarray, headers: Dict = None) -> Union[Dict, InferResponse]:
        return InferResponse(
            model_name=self.name,
            infer_outputs=[InferOutput(name="predictions", shape=list(predictions.shape), datatype="FP32", data=predictions.tolist())]
        )


if __name__ == "__main__":
    model = SalesForecastModel("sales-forecast")
    model.load()
    ModelServer().start([model])
'''

configmap = V1ConfigMap(
    metadata=V1ObjectMeta(name=CONFIGMAP_NAME, namespace=NAMESPACE, labels={"app": "sales-forecasting"}),
    data={"serve.py": SERVE_SCRIPT}
)

try:
    core_v1.read_namespaced_config_map(CONFIGMAP_NAME, NAMESPACE)
    print(f"‚ö†Ô∏è ConfigMap {CONFIGMAP_NAME} exists, replacing...")
    core_v1.replace_namespaced_config_map(CONFIGMAP_NAME, NAMESPACE, configmap)
except:
    print(f"üì¶ Creating ConfigMap {CONFIGMAP_NAME}...")
    core_v1.create_namespaced_config_map(NAMESPACE, configmap)

print(f"‚úÖ ConfigMap created with serve.py script")

## 2. Initialize KServe Client

The `KServeClient` manages InferenceService lifecycle.

In [None]:
# Initialize KServe client for model management
kserve_client = KServeClient()
print(f"‚úÖ KServeClient initialized")

## 3. Build InferenceService Spec

Define the model server deployment:

| Component | Value | Purpose |
|-----------|-------|----------|
| `image` | `quay.io/modh/training:...` | Container with PyTorch |
| `replicas` | 1-3 | Autoscaling range |
| `resources` | 2 CPU, 4Gi | Per-pod limits |

**Volume Mounts:**
- `/scripts` ‚Üí ConfigMap with serve.py
- `/shared` ‚Üí PVC with model artifacts

In [None]:
# Build InferenceService spec using KServe SDK
isvc = V1beta1InferenceService(
    api_version="serving.kserve.io/v1beta1",
    kind="InferenceService",
    metadata=k8s_client.V1ObjectMeta(
        name=MODEL_NAME,
        namespace=NAMESPACE,
        labels={"app": "sales-forecasting"},
        annotations={"serving.kserve.io/deploymentMode": "RawDeployment"}
    ),
    spec=V1beta1InferenceServiceSpec(
        predictor=V1beta1PredictorSpec(
            min_replicas=1,
            max_replicas=3,
            containers=[
                V1Container(
                    name="kserve-container",
                    image=TRAINER_IMAGE,
                    command=["/bin/bash", "-c"],
                    args=["pip install -q kserve joblib numpy scikit-learn && python /scripts/serve.py"],
                    ports=[k8s_client.V1ContainerPort(container_port=8080, protocol="TCP")],
                    env=[
                        V1EnvVar(name="MODEL_DIR", value=MODEL_DIR),
                        V1EnvVar(name="FEAST_SERVER_URL", value=FEAST_SERVER_URL),
                    ],
                    resources=V1ResourceRequirements(
                        requests={"cpu": "500m", "memory": "1Gi"},
                        limits={"cpu": "2", "memory": "4Gi"}
                    ),
                    readiness_probe=V1Probe(
                        http_get=V1HTTPGetAction(path="/v2/health/ready", port=8080),
                        initial_delay_seconds=60,
                        period_seconds=10
                    ),
                    volume_mounts=[
                        V1VolumeMount(name="scripts", mount_path="/scripts"),
                        V1VolumeMount(name="model-storage", mount_path="/shared"),
                    ]
                )
            ],
            volumes=[
                V1Volume(name="scripts", config_map=V1ConfigMapVolumeSource(name=CONFIGMAP_NAME)),
                V1Volume(name="model-storage", persistent_volume_claim=V1PersistentVolumeClaimVolumeSource(claim_name=PVC_NAME)),
            ]
        )
    )
)

print("‚úÖ InferenceService spec created")
print(f"   Image: {TRAINER_IMAGE}")
print(f"   Model dir: {MODEL_DIR}")

## 4. Deploy InferenceService

Create or update the service in Kubernetes:

In [None]:
# Deploy InferenceService (create or replace)
try:
    # Check if already exists
    existing = kserve_client.get(MODEL_NAME, namespace=NAMESPACE)
    print(f"‚ö†Ô∏è InferenceService '{MODEL_NAME}' exists, replacing...")
    kserve_client.replace(MODEL_NAME, isvc, namespace=NAMESPACE)
except Exception:
    print(f"üì¶ Creating InferenceService '{MODEL_NAME}'...")
    kserve_client.create(isvc)

print(f"‚úÖ InferenceService submitted to namespace '{NAMESPACE}'")


## 5. Wait for Ready

The service is ready when:
1. Pod is running
2. Model is loaded
3. Health check passes (`/v2/health/ready`)

In [None]:
# Wait for InferenceService to be ready
print("‚è≥ Waiting for InferenceService to be ready...")
kserve_client.wait_isvc_ready(MODEL_NAME, namespace=NAMESPACE, timeout_seconds=300)

# Get service status
isvc_status = kserve_client.get(MODEL_NAME, namespace=NAMESPACE)
url = isvc_status.get("status", {}).get("url", "")
print(f"‚úÖ InferenceService ready!")
print(f"   URL: {url}")


## 6. Make Predictions (V2 Protocol)

KServe V2 (Open Inference Protocol) format:

**Request:**
```json
POST /v2/models/sales-forecast/infer
{
  "inputs": [{
    "name": "entities",
    "data": [{"store_id": 1, "dept_id": 3}]
  }]
}
```

**Response:**
```json
{
  "outputs": [{
    "name": "predictions",
    "data": [96763.45]
  }]
}
```

In [None]:
# Create inference endpoint (KServe adds -predictor suffix)
# Headless service requires direct port 8080
ENDPOINT = f"http://{MODEL_NAME}-predictor.{NAMESPACE}.svc.cluster.local:8080"

# Check server health
try:
    resp = requests.get(f"{ENDPOINT}/v2/health/ready", timeout=5)
    ready = resp.status_code == 200
except:
    ready = False
print(f"‚úÖ Inference endpoint ready: {ready}")
print(f"   Endpoint: {ENDPOINT}")

In [None]:
def predict_with_feast(entities: list) -> dict:
    """
    Make prediction using KServe V2 protocol with Feast feature lookup.
    """
    # V2 protocol endpoint
    url = f"{ENDPOINT}/v2/models/{MODEL_NAME}/infer"
    
    # V2 protocol payload format
    payload = {
        "inputs": [
            {
                "name": "entities",
                "shape": [len(entities)],
                "datatype": "BYTES",
                "data": entities
            }
        ]
    }
    
    t0 = time.time()
    response = requests.post(url, json=payload, timeout=30)
    latency = (time.time() - t0) * 1000
    
    if response.status_code != 200:
        raise Exception(f"Prediction failed: {response.text}")
    
    result = response.json()
    # V2 protocol returns outputs array
    predictions = result.get("outputs", [{}])[0].get("data", [])
    
    return {"predictions": predictions, "latency_ms": latency, "entities": entities}

# Single prediction
entity = {"store_id": 1, "dept_id": 3}
try:
    result = predict_with_feast([entity])
    print(f"‚úÖ Store {entity['store_id']}, Dept {entity['dept_id']}: ${result['predictions'][0]:,.0f}")
    print(f"   Latency: {result['latency_ms']:.0f}ms")
except Exception as e:
    print(f"‚ùå Prediction error: {e}")

## 7. Batch Scoring

Score multiple entities in a single request for efficiency:

| Approach | Latency | Use Case |
|----------|---------|----------|
| Single | ~30ms | Real-time API |
| Batch (16) | ~40ms | Bulk scoring |

In [None]:
# Score multiple entities in a single request (batched)
entities = [{"store_id": s, "dept_id": d} for s in [1, 10, 25, 45] for d in [1, 5, 10, 14]]
print(f"Scoring {len(entities)} entities...")

try:
    result = predict_with_feast(entities)
    
    # Build results dataframe
    results = pd.DataFrame([
        {**e, "prediction": p}
        for e, p in zip(entities, result["predictions"])
    ])
    
    print(f"\n‚úÖ {len(results)} predictions in {result['latency_ms']:.0f}ms (batched)")
except Exception as e:
    print(f"‚ùå Batch prediction error: {e}")
    results = pd.DataFrame()

In [None]:
results


In [None]:
if 'prediction' in results.columns and results['prediction'].notna().any():
    print(f"üìä Summary:")
    print(f"   Min: ${results['prediction'].min():,.0f}")
    print(f"   Max: ${results['prediction'].max():,.0f}")
    print(f"   Mean: ${results['prediction'].mean():,.0f}")

## 8. Save Predictions

Save batch results to PVC for downstream processing:

In [None]:
from datetime import datetime
import os

if not results.empty:
    os.makedirs('/opt/app-root/src/shared/predictions', exist_ok=True)
    path = f"/opt/app-root/src/shared/predictions/batch_{datetime.now().strftime('%Y%m%d_%H%M%S')}.parquet"
    results.to_parquet(path, index=False)
    print(f"‚úÖ Saved: {path}")
else:
    print("‚ö†Ô∏è No results to save")

## 9. Cleanup (Optional)

Delete resources when done:

In [None]:
# Uncomment to delete resources
# kserve_client.delete(MODEL_NAME, namespace=NAMESPACE)
# core_v1.delete_namespaced_config_map(CONFIGMAP_NAME, NAMESPACE)
# print(f"‚úÖ Deleted InferenceService and ConfigMap")

---
## ‚úÖ Pipeline Complete!

### End-to-End Summary

| Notebook | Component | Output |
|----------|-----------|--------|
| 01 | Feast | Features registered, online store populated |
| 02 | Kubeflow | Model trained, logged to MLflow |
| 03 | KServe | Model serving with Feast integration |

### Key Benefits of This Architecture

| Benefit | How |
|---------|-----|
| **No feature skew** | Same FeatureService for train & serve |
| **Scalable training** | Ray + PyTorch DDP |
| **Low latency serving** | Feast online store (<50ms) |
| **Experiment tracking** | MLflow for all runs |
| **Simple client** | Send entity IDs, get predictions |