# 03 - Model Serving & Inference

![Workflow](../docs/03-inference-workflow.png)

**KServe → Feast Feature Server → Predictions**

**Prerequisites:**
1. `01-feast-features.ipynb` completed
2. `02-training.ipynb` completed
3. KServe deployed: `kubectl apply -f ../manifests/08-kserve-inference.yaml`

In [None]:
%pip install -q requests pandas tqdm
import requests
import pandas as pd
from tqdm.auto import tqdm
import time

## Configuration

In [None]:
NAMESPACE = "feast-trainer-demo"
SERVICE = "sales-forecast"

# In-cluster URL (from notebook pod)
ENDPOINT = f"http://{SERVICE}.{NAMESPACE}.svc.cluster.local:8080"

# Or use Route URL (from outside cluster)
# ENDPOINT = "https://sales-forecast-feast-trainer-demo.apps.your-cluster.com"

print(f"Endpoint: {ENDPOINT}")

## Health Check

In [None]:
try:
    r = requests.get(f"{ENDPOINT}/health", timeout=10)
    print(f"✅ Health: {r.json()}")
except Exception as e:
    print(f"❌ Service not reachable: {e}")
    print("\nDeploy KServe first:")
    print("  kubectl apply -f ../manifests/08-kserve-inference.yaml")

In [None]:
try:
    r = requests.get(f"{ENDPOINT}/v1/models/{SERVICE}", timeout=10)
    info = r.json()
    print(f"Model: {info.get('name')}")
    print(f"MAPE: {info.get('best_mape')}%")
    print(f"Features: {info.get('features', [])[:5]}...")
except Exception as e:
    print(f"Model info: {e}")

## Real-time Inference

In [None]:
# Single prediction with Feast features
entity = {"store_id": 1, "dept_id": 3}

t0 = time.time()
r = requests.post(f"{ENDPOINT}/v1/models/{SERVICE}:predict-with-feast", json={"entities": [entity]}, timeout=30)
latency = (time.time() - t0) * 1000

if r.status_code == 200:
    result = r.json()
    print(f"✅ Store {entity['store_id']}, Dept {entity['dept_id']}: ${result['predictions'][0]:,.0f}")
    print(f"   Latency: {latency:.0f}ms")
else:
    print(f"❌ Error: {r.text}")

## Batch Scoring

In [None]:
# Score multiple entities
entities = [{"store_id": s, "dept_id": d} for s in [1, 10, 25, 45] for d in [1, 5, 10, 14]]
print(f"Scoring {len(entities)} entities...")

preds = []
for e in tqdm(entities):
    try:
        r = requests.post(f"{ENDPOINT}/v1/models/{SERVICE}:predict-with-feast", json={"entities": [e]}, timeout=30)
        if r.status_code == 200:
            preds.append({**e, "prediction": r.json()['predictions'][0]})
        else:
            preds.append({**e, "prediction": None, "error": r.status_code})
    except Exception as ex:
        preds.append({**e, "prediction": None, "error": str(ex)})

results = pd.DataFrame(preds)
print(f"\n✅ {len(results)} predictions")
results

In [None]:
if 'prediction' in results.columns and results['prediction'].notna().any():
    print(f"📊 Summary:")
    print(f"   Min: ${results['prediction'].min():,.0f}")
    print(f"   Max: ${results['prediction'].max():,.0f}")
    print(f"   Mean: ${results['prediction'].mean():,.0f}")

## Save Predictions

In [None]:
from datetime import datetime
import os

os.makedirs('/shared/predictions', exist_ok=True)
path = f"/shared/predictions/batch_{datetime.now().strftime('%Y%m%d_%H%M%S')}.parquet"
results.to_parquet(path, index=False)
print(f"✅ Saved: {path}")

## Summary

| Component | Status |
|-----------|--------|
| KServe InferenceService | Model serving |
| Feast Feature Server | Online feature retrieval |
| `/predict-with-feast` | Entity → Features → Prediction |

**Key:** Same `inference_features` Feature Service ensures train-serve consistency.