- **If you want the simplest, cloud-native path (CPU/GPU, canary, autoscale):**
  - **KServe (Istio) + ONNX Runtime** for online inference. Easy CRDs, autoscaling by QPS/latency, built-in canary, model stored on S3/GCS.
- **If you need ultra-high throughput on NVIDIA GPUs or multi-model serving:**
  - **NVIDIA Triton Inference Server** (serves PyTorch, TorchScript, ONNX, TensorRT) + Horizontal Pod Autoscaler/KEDA.

Both run beautifully on k8s. TorchServe and Seldon Core are good too, but KServe and Triton cover most needs with less glue.

---

# Reference architectures

## A) KServe + ONNX Runtime (recommended default)

**Why:** Minimal boilerplate, great for HTTP/REST, autoscaling, canary, and observability.

**Flow**

1. Export model → `model.onnx` (or TorchScript `.pt`).
2. Push to object storage (S3/GCS/MinIO).
3. Deploy KServe `InferenceService` CRD that pulls the model.
4. Ingress via Istio → autoscale with HPA/KPA → Prometheus metrics → Grafana.

**ONNX example (single model):**

```yaml
apiVersion: "serving.kserve.io/v1beta1"
kind: InferenceService
metadata:
  name: classifier-v1
spec:
  predictor:
    onnx:
      storageUri: "s3://models/classifier/v1/"
      resources:
        requests: { cpu: "1", memory: "2Gi" }
        limits:   { cpu: "2", memory: "4Gi" }
      runtimeVersion: "1.17.3"  # ONNX Runtime CPU image tag used by KServe
```

**GPU?** Switch to TensorRT or Triton runtime (below), or use KServe’s `triton` predictor with `nodeSelector`/`nvidia.com/gpu` limits.

**Canary traffic split (v1=80%, v2=20%):**

```yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: classifier
spec:
  predictor:
    canaryTrafficPercent: 20
    onnx:
      storageUri: "s3://models/classifier/v2/"
  predictorFormer:
    onnx:
      storageUri: "s3://models/classifier/v1/"
```

> Use KServe autoscaling by concurrency/QPS; add request/limit resources; wire Prometheus for p50/p95 latency dashboards.

---

## B) NVIDIA Triton on k8s (best for GPUs, ensembles, batching)

**Why:** Highest throughput on NVIDIA GPUs, dynamic batching, multi-model/multi-framework, model ensembles, TensorRT acceleration.

**Flow**

1. Export → `model.onnx` (or `.pt` / TensorRT engine).
2. Create Triton **model repository** (S3/PVC).

   ```
   repo/
     model_x/1/model.onnx
     model_x/config.pbtxt
   ```
3. Run Triton Deployment with GPU limits; expose via Ingress or KServe’s Triton runtime.

**Minimal `config.pbtxt` (ONNX):**

```
name: "model_x"
platform: "onnxruntime_onnx"
max_batch_size: 32
dynamic_batching { preferred_batch_size: [4,8,16] max_queue_delay_microseconds: 2000 }
input { name: "input" data_type: TYPE_FP32 dims: [3,224,224] }
output { name: "logits" data_type: TYPE_FP32 dims: [10] }
```

**k8s Deployment (bare Triton):**

```yaml
apiVersion: apps/v1
kind: Deployment
metadata: { name: triton-gpu }
spec:
  replicas: 1
  selector: { matchLabels: { app: triton } }
  template:
    metadata: { labels: { app: triton } }
    spec:
      containers:
      - name: triton
        image: nvcr.io/nvidia/tritonserver:24.05-py3
        args: ["tritonserver","--model-repository=/models","--strict-model-config=false"]
        ports: [{containerPort:8000},{containerPort:8001},{containerPort:8002}]
        resources:
          limits: { nvidia.com/gpu: 1, cpu: "4", memory: "16Gi" }
          requests: { nvidia.com/gpu: 1, cpu: "2", memory: "8Gi" }
        volumeMounts: [{ name: model-repo, mountPath: /models }]
      volumes:
      - name: model-repo
        persistentVolumeClaim: { claimName: triton-models-pvc }
      nodeSelector: { "nvidia.com/gpu.present": "true" }
```

Pair with **KEDA** (scale on GPU utilization or queue depth) or HPA (scale by QPS/p95).

---

# Export formats vs runtimes (quick guide)

* **CPU only:** ONNX Runtime (fast), TorchScript (fine).
* **NVIDIA GPU:** ONNX → TensorRT via Triton (fastest), or TorchScript on Triton/PyTorch backend (good).
* **Heterogeneous cluster:** ONNX + KServe: one spec, many backends.
* **Custom Python preprocessing:** KServe’s Python server (wrap pre/post), or Triton Python backend.

---

# CI/CD & model registry

* Store `model.onnx`/`.pt` in S3/MinIO with semantic versions (`/classifier/v1/`, `/v2/`).
* GitHub Actions:

  1. export model,
  2. push to S3 (`s3://models/app/model/vX/`),
  3. `kubectl apply` updated `InferenceService` or bump Helm values,
  4. run smoke test job hitting `/v1/models/.../infer`.
* Keep a **shadow (champion–challenger)** job comparing outputs (tolerances) before promoting traffic.

---

# Observability & reliability

* **Metrics:** Prometheus scrape (QPS, p50/p95/p99, GPU util, batch size). Grafana dashboards.
* **Tracing:** OpenTelemetry on FastAPI edge (if you add a thin gateway) + Istio tracing.
* **Logging:** JSON logs; centralize via Fluent Bit → Loki/ELK.
* **Scaling:** HPA by concurrency/latency; KEDA by queue length (if using a message broker for async).
* **Resilience:** liveness/readiness probes, resource requests/limits, PodDisruptionBudget, topology spread.

---

# Security & cost notes

* Private models in S3: use IAM roles/IRSA or k8s Secrets; avoid embedding keys.
* NetworkPolicies to restrict egress/ingress; mTLS with Istio.
* Right-size: start CPU (ORT) if latency SLO allows; move hot models to GPU/TensorRT; enable dynamic batching.

---

# What I’d do for your cluster (opinionated plan)

1. **Export** to **ONNX** (opset 17), keep TorchScript as a fallback.
2. Put models in **MinIO** (dev) → S3 (prod).
3. Use **KServe**:

   * CPU services: `onnx` predictor (ORT).
   * GPU services: `triton` predictor with dynamic batching; convert to TensorRT for hottest paths.
4. Canary with KServe traffic split; autoscale via concurrency.
5. Prometheus/Grafana + alerting on p95 latency & error rate.
6. CI/CD: GitHub Actions builds/export/tests → updates `InferenceService` with versioned `storageUri`.

If you share your target latency/QPS and whether you have GPUs, I’ll drop in ready-to-apply YAML (KServe CRDs) for your exact model shape and a one-click Helm setup (dev/prod values).
