# 1) Training stack & monitoring (PyTorch-first)

**Project structure**

* `src/` (datasets, models, train.py, eval.py), `configs/` (Hydra/YAML), `scripts/`, `tests/`, `data/` (read-only!), `artifacts/` (checkpoints, logs), `docker/`.
* Config via **Hydra** (or plain YAML + argparse). Keep all hyperparams in config, not code.

**Data**

* Version datasets with **DVC** or **lakeFS**; keep metadata (schema, label map, splits).
* Data validation with **Great Expectations** (nulls, ranges, class balance, leakage checks).
* Reproducible splits: fixed seed, `StratifiedKFold` for class problems.

**Training**

* Use **PyTorch Lightning**/**Lightning Fabric** or a light custom loop with:

  * AMP: `torch.cuda.amp.autocast()` + `GradScaler` (or bfloat16 on newer GPUs).
  * **DDP** or **FSDP** (and/or **DeepSpeed ZeRO**) for multi-GPU.
  * Gradient clipping + gradient accumulation.
  * LR schedulers (Cosine, OneCycle) and **AdamW** (default), weight decay set.
  * Determinism toggles for reproducibility when needed.

**Experiment tracking**

* Pick one: **Weights & Biases**, **MLflow**, **Neptune**, or **ClearML**.

  * Track: run config, code commit, dataset hash, metrics, loss curves, LR, grad norms, GPU util, confusion matrices/PR curves, examples.
  * Log artifacts: checkpoints, TensorBoard logs, plots, ONNX/TorchScript exports.

**Profiling & performance**

* **torch.profiler** (schedule + trace to TensorBoard), `torch.backends.cudnn.benchmark=True` for speed (turn off for determinism), watch dataloader bottlenecks (`num_workers`, `pin_memory`).
* Memory: `torch.cuda.memory_summary()`; optimize with channels-last, fused ops where possible.

**Example: training loop essentials**

```python
model.train()
scaler = torch.cuda.amp.GradScaler()
for step, (x, y) in enumerate(loader):
    x, y = x.to(device, non_blocking=True), y.to(device, non_blocking=True)
    optimizer.zero_grad(set_to_none=True)
    with torch.cuda.amp.autocast():
        logits = model(x)
        loss = criterion(logits, y)
    scaler.scale(loss).backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    scaler.step(optimizer)
    scaler.update()
    scheduler.step()
```

# 2) Quality gates (before you even think “release”)

**Offline evaluation**

* Hold-out test set untouched by tuning.
* Cross-validation when data is small/imbalanced.
* Threshold-free metrics (AUC, log loss) + business metrics (e.g., precision\@target-recall).
* **Stat sig** on deltas (bootstrap CIs), not single-run noise.

**Robustness & safety**

* Slice metrics (per class/segment/lighting/device).
* Adversarial/noise tests, distribution shift tests, ood detectors if relevant.
* Bias/fairness checks where applicable.
* Model size/latency checks against SLOs (p95/p99) on target hardware.

**Promotion criteria (write them down)**

* “Promote to Staging if: AUC ≥ X on hold-out; p95 latency ≤ Y ms on T4; passes data validation; reproducible environment.”

# 3) Packaging & inference optimization

**Artifacts**

* Save both: (a) training **state\_dict** and (b) an **inference graph**:

  * TorchScript: `torch.jit.script`/`trace`
  * **ONNX** for cross-runtime; optionally convert to **TensorRT** on NVIDIA.
  * For PyTorch 2.x, try `torch.compile(model)` for CPU/GPU speedups.

**Inference optimizations**

* `torch.inference_mode()` (faster than `no_grad()`), batch where possible, static shapes if you can.
* Half/bfloat16 on GPU; channels-last for conv nets.
* Operator fusion (PyTorch 2 compiler), quantization:

  * **PTQ** (post-training quant) to int8 for CPU/edge.
  * **QAT** (quant-aware training) if accuracy loss is too high.

**Model registry**

* Use **MLflow Model Registry**, **W\&B Artifacts**, or **S3+manifest** with semantic versioning `model: 1.4.2`, stage = `Staging/Production/Archived`.

# 4) Release & rollout to production

**Serving options (common picks)**

* **FastAPI** + PyTorch (flexible, easy A/B and business logic).
* **TorchServe** (model store + handlers).
* **BentoML** (nice bundling), or **NVIDIA Triton** (multi-framework, dynamic batching).
* Batch vs. online: Use batch for offline scoring; online with autoscaling.

**Infra**

* Containerize (Docker): pin CUDA/cuDNN + PyTorch versions; non-root user; healthcheck.
* K8s deploy: readiness/liveness probes, resource requests/limits, HPA for autoscaling, node affinity if you need GPUs. Use **Knative** for scale-to-zero.
* Secrets & config: **K8s Secrets**, **Vault**; never bake creds into images.

**Release strategies**

* **Shadow** (mirror traffic; no user impact) → **Canary** (1–5–25–50–100%) → **Blue/Green** (instant switch with rollback).
* A/B testing for business metrics (conversion, attach rate, etc.) when user-visible.
* Feature flags (e.g., Unleash/LaunchDarkly) for controlled exposure.
* Rollback plan: one command to revert to previous image/model version.

**CI/CD for ML (“MLOps”)**

* **CI** (GitHub Actions/GitLab CI):

  * Lint/format (ruff/black, isort), type check (mypy), unit tests (pytest), small data smoke tests, export test (TorchScript/ONNX), reproducibility check, security scan (bandit/trivy).
* **CD**:

  * Build image with run ID + git SHA + model version label.
  * Push to registry, update Helm chart/kustomize; deploy to **Staging**; run synthetic and A/B tests; promote to **Prod** on pass.
* **Orchestration** (pipelines): **Airflow**, **Prefect**, or **Dagster** for retraining, evaluation, and automated promotions on schedule or data drift.

# 5) Production monitoring & operations

**System (golden signals)**

* Latency (p50/p95/p99), throughput (RPS), error rate, saturation (GPU/CPU/mem).
* Export Prometheus metrics; dashboards in Grafana. Trace with **OpenTelemetry**.

**Model telemetry**

* Request mix, input schema checks, outlier rate, drift:

  * Data drift (e.g., PSI/JS divergence per feature), prediction drift vs. training.
  * Quality proxy when labels are delayed (leading indicators).
  * If you get labels later: continuous evaluation (lagged accuracy/AUC, calibration).
* Log samples (privacy-safe, sampled & hashed); enable replay for debugging.

**Alerting & SLOs**

* SLOs (e.g., p95 < 60 ms; 99.9% availability; drift PSI < 0.2).
* Alerts: divergence (nan %, loss spikes), drift exceedance, latency/error budget burn, GPU OOM.

**Model lifecycle**

* Scheduled retraining or event-based (data volume/quality change).
* Auto-promotion only with strict gates + human approval.
* Decommission old models (archive artifacts + model card + changelog).

# 6) Reference checklists + mini-snippets

## Training-time checklist

* [ ] Data versioned; validation suite green
* [ ] Repro seed set; env pinned (PyTorch/CUDA, Python, OS)
* [ ] AMP on; DDP/FSDP if multi-GPU
* [ ] Tracking & artifacts logged (W\&B/MLflow)
* [ ] Profiler run shows no dataloader bottleneck
* [ ] Best-epoch checkpoint + exported TorchScript/ONNX

## Pre-release gates

* [ ] Test set metrics meet thresholds (with CIs)
* [ ] Slice metrics OK; latency budget OK on target HW
* [ ] Bias/safety checks OK
* [ ] Model card updated; changelog written

## Deploy checklist

* [ ] Image built (git SHA + model ver tags)
* [ ] Health endpoints added; readiness passes
* [ ] Canary plan & rollback ready
* [ ] Dashboards + alerts wired

**FastAPI serving skeleton**

```python
from fastapi import FastAPI
import torch

app = FastAPI()
model = torch.jit.load("model.ts").eval()
@torch.inference_mode()
@app.post("/predict")
async def predict(payload: dict):
    x = preprocess(payload)                # tensor on device
    y = model(x)                           # batched inference
    return postprocess(y)
```

**Prometheus metrics (example using fastapi-instrumentator)**

```python
from prometheus_client import Counter, Histogram
REQS = Counter("inference_requests_total", "Total requests")
LAT = Histogram("inference_latency_seconds", "Latency")
```

**MLflow quick logging**

```python
import mlflow, mlflow.pytorch
mlflow.set_experiment("my_model")
with mlflow.start_run():
    mlflow.log_params(cfg_as_dict)
    mlflow.log_metric("val_auc", auc)
    mlflow.pytorch.log_model(model, "model")
```

**Export to ONNX**

```python
torch.onnx.export(model, example_input, "model.onnx",
                  input_names=["input"], output_names=["output"],
                  dynamic_axes={"input": {0: "batch"}, "output": {0: "batch"}},
                  opset_version=17)
```

**Quantization (PTQ) sketch**

```python
model.eval()
model.qconfig = torch.ao.quantization.get_default_qconfig("fbgemm")
prepared = torch.ao.quantization.prepare(model)
# calibrate with a few hundred real samples
quantized = torch.ao.quantization.convert(prepared)
```

---

## Recommended tool picks (opinionated)

* **Config & structure:** Hydra + Lightning (or clean custom loops)
* **Tracking:** W\&B *or* MLflow (pick one)
* **Data versioning:** DVC
* **Validation:** Great Expectations
* **Serving:** FastAPI (flexibility) or Triton (throughput) or TorchServe (PyTorch-native)
* **Pipelines:** Prefect (DX) or Airflow (enterprise)
* **Deploy:** Docker + K8s + Helm; Prometheus/Grafana + OpenTelemetry

If you want, tell me your target hardware (GPU/CPU/edge), latency/throughput goals, and the problem type (classification/detection/etc.). I can turn this into a concrete template repo (Dockerfile, GH Actions CI, FastAPI server, Helm chart, and minimal training loop) tailored to your setup.
