Production-grade MLOps framework for LLM systems: experiment tracking, model versioning, drift detection, A/B testing, guardrails, and CI/CD. Battle-tested in enterprise financial services deployments.
Most teams can get an LLM working in a notebook. Almost none have solved:
- Reproducibility — can you recreate the exact prompt, model version, and retrieval config that produced a given output 6 months ago?
- Regression detection — does the new model version actually perform better on your domain, or just on OpenAI's evals?
- Production drift — is the model's output quality degrading silently as your data distribution shifts?
- Responsible deployment — how do you enforce output constraints (PII, toxicity, hallucination) consistently across 100K+ daily inferences?
This reference architecture answers all four.
┌──────────────────────────────────────────────────────────────┐
│ DEVELOPMENT LOOP │
│ │
│ Prompt Engineering → Experiment Tracking (MLflow) │
│ Model Evaluation → RAGAS / Custom Metrics → Registry │
└──────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ CI/CD PIPELINE (GitHub Actions) │
│ │
│ PR → Automated Eval Suite → Quality Gate → Stage → Prod │
│ (faithfulness, latency, cost per 1K, regression test) │
└──────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ PRODUCTION RUNTIME │
│ │
│ Inference API → Guardrails → Response → Observability │
│ │ │ │
│ A/B Router Drift Monitor │
│ (shadow/canary) (statistical tests) │
└──────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ FEEDBACK LOOP │
│ User signals → Airflow → Retraining / Prompt refinement │
└──────────────────────────────────────────────────────────────┘
| Component | Technology | Purpose |
|---|---|---|
| Experiment Tracking | MLflow | Prompt versions, model configs, eval metrics |
| Model Registry | MLflow Model Registry | Version control, staging, production promotion |
| Orchestration | Apache Airflow | Scheduled eval runs, retraining pipelines |
| Serving | FastAPI + Azure AKS | Low-latency inference with autoscaling |
| A/B Testing | Custom router | Shadow mode, canary, champion/challenger |
| Drift Detection | Evidently AI | Data drift, concept drift, output quality drift |
| Guardrails | Guardrails AI + custom | PII, hallucination, toxicity, domain constraints |
| CI/CD | GitHub Actions | Automated eval gates, progressive deployment |
| Observability | OpenTelemetry + Grafana | Latency, cost, quality SLOs |
llmops-reference-architecture/
├── src/
│ ├── tracking/
│ │ ├── experiment_tracker.py # MLflow experiment management
│ │ ├── model_registry.py # Model lifecycle: dev→staging→prod
│ │ └── prompt_versioning.py # Prompt registry with semantic diff
│ ├── evaluation/
│ │ ├── eval_suite.py # Automated quality gate checks
│ │ ├── regression_tests.py # Golden set regression detection
│ │ └── cost_profiler.py # Token cost tracking per pipeline
│ ├── serving/
│ │ ├── inference_api.py # FastAPI with streaming + health checks
│ │ ├── ab_router.py # Shadow/canary/champion routing
│ │ └── model_loader.py # Lazy loading + warm-up strategies
│ ├── monitoring/
│ │ ├── drift_detector.py # Statistical drift detection
│ │ ├── quality_monitor.py # Continuous quality scoring
│ │ └── alerting.py # SLO violation alerts
│ └── guardrails/
│ ├── output_validator.py # PII, hallucination, toxicity
│ └── input_sanitizer.py # Prompt injection detection
├── dags/
│ ├── eval_pipeline.py # Airflow: scheduled evaluation
│ └── retraining_trigger.py # Airflow: drift-triggered retraining
├── .github/workflows/
│ ├── eval_gate.yml # PR quality gate
│ └── deploy.yml # Progressive production deployment
├── configs/
│ ├── model_configs.yaml
│ └── eval_thresholds.yaml
└── requirements.txt
git clone https://github.com/codebygarrysingh/llmops-reference-architecture
cd llmops-reference-architecture
pip install -r requirements.txt
# Start MLflow and Airflow
docker-compose up -d
# Run an experiment
python -m src.tracking.experiment_tracker \
--model gpt-4o \
--prompt-version v2.3 \
--eval-dataset ./data/golden_set.json
# Check quality gate
python -m src.evaluation.eval_suite --threshold-config configs/eval_thresholds.yaml
# Start serving
uvicorn src.serving.inference_api:app --workers 4Every pull request triggers automated evaluation against the golden test set:
# .github/workflows/eval_gate.yml
- name: Run eval suite
run: python -m src.evaluation.eval_suite --strict
- name: Enforce quality thresholds
run: |
if faithfulness < 0.90: fail
if latency_p95 > 2000ms: fail
if cost_per_1k > $0.50: warn
if regression_delta > -0.02: failDeployment is blocked automatically if any threshold is breached — no manual sign-off required for regressions to reach production.
detector = DriftDetector(
reference_dataset=baseline_outputs,
drift_tests=["ks_test", "psi", "embedding_drift"],
alert_threshold=0.05
)
# Run on sliding 24h window
result = detector.check(current_window=last_24h_outputs)
if result.drift_detected:
alerting.trigger_oncall(result.summary)router = ABRouter(
champion=ProductionModel(version="v2.1"),
challenger=CandidateModel(version="v2.2"),
mode="shadow", # shadow | canary | champion_challenger
canary_pct=0.05, # 5% traffic to challenger
min_sample_size=1000 # statistical significance threshold
)Garry Singh — Principal AI & Data Engineer · MSc Oxford