LLMOps Reference Architecture

Production-grade MLOps framework for LLM systems: experiment tracking, model versioning, drift detection, A/B testing, guardrails, and CI/CD. Battle-tested in enterprise financial services deployments.

The Problem This Solves

Most teams can get an LLM working in a notebook. Almost none have solved:

Reproducibility — can you recreate the exact prompt, model version, and retrieval config that produced a given output 6 months ago?
Regression detection — does the new model version actually perform better on your domain, or just on OpenAI's evals?
Production drift — is the model's output quality degrading silently as your data distribution shifts?
Responsible deployment — how do you enforce output constraints (PII, toxicity, hallucination) consistently across 100K+ daily inferences?

This reference architecture answers all four.

Architecture

┌──────────────────────────────────────────────────────────────┐
│                    DEVELOPMENT LOOP                           │
│                                                              │
│  Prompt Engineering → Experiment Tracking (MLflow)          │
│  Model Evaluation → RAGAS / Custom Metrics → Registry       │
└──────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌──────────────────────────────────────────────────────────────┐
│                    CI/CD PIPELINE (GitHub Actions)            │
│                                                              │
│  PR → Automated Eval Suite → Quality Gate → Stage → Prod    │
│       (faithfulness, latency, cost per 1K, regression test)  │
└──────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌──────────────────────────────────────────────────────────────┐
│                    PRODUCTION RUNTIME                         │
│                                                              │
│  Inference API → Guardrails → Response → Observability       │
│       │                                       │             │
│  A/B Router                           Drift Monitor         │
│  (shadow/canary)                      (statistical tests)   │
└──────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌──────────────────────────────────────────────────────────────┐
│                    FEEDBACK LOOP                              │
│  User signals → Airflow → Retraining / Prompt refinement    │
└──────────────────────────────────────────────────────────────┘

Core Components

Component	Technology	Purpose
Experiment Tracking	MLflow	Prompt versions, model configs, eval metrics
Model Registry	MLflow Model Registry	Version control, staging, production promotion
Orchestration	Apache Airflow	Scheduled eval runs, retraining pipelines
Serving	FastAPI + Azure AKS	Low-latency inference with autoscaling
A/B Testing	Custom router	Shadow mode, canary, champion/challenger
Drift Detection	Evidently AI	Data drift, concept drift, output quality drift
Guardrails	Guardrails AI + custom	PII, hallucination, toxicity, domain constraints
CI/CD	GitHub Actions	Automated eval gates, progressive deployment
Observability	OpenTelemetry + Grafana	Latency, cost, quality SLOs

Project Structure

llmops-reference-architecture/
├── src/
│   ├── tracking/
│   │   ├── experiment_tracker.py   # MLflow experiment management
│   │   ├── model_registry.py       # Model lifecycle: dev→staging→prod
│   │   └── prompt_versioning.py    # Prompt registry with semantic diff
│   ├── evaluation/
│   │   ├── eval_suite.py           # Automated quality gate checks
│   │   ├── regression_tests.py     # Golden set regression detection
│   │   └── cost_profiler.py        # Token cost tracking per pipeline
│   ├── serving/
│   │   ├── inference_api.py        # FastAPI with streaming + health checks
│   │   ├── ab_router.py            # Shadow/canary/champion routing
│   │   └── model_loader.py         # Lazy loading + warm-up strategies
│   ├── monitoring/
│   │   ├── drift_detector.py       # Statistical drift detection
│   │   ├── quality_monitor.py      # Continuous quality scoring
│   │   └── alerting.py             # SLO violation alerts
│   └── guardrails/
│       ├── output_validator.py     # PII, hallucination, toxicity
│       └── input_sanitizer.py      # Prompt injection detection
├── dags/
│   ├── eval_pipeline.py            # Airflow: scheduled evaluation
│   └── retraining_trigger.py       # Airflow: drift-triggered retraining
├── .github/workflows/
│   ├── eval_gate.yml               # PR quality gate
│   └── deploy.yml                  # Progressive production deployment
├── configs/
│   ├── model_configs.yaml
│   └── eval_thresholds.yaml
└── requirements.txt

Quickstart

git clone https://github.com/codebygarrysingh/llmops-reference-architecture
cd llmops-reference-architecture
pip install -r requirements.txt

# Start MLflow and Airflow
docker-compose up -d

# Run an experiment
python -m src.tracking.experiment_tracker \
  --model gpt-4o \
  --prompt-version v2.3 \
  --eval-dataset ./data/golden_set.json

# Check quality gate
python -m src.evaluation.eval_suite --threshold-config configs/eval_thresholds.yaml

# Start serving
uvicorn src.serving.inference_api:app --workers 4

Quality Gate: CI/CD Integration

Every pull request triggers automated evaluation against the golden test set:

# .github/workflows/eval_gate.yml
- name: Run eval suite
  run: python -m src.evaluation.eval_suite --strict

- name: Enforce quality thresholds
  run: |
    if faithfulness < 0.90: fail
    if latency_p95 > 2000ms: fail
    if cost_per_1k > $0.50: warn
    if regression_delta > -0.02: fail

Deployment is blocked automatically if any threshold is breached — no manual sign-off required for regressions to reach production.

Drift Detection

detector = DriftDetector(
    reference_dataset=baseline_outputs,
    drift_tests=["ks_test", "psi", "embedding_drift"],
    alert_threshold=0.05
)

# Run on sliding 24h window
result = detector.check(current_window=last_24h_outputs)
if result.drift_detected:
    alerting.trigger_oncall(result.summary)

A/B Testing: Shadow Mode

router = ABRouter(
    champion=ProductionModel(version="v2.1"),
    challenger=CandidateModel(version="v2.2"),
    mode="shadow",          # shadow | canary | champion_challenger
    canary_pct=0.05,        # 5% traffic to challenger
    min_sample_size=1000    # statistical significance threshold
)

Author

Garry Singh — Principal AI & Data Engineer · MSc Oxford

Portfolio · LinkedIn · Book a Consultation

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src/tracking		src/tracking
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLMOps Reference Architecture

The Problem This Solves

Architecture

Core Components

Project Structure

Quickstart

Quality Gate: CI/CD Integration

Drift Detection

A/B Testing: Shadow Mode

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLMOps Reference Architecture

The Problem This Solves

Architecture

Core Components

Project Structure

Quickstart

Quality Gate: CI/CD Integration

Drift Detection

A/B Testing: Shadow Mode

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages