Skip to content

codebygarrysingh/llmops-reference-architecture

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

LLMOps Reference Architecture

Production-grade MLOps framework for LLM systems: experiment tracking, model versioning, drift detection, A/B testing, guardrails, and CI/CD. Battle-tested in enterprise financial services deployments.

Python MLflow Airflow Azure GitHub Actions


The Problem This Solves

Most teams can get an LLM working in a notebook. Almost none have solved:

  • Reproducibility — can you recreate the exact prompt, model version, and retrieval config that produced a given output 6 months ago?
  • Regression detection — does the new model version actually perform better on your domain, or just on OpenAI's evals?
  • Production drift — is the model's output quality degrading silently as your data distribution shifts?
  • Responsible deployment — how do you enforce output constraints (PII, toxicity, hallucination) consistently across 100K+ daily inferences?

This reference architecture answers all four.


Architecture

┌──────────────────────────────────────────────────────────────┐
│                    DEVELOPMENT LOOP                           │
│                                                              │
│  Prompt Engineering → Experiment Tracking (MLflow)          │
│  Model Evaluation → RAGAS / Custom Metrics → Registry       │
└──────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌──────────────────────────────────────────────────────────────┐
│                    CI/CD PIPELINE (GitHub Actions)            │
│                                                              │
│  PR → Automated Eval Suite → Quality Gate → Stage → Prod    │
│       (faithfulness, latency, cost per 1K, regression test)  │
└──────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌──────────────────────────────────────────────────────────────┐
│                    PRODUCTION RUNTIME                         │
│                                                              │
│  Inference API → Guardrails → Response → Observability       │
│       │                                       │             │
│  A/B Router                           Drift Monitor         │
│  (shadow/canary)                      (statistical tests)   │
└──────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌──────────────────────────────────────────────────────────────┐
│                    FEEDBACK LOOP                              │
│  User signals → Airflow → Retraining / Prompt refinement    │
└──────────────────────────────────────────────────────────────┘

Core Components

Component Technology Purpose
Experiment Tracking MLflow Prompt versions, model configs, eval metrics
Model Registry MLflow Model Registry Version control, staging, production promotion
Orchestration Apache Airflow Scheduled eval runs, retraining pipelines
Serving FastAPI + Azure AKS Low-latency inference with autoscaling
A/B Testing Custom router Shadow mode, canary, champion/challenger
Drift Detection Evidently AI Data drift, concept drift, output quality drift
Guardrails Guardrails AI + custom PII, hallucination, toxicity, domain constraints
CI/CD GitHub Actions Automated eval gates, progressive deployment
Observability OpenTelemetry + Grafana Latency, cost, quality SLOs

Project Structure

llmops-reference-architecture/
├── src/
│   ├── tracking/
│   │   ├── experiment_tracker.py   # MLflow experiment management
│   │   ├── model_registry.py       # Model lifecycle: dev→staging→prod
│   │   └── prompt_versioning.py    # Prompt registry with semantic diff
│   ├── evaluation/
│   │   ├── eval_suite.py           # Automated quality gate checks
│   │   ├── regression_tests.py     # Golden set regression detection
│   │   └── cost_profiler.py        # Token cost tracking per pipeline
│   ├── serving/
│   │   ├── inference_api.py        # FastAPI with streaming + health checks
│   │   ├── ab_router.py            # Shadow/canary/champion routing
│   │   └── model_loader.py         # Lazy loading + warm-up strategies
│   ├── monitoring/
│   │   ├── drift_detector.py       # Statistical drift detection
│   │   ├── quality_monitor.py      # Continuous quality scoring
│   │   └── alerting.py             # SLO violation alerts
│   └── guardrails/
│       ├── output_validator.py     # PII, hallucination, toxicity
│       └── input_sanitizer.py      # Prompt injection detection
├── dags/
│   ├── eval_pipeline.py            # Airflow: scheduled evaluation
│   └── retraining_trigger.py       # Airflow: drift-triggered retraining
├── .github/workflows/
│   ├── eval_gate.yml               # PR quality gate
│   └── deploy.yml                  # Progressive production deployment
├── configs/
│   ├── model_configs.yaml
│   └── eval_thresholds.yaml
└── requirements.txt

Quickstart

git clone https://github.com/codebygarrysingh/llmops-reference-architecture
cd llmops-reference-architecture
pip install -r requirements.txt

# Start MLflow and Airflow
docker-compose up -d

# Run an experiment
python -m src.tracking.experiment_tracker \
  --model gpt-4o \
  --prompt-version v2.3 \
  --eval-dataset ./data/golden_set.json

# Check quality gate
python -m src.evaluation.eval_suite --threshold-config configs/eval_thresholds.yaml

# Start serving
uvicorn src.serving.inference_api:app --workers 4

Quality Gate: CI/CD Integration

Every pull request triggers automated evaluation against the golden test set:

# .github/workflows/eval_gate.yml
- name: Run eval suite
  run: python -m src.evaluation.eval_suite --strict

- name: Enforce quality thresholds
  run: |
    if faithfulness < 0.90: fail
    if latency_p95 > 2000ms: fail
    if cost_per_1k > $0.50: warn
    if regression_delta > -0.02: fail

Deployment is blocked automatically if any threshold is breached — no manual sign-off required for regressions to reach production.


Drift Detection

detector = DriftDetector(
    reference_dataset=baseline_outputs,
    drift_tests=["ks_test", "psi", "embedding_drift"],
    alert_threshold=0.05
)

# Run on sliding 24h window
result = detector.check(current_window=last_24h_outputs)
if result.drift_detected:
    alerting.trigger_oncall(result.summary)

A/B Testing: Shadow Mode

router = ABRouter(
    champion=ProductionModel(version="v2.1"),
    challenger=CandidateModel(version="v2.2"),
    mode="shadow",          # shadow | canary | champion_challenger
    canary_pct=0.05,        # 5% traffic to challenger
    min_sample_size=1000    # statistical significance threshold
)

Author

Garry Singh — Principal AI & Data Engineer · MSc Oxford

Portfolio · LinkedIn · Book a Consultation

About

Production LLMOps framework: experiment tracking, model versioning, drift detection, guardrails, and CI/CD for LLM systems. MLflow · Guardrails AI · GitHub Actions.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages