Agent Reliability Lab (ARL) is a production-inspired evaluation and observability platform for agentic AI workflows.
It is designed to simulate how a Quality Engineer would validate, trace, and regression-test LLM-based systems before and after deployment.
The system demonstrates:
- Agent orchestration with tool usage
- Retrieval-Augmented Generation (RAG)
- Structured outputs with strict schemas
- Evaluation pipelines (regression testing)
- Observability and trace capture
- Latency and cost tracking
- Production feedback loop → new eval case promotion
- Install
uvlocally (single binary installer recommended). - From the repo root run
uv run poe dev— this installs FastAPI + uvicorn (frompyproject.toml) and starts the dev server with reload enabled onhttp://127.0.0.1:8000. - Use
uv run poe apifor a non-reload server (matches production flags). - Hit
GET /healthzto verify the service. - Run regression suites with
uv run poe test(unit guardrails) oruv run poe test_integration(spins up uvicorn and makes real HTTP calls).
uv caches dependencies automatically, so the command acts as the Makefile-style make dev target without needing a separate virtual environment. The helper scripts set PYTHONPATH=src, so you can run the stack without installing the package in editable mode.
The Phase 1 MVP adds a deterministic triage endpoint:
- Route:
POST /incident - Request Body:
IncidentRequest(incident + dag/run metadata) - Response Body:
AgentResponse(structured summary, root cause, actions, evidence)
Example:
uv run curl -X POST http://127.0.0.1:8000/incident \
-H "Content-Type: application/json" \
-d '{
"incident_id": "INC-1234",
"run_id": "ingest_run_20250225",
"dag_id": "customer_activity_dag",
"severity": "high",
"summary": "Load warehouse task continues to timeout",
"reporter": "pagerduty",
"keywords": ["timeout"]
}'Response contains status="triaged", a root_cause string, recommended actions, and evidence pulled from the local fixtures in data/.
Run the regression tests with:
uv run poe test # unit + mocked guardrail tests
uv run poe test_integration # boots FastAPI via uvicorn and calls the live serverPhase 2 hardens the MVP:
- Guardrail-configurable limits on tool invocations and reasoning steps.
- Tool runner enforces timeouts + bounded retries, converting failures into a
needs_infofallback instead of 500s. - All responses pass through the Pydantic
AgentResponseschema, now enriched with ametricsobject (latency_ms,token_usage,estimated_cost_usd, etc.). - Request metadata captures DAG ownership plus aggregated tool-call stats so operators can audit why a decision happened.
Every /incident call now returns these guardrailed metrics, making it easy to plug basic observability panels on top.
Build and evaluate a Data Pipeline Incident Triage Agent that:
- Accepts an incident (logs, metadata, context)
- Retrieves relevant runbooks and prior incidents
- Calls deterministic tools
- Produces structured RCA + recommended next actions
- Logs traces and metrics
- Runs against a standardized evaluation suite
- Tracks regression across model or prompt versions
The system is organized into layered components:
- Python 3.12
- FastAPI
- Uvicorn
- Pydantic (schema validation)
- Accept incident payloads
- Trigger agent workflows
- Trigger evaluation suite runs
- Expose run results and traces
- Provide endpoints for promoting failures to eval cases
Example endpoints:
- POST /incident
- POST /eval/run
- GET /eval/{run_id}
- GET /trace/{run_id}
- POST /promote/{run_id}
- LangGraph
- OpenAI or Anthropic API
- JSON schema enforcement
- Define deterministic workflow graph
- Manage step transitions
- Call tools
- Perform reasoning steps
- Emit structured JSON output
Workflow Graph:
Intake → Retrieval → Tool Calls → Reasoning → Structured Output
Each step logs:
- model used
- tokens consumed
- latency
- tool call metadata
- Pure Python
- Typed interfaces
- JSON schemas
- get_logs(run_id)
- get_dag_metadata(dag_id)
- search_runbooks(query)
- query_incident_db(sql)
- Deterministic outputs
- Explicit schemas
- Timeout + retry protection
- Safe execution boundaries
- Fully mockable for testing
All tool calls are recorded in trace storage.
- Qdrant (Docker) or pgvector
- SentenceTransformers or OpenAI embeddings
- Markdown runbook corpus
- Embed runbooks and prior incidents
- Perform semantic similarity search
- Return top-k evidence chunks
- Provide citation metadata
This layer enables:
- Grounded responses
- Factual scoring
- Evidence-based validation
- Postgres (Docker)
- SQLAlchemy
- Alembic (migrations)
eval_cases eval_runs tool_calls step_traces scores model_versions
- Persist evaluation cases
- Store run metadata
- Store trace events
- Store scoring metrics
- Track regression over time
- pytest
- Custom scoring framework
- JSON schema validation
- Optional LLM-as-judge for semantic scoring
Each eval case contains:
- input payload
- expected schema
- rubric rules
- severity classification
Scoring dimensions:
- Schema validity (hard pass/fail)
- Evidence grounding (did it cite retrieved docs?)
- Actionability (contains concrete next steps?)
- Consistency (variance across N runs)
- Latency (p50 / p95)
- Token cost
Eval results generate:
- pass rate
- failure category breakdown
- regression diff vs previous model version
- cost summary
- OpenTelemetry
- Structured JSON logging
- Trace ID correlation
- Postgres trace storage
- Record each agent step
- Record each tool call
- Capture model version
- Capture token usage
- Capture execution time
Enables:
- Failure debugging
- Root cause analysis
- Model comparison
- GitHub Actions
- Docker Compose
- pytest
- Run evaluation suite on PR
- Fail build on regression threshold
- Store historical metrics
- Load 25 eval cases
- Run agent N times per case
- Score each dimension
- Store metrics
- Produce JSON + console report
- Compare against baseline model version
The system can run in two modes:
- Local embedding model
- Small hosted LLM usage
- All services via Docker
- Approximate cost: $0–$20/month
- Hosted LLM
- Managed Postgres
- Managed vector DB
- CI runs using LLM calls
- Approximate cost: $20–$150/month depending on volume
Cost control mechanisms:
- Token logging
- Budget caps
- Smaller models for retrieval/classification
- Larger models only for reasoning
- Caching embeddings
- Fixed evaluation batch sizes
This repo demonstrates:
- Production-style agent engineering
- Evaluation infrastructure
- Observability of non-deterministic systems
- Regression safety for AI workflows
- Cost-aware AI deployment
It represents the core skillset of an AI Quality Engineer building reliability into agentic systems.