Agent Reliability Lab (ARL)

Agent Reliability Lab (ARL) is a production-inspired evaluation and observability platform for agentic AI workflows.

It is designed to simulate how a Quality Engineer would validate, trace, and regression-test LLM-based systems before and after deployment.

The system demonstrates:

Agent orchestration with tool usage
Retrieval-Augmented Generation (RAG)
Structured outputs with strict schemas
Evaluation pipelines (regression testing)
Observability and trace capture
Latency and cost tracking
Production feedback loop → new eval case promotion

🚀 Local Development (uv)

Install uv locally (single binary installer recommended).
From the repo root run uv run poe dev — this installs FastAPI + uvicorn (from pyproject.toml) and starts the dev server with reload enabled on http://127.0.0.1:8000.
Use uv run poe api for a non-reload server (matches production flags).
Hit GET /healthz to verify the service.
Run regression suites with uv run poe test (unit guardrails) or uv run poe test_integration (spins up uvicorn and makes real HTTP calls).

uv caches dependencies automatically, so the command acts as the Makefile-style make dev target without needing a separate virtual environment. The helper scripts set PYTHONPATH=src, so you can run the stack without installing the package in editable mode.

🔁 Incident Endpoint (Phase 1)

The Phase 1 MVP adds a deterministic triage endpoint:

Route: POST /incident
Request Body: IncidentRequest (incident + dag/run metadata)
Response Body: AgentResponse (structured summary, root cause, actions, evidence)

Example:

uv run curl -X POST http://127.0.0.1:8000/incident \
  -H "Content-Type: application/json" \
  -d '{
        "incident_id": "INC-1234",
        "run_id": "ingest_run_20250225",
        "dag_id": "customer_activity_dag",
        "severity": "high",
        "summary": "Load warehouse task continues to timeout",
        "reporter": "pagerduty",
        "keywords": ["timeout"]
      }'

Response contains status="triaged", a root_cause string, recommended actions, and evidence pulled from the local fixtures in data/.

Run the regression tests with:

uv run poe test              # unit + mocked guardrail tests
uv run poe test_integration  # boots FastAPI via uvicorn and calls the live server

🛡 Phase 2 Guardrails

Phase 2 hardens the MVP:

Guardrail-configurable limits on tool invocations and reasoning steps.
Tool runner enforces timeouts + bounded retries, converting failures into a needs_info fallback instead of 500s.
All responses pass through the Pydantic AgentResponse schema, now enriched with a metrics object (latency_ms, token_usage, estimated_cost_usd, etc.).
Request metadata captures DAG ownership plus aggregated tool-call stats so operators can audit why a decision happened.

Every /incident call now returns these guardrailed metrics, making it easy to plug basic observability panels on top.

🧠 Project Goal

Build and evaluate a Data Pipeline Incident Triage Agent that:

Accepts an incident (logs, metadata, context)
Retrieves relevant runbooks and prior incidents
Calls deterministic tools
Produces structured RCA + recommended next actions
Logs traces and metrics
Runs against a standardized evaluation suite
Tracks regression across model or prompt versions

🏗 Architecture Overview

The system is organized into layered components:

1️⃣ API Layer (Application Interface)

Tech

Python 3.12
FastAPI
Uvicorn
Pydantic (schema validation)

Responsibilities

Accept incident payloads
Trigger agent workflows
Trigger evaluation suite runs
Expose run results and traces
Provide endpoints for promoting failures to eval cases

Example endpoints:

POST /incident
POST /eval/run
GET /eval/{run_id}
GET /trace/{run_id}
POST /promote/{run_id}

2️⃣ Agent Workflow Layer

Tech

LangGraph
OpenAI or Anthropic API
JSON schema enforcement

Responsibilities

Define deterministic workflow graph
Manage step transitions
Call tools
Perform reasoning steps
Emit structured JSON output

Workflow Graph:

Intake → Retrieval → Tool Calls → Reasoning → Structured Output

Each step logs:

model used
tokens consumed
latency
tool call metadata

3️⃣ Tooling Layer (Deterministic Functions)

Tech

Pure Python
Typed interfaces
JSON schemas

Example Tools

get_logs(run_id)
get_dag_metadata(dag_id)
search_runbooks(query)
query_incident_db(sql)

Responsibilities

Deterministic outputs
Explicit schemas
Timeout + retry protection
Safe execution boundaries
Fully mockable for testing

All tool calls are recorded in trace storage.

4️⃣ Retrieval Layer (RAG)

Tech

Qdrant (Docker) or pgvector
SentenceTransformers or OpenAI embeddings
Markdown runbook corpus

Responsibilities

Embed runbooks and prior incidents
Perform semantic similarity search
Return top-k evidence chunks
Provide citation metadata

This layer enables:

Grounded responses
Factual scoring
Evidence-based validation

5️⃣ Data Layer

Tech

Postgres (Docker)
SQLAlchemy
Alembic (migrations)

Core Tables

eval_cases eval_runs tool_calls step_traces scores model_versions

Responsibilities

Persist evaluation cases
Store run metadata
Store trace events
Store scoring metrics
Track regression over time

6️⃣ Evaluation Layer (AI Quality Engineering Core)

Tech

pytest
Custom scoring framework
JSON schema validation
Optional LLM-as-judge for semantic scoring

Responsibilities

Each eval case contains:

input payload
expected schema
rubric rules
severity classification

Scoring dimensions:

Schema validity (hard pass/fail)
Evidence grounding (did it cite retrieved docs?)
Actionability (contains concrete next steps?)
Consistency (variance across N runs)
Latency (p50 / p95)
Token cost

Eval results generate:

pass rate
failure category breakdown
regression diff vs previous model version
cost summary

7️⃣ Observability Layer

Tech

OpenTelemetry
Structured JSON logging
Trace ID correlation
Postgres trace storage

Responsibilities

Record each agent step
Record each tool call
Capture model version
Capture token usage
Capture execution time

Enables:

Failure debugging
Root cause analysis
Model comparison

8️⃣ CI / Automation Layer

Tech

GitHub Actions
Docker Compose
pytest

Responsibilities

Run evaluation suite on PR
Fail build on regression threshold
Store historical metrics

🧪 Example Evaluation Workflow

Load 25 eval cases
Run agent N times per case
Score each dimension
Store metrics
Produce JSON + console report
Compare against baseline model version

💰 Cost Considerations

The system can run in two modes:

Local / Free Mode

Local embedding model
Small hosted LLM usage
All services via Docker
Approximate cost: $0–$20/month

Cloud Mode

Hosted LLM
Managed Postgres
Managed vector DB
CI runs using LLM calls
Approximate cost: $20–$150/month depending on volume

Cost control mechanisms:

Token logging
Budget caps
Smaller models for retrieval/classification
Larger models only for reasoning
Caching embeddings
Fixed evaluation batch sizes

🚀 Why This Project Matters

This repo demonstrates:

Production-style agent engineering
Evaluation infrastructure
Observability of non-deterministic systems
Regression safety for AI workflows
Cost-aware AI deployment

It represents the core skillset of an AI Quality Engineer building reliability into agentic systems.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
src		src
.gitignore		.gitignore
.python-version		.python-version
BUILDSTEPS.md		BUILDSTEPS.md
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
eval_report.json		eval_report.json
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Agent Reliability Lab (ARL)

🚀 Local Development (uv)

🔁 Incident Endpoint (Phase 1)

🛡 Phase 2 Guardrails

🧠 Project Goal

🏗 Architecture Overview

1️⃣ API Layer (Application Interface)

Tech

Responsibilities

2️⃣ Agent Workflow Layer

Tech

Responsibilities

3️⃣ Tooling Layer (Deterministic Functions)

Tech

Example Tools

Responsibilities

4️⃣ Retrieval Layer (RAG)

Tech

Responsibilities

5️⃣ Data Layer

Tech

Core Tables

Responsibilities

6️⃣ Evaluation Layer (AI Quality Engineering Core)

Tech

Responsibilities

7️⃣ Observability Layer

Tech

Responsibilities

8️⃣ CI / Automation Layer

Tech

Responsibilities

🧪 Example Evaluation Workflow

💰 Cost Considerations

Local / Free Mode

Cloud Mode

🚀 Why This Project Matters

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages