Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion DEVELOPMENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ Once running, submit a run with:
```bash
curl -X POST http://localhost:8001/api/runs \
-H 'content-type: application/json' \
-d '{"spec": {"approach": "trace_replay", "target": {"kind": "inline", "inline": {...}}, "evalConfig": {"metrics": ["tool_trajectory_avg_score"]}}}'
-d '{"spec": {"approach": "trace_replay", "target": {"kind": "inline", "inline": {...}}, "evalConfig": {"evaluators": [{"name": "tool_trajectory_avg_score", "type": "builtin"}]}}}'
```

Then poll `GET /api/runs/{runId}` and `GET /api/runs/{runId}/results`. Without `storage.backend=postgres`, the `/api/runs` endpoints return 503 with a hint pointing at the env var.
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -250,7 +250,7 @@ See the [Custom Evaluators guide](docs/custom-evaluators.md) for the full protoc
agentevals serve # bundled UI on http://localhost:8001
```

Upload traces and eval sets, select metrics, and view results with interactive span trees. Live-streamed traces appear in the "Local Dev" tab, grouped by session ID. For running from source, see [DEVELOPMENT.md](DEVELOPMENT.md).
Upload traces and eval sets, select evaluators, and view results with interactive span trees. Live-streamed traces appear in the "Local Dev" tab, grouped by session ID. For running from source, see [DEVELOPMENT.md](DEVELOPMENT.md).

Interactive API docs are available at `/docs` (Swagger) and `/redoc` while the server is running. The OTLP receiver on port 4318 serves its own docs at `http://localhost:4318/docs`.

Expand Down
6 changes: 3 additions & 3 deletions docs/eval-set-format.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Eval Set Format

An eval set is a JSON file containing golden reference data that metrics compare agent traces against. It follows the [Google ADK `EvalSet`](https://github.com/google/adk-python/blob/main/src/google/adk/evaluation/eval_set.py) schema, which means eval sets are portable between agentevals and ADK tooling.
An eval set is a JSON file containing golden reference data that evaluators compare agent traces against. It follows the [Google ADK `EvalSet`](https://github.com/google/adk-python/blob/main/src/google/adk/evaluation/eval_set.py) schema, which means eval sets are portable between agentevals and ADK tooling.

Most users will not need to author eval sets by hand. The web UI can generate them from live sessions (mark a session as golden, and the server builds the eval set automatically). This document is for users who want to create or edit eval sets directly, whether for CLI usage, CI pipelines, or version-controlled test suites.

Expand Down Expand Up @@ -203,9 +203,9 @@ The `parts` array can contain text, function calls, or function responses. Most

Each `FunctionCall` has `name`, `args`, and `id`. Each `FunctionResponse` has `name`, `response`, and `id`. Match `id` values between calls and responses to pair them.

## Which Metrics Use Eval Sets
## Which Evaluators Use Eval Sets

Not all metrics require an eval set. Use `agentevals list-metrics` to see which do:
Not all evaluators require an eval set. Use `agentevals evaluator list --source builtin` to see which built-in evaluators do:

| Metric | Needs Eval Set | What It Reads |
|---|---|---|
Expand Down
2 changes: 1 addition & 1 deletion examples/dice_agent/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,7 @@ Update `main.py` to test the new functionality.
**After agent completes:**
- Status changes to "EVALUATED"
- Evaluation results appear as colored badges
- Each metric shows: name and score (e.g., "tool_trajectory_avg_score: 1.00")
- Each evaluator result shows: name and score (e.g., "tool_trajectory_avg_score: 1.00")

**Multiple runs:**
- Each run creates a new session with model name in ID
Expand Down
4 changes: 2 additions & 2 deletions examples/kubernetes/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -221,7 +221,7 @@ This captures the GPT-5 session's tool trajectory and final responses as the gol
2. Select both sessions (the `gpt-4.1-mini` session and the `gpt-5` session)
3. Click **Evaluate**
4. Select the `helm-agent-comparison` eval set
5. Choose the metrics:
5. Choose the evaluators:
- **tool_trajectory_avg_score**: Did the agent call the correct tools in the correct order?
- **response_match_score**: Did the agent produce responses consistent with the golden reference?
6. Run the evaluation
Expand All @@ -241,7 +241,7 @@ Compare the two sessions in the results table:

<img width="1914" height="1154" alt="image" src="https://github.com/user-attachments/assets/5939a8d4-3775-4cf1-9cf2-d3b6b4afd582" />

You can also click an individual conversation and see a breakdown of each evaluators.
You can also click an individual conversation and see a breakdown of each evaluator.

<img width="1916" height="1348" alt="image" src="https://github.com/user-attachments/assets/984b3d29-8018-4fcb-9036-bb7c6e97d9ff" />

Expand Down
115 changes: 28 additions & 87 deletions src/agentevals/api/routes.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,14 +18,7 @@
from agentevals import __version__

from ..builtin_metrics import METRICS_NEEDING_EXPECTED, METRICS_NEEDING_GCP, METRICS_NEEDING_LLM
from ..config import (
BuiltinMetricDef,
CodeEvaluatorDef,
CustomEvaluatorDef,
EvalParams,
EvalRunConfig,
OpenAIEvalDef,
)
from ..config import EvalParams, EvalRunConfig
from ..converter import convert_traces
from ..extraction import get_extractor
from ..loader import load_traces
Expand Down Expand Up @@ -121,24 +114,6 @@ async def _maybe_persist_evaluate_run(

_MAX_JSON_BODY_BYTES = 50 * 1024 * 1024 # 50 MB (multipart endpoints allow 10 MB per file)

_TYPE_TO_MODEL = {
"builtin": BuiltinMetricDef,
"code": CodeEvaluatorDef,
"openai_eval": OpenAIEvalDef,
}


def _parse_custom_evaluators(raw: list[dict]) -> list[CustomEvaluatorDef]:
"""Parse a list of custom evaluator dicts from the API config JSON."""
defs: list[CustomEvaluatorDef] = []
for entry in raw:
evaluator_type = entry.get("type", "builtin")
model_cls = _TYPE_TO_MODEL.get(evaluator_type)
if not model_cls:
raise ValueError(f"Unknown custom evaluator type: {evaluator_type}")
defs.append(model_cls.model_validate(entry))
return defs


@router.get("/health", response_model=StandardResponse[HealthData])
async def health_check():
Expand Down Expand Up @@ -489,10 +464,10 @@ async def evaluate_traces(
eval_set_file: UploadFile | None = File(None),
):
"""
Evaluate agent traces using specified metrics.
Evaluate agent traces using the provided evaluator configuration.

Args:
trace_files: List of Jaeger JSON trace files
trace_files: List of Jaeger or OTLP JSON trace files
config: JSON string with evaluation configuration
eval_set_file: Optional golden eval set file

Expand Down Expand Up @@ -556,40 +531,23 @@ async def evaluate_traces(
)
f.write(content)

metrics = config_dict.get("metrics", ["tool_trajectory_avg_score"])
if not metrics or not isinstance(metrics, list):
raise HTTPException(
status_code=400,
detail="Config must include 'metrics' as a non-empty array",
)

threshold = config_dict.get("threshold")
if threshold is not None and (threshold < 0 or threshold > 1):
raise HTTPException(
status_code=400,
detail="Threshold must be between 0 and 1",
try:
eval_config = EvalRunConfig.model_validate(
{
**config_dict,
"traceFiles": trace_paths,
"evalSetFile": eval_set_path,
"traceFormat": trace_format,
}
)
except Exception as exc:
raise HTTPException(status_code=400, detail=f"Invalid config: {exc}") from exc

custom_evaluators: list[CustomEvaluatorDef] = []
raw_custom = config_dict.get("customEvaluators", config_dict.get("customMetrics", []))
if raw_custom:
try:
custom_evaluators = _parse_custom_evaluators(raw_custom)
except Exception as exc:
raise HTTPException(status_code=400, detail=f"Invalid customEvaluators: {exc}") from exc

eval_config = EvalRunConfig(
trace_files=trace_paths,
eval_set_file=eval_set_path,
metrics=metrics,
custom_evaluators=custom_evaluators,
trace_format=trace_format,
judge_model=config_dict.get("judgeModel"),
threshold=threshold,
trajectory_match_type=config_dict.get("trajectoryMatchType"),
logger.info(
"Evaluating %d trace file(s) with evaluators: %s",
len(trace_paths),
[e.name for e in eval_config.evaluators],
)

logger.info(f"Evaluating {len(trace_paths)} trace file(s) with metrics: {metrics}")
result = await run_evaluation(eval_config)

run_id = await _maybe_persist_evaluate_run(
Expand Down Expand Up @@ -675,36 +633,19 @@ async def event_generator():
return
f.write(content)

metrics = config_dict.get("metrics", ["tool_trajectory_avg_score"])
if not metrics or not isinstance(metrics, list):
yield f"data: {SSEErrorEvent(error='Config must include metrics as a non-empty array').model_dump_json(by_alias=True)}\n\n"
return

threshold = config_dict.get("threshold")
if threshold is not None and (threshold < 0 or threshold > 1):
yield f"data: {SSEErrorEvent(error='Threshold must be between 0 and 1').model_dump_json(by_alias=True)}\n\n"
try:
eval_config = EvalRunConfig.model_validate(
{
**config_dict,
"traceFiles": trace_paths,
"evalSetFile": eval_set_path,
"traceFormat": trace_format,
}
)
except Exception as exc:
yield f"data: {SSEErrorEvent(error=f'Invalid config: {exc}').model_dump_json(by_alias=True)}\n\n"
return

custom_evaluators: list[CustomEvaluatorDef] = []
raw_custom = config_dict.get("customEvaluators", config_dict.get("customMetrics", []))
if raw_custom:
try:
custom_evaluators = _parse_custom_evaluators(raw_custom)
except Exception as exc:
yield f"data: {SSEErrorEvent(error=f'Invalid customEvaluators: {exc}').model_dump_json(by_alias=True)}\n\n"
return

eval_config = EvalRunConfig(
trace_files=trace_paths,
eval_set_file=eval_set_path,
metrics=metrics,
custom_evaluators=custom_evaluators,
trace_format=trace_format,
judge_model=config_dict.get("judgeModel"),
threshold=threshold,
trajectory_match_type=config_dict.get("trajectoryMatchType"),
)

for trace_file_path in trace_paths:
try:
traces = load_traces(trace_file_path, format=eval_config.trace_format)
Expand Down
2 changes: 2 additions & 0 deletions src/agentevals/api/runs_routes.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,8 @@ async def submit_run(payload: RunRequest, request: Request):
service = _service(request)
try:
run = await service.submit(run_id=payload.run_id, spec=payload.spec)
except ValueError as exc:
raise HTTPException(status_code=status.HTTP_422_UNPROCESSABLE_CONTENT, detail=str(exc)) from exc
except RunSubmitConflict as exc:
raise HTTPException(
status_code=status.HTTP_409_CONFLICT,
Expand Down
16 changes: 7 additions & 9 deletions src/agentevals/api/streaming_routes.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,13 @@
import asyncio
import json
import logging
from typing import TYPE_CHECKING, Literal
from typing import TYPE_CHECKING

from fastapi import APIRouter, Depends, HTTPException
from fastapi.responses import FileResponse
from pydantic import BaseModel
from pydantic import BaseModel, ConfigDict, Field

from ..config import EvalRunConfig
from ..config import BuiltinMetricDef, EvalRunConfig, EvaluatorDef
from ..converter import convert_traces
from ..loader.otlp import OtlpJsonLoader
from ..runner import run_evaluation
Expand Down Expand Up @@ -42,11 +42,11 @@ class CreateEvalSetRequest(BaseModel):


class EvaluateSessionsRequest(BaseModel):
model_config = ConfigDict(extra="forbid")

golden_session_id: str
eval_set_id: str
metrics: list[str] = ["tool_trajectory_avg_score"]
judge_model: str = "gemini-2.5-flash"
trajectory_match_type: Literal["EXACT", "IN_ORDER", "ANY_ORDER"] | None = None
evaluators: list[EvaluatorDef] = Field(default_factory=lambda: [BuiltinMetricDef(name="tool_trajectory_avg_score")])


class PrepareEvaluationRequest(BaseModel):
Expand Down Expand Up @@ -209,9 +209,7 @@ async def eval_one_session(session_id: str, session) -> SessionEvalResult:
trace_files=[str(trace_file)],
trace_format="otlp-json",
eval_set_file=eval_set_file.name,
metrics=request.metrics,
judge_model=request.judge_model,
trajectory_match_type=request.trajectory_match_type,
evaluators=request.evaluators,
)

eval_result = await run_evaluation(config)
Expand Down
Loading