agent-bench

Benchmark harness that sends the same natural-language task to autonoma, nexus, and agentforge and reports multi-signal quality scores / cost / wall time / files produced side-by-side, with parallel execution and a persistent run history.

Status of each framework

framework	one-shot build CLI	integration
autonoma	`autonoma build "<prompt>" --output <dir> --no-animate`	supported
nexus	`nexus build "<prompt>" --output <dir> --no-tui`	supported
agentforge	no project-scaffolding command (only chat-style `run`)	unsupported — runner returns a `status="unsupported"` result; replace once a build command exists

Token usage is best-effort:

autonoma: summed (input + output) from traces/*/llm-calls.jsonl; model captured from trace rows.
nexus: parsed from the Total Tokens row printed by _print_summary; split 70/30 heuristically and priced against NEXUS_MODEL (default claude-sonnet-4-6).
agentforge: n/a until a one-shot build command lands.

Install

cd agent-bench
uv sync                          # or: pip install -e .
uv pip install -e '.[score]'     # optional: ruff / pytest / anthropic for real scoring
export ANTHROPIC_API_KEY=sk-ant-...

Scoring signals

Every run produces a Score object with six fields plus an aggregate:

signal	how it's measured	type
`files_ok`	task's `files_produced_min` + optional `must_contain_file` satisfied	bool
`imports_ok`	every `*.py` in the workdir parses cleanly with `ast.parse`	bool
`ruff_ok`	`python -m ruff check <workdir>` exits 0 (or `None` if ruff isn't installed)	bool \| None
`pytest_score`	`passed / total` from `pytest --maxfail=5 --timeout=30` with a 60s wall-cap; `None` if none	0.0–1.0 \| None
`judge_score`	LLM-as-judge (claude-sonnet-4-6, prompt-cached rubric) grading task fit / quality / runnability	0.0–1.0 \| None
`overall`	`0.30·files + 0.15·imports + 0.10·ruff + 0.20·pytest + 0.25·judge`, re-normalized over whichever signals are available	0.0–1.0

All subprocess calls for scoring run in the task workdir with a 60 s timeout and PYTHONDONTWRITEBYTECODE=1. No autonoma/nexus sandbox is imported — the harness is deliberately light-touch.

Graceful degradation:

No ANTHROPIC_API_KEY or no anthropic SDK → judge_score = None, overall re-normalized over the remaining signals.
No ruff / pytest_timeout installed → those signals become None instead of failing.
--dry-run short-circuits: no subprocess, no LLM judge, a synthetic Score(files_ok=True, imports_ok=True, overall=1.0).

Cost accounting

pricing.py holds a hardcoded model → ($/MTok input, $/MTok output) table covering claude-opus-4-7, claude-sonnet-4-6, claude-haiku-4-5-20251001, gpt-4o, gpt-4o-mini (list prices as of 2026-04-17 — re-check the vendor pages periodically). Unknown models yield cost_usd = None rather than a bogus zero.

Persistence

Every run / run-all invocation appends rows to ./bench.db (sqlite):

runs(id, ts, framework, task_id, model, wall_sec,
     input_tokens, output_tokens, cost_usd,
     files_produced, overall, score_json, workdir)

Use

# List tasks
agent-bench tasks

# Run one task against every framework (parallel, cap = 3)
agent-bench run hello_cli

# Cap concurrency lower / higher
agent-bench run todo_api --max-concurrent 1

# Smoke-test the harness without calling any LLM or scorer subprocess
agent-bench run-all --dry-run

# Restrict to a single framework
agent-bench run todo_api -f autonoma

# Re-render a saved report
agent-bench report --results ./results.json

# Recent stored runs
agent-bench history --last 20

# Per-framework leaderboard from the stored DB
agent-bench leaderboard

Layout of a run

runs/<framework>/<task_id>/<timestamp>/
  output/           # generated project files (counted + scored)
  traces/           # autonoma's llm-calls.jsonl lands here

Example output

                                  agent-bench results
┏━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━┳━━━━━┳━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━┓
┃ framework  ┃ task      ┃ status ┃ overall ┃ files ┃ imp ┃ ruff ┃ pytest ┃ judge ┃ tokens ┃ $cost ┃ time(s) ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━╇━━━━━╇━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━┩
│ autonoma   │ hello_cli │  OK    │   0.85  │   ✓   │  ✓  │  ✓   │  1.00  │  0.80 │ 12,345 │ $0.084 │    14.2 │
│ nexus      │ hello_cli │  OK    │   0.72  │   ✓   │  ✓  │  ✗   │    −   │  0.75 │  9,001 │ $0.045 │    11.7 │
│ agentforge │ hello_cli │  n/a   │     −   │   −   │  −  │  −   │    −   │    −  │      − │      − │     0.0 │
└────────────┴───────────┴────────┴─────────┴───────┴─────┴──────┴────────┴───────┴────────┴───────┴─────────┘

Icons: ✓ pass · ✗ fail · − not applicable / unknown.

Defining new tasks

Drop a YAML file in ./tasks/:

id: my_task
prompt: "Describe the thing the agent should build"
timeout_sec: 600
success_criteria:
  files_produced_min: 2
  must_contain_file: README.md

Known limitations

agentforge currently returns a chat-style response and does not write project files. The AgentForgeRunner marks these runs unsupported and leaves a TODO to wire up a future build command.
Token extraction for nexus is regex-based and uses a 70/30 input/output heuristic for cost estimation; if the framework changes its output format the harness will silently report None.
Pricing in pricing.py is a static snapshot — update it when vendor list prices change.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src/agent_bench		src/agent_bench
tasks		tasks
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

agent-bench

Status of each framework

Install

Scoring signals

Cost accounting

Persistence

Use

Layout of a run

Example output

Defining new tasks

Known limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

agent-bench

Status of each framework

Install

Scoring signals

Cost accounting

Persistence

Use

Layout of a run

Example output

Defining new tasks

Known limitations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages