Benchmark harness that sends the same natural-language task to
autonoma, nexus, and
agentforge and reports multi-signal quality scores /
cost / wall time / files produced side-by-side, with parallel execution
and a persistent run history.
| framework | one-shot build CLI | integration |
|---|---|---|
| autonoma | autonoma build "<prompt>" --output <dir> --no-animate |
supported |
| nexus | nexus build "<prompt>" --output <dir> --no-tui |
supported |
| agentforge | no project-scaffolding command (only chat-style run) |
unsupported — runner returns a status="unsupported" result; replace once a build command exists |
Token usage is best-effort:
- autonoma: summed (input + output) from
traces/*/llm-calls.jsonl; model captured from trace rows. - nexus: parsed from the
Total Tokensrow printed by_print_summary; split 70/30 heuristically and priced againstNEXUS_MODEL(defaultclaude-sonnet-4-6). - agentforge: n/a until a one-shot build command lands.
cd agent-bench
uv sync # or: pip install -e .
uv pip install -e '.[score]' # optional: ruff / pytest / anthropic for real scoring
export ANTHROPIC_API_KEY=sk-ant-...Every run produces a Score object with six fields plus an aggregate:
| signal | how it's measured | type |
|---|---|---|
files_ok |
task's files_produced_min + optional must_contain_file satisfied |
bool |
imports_ok |
every *.py in the workdir parses cleanly with ast.parse |
bool |
ruff_ok |
python -m ruff check <workdir> exits 0 (or None if ruff isn't installed) |
bool | None |
pytest_score |
passed / total from pytest --maxfail=5 --timeout=30 with a 60s wall-cap; None if none |
0.0–1.0 | None |
judge_score |
LLM-as-judge (claude-sonnet-4-6, prompt-cached rubric) grading task fit / quality / runnability | 0.0–1.0 | None |
overall |
0.30·files + 0.15·imports + 0.10·ruff + 0.20·pytest + 0.25·judge, re-normalized over whichever signals are available |
0.0–1.0 |
All subprocess calls for scoring run in the task workdir with a 60 s
timeout and PYTHONDONTWRITEBYTECODE=1. No autonoma/nexus sandbox is
imported — the harness is deliberately light-touch.
Graceful degradation:
- No
ANTHROPIC_API_KEYor noanthropicSDK →judge_score = None,overallre-normalized over the remaining signals. - No
ruff/pytest_timeoutinstalled → those signals becomeNoneinstead of failing. --dry-runshort-circuits: no subprocess, no LLM judge, a syntheticScore(files_ok=True, imports_ok=True, overall=1.0).
pricing.py holds a hardcoded model → ($/MTok input, $/MTok output)
table covering claude-opus-4-7, claude-sonnet-4-6,
claude-haiku-4-5-20251001, gpt-4o, gpt-4o-mini (list prices as of
2026-04-17 — re-check the vendor pages periodically). Unknown models
yield cost_usd = None rather than a bogus zero.
Every run / run-all invocation appends rows to ./bench.db (sqlite):
runs(id, ts, framework, task_id, model, wall_sec,
input_tokens, output_tokens, cost_usd,
files_produced, overall, score_json, workdir)
# List tasks
agent-bench tasks
# Run one task against every framework (parallel, cap = 3)
agent-bench run hello_cli
# Cap concurrency lower / higher
agent-bench run todo_api --max-concurrent 1
# Smoke-test the harness without calling any LLM or scorer subprocess
agent-bench run-all --dry-run
# Restrict to a single framework
agent-bench run todo_api -f autonoma
# Re-render a saved report
agent-bench report --results ./results.json
# Recent stored runs
agent-bench history --last 20
# Per-framework leaderboard from the stored DB
agent-bench leaderboardruns/<framework>/<task_id>/<timestamp>/
output/ # generated project files (counted + scored)
traces/ # autonoma's llm-calls.jsonl lands here
agent-bench results
┏━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━┳━━━━━┳━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━┓
┃ framework ┃ task ┃ status ┃ overall ┃ files ┃ imp ┃ ruff ┃ pytest ┃ judge ┃ tokens ┃ $cost ┃ time(s) ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━╇━━━━━╇━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━┩
│ autonoma │ hello_cli │ OK │ 0.85 │ ✓ │ ✓ │ ✓ │ 1.00 │ 0.80 │ 12,345 │ $0.084 │ 14.2 │
│ nexus │ hello_cli │ OK │ 0.72 │ ✓ │ ✓ │ ✗ │ − │ 0.75 │ 9,001 │ $0.045 │ 11.7 │
│ agentforge │ hello_cli │ n/a │ − │ − │ − │ − │ − │ − │ − │ − │ 0.0 │
└────────────┴───────────┴────────┴─────────┴───────┴─────┴──────┴────────┴───────┴────────┴───────┴─────────┘
Icons: ✓ pass · ✗ fail · − not applicable / unknown.
Drop a YAML file in ./tasks/:
id: my_task
prompt: "Describe the thing the agent should build"
timeout_sec: 600
success_criteria:
files_produced_min: 2
must_contain_file: README.mdagentforgecurrently returns a chat-style response and does not write project files. TheAgentForgeRunnermarks these runsunsupportedand leaves aTODOto wire up a future build command.- Token extraction for nexus is regex-based and uses a 70/30 input/output
heuristic for cost estimation; if the framework changes its output
format the harness will silently report
None. - Pricing in
pricing.pyis a static snapshot — update it when vendor list prices change.