Skip to content

dlekdns08/agent-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

agent-bench

Benchmark harness that sends the same natural-language task to autonoma, nexus, and agentforge and reports multi-signal quality scores / cost / wall time / files produced side-by-side, with parallel execution and a persistent run history.

Status of each framework

framework one-shot build CLI integration
autonoma autonoma build "<prompt>" --output <dir> --no-animate supported
nexus nexus build "<prompt>" --output <dir> --no-tui supported
agentforge no project-scaffolding command (only chat-style run) unsupported — runner returns a status="unsupported" result; replace once a build command exists

Token usage is best-effort:

  • autonoma: summed (input + output) from traces/*/llm-calls.jsonl; model captured from trace rows.
  • nexus: parsed from the Total Tokens row printed by _print_summary; split 70/30 heuristically and priced against NEXUS_MODEL (default claude-sonnet-4-6).
  • agentforge: n/a until a one-shot build command lands.

Install

cd agent-bench
uv sync                          # or: pip install -e .
uv pip install -e '.[score]'     # optional: ruff / pytest / anthropic for real scoring
export ANTHROPIC_API_KEY=sk-ant-...

Scoring signals

Every run produces a Score object with six fields plus an aggregate:

signal how it's measured type
files_ok task's files_produced_min + optional must_contain_file satisfied bool
imports_ok every *.py in the workdir parses cleanly with ast.parse bool
ruff_ok python -m ruff check <workdir> exits 0 (or None if ruff isn't installed) bool | None
pytest_score passed / total from pytest --maxfail=5 --timeout=30 with a 60s wall-cap; None if none 0.0–1.0 | None
judge_score LLM-as-judge (claude-sonnet-4-6, prompt-cached rubric) grading task fit / quality / runnability 0.0–1.0 | None
overall 0.30·files + 0.15·imports + 0.10·ruff + 0.20·pytest + 0.25·judge, re-normalized over whichever signals are available 0.0–1.0

All subprocess calls for scoring run in the task workdir with a 60 s timeout and PYTHONDONTWRITEBYTECODE=1. No autonoma/nexus sandbox is imported — the harness is deliberately light-touch.

Graceful degradation:

  • No ANTHROPIC_API_KEY or no anthropic SDK → judge_score = None, overall re-normalized over the remaining signals.
  • No ruff / pytest_timeout installed → those signals become None instead of failing.
  • --dry-run short-circuits: no subprocess, no LLM judge, a synthetic Score(files_ok=True, imports_ok=True, overall=1.0).

Cost accounting

pricing.py holds a hardcoded model → ($/MTok input, $/MTok output) table covering claude-opus-4-7, claude-sonnet-4-6, claude-haiku-4-5-20251001, gpt-4o, gpt-4o-mini (list prices as of 2026-04-17 — re-check the vendor pages periodically). Unknown models yield cost_usd = None rather than a bogus zero.

Persistence

Every run / run-all invocation appends rows to ./bench.db (sqlite):

runs(id, ts, framework, task_id, model, wall_sec,
     input_tokens, output_tokens, cost_usd,
     files_produced, overall, score_json, workdir)

Use

# List tasks
agent-bench tasks

# Run one task against every framework (parallel, cap = 3)
agent-bench run hello_cli

# Cap concurrency lower / higher
agent-bench run todo_api --max-concurrent 1

# Smoke-test the harness without calling any LLM or scorer subprocess
agent-bench run-all --dry-run

# Restrict to a single framework
agent-bench run todo_api -f autonoma

# Re-render a saved report
agent-bench report --results ./results.json

# Recent stored runs
agent-bench history --last 20

# Per-framework leaderboard from the stored DB
agent-bench leaderboard

Layout of a run

runs/<framework>/<task_id>/<timestamp>/
  output/           # generated project files (counted + scored)
  traces/           # autonoma's llm-calls.jsonl lands here

Example output

                                  agent-bench results
┏━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━┳━━━━━┳━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━┓
┃ framework  ┃ task      ┃ status ┃ overall ┃ files ┃ imp ┃ ruff ┃ pytest ┃ judge ┃ tokens ┃ $cost ┃ time(s) ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━╇━━━━━╇━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━┩
│ autonoma   │ hello_cli │  OK    │   0.85  │   ✓   │  ✓  │  ✓   │  1.00  │  0.80 │ 12,345 │ $0.084 │    14.2 │
│ nexus      │ hello_cli │  OK    │   0.72  │   ✓   │  ✓  │  ✗   │    −   │  0.75 │  9,001 │ $0.045 │    11.7 │
│ agentforge │ hello_cli │  n/a   │     −   │   −   │  −  │  −   │    −   │    −  │      − │      − │     0.0 │
└────────────┴───────────┴────────┴─────────┴───────┴─────┴──────┴────────┴───────┴────────┴───────┴─────────┘

Icons: pass · fail · not applicable / unknown.

Defining new tasks

Drop a YAML file in ./tasks/:

id: my_task
prompt: "Describe the thing the agent should build"
timeout_sec: 600
success_criteria:
  files_produced_min: 2
  must_contain_file: README.md

Known limitations

  • agentforge currently returns a chat-style response and does not write project files. The AgentForgeRunner marks these runs unsupported and leaves a TODO to wire up a future build command.
  • Token extraction for nexus is regex-based and uses a 70/30 input/output heuristic for cost estimation; if the framework changes its output format the harness will silently report None.
  • Pricing in pricing.py is a static snapshot — update it when vendor list prices change.

About

Multi-signal benchmark harness for code-gen agents (autonoma / nexus / agentforge).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages