Benchmark LLM agents across providers on speed, cost, and quality — and get an honest report card.
Most LLM "evals" focus on one axis: cost, or latency, or a single accuracy number.
agent-bench runs the same standardized tasks against multiple providers and reports all three, side by side, with charts.
If you're looking for cost-only tracking,
agent-cost-cliis the right tool.agent-benchis complementary — it measures actual answer quality and end-to-end speed on top of cost.
Built by Rapid Claw — deploy and benchmark AI agents in production.
- Runs standardized benchmark suites against any combination of LLM providers.
- Collects latency (TTFT, total), throughput (tokens/sec), cost per task, accuracy, and format compliance.
- Produces a report card as a terminal table, Markdown, CSV, or a self-contained HTML page with bar and radar charts.
- Lets you define your own benchmark tasks in a short YAML file.
| Suite | What it measures |
|---|---|
reasoning |
Math + logic problems with short final answers. Exact-match grading. |
coding |
Generates a Python function, extracts it, and verifies against tests. |
writing |
Coherence heuristics + strict format compliance (bullet counts, length). |
tool-use |
Function-calling accuracy — did the model pick the right tool & args? |
| Provider | Install extra | Auth |
|---|---|---|
| OpenAI | agent-bench[openai] |
OPENAI_API_KEY |
| Anthropic | agent-bench[anthropic] |
ANTHROPIC_API_KEY |
| Google Gemini | agent-bench[gemini] |
GOOGLE_API_KEY |
| Ollama (local) | agent-bench[ollama] |
none (local daemon) |
Each SDK is imported lazily, so you only install what you use.
# core + all providers
pip install 'agent-bench[all]'
# or pick and choose
pip install 'agent-bench[openai,anthropic]'From source:
git clone https://github.com/arcane-bear/agent-bench
cd agent-bench
pip install -e '.[dev,all]'# run reasoning + coding against two providers, write HTML + MD reports
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
agent-bench run \
--providers openai,anthropic \
--suite reasoning --suite coding \
--output results.json \
--html report.html \
--md report.mdPin specific models with provider:model syntax:
agent-bench run \
--providers 'openai:gpt-4o-mini,anthropic:claude-haiku-4-5,gemini:gemini-1.5-flash,ollama:llama3.2' \
--suite reasoningRegenerate reports later without re-running:
agent-bench report results.json --html report.html --md report.md --csv rows.csvList built-in suites:
agent-bench list-suitesTerminal:
agent-bench — report card
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━┓
┃ Provider ┃ Model ┃ Pass ┃ Accuracy ┃ Format ┃ TTFT (s) ┃ Total (s) ┃ tok/s ┃ Cost ($) ┃ Err ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━┩
│ anthropic │ claude-haiku-4-5 │ 8/9 (89%) │ 0.91 │ 1.00 │ 0.34 │ 2.10 │ 85.2 │ 0.0041 │ 0 │
│ openai │ gpt-4o-mini │ 7/9 (78%) │ 0.83 │ 0.94 │ 0.21 │ 1.75 │ 110.4 │ 0.0008 │ 0 │
│ gemini │ gemini-1.5-flash │ 6/9 (67%) │ 0.74 │ 0.89 │ 0.29 │ 1.62 │ 120.9 │ 0.0004 │ 0 │
│ ollama │ llama3.2 │ 4/9 (44%) │ 0.52 │ 0.78 │ 0.05 │ 4.90 │ 28.7 │ 0.0000 │ 0 │
└────────────┴─────────────────────┴─────────┴──────────┴────────┴──────────┴───────────┴───────┴──────────┴─────┘
HTML report (written by --html report.html) shows:
- Overall table (same data as terminal).
- Bar charts: accuracy, latency, throughput, and total cost per provider.
- A radar chart plotting each provider's accuracy profile across all suites — great for spotting "good at reasoning, weak at tool-use" patterns at a glance.
Markdown report (--md report.md) is the same table, plus a per-suite accuracy breakdown, which pastes cleanly into GitHub issues or PRs.
Drop a YAML file anywhere and point --custom at it:
# examples/custom_suite.yaml
name: my-team-evals
tasks:
- id: capital-france
system: "Answer with just the city name."
prompt: "What is the capital of France?"
expected:
contains: "paris"
max_tokens: 32
- id: json-output
system: "Respond with a JSON object only."
prompt: 'Return {"status": "ok", "version": 1} exactly.'
expected:
regex: '"status"\s*:\s*"ok"'
format_regex: '^\{'
max_tokens: 64Then:
agent-bench run \
--providers openai,anthropic \
--custom examples/custom_suite.yaml \
--html report.htmlSupported expected checks:
| Key | Meaning |
|---|---|
contains |
Substring must appear in the output (case-insensitive). |
equals |
Whole output must equal the value (whitespace-normalized). |
regex |
Regex must match somewhere in the output. |
format_regex |
Regex presence → contributes to format compliance score. |
from agent_bench import BenchmarkRunner
from agent_bench.providers import load_provider
from agent_bench.suites import load_suite
from agent_bench.reports import write_html
runner = BenchmarkRunner(
providers=[load_provider("openai:gpt-4o-mini"), load_provider("anthropic")],
suites=[load_suite("reasoning"), load_suite("coding")],
)
result = runner.run()
result.save("results.json")
import json
write_html(json.load(open("results.json")), "report.html")| Metric | How it's measured |
|---|---|
| TTFT (s) | Wall clock from request start to first streamed token. |
| Total (s) | Wall clock for the full completion, including stream close. |
| tokens/sec | completion_tokens / total_s. |
| cost (USD) | Provider-reported token counts × a small built-in pricing table. |
| accuracy | Per-suite check — exact-match, unit-test pass rate, or regex match. |
| format compliance | Suite-specific: did output match the required shape (e.g. Answer:)? |
Pricing lives in src/agent_bench/metrics.py — update it as vendors change prices.
pip install -e '.[dev]'
pytest
ruff check src testsTests use a FakeProvider so the full pipeline (runner → scoring → reports) is exercised without hitting any real APIs.
Learn more about LLM performance benchmarking on the Rapid Claw blog.
Visit rapidclaw.dev for the full suite of AI agent deployment tools.
MIT — see LICENSE.