Benchmarking framework for AI agents. Measure performance, accuracy, cost, and latency across different agent implementations.
When building AI agents, developers face several measurement challenges:
- No standard benchmarks: Each team creates ad-hoc tests, making comparisons meaningless
- Hidden costs: Token usage and API costs often aren't tracked systematically
- Performance blind spots: Latency and throughput go unmeasured until production
- Regression detection: Hard to know when an agent update makes things worse
- Model comparison: Switching between GPT-4, Claude, or local models requires rebuilding tests
agent-bench provides a standardized framework to measure what matters: success rate, cost, latency, and capability across different task categories.
npm install -g agent-benchOr use directly with npx:
npx agent-bench run -s coding -m gpt-4oRun the coding benchmark suite against GPT-4:
export OPENAI_API_KEY=sk-...
agent-bench run --suite coding --model gpt-4oRun against Claude:
export ANTHROPIC_API_KEY=sk-ant-...
agent-bench run --suite reasoning --model claude-sonnet-4-20250514Compare multiple agents:
agent-bench run -s all -m gpt-4o -o results/gpt4o.json
agent-bench run -s all -m claude-sonnet-4-20250514 -o results/claude.json
agent-bench compare results/*.json| Suite | Tasks | Description |
|---|---|---|
| reasoning | 5 | Math, logic puzzles, sequences, word problems |
| coding | 5 | Code generation, debugging, complexity analysis |
| data | 3 | JSON/CSV parsing, regex extraction, transformation |
| tool-use | 2 | Function calling, parameter identification, sequencing |
| research | 2 | Comparison, explanation, synthesis |
| all | 17 | All built-in tasks combined |
List available suites:
agent-bench list# Run a benchmark
agent-bench run -s <suite> -m <model>
# Compare results
agent-bench compare result1.json result2.json
# List available suites
agent-bench list
# Create a custom suite template
agent-bench init my-suite.yaml
# Create an agent config template
agent-bench agent-config my-agent.yamlagent-bench run \
--suite coding \ # Suite name or file path
--model gpt-4o \ # Model name
--type openai \ # Adapter type (openai, claude, cli)
--name "GPT-4o Agent" \ # Display name for reports
--timeout 300 \ # Task timeout in seconds
--retries 1 \ # Retry failed tasks
--output results.json \ # Save results
--format markdown \ # Output format (text, json, markdown)
--verbose # Show detailed task resultsCreate an agent config file:
# agent.yaml
type: openai
name: my-agent
model: gpt-4o-mini
baseUrl: https://api.openai.com/v1
systemPrompt: |
You are a helpful assistant.
Be concise in your responses.Run with config:
agent-bench run -s coding -c agent.yamlCreate custom suites for domain-specific testing:
# my-suite.yaml
name: sql-benchmark
version: "1.0.0"
description: SQL query generation tests
config:
timeout: 60
retries: 1
tasks:
- id: sql-select
name: Basic SELECT
description: Generate a simple SELECT query
category: coding
difficulty: easy
prompt: |
Write a SQL query to select all users older than 21
from a table called 'users' with columns: id, name, age.
Return only the SQL query.
validation:
type: regex
pattern: "SELECT.*FROM.*users.*WHERE.*age.*>.*21"
- id: sql-join
name: JOIN Query
description: Generate a query with JOIN
category: coding
difficulty: medium
prompt: |
Write a SQL query to get all orders with customer names.
Tables: orders (id, customer_id, amount), customers (id, name).
validation:
type: contains
expected: "JOIN|join"- exact: Output must match expected string exactly
- contains: Output must contain expected text (use
|for alternatives) - regex: Output must match the regex pattern
- function: Custom validator module (advanced)
Use agent-bench as a library:
import {
BenchmarkRunner,
loadSuite,
OpenAIAdapter,
ClaudeAdapter,
formatReport
} from 'agent-bench';
// Create an adapter
const adapter = new OpenAIAdapter({
model: 'gpt-4o',
apiKey: process.env.OPENAI_API_KEY
});
// Load a suite
const suite = await loadSuite('coding');
// Run benchmark
const runner = new BenchmarkRunner({
timeout: 300,
onTaskComplete: (result) => {
console.log(`${result.taskName}: ${result.passed ? 'PASS' : 'FAIL'}`);
}
});
const result = await runner.run(suite, adapter);
// Format and print report
console.log(formatReport(result, { verbose: true }));Implement the AgentAdapter interface for any agent:
import { AgentAdapter, BenchmarkTask, AgentResponse } from 'agent-bench';
class MyCustomAdapter implements AgentAdapter {
name = 'my-custom-agent';
async init(): Promise<void> {
// Setup code
}
async run(task: BenchmarkTask): Promise<AgentResponse> {
const start = Date.now();
// Call your agent
const output = await myAgent.complete(task.prompt);
return {
output,
latencyMs: Date.now() - start,
tokensIn: 100, // If available
tokensOut: 50,
cost: 0.001 // USD
};
}
async cleanup(): Promise<void> {
// Teardown code
}
}============================================================
BENCHMARK RESULTS: coding
============================================================
Agent: gpt-4o
Timestamp: 2024-01-15T10:30:00Z
Duration: 45.23s
--- Summary ---
Tasks: 5
Passed: 4 (80.0%)
Failed: 1
Errors: 0
--- Metrics ---
Tokens In: 2,340
Tokens Out: 1,892
Total Cost: $0.0234
Avg Latency: 1245ms
Use --format json for machine-readable output, ideal for CI/CD integration.
Use --format markdown for documentation or GitHub issues.
Run benchmarks in GitHub Actions:
# .github/workflows/benchmark.yml
name: Agent Benchmark
on:
pull_request:
schedule:
- cron: '0 0 * * *' # Daily
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '20'
- name: Install agent-bench
run: npm install -g agent-bench
- name: Run benchmarks
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
agent-bench run -s all -m gpt-4o-mini -o results.json
- name: Upload results
uses: actions/upload-artifact@v4
with:
name: benchmark-results
path: results.json
- name: Check success rate
run: |
RATE=$(jq '.summary.successRate' results.json)
if (( $(echo "$RATE < 80" | bc -l) )); then
echo "Success rate $RATE% is below threshold"
exit 1
fi- Portable: Works with any LLM provider or local model
- Extensible: Add custom suites and adapters easily
- Measurable: Track success rate, cost, and latency together
- Reproducible: Same suite gives comparable results across runs
- Minimal: No heavy dependencies, fast to install and run
MIT