agent-bench

Benchmarking framework for AI agents. Measure performance, accuracy, cost, and latency across different agent implementations.

The Problem

When building AI agents, developers face several measurement challenges:

No standard benchmarks: Each team creates ad-hoc tests, making comparisons meaningless
Hidden costs: Token usage and API costs often aren't tracked systematically
Performance blind spots: Latency and throughput go unmeasured until production
Regression detection: Hard to know when an agent update makes things worse
Model comparison: Switching between GPT-4, Claude, or local models requires rebuilding tests

agent-bench provides a standardized framework to measure what matters: success rate, cost, latency, and capability across different task categories.

Installation

npm install -g agent-bench

Or use directly with npx:

npx agent-bench run -s coding -m gpt-4o

Quick Start

Run the coding benchmark suite against GPT-4:

export OPENAI_API_KEY=sk-...
agent-bench run --suite coding --model gpt-4o

Run against Claude:

export ANTHROPIC_API_KEY=sk-ant-...
agent-bench run --suite reasoning --model claude-sonnet-4-20250514

Compare multiple agents:

agent-bench run -s all -m gpt-4o -o results/gpt4o.json
agent-bench run -s all -m claude-sonnet-4-20250514 -o results/claude.json
agent-bench compare results/*.json

Built-in Suites

Suite	Tasks	Description
reasoning	5	Math, logic puzzles, sequences, word problems
coding	5	Code generation, debugging, complexity analysis
data	3	JSON/CSV parsing, regex extraction, transformation
tool-use	2	Function calling, parameter identification, sequencing
research	2	Comparison, explanation, synthesis
all	17	All built-in tasks combined

List available suites:

agent-bench list

Usage

Basic Commands

# Run a benchmark
agent-bench run -s <suite> -m <model>

# Compare results
agent-bench compare result1.json result2.json

# List available suites
agent-bench list

# Create a custom suite template
agent-bench init my-suite.yaml

# Create an agent config template
agent-bench agent-config my-agent.yaml

Run Options

agent-bench run \
  --suite coding \              # Suite name or file path
  --model gpt-4o \              # Model name
  --type openai \               # Adapter type (openai, claude, cli)
  --name "GPT-4o Agent" \       # Display name for reports
  --timeout 300 \               # Task timeout in seconds
  --retries 1 \                 # Retry failed tasks
  --output results.json \       # Save results
  --format markdown \           # Output format (text, json, markdown)
  --verbose                     # Show detailed task results

Using Config Files

Create an agent config file:

# agent.yaml
type: openai
name: my-agent
model: gpt-4o-mini
baseUrl: https://api.openai.com/v1
systemPrompt: |
  You are a helpful assistant.
  Be concise in your responses.

Run with config:

agent-bench run -s coding -c agent.yaml

Custom Benchmark Suites

Create custom suites for domain-specific testing:

# my-suite.yaml
name: sql-benchmark
version: "1.0.0"
description: SQL query generation tests

config:
  timeout: 60
  retries: 1

tasks:
  - id: sql-select
    name: Basic SELECT
    description: Generate a simple SELECT query
    category: coding
    difficulty: easy
    prompt: |
      Write a SQL query to select all users older than 21
      from a table called 'users' with columns: id, name, age.
      Return only the SQL query.
    validation:
      type: regex
      pattern: "SELECT.*FROM.*users.*WHERE.*age.*>.*21"

  - id: sql-join
    name: JOIN Query
    description: Generate a query with JOIN
    category: coding
    difficulty: medium
    prompt: |
      Write a SQL query to get all orders with customer names.
      Tables: orders (id, customer_id, amount), customers (id, name).
    validation:
      type: contains
      expected: "JOIN|join"

Validation Types

exact: Output must match expected string exactly
contains: Output must contain expected text (use | for alternatives)
regex: Output must match the regex pattern
function: Custom validator module (advanced)

Programmatic API

Use agent-bench as a library:

import {
  BenchmarkRunner,
  loadSuite,
  OpenAIAdapter,
  ClaudeAdapter,
  formatReport
} from 'agent-bench';

// Create an adapter
const adapter = new OpenAIAdapter({
  model: 'gpt-4o',
  apiKey: process.env.OPENAI_API_KEY
});

// Load a suite
const suite = await loadSuite('coding');

// Run benchmark
const runner = new BenchmarkRunner({
  timeout: 300,
  onTaskComplete: (result) => {
    console.log(`${result.taskName}: ${result.passed ? 'PASS' : 'FAIL'}`);
  }
});

const result = await runner.run(suite, adapter);

// Format and print report
console.log(formatReport(result, { verbose: true }));

Custom Adapters

Implement the AgentAdapter interface for any agent:

import { AgentAdapter, BenchmarkTask, AgentResponse } from 'agent-bench';

class MyCustomAdapter implements AgentAdapter {
  name = 'my-custom-agent';

  async init(): Promise<void> {
    // Setup code
  }

  async run(task: BenchmarkTask): Promise<AgentResponse> {
    const start = Date.now();

    // Call your agent
    const output = await myAgent.complete(task.prompt);

    return {
      output,
      latencyMs: Date.now() - start,
      tokensIn: 100,  // If available
      tokensOut: 50,
      cost: 0.001     // USD
    };
  }

  async cleanup(): Promise<void> {
    // Teardown code
  }
}

Output Formats

Text (default)

============================================================
BENCHMARK RESULTS: coding
============================================================

Agent:      gpt-4o
Timestamp:  2024-01-15T10:30:00Z
Duration:   45.23s

--- Summary ---
Tasks:      5
Passed:     4 (80.0%)
Failed:     1
Errors:     0

--- Metrics ---
Tokens In:  2,340
Tokens Out: 1,892
Total Cost: $0.0234
Avg Latency: 1245ms

JSON

Use --format json for machine-readable output, ideal for CI/CD integration.

Markdown

Use --format markdown for documentation or GitHub issues.

CI/CD Integration

Run benchmarks in GitHub Actions:

# .github/workflows/benchmark.yml
name: Agent Benchmark

on:
  pull_request:
  schedule:
    - cron: '0 0 * * *'  # Daily

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: '20'

      - name: Install agent-bench
        run: npm install -g agent-bench

      - name: Run benchmarks
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          agent-bench run -s all -m gpt-4o-mini -o results.json

      - name: Upload results
        uses: actions/upload-artifact@v4
        with:
          name: benchmark-results
          path: results.json

      - name: Check success rate
        run: |
          RATE=$(jq '.summary.successRate' results.json)
          if (( $(echo "$RATE < 80" | bc -l) )); then
            echo "Success rate $RATE% is below threshold"
            exit 1
          fi

Design Principles

Portable: Works with any LLM provider or local model
Extensible: Add custom suites and adapters easily
Measurable: Track success rate, cost, and latency together
Reproducible: Same suite gives comparable results across runs
Minimal: No heavy dependencies, fast to install and run

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

agent-bench

The Problem

Installation

Quick Start

Built-in Suites

Usage

Basic Commands

Run Options

Using Config Files

Custom Benchmark Suites

Validation Types

Programmatic API

Custom Adapters

Output Formats

Text (default)

JSON

Markdown

CI/CD Integration

Design Principles

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

agent-bench

The Problem

Installation

Quick Start

Built-in Suites

Usage

Basic Commands

Run Options

Using Config Files

Custom Benchmark Suites

Validation Types

Programmatic API

Custom Adapters

Output Formats

Text (default)

JSON

Markdown

CI/CD Integration

Design Principles

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages