Skip to content

dabit3/agent-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

agent-bench

Benchmarking framework for AI agents. Measure performance, accuracy, cost, and latency across different agent implementations.

The Problem

When building AI agents, developers face several measurement challenges:

  1. No standard benchmarks: Each team creates ad-hoc tests, making comparisons meaningless
  2. Hidden costs: Token usage and API costs often aren't tracked systematically
  3. Performance blind spots: Latency and throughput go unmeasured until production
  4. Regression detection: Hard to know when an agent update makes things worse
  5. Model comparison: Switching between GPT-4, Claude, or local models requires rebuilding tests

agent-bench provides a standardized framework to measure what matters: success rate, cost, latency, and capability across different task categories.

Installation

npm install -g agent-bench

Or use directly with npx:

npx agent-bench run -s coding -m gpt-4o

Quick Start

Run the coding benchmark suite against GPT-4:

export OPENAI_API_KEY=sk-...
agent-bench run --suite coding --model gpt-4o

Run against Claude:

export ANTHROPIC_API_KEY=sk-ant-...
agent-bench run --suite reasoning --model claude-sonnet-4-20250514

Compare multiple agents:

agent-bench run -s all -m gpt-4o -o results/gpt4o.json
agent-bench run -s all -m claude-sonnet-4-20250514 -o results/claude.json
agent-bench compare results/*.json

Built-in Suites

Suite Tasks Description
reasoning 5 Math, logic puzzles, sequences, word problems
coding 5 Code generation, debugging, complexity analysis
data 3 JSON/CSV parsing, regex extraction, transformation
tool-use 2 Function calling, parameter identification, sequencing
research 2 Comparison, explanation, synthesis
all 17 All built-in tasks combined

List available suites:

agent-bench list

Usage

Basic Commands

# Run a benchmark
agent-bench run -s <suite> -m <model>

# Compare results
agent-bench compare result1.json result2.json

# List available suites
agent-bench list

# Create a custom suite template
agent-bench init my-suite.yaml

# Create an agent config template
agent-bench agent-config my-agent.yaml

Run Options

agent-bench run \
  --suite coding \              # Suite name or file path
  --model gpt-4o \              # Model name
  --type openai \               # Adapter type (openai, claude, cli)
  --name "GPT-4o Agent" \       # Display name for reports
  --timeout 300 \               # Task timeout in seconds
  --retries 1 \                 # Retry failed tasks
  --output results.json \       # Save results
  --format markdown \           # Output format (text, json, markdown)
  --verbose                     # Show detailed task results

Using Config Files

Create an agent config file:

# agent.yaml
type: openai
name: my-agent
model: gpt-4o-mini
baseUrl: https://api.openai.com/v1
systemPrompt: |
  You are a helpful assistant.
  Be concise in your responses.

Run with config:

agent-bench run -s coding -c agent.yaml

Custom Benchmark Suites

Create custom suites for domain-specific testing:

# my-suite.yaml
name: sql-benchmark
version: "1.0.0"
description: SQL query generation tests

config:
  timeout: 60
  retries: 1

tasks:
  - id: sql-select
    name: Basic SELECT
    description: Generate a simple SELECT query
    category: coding
    difficulty: easy
    prompt: |
      Write a SQL query to select all users older than 21
      from a table called 'users' with columns: id, name, age.
      Return only the SQL query.
    validation:
      type: regex
      pattern: "SELECT.*FROM.*users.*WHERE.*age.*>.*21"

  - id: sql-join
    name: JOIN Query
    description: Generate a query with JOIN
    category: coding
    difficulty: medium
    prompt: |
      Write a SQL query to get all orders with customer names.
      Tables: orders (id, customer_id, amount), customers (id, name).
    validation:
      type: contains
      expected: "JOIN|join"

Validation Types

  • exact: Output must match expected string exactly
  • contains: Output must contain expected text (use | for alternatives)
  • regex: Output must match the regex pattern
  • function: Custom validator module (advanced)

Programmatic API

Use agent-bench as a library:

import {
  BenchmarkRunner,
  loadSuite,
  OpenAIAdapter,
  ClaudeAdapter,
  formatReport
} from 'agent-bench';

// Create an adapter
const adapter = new OpenAIAdapter({
  model: 'gpt-4o',
  apiKey: process.env.OPENAI_API_KEY
});

// Load a suite
const suite = await loadSuite('coding');

// Run benchmark
const runner = new BenchmarkRunner({
  timeout: 300,
  onTaskComplete: (result) => {
    console.log(`${result.taskName}: ${result.passed ? 'PASS' : 'FAIL'}`);
  }
});

const result = await runner.run(suite, adapter);

// Format and print report
console.log(formatReport(result, { verbose: true }));

Custom Adapters

Implement the AgentAdapter interface for any agent:

import { AgentAdapter, BenchmarkTask, AgentResponse } from 'agent-bench';

class MyCustomAdapter implements AgentAdapter {
  name = 'my-custom-agent';

  async init(): Promise<void> {
    // Setup code
  }

  async run(task: BenchmarkTask): Promise<AgentResponse> {
    const start = Date.now();

    // Call your agent
    const output = await myAgent.complete(task.prompt);

    return {
      output,
      latencyMs: Date.now() - start,
      tokensIn: 100,  // If available
      tokensOut: 50,
      cost: 0.001     // USD
    };
  }

  async cleanup(): Promise<void> {
    // Teardown code
  }
}

Output Formats

Text (default)

============================================================
BENCHMARK RESULTS: coding
============================================================

Agent:      gpt-4o
Timestamp:  2024-01-15T10:30:00Z
Duration:   45.23s

--- Summary ---
Tasks:      5
Passed:     4 (80.0%)
Failed:     1
Errors:     0

--- Metrics ---
Tokens In:  2,340
Tokens Out: 1,892
Total Cost: $0.0234
Avg Latency: 1245ms

JSON

Use --format json for machine-readable output, ideal for CI/CD integration.

Markdown

Use --format markdown for documentation or GitHub issues.

CI/CD Integration

Run benchmarks in GitHub Actions:

# .github/workflows/benchmark.yml
name: Agent Benchmark

on:
  pull_request:
  schedule:
    - cron: '0 0 * * *'  # Daily

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: '20'

      - name: Install agent-bench
        run: npm install -g agent-bench

      - name: Run benchmarks
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          agent-bench run -s all -m gpt-4o-mini -o results.json

      - name: Upload results
        uses: actions/upload-artifact@v4
        with:
          name: benchmark-results
          path: results.json

      - name: Check success rate
        run: |
          RATE=$(jq '.summary.successRate' results.json)
          if (( $(echo "$RATE < 80" | bc -l) )); then
            echo "Success rate $RATE% is below threshold"
            exit 1
          fi

Design Principles

  1. Portable: Works with any LLM provider or local model
  2. Extensible: Add custom suites and adapters easily
  3. Measurable: Track success rate, cost, and latency together
  4. Reproducible: Same suite gives comparable results across runs
  5. Minimal: No heavy dependencies, fast to install and run

License

MIT

About

Benchmarking framework for AI agents. Measure performance, accuracy, cost, and time across agent implementations.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors