Skip to content

aaronlab/agent-bench-lite

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

agent-bench-lite

A lightweight AI Agent evaluation benchmark toolkit.

Overview

agent-bench-lite provides a modular, extensible framework for evaluating AI agents across six core dimensions:

Dimension What it measures
Tool Calling Accuracy Can the agent call the right tools with correct parameters?
Planning & Decomposition Can the agent break complex tasks into logical steps?
Context Retention Can the agent remember and use earlier context?
Error Recovery Can the agent handle and recover from errors gracefully?
Instruction Following Does the agent follow exact specifications?
Multi-step Reasoning Can the agent chain logical steps to reach a conclusion?

Installation

# Base install (no LLM adapters)
pip install -e .

# With Anthropic adapter
pip install -e ".[anthropic]"

# With OpenAI adapter
pip install -e ".[openai]"

# Everything
pip install -e ".[all]"

# Development
pip install -e ".[dev]"

Quick Start

import asyncio
from agent_bench_lite import BenchmarkRunner, EchoAdapter

async def main():
    adapter = EchoAdapter()
    runner = BenchmarkRunner(adapter=adapter)
    report = await runner.run()
    report.print_summary()
    report.save_json("results.json")

asyncio.run(main())

Or use the example script:

python examples/run_benchmark.py

Architecture

  • Adapters wrap LLM APIs into a common interface (BaseAdapter)
  • Dimensions define evaluation tasks and scoring logic (BaseDimension)
  • Runner orchestrates task execution across dimensions
  • Evaluator computes scores from raw results
  • Reporter formats and exports results

Adding a new dimension

from agent_bench_lite.dimensions.base import BaseDimension, TaskResult

class MyDimension(BaseDimension):
    name = "my_dimension"
    description = "Evaluates something new"

    def get_tasks(self):
        return [...]

    async def evaluate_task(self, task, agent_response):
        return TaskResult(...)

Adding a new adapter

from agent_bench_lite.adapters.base import BaseAdapter

class MyAdapter(BaseAdapter):
    async def send_message(self, messages, tools=None):
        ...

    async def send_message_with_tools(self, messages, tools, tool_handler):
        ...

License

MIT

About

Lightweight AI Agent evaluation benchmark toolkit — 6 dimensions, 30 tasks, plugin architecture, async execution, beautiful CLI

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages