A lightweight AI Agent evaluation benchmark toolkit.
agent-bench-lite provides a modular, extensible framework for evaluating AI agents across six core dimensions:
| Dimension | What it measures |
|---|---|
| Tool Calling Accuracy | Can the agent call the right tools with correct parameters? |
| Planning & Decomposition | Can the agent break complex tasks into logical steps? |
| Context Retention | Can the agent remember and use earlier context? |
| Error Recovery | Can the agent handle and recover from errors gracefully? |
| Instruction Following | Does the agent follow exact specifications? |
| Multi-step Reasoning | Can the agent chain logical steps to reach a conclusion? |
# Base install (no LLM adapters)
pip install -e .
# With Anthropic adapter
pip install -e ".[anthropic]"
# With OpenAI adapter
pip install -e ".[openai]"
# Everything
pip install -e ".[all]"
# Development
pip install -e ".[dev]"import asyncio
from agent_bench_lite import BenchmarkRunner, EchoAdapter
async def main():
adapter = EchoAdapter()
runner = BenchmarkRunner(adapter=adapter)
report = await runner.run()
report.print_summary()
report.save_json("results.json")
asyncio.run(main())Or use the example script:
python examples/run_benchmark.py- Adapters wrap LLM APIs into a common interface (
BaseAdapter) - Dimensions define evaluation tasks and scoring logic (
BaseDimension) - Runner orchestrates task execution across dimensions
- Evaluator computes scores from raw results
- Reporter formats and exports results
from agent_bench_lite.dimensions.base import BaseDimension, TaskResult
class MyDimension(BaseDimension):
name = "my_dimension"
description = "Evaluates something new"
def get_tasks(self):
return [...]
async def evaluate_task(self, task, agent_response):
return TaskResult(...)from agent_bench_lite.adapters.base import BaseAdapter
class MyAdapter(BaseAdapter):
async def send_message(self, messages, tools=None):
...
async def send_message_with_tools(self, messages, tools, tool_handler):
...MIT