agent-bench-lite

A lightweight AI Agent evaluation benchmark toolkit.

Overview

agent-bench-lite provides a modular, extensible framework for evaluating AI agents across six core dimensions:

Dimension	What it measures
Tool Calling Accuracy	Can the agent call the right tools with correct parameters?
Planning & Decomposition	Can the agent break complex tasks into logical steps?
Context Retention	Can the agent remember and use earlier context?
Error Recovery	Can the agent handle and recover from errors gracefully?
Instruction Following	Does the agent follow exact specifications?
Multi-step Reasoning	Can the agent chain logical steps to reach a conclusion?

Installation

# Base install (no LLM adapters)
pip install -e .

# With Anthropic adapter
pip install -e ".[anthropic]"

# With OpenAI adapter
pip install -e ".[openai]"

# Everything
pip install -e ".[all]"

# Development
pip install -e ".[dev]"

Quick Start

import asyncio
from agent_bench_lite import BenchmarkRunner, EchoAdapter

async def main():
    adapter = EchoAdapter()
    runner = BenchmarkRunner(adapter=adapter)
    report = await runner.run()
    report.print_summary()
    report.save_json("results.json")

asyncio.run(main())

Or use the example script:

python examples/run_benchmark.py

Architecture

Adapters wrap LLM APIs into a common interface (BaseAdapter)
Dimensions define evaluation tasks and scoring logic (BaseDimension)
Runner orchestrates task execution across dimensions
Evaluator computes scores from raw results
Reporter formats and exports results

Adding a new dimension

from agent_bench_lite.dimensions.base import BaseDimension, TaskResult

class MyDimension(BaseDimension):
    name = "my_dimension"
    description = "Evaluates something new"

    def get_tasks(self):
        return [...]

    async def evaluate_task(self, task, agent_response):
        return TaskResult(...)

Adding a new adapter

from agent_bench_lite.adapters.base import BaseAdapter

class MyAdapter(BaseAdapter):
    async def send_message(self, messages, tools=None):
        ...

    async def send_message_with_tools(self, messages, tools, tool_handler):
        ...

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
agent_bench_lite		agent_bench_lite
examples		examples
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

agent-bench-lite

Overview

Installation

Quick Start

Architecture

Adding a new dimension

Adding a new adapter

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

agent-bench-lite

Overview

Installation

Quick Start

Architecture

Adding a new dimension

Adding a new adapter

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages