agent_eval

A reproducible benchmark for Claude Code skill bundles (.claude/skills/ directories) and CLAUDE.md configurations. Runs a fixed task set against Anthropic, OpenAI, and Google models in a sandboxed environment and reports pass@k, cost, latency, and a handful of adversarial flags.

Apache-2.0.

Install

pip install -e .
docker build -f sandbox/Dockerfile.base -t agenteval-sandbox:base sandbox/

The sandbox is Docker-based by default. Without Docker the harness falls back to a local-subprocess mode (no isolation, dev only); set AGENTEVAL_SANDBOX=local to silence the fallback warning.

Use

# Dry run.
agenteval dry-run --skills none --tasks skill-specific-v1 --model claude-opus-4-7

# Full run (set ANTHROPIC_API_KEY first).
agenteval eval --skills ./.claude/skills/ --tasks skill-specific-v1 \
    --model claude-opus-4-7 --out result.json

# Canonicalise + verify a submission.
agenteval submit ./result.json
agenteval verify ./result.entry.json --skills ./.claude/skills/ --tasks skill-specific-v1

Other runners: --runner openai --model gpt-5.2, --runner google --model gemini-3-pro. Exploratory mode for non-leaderboard seed sweeps: --exploratory --seeds N.

What's measured

pass@1, pass@5 (Chen et al. 2021 unbiased estimator), pass^5 (TAU-Bench-style reliability), cost, latency, tool-call count, timeout rate. All point estimates carry bootstrapped 95% CIs. Eight descriptive flags (high-variance, talkative, tool-storm, pricing-stale, model-drift, borderline-stability, holdout-divergence, passive). No scalar rank.

See docs/methodology.md for the protocol, including the two-panel leaderboard split (uncontaminated primary; SWE-bench-Lite as secondary, marked contaminated).

What's not measured (yet)

Code style/aesthetics, long-horizon (>5 min) tasks, multi-modal, LLM-as-judge subjective grades. v2 may revisit some of these.

Submitting a result

Submission is PR-based. Run agenteval submit ./result.json, commit the resulting .entry.json under frontend/data/submissions/, open a PR. CI re-verifies; merge after the verifier agrees.

Layout

src/                   harness, runners, metrics, sandbox, grading
tasks/skill-specific-v1/   20 hand-curated task YAMLs
sandbox/               Docker base image
frontend/              Next.js leaderboard (static export)
docs/                  methodology, task/metric/sandbox/reproducibility specs
pricing.yaml           per-(provider, model) token prices

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.github		.github
docs		docs
frontend		frontend
sandbox		sandbox
scripts		scripts
skills-baseline/no-skills		skills-baseline/no-skills
src		src
tasks		tasks
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
constraints.txt		constraints.txt
pricing.yaml		pricing.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

agent_eval

Install

Use

What's measured

What's not measured (yet)

Submitting a result

Layout

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

agent_eval

Install

Use

What's measured

What's not measured (yet)

Submitting a result

Layout

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages