agent-spec

Test harness for .claude/ directories. Sandboxes projects, runs agents, scores results, and iteratively improves instructions until agents succeed without human intervention.

Requirements

Claude Code CLI
Python 3.12+
Node.js 18+ (for JavaScript challenges)
ANTHROPIC_API_KEY environment variable set

Quick Start

git clone <repo> && cd agent-spec
git submodule update --init
cd agent-spec

# List available evals and configs
python3 scripts/cli.py list

# Run the token-efficiency eval (compares 3 CLAUDE.md strategies)
python3 scripts/cli.py run token-efficiency A-baseline
python3 scripts/cli.py run token-efficiency B-drona23
python3 scripts/cli.py run token-efficiency C-caveman

# Generate a results summary
python3 scripts/summarize.py token-efficiency --filter-eval

See evals/token-efficiency/ for a complete worked example with walkthrough.

Structure

agent-spec/
├── agent-spec/        # The harness (orchestrator, evals, scripts)
├── products/          # .claude/ configs being developed
├── targets/           # Test subject apps for evals
└── submodules/        # Third-party repos (git submodule)

How It Works

An eval defines challenges (coding tasks with deterministic tests) and configs (.claude/ directory variants). The harness runs every config against every challenge in isolated sandboxes, measures tokens and cost, and reports the results.

challenges x configs = runs

Each run: sandbox the project, inject the config's .claude/, give the agent the prompt, let it work, run verify.sh. The agent doesn't decide if it's done -- the test decides. The primary metric is tokens-to-correctness: not just pass/fail, but how many tokens it took to get there.

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
agent-spec		agent-spec
products/bug-squasher/.claude		products/bug-squasher/.claude
submodules		submodules
targets		targets
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

agent-spec

Requirements

Quick Start

Structure

How It Works

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

agent-spec

Requirements

Quick Start

Structure

How It Works

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages