Evaluate AI agent skills with repeatable tasks, automated judging, and pass@k scoring.
pipx install caliper-evalCaliper is a local-first evaluation harness for Claude Code skills, Codex skills, and API-backed agents. It runs a skill against one or more task specs, records every attempt, judges the result with an LLM and/or deterministic Python assertions, and saves reproducible result files you can inspect later.
Use Caliper when you want to answer practical questions like:
- Did this skill actually get better after my prompt edit?
- Does it still pass the workflows it passed last week?
- Does Codex or Claude Code run this skill more reliably for my use case?
- Is the skill doing the work, or would the baseline agent pass without it?
- Can contributors change a skill without relying on subjective manual testing?
Caliper is especially useful for agent skills because skills are hard to review with ordinary unit tests. A good skill is part prompt, part workflow, part tool contract. Caliper turns that behavior into versioned eval specs, repeatable runs, pass/fail judgments, and saved transcripts.
- Skill-first evaluation for Claude Code, Codex, Anthropic API, and OpenAI API backends.
- Independent agent and judge backends, so you can test a Codex skill with a Claude judge, a Claude Code skill with a Codex judge, or keep everything on one provider.
- Natural-language and deterministic checks through
expect:andassert:. - pass@k scoring for measuring reliability across repeated attempts.
- Baseline runs to show whether the skill improves over an unassisted agent.
- Attempt isolation with fresh temporary homes and no session history.
- Reproducible result files that snapshot the skill content, referenced local files, and git SHA when available.
- Agent-installable evaluator skill so Claude Code or Codex can help create, validate, run, and interpret evals.
Caliper works well for:
- evaluating Claude Code slash-command skills
- evaluating Codex skills
- comparing agent backends on the same task suite
- regression-testing prompt and workflow changes
- checking coding, review, refactor, summarization, screenshot, and file-writing behaviors
- mixing LLM judgment with exact checks for files, JSON, command output, images, and repository state
It is not a replacement for normal unit tests. Use unit tests for deterministic library behavior. Use Caliper for agent behavior where the output depends on a model following instructions, using tools, and completing a workflow.
Install Caliper from PyPI:
pipx install caliper-evalOr install the latest version from GitHub:
pipx install git+https://github.com/edonadei/caliper.gitBoth install methods expose the caliper command:
caliper --helpFor local development from the repository root:
pip install -e .For local development and optional OpenAI API support:
pip install -e ".[dev,openai]"Caliper requires Python 3.10 or newer.
Caliper can run the agent under test and the judge through different backends.
| Role | Claude Code CLI | Codex CLI | API backends |
|---|---|---|---|
| Agent under test | skill.backend: claude-code |
skill.backend: codex |
skill.backend: claude-api or openai-api |
| LLM judge | judge.backend: claude-code |
judge.backend: codex |
judge.backend: claude-api or openai-api |
| Auth/billing | Claude Code subscription/auth | Codex CLI subscription/auth | Provider API key/billing |
| Captured attempt record | Claude stream-json tool-call transcript |
Codex JSONL events, including tool calls when emitted | Final API response text |
Install and authenticate the claude CLI. backend: claude-code uses your
normal Claude Code CLI auth.
If you explicitly use backend: claude-api, set:
export ANTHROPIC_API_KEY=...Install and authenticate the Codex CLI:
npm install -g @openai/codex
codex login
codex --versionbackend: codex calls Codex with codex exec. It does not fall back to the
OpenAI API. If the CLI is unavailable or cannot authenticate, Caliper reports a
backend configuration error.
When the Codex desktop app is installed, Caliper prefers the app-bundled Codex
CLI over an older codex found on PATH. Set CODEX_CLI_PATH to force a
specific CLI binary.
Check installed agent CLI versions:
caliper update-cli --checkUpdate an npm-installed agent CLI explicitly:
caliper update-cli codex
caliper update-cli claude-codeCaliper does not auto-update agent CLIs during eval runs. That keeps evals
reproducible and avoids adding network failures to every run. If Caliper is
using the Codex CLI bundled inside the desktop app, caliper update-cli codex
will ask you to update the app or set CODEX_CLI_PATH; npm cannot update the
app-bundled binary.
If you explicitly use backend: openai-api, set:
export OPENAI_API_KEY=...Create an eval spec:
# my-skill.eval.yaml
skill:
path: ./SKILL.md
backend: codex
judge:
backend: codex
tasks:
- name: Produces the expected answer
prompt: "Use this skill to answer: what is 2 + 2?"
expect: "The assistant answers 4."Run it:
caliper run my-skill.eval.yaml --k 3 --baselineExample output:
CALIPER - my-skill - k=3 - codex
ID Task k (3) pass@k
task-1 Produces the expected answer 2/3 96.3% PARTIAL
With skill 96.3% ###################-
No skill 70.4% ##############------
Delta +25.9% up
Results saved to .caliper/results/my-skill/2026-05-22T14-23-01Z.json
Browse results:
caliper list
caliper report my-skillValidate a spec before running:
caliper validate my-skill.eval.yaml- Create a small eval spec for one behavior you care about.
- Run it with
--k 1while iterating on the spec. - Add deterministic
assert:checks for facts an LLM judge should not guess. - Run with
--k 3or higher once the task is stable. - Use
--baselineto measure whether the skill helps over the raw agent. - Commit the spec beside the skill so future contributors can run the same evaluation before changing behavior.
caliper run path/to/skill.eval.yaml --k 3 --baseline --verboseThe repository includes an evaluate-skill agent skill. Installing it lets
Claude Code or Codex help you create eval specs, validate them, run Caliper, and
summarize results from inside your normal agent workflow.
If you installed the CLI, use the bundled installer:
caliper install-skill codex
caliper install-skill claude-codePreview the destination without writing files:
caliper install-skill codex --dry-runUse --force to overwrite an existing installed copy.
Without the CLI installer, copy the skill into Claude Code commands:
mkdir -p ~/.claude/commands
curl -fsSL https://raw.githubusercontent.com/edonadei/caliper/main/skills/evaluate-skill/SKILL.md \
-o ~/.claude/commands/evaluate-skill.mdThen use it in Claude Code:
/evaluate-skill validate my-skill.eval.yaml
/evaluate-skill run my-skill.eval.yaml --k 3
Without the CLI installer, install the skill in Codex:
mkdir -p ~/.codex/skills/evaluate-skill
curl -fsSL https://raw.githubusercontent.com/edonadei/caliper/main/skills/evaluate-skill/SKILL.md \
-o ~/.codex/skills/evaluate-skill/SKILL.mdMake sure caliper is on PATH for Codex sessions. If you installed Caliper in
editable mode, the generated console script is usually enough.
Then ask Codex:
Use the evaluate-skill skill to validate my-skill.eval.yaml.
Use the evaluate-skill skill to run my-skill.eval.yaml with k=3 and summarize the result.
skill:
path: ./SKILL.md
backend: codex
judge:
backend: codex
tasks:
- name: Validates a spec
prompt: "Use caliper to validate ./example.eval.yaml and summarize the result."
expect: "The assistant runs caliper validate and reports whether the spec is valid."caliper run my-codex-skill.eval.yaml --k 1 --verboseskill:
path: ~/.claude/commands/review.md
backend: claude-code
model: claude-sonnet-4-6
judge:
backend: claude-code
model: claude-haiku-4-5-20251001
tasks:
- name: Finds a null dereference
prompt: "/review the staged changes in /tmp/eval-repo"
expect: "The review identifies a possible null pointer dereference."The agent backend and judge backend are independent:
skill:
path: ./SKILL.md
backend: codex
judge:
backend: claude-codeOr opt into API billing explicitly:
skill:
path: ./SKILL.md
backend: openai-api
model: gpt-4o-mini
judge:
backend: openai-api
model: gpt-4o-miniUse assert: when success can be verified with Python. This is usually better
than asking an LLM to judge files, JSON, command output, or screenshots.
tasks:
- name: Writes an output file
cleanup: rm -f /tmp/out.txt
prompt: "Write hello world to /tmp/out.txt"
expect: "A file is written at /tmp/out.txt."
assert: |
from pathlib import Path
path = Path("/tmp/out.txt")
assert path.exists(), "Output file was not created"
assert path.read_text().strip() == "hello world"When both expect and assert are present, both must pass.
The repo includes a Codex-backed screenshot eval:
caliper validate skills/evaluate-skill/references/evals/screenshot/screenshot.eval.yaml
caliper run skills/evaluate-skill/references/evals/screenshot/screenshot.eval.yaml --k 1 --judge script --verboseOn macOS, the process running the eval must have Screen Recording permission. If
direct screencapture -x /tmp/test.png fails, this eval will fail until that
permission is granted.
skill:
path: ./SKILL.md # optional path to the skill file
backend: codex # claude-code | codex | claude-api | openai-api
model: <model-name> # optional backend-specific model override
judge:
backend: codex # claude-code | codex | claude-api | openai-api
model: <model-name> # optional backend-specific model override
sandbox:
extra_path:
- ./bin # optional paths prepended to PATH
forbidden_files:
- ".*\\.eval\\.yaml$" # agent cannot read the spec file
- "./.caliper/.*" # agent cannot read saved results
tasks:
- name: Short task name
setup: <shell command> # optional, runs before each attempt
cleanup: <shell command> # optional, always runs after each attempt
prompt: <prompt sent to the agent>
expect: <natural-language success condition>
assert: |
# optional inline Python assertion
assert True
- name: Task with external assertion script
prompt: "Generate a report"
assert: ./assertions/check_report.pyEach task must define at least one of expect or assert. Task ids are assigned
automatically as task-001, task-002, and so on.
| Command | Description |
|---|---|
caliper run <spec> |
Run an evaluation spec |
caliper new [name] |
Create a new evaluation spec with the wizard |
caliper validate <spec> |
Validate a spec file |
caliper list [spec] |
List specs and saved runs |
caliper report <spec-or-result> |
Re-render saved results |
| Flag | Default | Description |
|---|---|---|
--k INT |
3 |
Attempts per task |
--baseline |
off | Also run each task without the skill |
--judge autorater |
autorater |
LLM judge gives a direct pass/fail |
--judge script |
Run static assertions and, if expect exists, an LLM judge |
|
--judge autorater-sdk |
Legacy alias for Anthropic SDK judging; prefer judge.backend: claude-api |
|
--workers INT |
4 |
Parallel task workers |
--timeout INT |
120 |
Seconds per attempt |
--model MODEL |
Override skill.model for the agent under test |
|
--verbose |
off | Show per-attempt judge reasoning |
--output PATH |
Also save results JSON to a specific path |
| Command | Description |
|---|---|
caliper update-cli --check |
Show installed and latest npm versions for known agent CLIs |
caliper update-cli codex |
Update an npm-installed Codex CLI |
caliper update-cli claude-code |
Update an npm-installed Claude Code CLI |
caliper update-cli codex --yes |
Update without an interactive confirmation prompt |
--judge autorater asks the configured judge backend to decide whether the
transcript satisfies expect.
judge:
backend: codexThe autorater's default input is the full attempt record Caliper captured for
that backend. When the backend exposes tool calls, Caliper includes those tool
names, inputs, and tool results in the transcript sent to the autorater; tool
usage checks such as "used pygount" should be judged from that trace instead
of inferred from the final assistant message alone.
This is implicit. Eval specs do not need to opt into trace visibility. The only
limitation is backend support: some runners expose structured tool-call traces,
while others expose only final text output. In those cases, use deterministic
assert: checks for files, command output, or other artifacts the transcript
cannot prove.
--judge script always runs static assert: checks when present.
If the task also has expect, it also asks the configured judge backend for an
LLM verdict. With judge.backend: codex, that LLM check is performed by Codex
CLI. With judge.backend: claude-code, it is performed by Claude Code CLI. Use
claude-api or openai-api only when API billing is intended.
Static assertions run locally with Python. They are ideal for verifying:
- files exist
- exact file contents
- JSON/schema validity
- command output
- images or screenshots
- repository state
Each attempt runs with a fresh temporary HOME directory. For Claude Code,
Caliper installs a temporary slash-command skill in that isolated home. For
Codex, Caliper injects the skill body directly into the prompt passed to
codex exec.
Results are saved next to the spec:
.caliper/results/<spec-name>/<timestamp>.json
Each result includes a skill snapshot: the skill file content, referenced local files, and git SHA when available.
For each task:
pass@k = 1 - (1 - successes / k)^k
The aggregate score is the average task pass@k. With --baseline, Caliper also
runs the same tasks without the skill and reports the delta.
caliper/
commands/ Typer command implementations
harness/ Claude, Codex, and API execution backends
judge/ LLM and script judging implementations
schema/ Eval spec and result models
runner.py Evaluation orchestration
skills/
evaluate-skill/ Agent skill for running Caliper from Claude Code or Codex
tests/ Pytest coverage for harnesses, judges, and runner behavior
Contributions are welcome when they keep Caliper focused on repeatable, maintainable skill evaluation.
Good first contribution areas:
- add example evals for real skills
- improve backend error messages
- add deterministic assertion helpers
- expand tests for harness and judge behavior
- improve result reporting and summaries
- document common setup problems for Claude Code and Codex
Before opening a pull request:
pip install -e ".[dev,openai]"
pytest
ruff check .
caliper validate skills/evaluate-skill/evaluate-skill.eval.yamlWhen changing behavior, include either a test or an eval fixture that demonstrates
the expected outcome. Keep backend-specific behavior isolated to the relevant
module under caliper/harness/ or caliper/judge/ when possible.
The model name in skill.model or judge.model is not available to your Codex
account. Use a model that codex exec --model <name> supports.
Install the Codex CLI and ensure it is on PATH:
npm install -g @openai/codexInstall and authenticate Claude Code, or switch the relevant backend to codex,
claude-api, or openai-api.
When a task has only assert:, no LLM judge is required. Add expect: if you
also want an LLM to judge the transcript.

