A small, opinionated harness for running prompt and model evals against your own datasets. It's the bench gear we use at Canopy Road Workshop to keep specialized model work honest.
Status: early, opinionated, and incomplete. The dataset format is stable. Public 1.0 pending. Watching how a few API edges settle before tagging it.
Three small pieces, nothing exotic:
| Piece | What it does |
|---|---|
| Runner | Fans prompts across providers (Claude, GPT, local Ollama / vLLM) with retry and rate-limit handling. |
| Dataset | JSONL with input, expected, and an open meta bag. Your dataset is the thing you actually own; the rig just iterates it. |
| Scorecard | Mixes deterministic checks (regex, JSON-schema, exact match) with LLM-judge rubrics, and emits a diff against the previous run so regressions show up in code review instead of in production. |
Most "AI eval" frameworks want to own your data, your providers, and your scoring rubric. We didn't want any of that. eval-rig is the smallest thing that lets you:
- Point at a JSONL file you already have.
- Run it against one or more models / prompts.
- Get a pass/fail scorecard and a diff vs. the last run.
Every prompt change, every fine-tune, and every model swap gets a number against a real dataset before it ships. That's the whole pitch.
pip install -e .
# Run the included sanity-check dataset against Claude Haiku
eval-rig run \
--dataset datasets/sanity.jsonl \
--model claude-haiku-4-5 \
--prompt "Answer the user's question in one short sentence." \
--out runs/2026-05-05-haiku.json
# Score it
eval-rig score runs/2026-05-05-haiku.json --rubric examples/rubric.yaml
# Diff against last week's run
eval-rig diff runs/2026-04-28-haiku.json runs/2026-05-05-haiku.jsonJSONL. One example per line:
{"input": "What's 2+2?", "expected": "4", "meta": {"id": "math-001", "tags": ["arithmetic"]}}
{"input": "Capital of Florida?", "expected": "Tallahassee", "meta": {"id": "geo-001"}}input and expected are strings. meta is an open bag — id, tags, source URL, anything you want to filter or group on later. The runner never inspects meta; the scorecard can.
eval_rig/ # package code
runner.py # provider fan-out + retry/backoff
dataset.py # JSONL loader, schema validation
scorecard.py # deterministic + LLM-judge scoring
diff.py # run-over-run diff
cli.py # eval-rig entry point
datasets/ # tiny sanity datasets (not benchmarks — your data is your data)
examples/ # rubric and prompt examples
The dataset format is stable. The runner config is not — provider names, retry policy, and the rubric DSL are all still moving. Issues and design feedback welcome. Don't pin to a specific commit until 1.0.
MIT — see LICENSE.