eval-rig

A small, opinionated harness for running prompt and model evals against your own datasets. It's the bench gear we use at Canopy Road Workshop to keep specialized model work honest.

Status: early, opinionated, and incomplete. The dataset format is stable. Public 1.0 pending. Watching how a few API edges settle before tagging it.

Three small pieces, nothing exotic:

Piece	What it does
Runner	Fans prompts across providers (Claude, GPT, local Ollama / vLLM) with retry and rate-limit handling.
Dataset	JSONL with `input`, `expected`, and an open `meta` bag. Your dataset is the thing you actually own; the rig just iterates it.
Scorecard	Mixes deterministic checks (regex, JSON-schema, exact match) with LLM-judge rubrics, and emits a diff against the previous run so regressions show up in code review instead of in production.

Why this exists

Most "AI eval" frameworks want to own your data, your providers, and your scoring rubric. We didn't want any of that. eval-rig is the smallest thing that lets you:

Point at a JSONL file you already have.
Run it against one or more models / prompts.
Get a pass/fail scorecard and a diff vs. the last run.

Every prompt change, every fine-tune, and every model swap gets a number against a real dataset before it ships. That's the whole pitch.

Quickstart

pip install -e .

# Run the included sanity-check dataset against Claude Haiku
eval-rig run \
  --dataset datasets/sanity.jsonl \
  --model claude-haiku-4-5 \
  --prompt "Answer the user's question in one short sentence." \
  --out runs/2026-05-05-haiku.json

# Score it
eval-rig score runs/2026-05-05-haiku.json --rubric examples/rubric.yaml

# Diff against last week's run
eval-rig diff runs/2026-04-28-haiku.json runs/2026-05-05-haiku.json

Dataset format

JSONL. One example per line:

{"input": "What's 2+2?", "expected": "4", "meta": {"id": "math-001", "tags": ["arithmetic"]}}
{"input": "Capital of Florida?", "expected": "Tallahassee", "meta": {"id": "geo-001"}}

input and expected are strings. meta is an open bag — id, tags, source URL, anything you want to filter or group on later. The runner never inspects meta; the scorecard can.

Layout

eval_rig/        # package code
  runner.py      # provider fan-out + retry/backoff
  dataset.py     # JSONL loader, schema validation
  scorecard.py   # deterministic + LLM-judge scoring
  diff.py        # run-over-run diff
  cli.py         # eval-rig entry point
datasets/        # tiny sanity datasets (not benchmarks — your data is your data)
examples/        # rubric and prompt examples

Status & contributions

The dataset format is stable. The runner config is not — provider names, retry policy, and the rubric DSL are all still moving. Issues and design feedback welcome. Don't pin to a specific commit until 1.0.

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
datasets		datasets
eval_rig		eval_rig
examples		examples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

eval-rig

Why this exists

Quickstart

Dataset format

Layout

Status & contributions

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

eval-rig

Why this exists

Quickstart

Dataset format

Layout

Status & contributions

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages