Skip to content

canopy-workshop/eval-rig

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

eval-rig

A small, opinionated harness for running prompt and model evals against your own datasets. It's the bench gear we use at Canopy Road Workshop to keep specialized model work honest.

Status: early, opinionated, and incomplete. The dataset format is stable. Public 1.0 pending. Watching how a few API edges settle before tagging it.

Three small pieces, nothing exotic:

Piece What it does
Runner Fans prompts across providers (Claude, GPT, local Ollama / vLLM) with retry and rate-limit handling.
Dataset JSONL with input, expected, and an open meta bag. Your dataset is the thing you actually own; the rig just iterates it.
Scorecard Mixes deterministic checks (regex, JSON-schema, exact match) with LLM-judge rubrics, and emits a diff against the previous run so regressions show up in code review instead of in production.

Why this exists

Most "AI eval" frameworks want to own your data, your providers, and your scoring rubric. We didn't want any of that. eval-rig is the smallest thing that lets you:

  1. Point at a JSONL file you already have.
  2. Run it against one or more models / prompts.
  3. Get a pass/fail scorecard and a diff vs. the last run.

Every prompt change, every fine-tune, and every model swap gets a number against a real dataset before it ships. That's the whole pitch.

Quickstart

pip install -e .

# Run the included sanity-check dataset against Claude Haiku
eval-rig run \
  --dataset datasets/sanity.jsonl \
  --model claude-haiku-4-5 \
  --prompt "Answer the user's question in one short sentence." \
  --out runs/2026-05-05-haiku.json

# Score it
eval-rig score runs/2026-05-05-haiku.json --rubric examples/rubric.yaml

# Diff against last week's run
eval-rig diff runs/2026-04-28-haiku.json runs/2026-05-05-haiku.json

Dataset format

JSONL. One example per line:

{"input": "What's 2+2?", "expected": "4", "meta": {"id": "math-001", "tags": ["arithmetic"]}}
{"input": "Capital of Florida?", "expected": "Tallahassee", "meta": {"id": "geo-001"}}

input and expected are strings. meta is an open bag — id, tags, source URL, anything you want to filter or group on later. The runner never inspects meta; the scorecard can.

Layout

eval_rig/        # package code
  runner.py      # provider fan-out + retry/backoff
  dataset.py     # JSONL loader, schema validation
  scorecard.py   # deterministic + LLM-judge scoring
  diff.py        # run-over-run diff
  cli.py         # eval-rig entry point
datasets/        # tiny sanity datasets (not benchmarks — your data is your data)
examples/        # rubric and prompt examples

Status & contributions

The dataset format is stable. The runner config is not — provider names, retry policy, and the rubric DSL are all still moving. Issues and design feedback welcome. Don't pin to a specific commit until 1.0.

License

MIT — see LICENSE.

About

A small, opinionated harness for running prompt and model evals against your own datasets.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages