Open eval suite for the Factlet Protocol. MIT-licensed.
Status: Tier 1 — methodology + raw runs. This release publishes the eval infrastructure, task set, raw run data, and worked examples. It deliberately does not publish a headline aggregate number. A defensible aggregate number ships in Tier 2 (~2-4 weeks) after expanding to N=100+ tasks with multi-judge agreement, externally-authored tasks, and bootstrap CIs. Why this sequencing: see
docs/RESULTS-N6-MAY-2026.md(Limitations + Tier 2 publish-gate).
A reproducible benchmark harness that compares LLM behavior under three conditions:
- Baseline — no factbook
- Naive grounding — factbook content as flat markdown in system
- Factlet-grounded — factbook rendered via per-vendor renderer (Factlet Protocol)
…across three frontier models (Claude Sonnet 4.6, GPT-4.1, Gemini 2.0 Flash) on hand-crafted tasks across three domains (payments, frontend, ML pipeline).
The primary comparison is with-factbook (any rendering) vs no-factbook — does giving the model team-specific truth in context measurably reduce harmful or off-policy output? The secondary comparison #3 vs #2 is reported as a diagnostic on whether structured per-vendor rendering provides additional lift over naive markdown grounding at the rendering layer.
tier1/
tasks/ 6 scaffold tasks (target before seal: 20)
payments/ *.yaml — 2 tasks
frontend/ *.yaml — 2 tasks
ml-pipeline/ *.yaml — 2 tasks (incl. 1 outside-coverage calibration task)
factbooks/ *.yaml — copies from factlet-ai/registry
judge-prompts/ *.md — per-metric LLM-as-judge prompts
task-schema.md — task YAML schema reference
methodology.md — design rationale, sample-size analysis, limits
PREREG.md — pre-registration template + SHA-256 of locked artifacts
runner/
run.py — runs tasks × conditions × models
score.py — multi-judge scorer (GPT-4.1 / Claude / Gemini)
aggregate.py — scored.jsonl → markdown reports
validate.py — task YAML schema validator
pyproject.toml
docs/
RESULTS-N6-MAY-2026.md — N=6 scaffold-run results, robustness checks, Tier 2 publish-gate
results/ — timestamped raw runs + reports (gitignored once they exist)
cd runner/
pip install -e .
# Set API keys (BYOK — no provider lock-in)
export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-...
export GOOGLE_API_KEY=AIza... # GEMINI_API_KEY also accepted
# Validate task YAML
python validate.py --tasks ../tier1/tasks --factbooks ../tier1/factbooks
# Run all tasks × all conditions × all models
python run.py \
--tasks ../tier1/tasks --factbooks ../tier1/factbooks \
--output ../results/$(date +%Y-%m-%d)
# Score with 3 judges (per-metric)
RUN_DIR=../results/$(date +%Y-%m-%d)
python score.py \
--raw $RUN_DIR/raw.jsonl \
--tasks ../tier1/tasks --factbooks ../tier1/factbooks \
--prompts ../tier1/judge-prompts \
--output $RUN_DIR/scored.jsonl
# Aggregate to markdown reports (summary + per-task detail + agreement)
python aggregate.py \
--scored $RUN_DIR/scored.jsonl \
--output $RUN_DIRCost at the current 6-task scaffold: ~$2.50 (54 generation calls + 810 judge calls). Scales linearly with task count and judge count.
Tier 1's task set, judge prompts, scoring rubric, and analysis plan are pre-registered before any run. The seal recipe (using git ls-tree for filesystem-/locale-independent hashing) lives in tier1/PREREG.md. Any change requires a new pre-registration with reason.
A first scaffold run (N=6 tasks, single-author) is at docs/RESULTS-N6-MAY-2026.md. The data does not support a single-percentage headline — see the limitations section.
Raw + scored data: results/v2/.
See tier1/methodology.md for the full design rationale, sample-size analysis, judge architecture, and known limitations.
Tasks authored by someone other than the protocol author are especially welcome. Every task in the current scaffold is mihirchoudhary-authored; the next eval run is gated on ≥5 externally-authored tasks landing.
Domains we especially want right now: security (auth flows, IAM policy, secret handling), devops (Terraform / Kubernetes / CI policies), data engineering (schema decisions, pipeline conventions). Other domains welcome.
The shape of a good task (full criteria in CONTRIBUTING.md): tests a case where the model's training prior is likely to conflict with a documented team-specific decision. Worked example:
A team retired Redux for state management in 2025-Q4 in favor of TanStack Query. The factbook records this. The task asks the model to "set up state management for an orders table." Without the factbook, the model defaults to Redux (most common public-internet answer). With the factbook, the model uses TanStack and cites the retirement decision. The task tests the model's ability to defer to a documented team decision over its training prior.
That's the frontend-001-state-management task in this repo. It moved quality from 2.0 (without factbook) to 4.67 (with) in the N=6 run — one of the strongest signals in the dataset.
To contribute:
- Open a Task proposal issue first describing the conflict your task tests. Maintainer replies within a week with go / discuss / no-fit.
- Once green-lit, write the YAML following
tier1/task-schema.md, setexternal_author: true, validate locally, open a PR. - Read CONTRIBUTING.md for the full process and quality bar.
- Tier 2 (gated on ≥5 externally-authored tasks landing): N grows to ≥100, vanilla-RAG and vendor-Memory comparators added, bootstrap CIs.
- Tier 3 (later): conformance test suite, retrieval-quality benchmarks, calibration evals on the protocol's confidence signal, continuous-eval infrastructure.
MIT — see LICENSE.