Factlet Protocol — Evals

Open eval suite for the Factlet Protocol. MIT-licensed.

Status: Tier 1 — methodology + raw runs. This release publishes the eval infrastructure, task set, raw run data, and worked examples. It deliberately does not publish a headline aggregate number. A defensible aggregate number ships in Tier 2 (~2-4 weeks) after expanding to N=100+ tasks with multi-judge agreement, externally-authored tasks, and bootstrap CIs. Why this sequencing: see docs/RESULTS-N6-MAY-2026.md (Limitations + Tier 2 publish-gate).

What this repo is

A reproducible benchmark harness that compares LLM behavior under three conditions:

Baseline — no factbook
Naive grounding — factbook content as flat markdown in system
Factlet-grounded — factbook rendered via per-vendor renderer (Factlet Protocol)

…across three frontier models (Claude Sonnet 4.6, GPT-4.1, Gemini 2.0 Flash) on hand-crafted tasks across three domains (payments, frontend, ML pipeline).

The primary comparison is with-factbook (any rendering) vs no-factbook — does giving the model team-specific truth in context measurably reduce harmful or off-policy output? The secondary comparison #3 vs #2 is reported as a diagnostic on whether structured per-vendor rendering provides additional lift over naive markdown grounding at the rendering layer.

Repo layout

tier1/
  tasks/                  6 scaffold tasks (target before seal: 20)
    payments/      *.yaml — 2 tasks
    frontend/      *.yaml — 2 tasks
    ml-pipeline/   *.yaml — 2 tasks (incl. 1 outside-coverage calibration task)
  factbooks/       *.yaml — copies from factlet-ai/registry
  judge-prompts/   *.md   — per-metric LLM-as-judge prompts
  task-schema.md         — task YAML schema reference
  methodology.md         — design rationale, sample-size analysis, limits
  PREREG.md              — pre-registration template + SHA-256 of locked artifacts
runner/
  run.py                 — runs tasks × conditions × models
  score.py               — multi-judge scorer (GPT-4.1 / Claude / Gemini)
  aggregate.py           — scored.jsonl → markdown reports
  validate.py            — task YAML schema validator
  pyproject.toml
docs/
  RESULTS-N6-MAY-2026.md — N=6 scaffold-run results, robustness checks, Tier 2 publish-gate
results/                 — timestamped raw runs + reports (gitignored once they exist)

Quick start

cd runner/
pip install -e .

# Set API keys (BYOK — no provider lock-in)
export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-...
export GOOGLE_API_KEY=AIza...   # GEMINI_API_KEY also accepted

# Validate task YAML
python validate.py --tasks ../tier1/tasks --factbooks ../tier1/factbooks

# Run all tasks × all conditions × all models
python run.py \
  --tasks ../tier1/tasks --factbooks ../tier1/factbooks \
  --output ../results/$(date +%Y-%m-%d)

# Score with 3 judges (per-metric)
RUN_DIR=../results/$(date +%Y-%m-%d)
python score.py \
  --raw $RUN_DIR/raw.jsonl \
  --tasks ../tier1/tasks --factbooks ../tier1/factbooks \
  --prompts ../tier1/judge-prompts \
  --output $RUN_DIR/scored.jsonl

# Aggregate to markdown reports (summary + per-task detail + agreement)
python aggregate.py \
  --scored $RUN_DIR/scored.jsonl \
  --output $RUN_DIR

Cost at the current 6-task scaffold: ~$2.50 (54 generation calls + 810 judge calls). Scales linearly with task count and judge count.

Pre-registration

Tier 1's task set, judge prompts, scoring rubric, and analysis plan are pre-registered before any run. The seal recipe (using git ls-tree for filesystem-/locale-independent hashing) lives in tier1/PREREG.md. Any change requires a new pre-registration with reason.

Results so far

A first scaffold run (N=6 tasks, single-author) is at docs/RESULTS-N6-MAY-2026.md. The data does not support a single-percentage headline — see the limitations section.

Raw + scored data: results/v2/.

Conditions and methodology

See tier1/methodology.md for the full design rationale, sample-size analysis, judge architecture, and known limitations.

Contributing tasks

Tasks authored by someone other than the protocol author are especially welcome. Every task in the current scaffold is mihirchoudhary-authored; the next eval run is gated on ≥5 externally-authored tasks landing.

Domains we especially want right now: security (auth flows, IAM policy, secret handling), devops (Terraform / Kubernetes / CI policies), data engineering (schema decisions, pipeline conventions). Other domains welcome.

The shape of a good task (full criteria in CONTRIBUTING.md): tests a case where the model's training prior is likely to conflict with a documented team-specific decision. Worked example:

A team retired Redux for state management in 2025-Q4 in favor of TanStack Query. The factbook records this. The task asks the model to "set up state management for an orders table." Without the factbook, the model defaults to Redux (most common public-internet answer). With the factbook, the model uses TanStack and cites the retirement decision. The task tests the model's ability to defer to a documented team decision over its training prior.

That's the frontend-001-state-management task in this repo. It moved quality from 2.0 (without factbook) to 4.67 (with) in the N=6 run — one of the strongest signals in the dataset.

To contribute:

Open a Task proposal issue first describing the conflict your task tests. Maintainer replies within a week with go / discuss / no-fit.
Once green-lit, write the YAML following tier1/task-schema.md, set external_author: true, validate locally, open a PR.
Read CONTRIBUTING.md for the full process and quality bar.

Status of Tier 2 / Tier 3

Tier 2 (gated on ≥5 externally-authored tasks landing): N grows to ≥100, vanilla-RAG and vendor-Memory comparators added, bootstrap CIs.
Tier 3 (later): conformance test suite, retrieval-quality benchmarks, calibration evals on the protocol's confidence signal, continuous-eval infrastructure.

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github		.github
docs		docs
results/v2		results/v2
runner		runner
tier1		tier1
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Factlet Protocol — Evals

What this repo is

Repo layout

Quick start

Pre-registration

Results so far

Conditions and methodology

Contributing tasks

Status of Tier 2 / Tier 3

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Factlet Protocol — Evals

What this repo is

Repo layout

Quick start

Pre-registration

Results so far

Conditions and methodology

Contributing tasks

Status of Tier 2 / Tier 3

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages