Turn an AI eval result into one portable, offline-verifiable receipt. It proves who signed these exact bytes and that nothing changed since — not that the number is true. Ed25519 + RFC 6962 Merkle, one file, no server, no network.
Every AI eval number you read — a safety benchmark, a capability score, a leaderboard entry — is an unverifiable claim. You trust the lab. There's no portable way to check, offline, that a result was signed by a stated party, hasn't been altered, and covers the samples it claims.
proofbundle is that check. It's a small MIT-licensed Python tool (a compact, auditable trusted core,
depends only on cryptography) that turns a result into a signed
receipt anyone can verify from a single file — and it's honest about the line it does not cross.
pip install "proofbundle[eval]"
proofbundle demoYou'll see an honest receipt verify => OK, then six independent tampers each verify FAILED, then
a swapped sample get caught — all in memory. The command exits non-zero if any tamper slips through,
so it's also a self-test. Full walkthrough: docs/DEMO.md.
# your own receipt, from a signed payload:
proofbundle emit --payload-file result.json --new-key signer.key --out receipt.json
proofbundle verify receipt.json # exit 0 = OK, 1 = failed, 2 = malformed| ✅ It proves | ❌ It does not prove |
|---|---|
| These exact bytes were signed by this key (authorship) | That the number is true |
| Nothing changed since signing (integrity, Ed25519 + RFC 6962) | That the issuer is honest |
| The result is attributable to a stated issuer | That the eval was well-designed |
| A threshold was met while hiding the model/dataset (salted commitments) | That there was no cherry-picking — unless pre-registered |
| Optionally: individual samples, offline-auditable (per-sample Merkle) | That the computation was correct — that needs a TEE or independent reproduction |
This boundary is the point, not a weakness. A receipt makes a claim attributable, tamper-evident, and — with pre-registration and per-sample auditing — bounded and spot-checkable. Full detail: THREAT_MODEL.md.
flowchart LR
H["eval harness<br/>inspect_ai · lm-eval · promptfoo · pytest"] --> A["adapter → signed claim<br/>salted commitments · provenance · samples root"]
A --> R["receipt<br/>one portable file"]
R --> V{{"proofbundle verify — offline"}}
V --> C["signature · Merkle inclusion · SD-JWT/KB ·<br/>witness quorum · status list · sample openings"]
C --> OK(["=> OK / FAILED"])
style V fill:#D6248A,stroke:#D6248A,color:#fff
style OK fill:#D6248A,stroke:#D6248A,color:#fff
- Core — Ed25519 signature + RFC 6962 / 9162 Merkle inclusion, verified fully offline. Checks a real Sigstore Rekor proof, so correctness isn't self-referential.
- Eval receipts — a signed claim (
metric ⋈ threshold,n, salted model/dataset commitments, assurance level, provenance) from your run. See EVAL_CLAIM.md. - Selective disclosure — SD-JWT (RFC 9901) with Key Binding: prove a threshold while withholding the exact score.
- Transparency-log interop — C2SP
tlog-checkpoint/ cosignature /.tlog-proof, with post-quantum ML-DSA-44 witness cosignatures. Optional Token-Status-List revocation snapshots. - Per-sample audit — commit to every sample; an auditor challenges random indices (with a fresh nonce or a public randomness beacon, v1.9) and openings must bind to the signed root. Catches 1% sample-doctoring with 95% confidence at 300 samples, regardless of run size.
- Pre-registration —
proofbundle prereg <plan>commits to the protocol before the run, so best-of-many publishing becomes visible. - Integrations — opt-in inspect_ai end-of-task hook and pytest plugin (emit only when
PROOFBUNDLE_EMIT=1/--proofbundle), plus a Hugging Face Community Evals bridge. See INTEGRATIONS.md.
| For… | Read |
|---|---|
| Skeptics (why not SHA-256 / Sigstore / trust the issuer) | docs/FAQ.md |
| New to this? plain-terms glossary | docs/GLOSSARY.md |
| Reviewers (30-minute adversarial audit path) | docs/REVIEWERS.md |
| Where every trust anchor comes from | docs/TRUST_ANCHORS.md |
| The demos, tier by tier | docs/DEMO.md |
| The normative format + verification order | SPEC.md |
| Honest comparison to Rekor / in-toto / OMS / ValiChord | INTEROP.md |
| Regulatory mapping (and what to never claim) | COMPLIANCE.md |
| Funders / role fit | docs/PROJECT_BRIEF.md |
| Preview: TEE-attestation bridge (v2.0 beta) | docs/EXPERIMENTAL_ENCLAVE.md |
pip install proofbundle # core: offline verify + plain emit (dependency-free)
pip install "proofbundle[eval]" # + eval receipts, prereg, and the demo (adds an RFC 8785 JCS canonicalizer)
pip install "proofbundle[inspect]" # inspect_ai adapter + hook
pip install "proofbundle[pq]" # verify ML-DSA-44 (post-quantum) witness cosignaturesRequires Python 3.10+. The verify path never rolls its own crypto — Ed25519 comes from
cryptography; Merkle hashing is RFC 6962.
Beta, SemVer-committed, 303 tests + a CI mutation gate + property-based parser fuzzing. Correctness is anchored to external RFC 6962 vectors and a real Rekor proof, not just its own bundles. It is not a log service, a full in-toto client, a TEE, a consensus network, or a compliance product by itself — it is the small, offline, standards-native receipt layer between them. Security policy: SECURITY.md.
See CONTRIBUTING.md and the Code of Conduct. Good first
issues are labeled good-first-issue;
security findings go through SECURITY.md. The verifier core aims to stay small,
dependency-light, and correct.
MIT — see LICENSE.
proofbundle is part of b7n0de, Verified AI Work · b7n0de.com