proofbundle

Turn an AI eval result into one portable, offline-verifiable receipt. It proves who signed these exact bytes and that nothing changed since — not that the number is true. Ed25519 + RFC 6962 Merkle, one file, no server, no network.

The problem

Every AI eval number you read — a safety benchmark, a capability score, a leaderboard entry — is an unverifiable claim. You trust the lab. There's no portable way to check, offline, that a result was signed by a stated party, hasn't been altered, and covers the samples it claims.

proofbundle is that check. It's a small MIT-licensed Python tool (a compact, auditable trusted core, depends only on cryptography) that turns a result into a signed receipt anyone can verify from a single file — and it's honest about the line it does not cross.

60-second try (offline, no setup)

pip install "proofbundle[eval]"
proofbundle demo

You'll see an honest receipt verify => OK, then six independent tampers each verify FAILED, then a swapped sample get caught — all in memory. The command exits non-zero if any tamper slips through, so it's also a self-test. Full walkthrough: docs/DEMO.md.

# your own receipt, from a signed payload:
proofbundle emit --payload-file result.json --new-key signer.key --out receipt.json
proofbundle verify receipt.json        # exit 0 = OK, 1 = failed, 2 = malformed

What a receipt proves — and what it doesn't

✅ It proves	❌ It does not prove
These exact bytes were signed by this key (authorship)	That the number is true
Nothing changed since signing (integrity, Ed25519 + RFC 6962)	That the issuer is honest
The result is attributable to a stated issuer	That the eval was well-designed
A threshold was met while hiding the model/dataset (salted commitments)	That there was no cherry-picking — unless pre-registered
Optionally: individual samples, offline-auditable (per-sample Merkle)	That the computation was correct — that needs a TEE or independent reproduction

This boundary is the point, not a weakness. A receipt makes a claim attributable, tamper-evident, and — with pre-registration and per-sample auditing — bounded and spot-checkable. Full detail: THREAT_MODEL.md.

How it fits together

flowchart LR
    H["eval harness<br/>inspect_ai · lm-eval · promptfoo · pytest"] --> A["adapter → signed claim<br/>salted commitments · provenance · samples root"]
    A --> R["receipt<br/>one portable file"]
    R --> V{{"proofbundle verify — offline"}}
    V --> C["signature · Merkle inclusion · SD-JWT/KB ·<br/>witness quorum · status list · sample openings"]
    C --> OK(["=> OK / FAILED"])
    style V fill:#D6248A,stroke:#D6248A,color:#fff
    style OK fill:#D6248A,stroke:#D6248A,color:#fff

What's in the box

Core — Ed25519 signature + RFC 6962 / 9162 Merkle inclusion, verified fully offline. Checks a real Sigstore Rekor proof, so correctness isn't self-referential.
Eval receipts — a signed claim (metric ⋈ threshold, n, salted model/dataset commitments, assurance level, provenance) from your run. See EVAL_CLAIM.md.
Selective disclosure — SD-JWT (RFC 9901) with Key Binding: prove a threshold while withholding the exact score.
Transparency-log interop — C2SP tlog-checkpoint / cosignature / .tlog-proof, with post-quantum ML-DSA-44 witness cosignatures. Optional Token-Status-List revocation snapshots.
Per-sample audit — commit to every sample; an auditor challenges random indices (with a fresh nonce or a public randomness beacon, v1.9) and openings must bind to the signed root. Catches 1% sample-doctoring with 95% confidence at 300 samples, regardless of run size.
Pre-registration — proofbundle prereg <plan> commits to the protocol before the run, so best-of-many publishing becomes visible.
Integrations — opt-in inspect_ai end-of-task hook and pytest plugin (emit only when PROOFBUNDLE_EMIT=1 / --proofbundle), plus a Hugging Face Community Evals bridge. See INTEGRATIONS.md.

Docs

For…	Read
Skeptics (why not SHA-256 / Sigstore / trust the issuer)	docs/FAQ.md
New to this? plain-terms glossary	docs/GLOSSARY.md
Reviewers (30-minute adversarial audit path)	docs/REVIEWERS.md
Where every trust anchor comes from	docs/TRUST_ANCHORS.md
The demos, tier by tier	docs/DEMO.md
The normative format + verification order	SPEC.md
Honest comparison to Rekor / in-toto / OMS / ValiChord	INTEROP.md
Regulatory mapping (and what to never claim)	COMPLIANCE.md
Funders / role fit	docs/PROJECT_BRIEF.md
Preview: TEE-attestation bridge (v2.0 beta)	docs/EXPERIMENTAL_ENCLAVE.md

Install

pip install proofbundle                 # core: offline verify + plain emit (dependency-free)
pip install "proofbundle[eval]"          # + eval receipts, prereg, and the demo (adds an RFC 8785 JCS canonicalizer)
pip install "proofbundle[inspect]"      # inspect_ai adapter + hook
pip install "proofbundle[pq]"           # verify ML-DSA-44 (post-quantum) witness cosignatures

Requires Python 3.10+. The verify path never rolls its own crypto — Ed25519 comes from cryptography; Merkle hashing is RFC 6962.

Status & scope

Beta, SemVer-committed, 303 tests + a CI mutation gate + property-based parser fuzzing. Correctness is anchored to external RFC 6962 vectors and a real Rekor proof, not just its own bundles. It is not a log service, a full in-toto client, a TEE, a consensus network, or a compliance product by itself — it is the small, offline, standards-native receipt layer between them. Security policy: SECURITY.md.

Contributing

See CONTRIBUTING.md and the Code of Conduct. Good first issues are labeled good-first-issue; security findings go through SECURITY.md. The verifier core aims to stay small, dependency-light, and correct.

License

MIT — see LICENSE.

_{proofbundle is part of b7n0de, Verified AI Work · b7n0de.com}

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.github		.github
action		action
assets		assets
docs		docs
examples		examples
schemas		schemas
scripts		scripts
src/proofbundle		src/proofbundle
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
COMPLIANCE.md		COMPLIANCE.md
CONTRIBUTING.md		CONTRIBUTING.md
EVAL_CLAIM.md		EVAL_CLAIM.md
INTEGRATIONS.md		INTEGRATIONS.md
INTEROP.md		INTEROP.md
LICENSE		LICENSE
Makefile		Makefile
OUTREACH_issue_inspect_evals.md		OUTREACH_issue_inspect_evals.md
OUTREACH_pr_every_eval_ever.md		OUTREACH_pr_every_eval_ever.md
PREDICATE.md		PREDICATE.md
README.md		README.md
RELEASE.md		RELEASE.md
REVIEW.md		REVIEW.md
REVIEW_v1.6.md		REVIEW_v1.6.md
SECURITY.md		SECURITY.md
SPEC.md		SPEC.md
THREAT_MODEL.md		THREAT_MODEL.md
paper.bib		paper.bib
paper.md		paper.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

proofbundle

The problem

60-second try (offline, no setup)

What a receipt proves — and what it doesn't

How it fits together

What's in the box

Docs

Install

Status & scope

Contributing

License

About

Uh oh!

Releases 22

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

proofbundle

The problem

60-second try (offline, no setup)

What a receipt proves — and what it doesn't

How it fits together

What's in the box

Docs

Install

Status & scope

Contributing

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 22

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages