Releases: agent-axiom/agent-anvil
Releases · agent-axiom/agent-anvil
v0.2.65
09 Jun 22:10
Compare
Sorry, something went wrong.
No results found
Summary
add a public validate_maintainer_rerun_attestation API
add anvil leaderboard validate-rerun for explicit maintainer-side attestation checks
document validate-rerun in the leaderboard and CLI guides
Verification
CI passed on PR #209 and PR #210
local gate passed: ruff format --check, ruff check, ty check, pytest -q
v0.2.64
09 Jun 21:52
Compare
Sorry, something went wrong.
No results found
Summary
expose maintainer rerun evidence fields in generated leaderboard rows, JSON, and CSV
align the Hugging Face leaderboard renderer with maintainer rerun evidence aliases
update leaderboard documentation and checked-in index schema
Verification
CI passed on PR #206 and PR #207
local gate passed: ruff format --check, ruff check, ty check, pytest -q
Agent Anvil v0.2.63
09 Jun 12:39
Compare
Sorry, something went wrong.
No results found
Highlights
Verify maintainer rerun GitHub Actions runs when leaderboard build uses --github-run.
Reject failed rerun runs, repository mismatches, and SHA mismatches.
Document that --github-run verifies maintainer rerun attestations too.
Verification
Agent Anvil v0.2.62
09 Jun 06:56
Compare
Sorry, something went wrong.
No results found
Highlights
Add anvil leaderboard attest-rerun for generating maintainer rerun attestations.
Validate original and rerun leaderboard submissions before writing maintainer_rerun overlays.
Document the generated maintainer rerun attestation workflow.
Verification
Agent Anvil v0.2.61
09 Jun 04:05
Compare
Sorry, something went wrong.
No results found
Highlights
Add typed maintainer rerun attestations for leaderboard rows.
Add leaderboard build overlay support via --maintainer-reruns.
Validate attestation evidence hash, headline metrics, agent/benchmark identity, and GitHub Actions rerun URL.
Verification
Agent Anvil v0.2.60
09 Jun 03:41
Compare
Sorry, something went wrong.
No results found
Highlights
Reject direct leaderboard submissions that self-claim maintainer_rerun trust.
Reserve maintainer_rerun labels for maintainer-side rerun attestations.
Update leaderboard workflow examples to v0.2.60.
Verification
Agent Anvil v0.2.59
09 Jun 03:21
Compare
Sorry, something went wrong.
No results found
Summary
Reject versioned results.json files when summary trial counts or pass rate do not match grades.
Document summary-vs-grades validation for persisted results artifacts.
Add parametrized storage tests for tampered aggregate summaries.
Verification
PR #191 CI passed: Demo Eval, Test Python 3.12, Test Python 3.14.
Release PR #192 CI passed: Demo Eval, Test Python 3.12, Test Python 3.14.
Agent Anvil v0.2.58
08 Jun 22:08
Compare
Sorry, something went wrong.
No results found
Summary
Validate that results.grades trace_path values point back to matching run traces.
Reject missing, mismatched, or path-traversal grade trace paths during run validation.
Surface grade trace path verification in artifact trust summaries.
Verification
PR #188 CI passed: Demo Eval, Test Python 3.12, Test Python 3.14.
Release PR #189 CI passed: Demo Eval, Test Python 3.12, Test Python 3.14.
Agent Anvil v0.2.57
08 Jun 21:49
Compare
Sorry, something went wrong.
No results found
Summary
Add opt-in strict trace step validation for trace and run artifacts.
Keep default trace loading permissive for legacy/custom observer events.
Document strict validation for CI artifact consumers.
Verification
PR #185 CI passed: Demo Eval, Test Python 3.12, Test Python 3.14.
Release PR #186 CI passed: Demo Eval, Test Python 3.12, Test Python 3.14.
Agent Anvil v0.2.56
08 Jun 21:30
Compare
Sorry, something went wrong.
No results found
Summary
Share the assertion JSONPath grammar and resolver in anvil.jsonpath.
Make scenario validation and deterministic grading use one parser contract.
Add contract tests for supported, unsupported, missing, and array-indexed paths.
Verification
PR #182 CI passed: Demo Eval, Test Python 3.12, Test Python 3.14.
Release PR #183 CI passed: Demo Eval, Test Python 3.12, Test Python 3.14.