Skip to content

Releases: agent-axiom/agent-anvil

v0.2.65

09 Jun 22:10
3b6c13a

Choose a tag to compare

Summary

  • add a public validate_maintainer_rerun_attestation API
  • add anvil leaderboard validate-rerun for explicit maintainer-side attestation checks
  • document validate-rerun in the leaderboard and CLI guides

Verification

  • CI passed on PR #209 and PR #210
  • local gate passed: ruff format --check, ruff check, ty check, pytest -q

v0.2.64

09 Jun 21:52
0333091

Choose a tag to compare

Summary

  • expose maintainer rerun evidence fields in generated leaderboard rows, JSON, and CSV
  • align the Hugging Face leaderboard renderer with maintainer rerun evidence aliases
  • update leaderboard documentation and checked-in index schema

Verification

  • CI passed on PR #206 and PR #207
  • local gate passed: ruff format --check, ruff check, ty check, pytest -q

Agent Anvil v0.2.63

09 Jun 12:39
311c9c0

Choose a tag to compare

Highlights

  • Verify maintainer rerun GitHub Actions runs when leaderboard build uses --github-run.
  • Reject failed rerun runs, repository mismatches, and SHA mismatches.
  • Document that --github-run verifies maintainer rerun attestations too.

Verification

Agent Anvil v0.2.62

09 Jun 06:56
972de1c

Choose a tag to compare

Highlights

  • Add anvil leaderboard attest-rerun for generating maintainer rerun attestations.
  • Validate original and rerun leaderboard submissions before writing maintainer_rerun overlays.
  • Document the generated maintainer rerun attestation workflow.

Verification

Agent Anvil v0.2.61

09 Jun 04:05
165dd9d

Choose a tag to compare

Highlights

  • Add typed maintainer rerun attestations for leaderboard rows.
  • Add leaderboard build overlay support via --maintainer-reruns.
  • Validate attestation evidence hash, headline metrics, agent/benchmark identity, and GitHub Actions rerun URL.

Verification

Agent Anvil v0.2.60

09 Jun 03:41
405bcbb

Choose a tag to compare

Highlights

  • Reject direct leaderboard submissions that self-claim maintainer_rerun trust.
  • Reserve maintainer_rerun labels for maintainer-side rerun attestations.
  • Update leaderboard workflow examples to v0.2.60.

Verification

Agent Anvil v0.2.59

09 Jun 03:21
c0f1796

Choose a tag to compare

Summary

  • Reject versioned results.json files when summary trial counts or pass rate do not match grades.
  • Document summary-vs-grades validation for persisted results artifacts.
  • Add parametrized storage tests for tampered aggregate summaries.

Verification

  • PR #191 CI passed: Demo Eval, Test Python 3.12, Test Python 3.14.
  • Release PR #192 CI passed: Demo Eval, Test Python 3.12, Test Python 3.14.

Agent Anvil v0.2.58

08 Jun 22:08
42d5f49

Choose a tag to compare

Summary

  • Validate that results.grades trace_path values point back to matching run traces.
  • Reject missing, mismatched, or path-traversal grade trace paths during run validation.
  • Surface grade trace path verification in artifact trust summaries.

Verification

  • PR #188 CI passed: Demo Eval, Test Python 3.12, Test Python 3.14.
  • Release PR #189 CI passed: Demo Eval, Test Python 3.12, Test Python 3.14.

Agent Anvil v0.2.57

08 Jun 21:49
951716f

Choose a tag to compare

Summary

  • Add opt-in strict trace step validation for trace and run artifacts.
  • Keep default trace loading permissive for legacy/custom observer events.
  • Document strict validation for CI artifact consumers.

Verification

  • PR #185 CI passed: Demo Eval, Test Python 3.12, Test Python 3.14.
  • Release PR #186 CI passed: Demo Eval, Test Python 3.12, Test Python 3.14.

Agent Anvil v0.2.56

08 Jun 21:30
0d7628f

Choose a tag to compare

Summary

  • Share the assertion JSONPath grammar and resolver in anvil.jsonpath.
  • Make scenario validation and deterministic grading use one parser contract.
  • Add contract tests for supported, unsupported, missing, and array-indexed paths.

Verification

  • PR #182 CI passed: Demo Eval, Test Python 3.12, Test Python 3.14.
  • Release PR #183 CI passed: Demo Eval, Test Python 3.12, Test Python 3.14.