Skip to content

v1.20.0 — back open for business (resilient recovery + sharded Gate A)

Latest

Choose a tag to compare

@cgbarlow cgbarlow released this 17 Jun 10:13
· 223 commits to main since this release
86a357e

Headline: back open for business — resilient recovery after the "ruv sledgehammer". A contributor's ~337 direct prove(...) PRs ("the ruv sledgehammer", #1904) flooded Gate A and stranded ~200 valid proofs behind a saturated verifier queue. The backlog is now cleared and the swarm is hardened so it can't re-form: the stranded/failed direct PRs were closed, ~318 valid proofs re-routed onto the metered queue, the orphaned CI runs cancelled, and attribution fixed (ruvruvnet). Around that we shipped a full resilience-and-recovery stack across v1.17.1–v1.20.0 — admission control now offers the re-route on close so a flood self-heals, a reroute_stranded.py tool recovers a stranded proof without re-proving, a scheduled queue-dispatcher drains the queue governor-metered, a close-superseded sweep retires originals as their re-routes land, and a recovery playbook documents it all. This release lands the capacity piece: sharded Gate A kernel replay (ADR-063, #1909) cuts the ~1-hour verification long pole ~N× across parallel runners so verification keeps pace with submissions — every olean still kernel-replayed exactly once. The doors are open again, and the queue drains itself.

Added

  • Added sharded Gate A kernel replay (ADR-063 / SPEC-063-A) to attack the verification-capacity bottleneck (#1909). The full kernel replay is a ~1-hour serial pass over the whole library because leanchecker holds mathlib resident per process (two won't fit one runner) — but the work is already chunked and embarrassingly parallel across runners. New plan and replay-shard subcommands in tools/gate_a/parallel_modules.py partition the replay target set into N disjoint, covering shards (reusing the existing scope logic verbatim, so a shard never checks a different set than a full replay would) and replay one shard per parallel matrix runner, cutting the long pole to ~1/N wall-clock. Soundness is preserved — every olean is still kernel-replayed exactly once (the partition is unit-tested disjoint+covering, the matrix cover job fails closed on any non-green shard, and the daily full-replay backstop is unchanged), and shards share only the git SHA so leanchecker's inputs stay locally-derived (ADR-049). Ships with the tooling + tests and a non-required manual gate-a-shard-pilot workflow that validates the matrix on real runners; promoting it into the required gate-a.yml is a pilot-gated follow-up (ADR-058).
  • Added swarm/run.sh: a one-command launcher for the governed swarm (ADR-058). It starts the two halves of the coordinated --prove flow together — one dispatcher loop (agent.sh --dispatch-queue) plus one resilient prover loop (supervise.sh --prove "$@") — sharing the shell's UNSORRY_* env and torn down together on exit, so an operator runs ./swarm/run.sh instead of managing two commands. Args pass through to the prover. Run exactly one dispatcher; for more provers, start additional supervise.sh --prove. Added to the agent-lint shellcheck/bash-n gate and documented in CONTRIBUTING.

Changed

  • Made the aisp-advisory Gate B check incremental — it now validates only the .aisp coordination records changed in the PR/push (via git diff against the base, with fetch-depth: 0), instead of re-running npx aisp-validator@0.3.0 over every record in goals/translations/decompositions/library/index/proof-runs on each run. The per-file npx loop across thousands of immutable proof-runs/*.aisp was making this advisory check the slowest in the pipeline (~8 min); a typical PR touches a handful of records, so it now finishes in seconds. Behaviour is otherwise unchanged: same job name (aisp-advisory), still continue-on-error (never reddens, ADR-003 / issue #318), and it falls back to validating all records when there's no usable base diff (first push / shallow history). The swarm-contract validation is unchanged.
  • Gate A now routes prepare/archive, audit, and replay work to dedicated Namespace profiles (namespace-profile-unsorry-prepare, namespace-profile-unsorry-audit, and namespace-profile-unsorry-replay) so operators can scale each verifier lane independently.
  • The Gate A .lake/build GitHub Actions cache (ADR-045) now only runs on non-Namespace profiles (if: needs.detect.outputs.volume != 'true'). On a Namespace runner the .lake cache volume (ADR-046) already persists .lake/build across jobs and runs, so the GitHub cache was redundant churn — a ~320 MB per-commit entry (keyed on github.sha) that thrashed the 10 GB repo cache and created one near-duplicate entry per PR. Gating it keeps the cache list small and entries reused efficiently. Trade-off: on Namespace the #801 exact-key build-skip no longer fires, but the warm volume makes audit/replay's library build a cheap incremental, so throughput is unchanged.