Headline: back open for business — resilient recovery after the "ruv sledgehammer". A contributor's ~337 direct prove(...) PRs ("the ruv sledgehammer", #1904) flooded Gate A and stranded ~200 valid proofs behind a saturated verifier queue. The backlog is now cleared and the swarm is hardened so it can't re-form: the stranded/failed direct PRs were closed, ~318 valid proofs re-routed onto the metered queue, the orphaned CI runs cancelled, and attribution fixed (ruv → ruvnet). Around that we shipped a full resilience-and-recovery stack across v1.17.1–v1.20.0 — admission control now offers the re-route on close so a flood self-heals, a reroute_stranded.py tool recovers a stranded proof without re-proving, a scheduled queue-dispatcher drains the queue governor-metered, a close-superseded sweep retires originals as their re-routes land, and a recovery playbook documents it all. This release lands the capacity piece: sharded Gate A kernel replay (ADR-063, #1909) cuts the ~1-hour verification long pole ~N× across parallel runners so verification keeps pace with submissions — every olean still kernel-replayed exactly once. The doors are open again, and the queue drains itself.
Added
- Added sharded Gate A kernel replay (ADR-063 / SPEC-063-A) to attack the verification-capacity bottleneck (#1909). The full kernel replay is a ~1-hour serial pass over the whole library because
leancheckerholds mathlib resident per process (two won't fit one runner) — but the work is already chunked and embarrassingly parallel across runners. Newplanandreplay-shardsubcommands intools/gate_a/parallel_modules.pypartition the replay target set into N disjoint, covering shards (reusing the existing scope logic verbatim, so a shard never checks a different set than a full replay would) and replay one shard per parallel matrix runner, cutting the long pole to ~1/N wall-clock. Soundness is preserved — every olean is still kernel-replayed exactly once (the partition is unit-tested disjoint+covering, the matrix cover job fails closed on any non-green shard, and the daily full-replay backstop is unchanged), and shards share only the git SHA soleanchecker's inputs stay locally-derived (ADR-049). Ships with the tooling + tests and a non-required manualgate-a-shard-pilotworkflow that validates the matrix on real runners; promoting it into the requiredgate-a.ymlis a pilot-gated follow-up (ADR-058). - Added
swarm/run.sh: a one-command launcher for the governed swarm (ADR-058). It starts the two halves of the coordinated--proveflow together — one dispatcher loop (agent.sh --dispatch-queue) plus one resilient prover loop (supervise.sh --prove "$@") — sharing the shell'sUNSORRY_*env and torn down together on exit, so an operator runs./swarm/run.shinstead of managing two commands. Args pass through to the prover. Run exactly one dispatcher; for more provers, start additionalsupervise.sh --prove. Added to the agent-lint shellcheck/bash-n gate and documented in CONTRIBUTING.
Changed
- Made the
aisp-advisoryGate B check incremental — it now validates only the.aispcoordination records changed in the PR/push (viagit diffagainst the base, withfetch-depth: 0), instead of re-runningnpx aisp-validator@0.3.0over every record ingoals/translations/decompositions/library/index/proof-runson each run. The per-filenpxloop across thousands of immutableproof-runs/*.aispwas making this advisory check the slowest in the pipeline (~8 min); a typical PR touches a handful of records, so it now finishes in seconds. Behaviour is otherwise unchanged: same job name (aisp-advisory), stillcontinue-on-error(never reddens, ADR-003 / issue #318), and it falls back to validating all records when there's no usable base diff (first push / shallow history). The swarm-contract validation is unchanged. - Gate A now routes prepare/archive, audit, and replay work to dedicated Namespace profiles (
namespace-profile-unsorry-prepare,namespace-profile-unsorry-audit, andnamespace-profile-unsorry-replay) so operators can scale each verifier lane independently. - The Gate A
.lake/buildGitHub Actions cache (ADR-045) now only runs on non-Namespace profiles (if: needs.detect.outputs.volume != 'true'). On a Namespace runner the.lakecache volume (ADR-046) already persists.lake/buildacross jobs and runs, so the GitHub cache was redundant churn — a ~320 MB per-commit entry (keyed ongithub.sha) that thrashed the 10 GB repo cache and created one near-duplicate entry per PR. Gating it keeps the cache list small and entries reused efficiently. Trade-off: on Namespace the #801 exact-key build-skip no longer fires, but the warm volume makes audit/replay's library build a cheap incremental, so throughput is unchanged.