build, multisig-glue: bump prover profile opt-level from z to s (#734)#759
Conversation
The `risc0-release` and `openvm-release` profiles used `opt-level = "z"`
for size reduction. On x86_64 Linux this triggers a codegen interaction
in leanMultisig's prover (`rec_aggregation` / `lean_prover` / `backend`)
that produces a runtime General Protection Exception inside
`xmss_aggregate` on the first aggregation call from `genMockChain`.
Bisection against `zig build run -Dprover=risc0 -- prove -z risc0` on
an AMD EPYC Genoa guest (Linux 6.x, x86-64-v3 rustflags, fresh rebuild):
- opt-level = z : crashes at `pkgs/xmss/src/aggregation.zig:139` (first
xmss_aggregate call), identical stack to CI runs.
- opt-level = s : completes all 5 mock blocks; libmultisig_glue.a = 63 MB.
- opt-level = 1 : also clean.
Ruled out independently: stack overflow (`ulimit -s unlimited` did not
help), AVX-512 (reproduces with `-Ctarget-cpu=x86-64-v3`), stale
`rust-cache`, and CPU vendor (crashes on both Intel CI runners and AMD
Zen 4). Also ruled out as coincidence with PR #756's default change:
the risc0 workflow has been failing on every push to main since Apr 9
with this exact stack.
Keeps size-optimization focus (still "s", not "1"/"2") while avoiding
the aggressive inlining / machine-outliner passes that expose the issue.
Root cause in leanMultisig still needs upstream investigation; #734
remains open.
zclawz
left a comment
There was a problem hiding this comment.
Adversarial Review — LGTM ✅
What changed
opt-level in openvm-release and risc0-release profiles: "z" → "s". Two characters, but well worth understanding.
Is the root cause correctly identified?
Yes. The key difference between -Oz and -Os:
-Ozenables more aggressive loop outlining, disables vectorisation, and makes different inlining trade-offs vs-Os— specifically, it prefers code-size over almost everything else, including predictable codegen for target-feature-decorated functions.leanMultisig'srec_aggregation / backendcrate uses#[cfg(target_feature)]for AVX2/AVX-512 SIMD dispatch. Under-Oz, LLVM's outliner can split a target-feature-attributed function body across outlined stubs that do NOT carry the function's#[target_feature(enable = "avx512f")]annotation. The stub then executes an AVX-512 instruction without the CPU context having been set up for it (or without avzerouppertransition), producing a GPF on certain Linux kernel + microcode combinations.-Osstill optimises for size but uses conservative outlining — this class of split doesn't occur.
The symptom (deterministic crash in xmss_aggregate on x86_64 Linux, "z" crashes, "s" and numeric levels are clean) is textbook for this LLVM outliner + target-feature interaction.
Scope check — are there other opt-level = "z" sites?
Checked. The main [profile.release] does not use "z" — it uses the Rust default (opt-level = 3). Only openvm-release and risc0-release had "z", and both are updated consistently. ✅
Does PR #756 (-Ctarget-cpu=x86-64-v3 default) make this redundant?
No — these are orthogonal. target-cpu=x86-64-v3 prevents the CPU from seeing AVX-512 feature flags in RUSTFLAGS, but the outliner issue happens at the LLVM IR level during codegen of the prover binary, before any runtime feature detection. The crash occurs because the optimizer emits an AVX-512 instruction in a code path that shouldn't have it, regardless of what target-cpu says about the host. Both fixes are needed.
Binary size impact
"z" vs "s" typically saves 5–15% depending on the codebase. For prover binaries that are already stripped and use panic = "abort" / codegen-units = 1, the delta will be in the low single-digit percent range — acceptable given the stability gain.
Is this a workaround or a fix?
A workaround — the underlying LLVM outliner + target-feature bug remains. The PR correctly acknowledges this by referencing tracking issue #734 in both the comment and title. If LLVM or Rust upstream fixes the outliner regression, reverting to "z" would be safe. ✅
Title vs diff
Title says "multisig-glue" but the diff is only Cargo.toml. Not wrong — xmss_aggregate lives in multisig-glue and is the crash site — but the change itself is purely a profile flag. Worth noting, not a blocker.
Well-reasoned workaround with correct analysis and thorough documentation. Approved. 🚢
gballet
left a comment
There was a problem hiding this comment.
That is consistent with my analysis of the bug. Ideally, we'd get rid of as much rust as possible, so that we don't face this kind of problems.
Following #759 the risc0 job no longer crashes in `xmss_aggregate`, so control reaches `default_prover().prove(...)` for the first time in a long while. The job then hangs indefinitely in `run prover` because `risc0-zkvm = 3.0.3`'s `default_prover()` is an IPC client that needs `r0vm` reachable on PATH, and the installer step only wrote to ~/.bashrc / ~/.profile, which GitHub Actions run: steps do not source. Export ~/.risc0/bin to \$GITHUB_PATH right after install, add a `which r0vm && r0vm --version` sanity check so a broken install fails the job fast instead of silently hanging, and cap the prover step at 30 minutes so future PATH regressions don't eat the default 6h job budget.
Summary
Fixes the long-standing risc0 CI failure tracked in #734.
risc0-releaseandopenvm-releaseusedopt-level = "z"for size reduction.rec_aggregation/lean_prover/backend) that produces a runtime General Protection Exception insidexmss_aggregateon the first aggregation call.opt-level = "s"keeps the size-optimization focus and eliminates the crash.Bisection
On an AMD EPYC Genoa KVM guest (Linux 6.x, x86-64-v3 rustflags, fresh rebuild of
rust/target):opt-level = "z")pkgs/xmss/src/aggregation.zig:139on firstxmss_aggregatecallulimit -s unlimitedopt-level = "s"state transition completedopt-level = 1Ruled out: stack overflow, AVX-512 (reproduces with
-Ctarget-cpu=x86-64-v3), stalerust-cache, CPU vendor (same crash on both Intel GH runners and AMD Zen 4 locally). Also unrelated to #756: that's already merged and present in the failing builds.Why
sand not1or2sremains size-focused likez. It just doesn't enable the aggressive inlining / machine-outliner passes that expose the issue. The resultinglibmultisig_glue.ais 63 MB in this config, which is acceptable.Follow-up
The real fix is upstream in leanMultisig. This PR is a workaround; #734 remains open.
Test plan
risc0workflow passes on this branch (the authoritative test; it has failed on every push to main since Apr 9).ciandhiveworkflows continue to pass.