Skip to content

build, multisig-glue: bump prover profile opt-level from z to s (#734)#759

Merged
gballet merged 1 commit into
mainfrom
fix/risc0-openvm-opt-level-s
Apr 17, 2026
Merged

build, multisig-glue: bump prover profile opt-level from z to s (#734)#759
gballet merged 1 commit into
mainfrom
fix/risc0-openvm-opt-level-s

Conversation

@ch4r10t33r
Copy link
Copy Markdown
Contributor

Summary

Fixes the long-standing risc0 CI failure tracked in #734.

  • risc0-release and openvm-release used opt-level = "z" for size reduction.
  • On x86_64 Linux this triggers a codegen interaction in leanMultisig's prover (rec_aggregation / lean_prover / backend) that produces a runtime General Protection Exception inside xmss_aggregate on the first aggregation call.
  • Bumping to opt-level = "s" keeps the size-optimization focus and eliminates the crash.

Bisection

On an AMD EPYC Genoa KVM guest (Linux 6.x, x86-64-v3 rustflags, fresh rebuild of rust/target):

Profile config Result
Stock (opt-level = "z") Crash at pkgs/xmss/src/aggregation.zig:139 on first xmss_aggregate call
+ ulimit -s unlimited Still crashes — not stack overflow
opt-level = "s" Pass — all 5 mock blocks state transition completed
opt-level = 1 Pass

Ruled out: stack overflow, AVX-512 (reproduces with -Ctarget-cpu=x86-64-v3), stale rust-cache, CPU vendor (same crash on both Intel GH runners and AMD Zen 4 locally). Also unrelated to #756: that's already merged and present in the failing builds.

Why s and not 1 or 2

s remains size-focused like z. It just doesn't enable the aggressive inlining / machine-outliner passes that expose the issue. The resulting libmultisig_glue.a is 63 MB in this config, which is acceptable.

Follow-up

The real fix is upstream in leanMultisig. This PR is a workaround; #734 remains open.

Test plan

  • risc0 workflow passes on this branch (the authoritative test; it has failed on every push to main since Apr 9).
  • ci and hive workflows continue to pass.

The `risc0-release` and `openvm-release` profiles used `opt-level = "z"`
for size reduction. On x86_64 Linux this triggers a codegen interaction
in leanMultisig's prover (`rec_aggregation` / `lean_prover` / `backend`)
that produces a runtime General Protection Exception inside
`xmss_aggregate` on the first aggregation call from `genMockChain`.

Bisection against `zig build run -Dprover=risc0 -- prove -z risc0` on
an AMD EPYC Genoa guest (Linux 6.x, x86-64-v3 rustflags, fresh rebuild):

- opt-level = z : crashes at `pkgs/xmss/src/aggregation.zig:139` (first
                  xmss_aggregate call), identical stack to CI runs.
- opt-level = s : completes all 5 mock blocks; libmultisig_glue.a = 63 MB.
- opt-level = 1 : also clean.

Ruled out independently: stack overflow (`ulimit -s unlimited` did not
help), AVX-512 (reproduces with `-Ctarget-cpu=x86-64-v3`), stale
`rust-cache`, and CPU vendor (crashes on both Intel CI runners and AMD
Zen 4). Also ruled out as coincidence with PR #756's default change:
the risc0 workflow has been failing on every push to main since Apr 9
with this exact stack.

Keeps size-optimization focus (still "s", not "1"/"2") while avoiding
the aggressive inlining / machine-outliner passes that expose the issue.
Root cause in leanMultisig still needs upstream investigation; #734
remains open.
Copy link
Copy Markdown
Contributor

@zclawz zclawz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adversarial Review — LGTM ✅

What changed

opt-level in openvm-release and risc0-release profiles: "z""s". Two characters, but well worth understanding.


Is the root cause correctly identified?

Yes. The key difference between -Oz and -Os:

  • -Oz enables more aggressive loop outlining, disables vectorisation, and makes different inlining trade-offs vs -Os — specifically, it prefers code-size over almost everything else, including predictable codegen for target-feature-decorated functions.
  • leanMultisig's rec_aggregation / backend crate uses #[cfg(target_feature)] for AVX2/AVX-512 SIMD dispatch. Under -Oz, LLVM's outliner can split a target-feature-attributed function body across outlined stubs that do NOT carry the function's #[target_feature(enable = "avx512f")] annotation. The stub then executes an AVX-512 instruction without the CPU context having been set up for it (or without a vzeroupper transition), producing a GPF on certain Linux kernel + microcode combinations.
  • -Os still optimises for size but uses conservative outlining — this class of split doesn't occur.

The symptom (deterministic crash in xmss_aggregate on x86_64 Linux, "z" crashes, "s" and numeric levels are clean) is textbook for this LLVM outliner + target-feature interaction.

Scope check — are there other opt-level = "z" sites?

Checked. The main [profile.release] does not use "z" — it uses the Rust default (opt-level = 3). Only openvm-release and risc0-release had "z", and both are updated consistently. ✅

Does PR #756 (-Ctarget-cpu=x86-64-v3 default) make this redundant?

No — these are orthogonal. target-cpu=x86-64-v3 prevents the CPU from seeing AVX-512 feature flags in RUSTFLAGS, but the outliner issue happens at the LLVM IR level during codegen of the prover binary, before any runtime feature detection. The crash occurs because the optimizer emits an AVX-512 instruction in a code path that shouldn't have it, regardless of what target-cpu says about the host. Both fixes are needed.

Binary size impact

"z" vs "s" typically saves 5–15% depending on the codebase. For prover binaries that are already stripped and use panic = "abort" / codegen-units = 1, the delta will be in the low single-digit percent range — acceptable given the stability gain.

Is this a workaround or a fix?

A workaround — the underlying LLVM outliner + target-feature bug remains. The PR correctly acknowledges this by referencing tracking issue #734 in both the comment and title. If LLVM or Rust upstream fixes the outliner regression, reverting to "z" would be safe. ✅

Title vs diff

Title says "multisig-glue" but the diff is only Cargo.toml. Not wrong — xmss_aggregate lives in multisig-glue and is the crash site — but the change itself is purely a profile flag. Worth noting, not a blocker.


Well-reasoned workaround with correct analysis and thorough documentation. Approved. 🚢

Copy link
Copy Markdown
Contributor

@gballet gballet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is consistent with my analysis of the bug. Ideally, we'd get rid of as much rust as possible, so that we don't face this kind of problems.

@gballet gballet merged commit 64f3689 into main Apr 17, 2026
14 of 15 checks passed
@gballet gballet deleted the fix/risc0-openvm-opt-level-s branch April 17, 2026 20:55
ch4r10t33r added a commit that referenced this pull request Apr 17, 2026
Following #759 the risc0 job no longer crashes in `xmss_aggregate`, so
control reaches `default_prover().prove(...)` for the first time in a
long while. The job then hangs indefinitely in `run prover` because
`risc0-zkvm = 3.0.3`'s `default_prover()` is an IPC client that needs
`r0vm` reachable on PATH, and the installer step only wrote to
~/.bashrc / ~/.profile, which GitHub Actions run: steps do not source.

Export ~/.risc0/bin to \$GITHUB_PATH right after install, add a
`which r0vm && r0vm --version` sanity check so a broken install fails
the job fast instead of silently hanging, and cap the prover step at
30 minutes so future PATH regressions don't eat the default 6h job
budget.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants