Skip to content

parallelise: review follow-ups (metrics, FFI hoist, determinism, prewarm)#796

Merged
ch4r10t33r merged 2 commits into
mainfrom
fix/pr-780-review-followups
Apr 28, 2026
Merged

parallelise: review follow-ups (metrics, FFI hoist, determinism, prewarm)#796
ch4r10t33r merged 2 commits into
mainfrom
fix/pr-780-review-followups

Conversation

@ch4r10t33r
Copy link
Copy Markdown
Contributor

Follow-up to #780 (now merged into main). Targets main. Each fix is independent — no consensus impact, all corrections are local to the parallel paths introduced by #780.

Bot review classification

Adversarial review on #780 raised six concerns. Verdict + action:

# Concern Verdict This PR
1 pool.scope synchronicity — Critical, memory safety Invalid. lib.zig:466-484 does defer self.scopeWait(&s) and the wait loop drains pending == 0 before return. Doc string explicitly says "block until every task spawned on scope has completed" No fix needed
2 Unvetted thread-pool dep — High, supply chain Process concern, not addressable in code Out of scope. Requires governance decision (vendor / fork / continue pinning by hash)
3 XMSS FFI thread-safety — High Mostly invalid, but real adjacent risk Pre-warm xmss.setupVerifier() on main thread; hoist xmss.PublicKey.fromBytes out of workers in compactAttestations
4 Thread-pool ownership inconsistency — Medium Valid observation, not a bug: cli/main.zig (devnet) and cli/node.zig (Node struct) are mutually exclusive code paths. No deployment runs both No structural change. Documented invariants on chain.thread_pool so future consumers understand the contract
5 compactAttestations not fully reviewed — Medium, liveness risk Mostly invalid (aggregation_bits derived from same proof.participants, cannot diverge). But real determinism gap missed by the bot: hashmap iteration order Sort group_entries deterministically by AttestationData
6 Metrics timer semantics — Low Valid Per-task histogram observation in verifySignaturesParallel, matching serial granularity

Bot also missed: shared allocator across parallel workers (production = GPA = thread-safe; documented), and setupVerifier first-call race (fixed by pre-warm).

Changes

pkgs/state-transition/src/transition.zig

  • VerifyTask carries elapsed_ns: u64. Worker times the FFI verify call with std.time.Timer (monotonic) and stores ns into the task slot.
  • Post-pool emit replaces the single batch observe() with one Histogram.record(elapsed_s) per verified task. Histogram percentiles now match the serial path's granularity.

pkgs/types/src/block.zig

  • New CompactGroupPrep carries pre-built per-child []*const HashSigPublicKey slices.
  • New helpers compactSingleProof / compactMultiProofWithPrep / runCompactGroupPrep replace the monolithic compactAttestationGroup.
  • All xmss.PublicKey.fromBytes calls happen serially in a pre-phase before pool.scope. Workers receive prebuilt handles and only invoke aggregate() (which takes const handles).
  • pubkey_wrappers ArrayList owns wrapper lifetime for the duration of the scope call; freed on unwind.
  • group_entries sorted by AttestationData (slot → head/target/source root bytes → checkpoint slots) before any processing. Both serial and parallel branches consume the sorted list.

pkgs/cli/src/main.zig and pkgs/cli/src/node.zig

  • Both pool-creation sites call xmss.setupVerifier() immediately after ThreadPool.init, before any consumer runs a parallel verify. Removes first-time-init race regardless of Rust implementation.

pkgs/node/src/chain.zig

  • Expanded doc comment on BeamChain.thread_pool to spell out the three invariants new consumers must preserve (allocator thread-safety, setupVerifier pre-warm, PublicKeyCache non-thread-safety).

Test plan

  • zig build all — clean rebuild, EXIT=0.
  • zig build test — all existing unit tests pass.
  • zeam --help — built CLI starts and prints usage.
  • Devnet: confirm lean_pq_sig_aggregated_signatures_verification_time_seconds percentiles look reasonable under the parallel path (should match serial baseline).
  • Devnet: confirm two validators producing the same attestation set emit byte-identical blocks (covered by deterministic sort).

…minism)

Addresses adversarial review on PR #780. Each fix is independent of the
others; all of them target correctness or determinism risks introduced
by moving signature verification and attestation compaction onto a
shared work-stealing thread pool.

state-transition: per-task histogram observation
  verifySignaturesParallel previously wrapped one batch_timer around the
  entire pool.scope call, so the
  lean_pq_sig_aggregated_signatures_verification_time_seconds histogram
  received exactly one sample per block — while the serial path observes
  once per attestation. Mixing the two granularities into the same
  histogram silently distorts P50/P99 between deployments. Each
  VerifyTask now records its own elapsed_ns inside the worker (using
  std.time.Timer for monotonic timing), and the post-pool emit calls
  Histogram.record per verified task.

types/block: deterministic group ordering
  std.AutoHashMap iterator order is not stable across runs (insertion
  order is preserved only until the next rehash), so two validators
  producing identical attestation sets could emit byte-different blocks.
  Sort group_entries by AttestationData (slot, head/target/source root
  bytes, then checkpoint slots) before processing in either the serial
  or parallel branch.

types/block: hoist xmss.PublicKey.fromBytes out of parallel workers
  Previously compactAttestationGroup called xmss.PublicKey.fromBytes
  inside each worker thread. Rust-side pubkey deserialization is not
  documented as Send, and setupVerifier (called transitively) carries
  first-time-init races. New CompactGroupPrep is built serially before
  pool.scope: every fromBytes call happens on the main thread, workers
  only invoke aggregate() on already-deserialized handles. The shared
  pubkey_wrappers ArrayList owns the wrappers' lifetime across the
  scope call.

cli: pre-warm xmss.setupVerifier after pool init
  Both the multi-node devnet path (pkgs/cli/src/main.zig) and the
  single-node Node struct path (pkgs/cli/src/node.zig) now call
  xmss.setupVerifier() once on the main thread immediately after
  ThreadPool.init. The Rust verifier setup is documented as idempotent
  but is not hardened against first-time-init races between concurrent
  callers; pre-warming on the main thread removes the race regardless
  of the Rust implementation.

node/chain: document thread_pool invariants
  Expanded the doc comment on BeamChain.thread_pool to spell out the
  three thread-safety invariants new consumers must preserve:
    1. chain.allocator must be safe for concurrent use (today: GPA).
    2. xmss.setupVerifier must be called on the main thread before the
       first parallel verify.
    3. xmss.PublicKeyCache is NOT thread-safe; cache access stays in
       the serial pre-phase only.
@ch4r10t33r ch4r10t33r marked this pull request as ready for review April 28, 2026 15:39
@ch4r10t33r ch4r10t33r requested a review from zclawz April 28, 2026 15:42
@ch4r10t33r ch4r10t33r merged commit dacc1c2 into main Apr 28, 2026
13 checks passed
@ch4r10t33r ch4r10t33r deleted the fix/pr-780-review-followups branch April 28, 2026 20:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants