Skip to content

node, types: fix aggregator worker segfault on aggregate commit#933

Merged
ch4r10t33r merged 12 commits into
mainfrom
fix/aggregator-commit-segfault
May 26, 2026
Merged

node, types: fix aggregator worker segfault on aggregate commit#933
ch4r10t33r merged 12 commits into
mainfrom
fix/aggregator-commit-segfault

Conversation

@ch4r10t33r
Copy link
Copy Markdown
Contributor

Summary

  • Fix double-free in computeSingleAggregatedSignature when the single-child passthrough (.done) path returns: set prep.outcome = .skip before returning, matching the existing .ffi path guard.
  • Fix use-after-corruption in commitOneAggregateResult: ssz.serialize corrupts the source proof in memory (documented in sszCloneAndGetBytes). Use one serialize pass to store the proof and deserialize a separate copy for the publish path, instead of returning the corrupted original.

These match the SIGSEGV stack seen on devnet-4 zeam_8 (exit 139) in the aggregate worker at sszClone / commitOneAggregateResult.

Test plan

  • Added regression test: computeSingleAggregatedSignature: single-child passthrough survives prep deinit
  • zig build test --summary all
  • Deploy to devnet-4 aggregator and confirm zeam_8 no longer crash-loops after publishing aggregations

Prevent prep.deinit from freeing a moved single-child aggregate result,
and avoid reusing proof memory after ssz.serialize corrupts it during
commitOneAggregateResult clone-on-insert.
Guard ethlibp2p rpcCallbacks with a mutex so libxev status refresh and
rust-bridge response delivery cannot corrupt callback entries (GPE in
onReqRespResponse during status bursts). Drop transport callbacks when
the node layer finalizes pending RPCs.
Use var for single-child passthrough test result so deinit receives
mutable access.
Snapshot RPC callback fields under one lock with an owned peer_id copy
instead of returning map pointers that could be freed before use. Take
aggregate signatures by value in commitOneAggregateResult and deinit on
duplicate suppress or pre-serialize failure paths.
@zclawz
Copy link
Copy Markdown
Contributor

zclawz commented May 26, 2026

Review — zclawz

Three fixes bundled together. Fix 1 is clean. Fix 2 has a correctness bug introduced by its own implementation. Fix 3 is mostly sound with one lifetime question.


✅ Fix 1 — computeSingleAggregatedSignature .done double-free (block.zig)

Correct and minimal. Setting prep.outcome = .skip before returning mirrors the existing .ffi path guard. The regression test covers the scenario well.


🔴 Fix 2 — commitOneAggregateResult SSZ corruption — new errdefer double-free introduced

The approach (serialize once → retain bytes → deserialize a separate copy for publish) is logically correct for the corruption problem. But the implementation has a bug:

errdefer stored_proof.deinit();                   // (A) armed
// ... source_payload_bits_owned / source_gossip_bits_owned errdefs also armed

try gop.value_ptr.append(self.allocator, .{
    .proof = stored_proof,                         // bits copied into map entry
    .source_payload_participants = source_payload_bits,
    .source_gossip_participants = source_gossip_bits,
});

// (A) is still armed here
var publish_proof: types.AggregatedSignatureProof = undefined;
try ssz.deserialize(…, proof_bytes, &publish_proof, self.allocator);  // can OOM

If the deserialize fails after append succeeds, stored_proof.deinit() fires — but the map entry's .proof is a bitwise copy of the same pointers. Double-free on the map entry; it holds dangling pointers until cleanup.

The old code had no try after append, so this pattern was structurally fragile but safe in practice. The new try ssz.deserialize makes the fragility exploitable.

Fix: gate stored_proof.deinit() behind a stored_proof_owned boolean (matching the source_payload_bits_owned / source_gossip_bits_owned pattern already in the function), and clear all three flags immediately before the append call.


⚠️ Fix 3 — RPC callback mutex (ethlibp2p.zig) — snapshot handler lifetime

The mutex design is sound overall. snapshotRpcCallbackForDelivery copies what it needs under the lock, releases, then calls the handler — correct. takeRpcCallback gives exclusive ownership — correct. The cancelInflightRequest / finalizePendingRequest integration is a good defensive layer.

One concern: snapshotRpcCallbackForDelivery captures handler = callback_ptr.handler (a struct copy: fn-ptr + ptr field), releases the lock, then calls handler.onReqRespResponse. Concurrently, cancelInflightRpcCallback can run takeRpcCallbackowned_callback.deinit(). If deinit() frees the object that handler.ptr points into, the in-flight onReqRespResponse call has a dangling ptr — UAF. Whether this is safe depends on the ownership of whatever handler.ptr references. If it outlives the callback struct (owned by the node or a pool), it's fine; if it's owned by the callback struct itself, it's a real UAF. This should be verified or documented explicitly.


Minor

  • notifyRpcErrorFmt log regression: removing the racy getPtr for peer info is correct, but the OOM branch now logs less. The getRpcCallbackMethod helper could retrieve peer info safely if desired.
  • No test for the double-serialize path: commitOneAggregateResult is the highest-risk new code and has no unit test. A test driving the function with a real proof would catch both the corruption fix and the errdefer bug above.
  • PR scope: three logically independent fixes; worth splitting post-devnet if possible for easier bisect.

Verdict: Fix 1 ✅, Fix 2 needs the stored_proof_owned guard before merging (real correctness bug under OOM), Fix 3 needs the handler lifetime question resolved or documented.

Gate stored_proof deinit behind stored_proof_owned so a publish-path
deserialize failure cannot double-free the map entry. Document that RPC
handler.ptr outlives callback deinit, restore peer info in error logs via
a locked peer_id dup, and add a commitOneAggregateResult regression test.
@zclawz
Copy link
Copy Markdown
Contributor

zclawz commented May 26, 2026

Follow-up after latest push

The two concrete fixes from my prior comment are mostly addressed:

  • ✅ The stored_proof_owned guard was added and cleared after append, so the specific post-append deserialize failure double-free I called out is fixed.
  • ✅ The handler lifetime question is now documented, and the code matches the comment: ReqRespRequestCallback.deinit() only frees peer_id, while handler.ptr comes from BeamNode.getReqRespResponseHandler() and points at the node. That resolves my UAF concern for the current handler source.
  • ✅ Added a regression test for stored vs publish proof independence. Good.

One remaining subtle failure-path issue in commitOneAggregateResult:

var owned_signature = signature;
var signature_live = true;
errdefer if (signature_live) owned_signature.deinit();

const proof_bytes = try types.sszCloneAndGetBytes(
    self.allocator,
    types.AggregatedSignatureProof,
    owned_signature,
    &stored_proof,
);
signature_live = false;

This still assumes sszCloneAndGetBytes either succeeds completely or leaves owned_signature safe to deinit. But sszCloneAndGetBytes does:

try ssz.serialize(T, data, &bytes, allocator);      // known to corrupt this proof value
try ssz.deserialize(T, bytes.items[0..], cloned, allocator); // can fail/OOM

If serialize succeeds/corrupts the source and the later deserialize or toOwnedSlice path fails, the caller never reaches signature_live = false; the errdefer then deinits the corrupted owned_signature. That's the same class of crash you're fixing, just on the error path before the append.

Suggested fix: don't use sszCloneAndGetBytes as an opaque helper for this consumed proof. Inline/split the steps here so signature_live = false is set immediately after the successful ssz.serialize call — i.e. at the precise point where the source must be considered consumed/corrupted — before any later fallible clone/byte ownership steps. Alternatively introduce a helper whose contract explicitly consumes the source once serialization succeeds.

This is an OOM/error-path issue, but because the whole PR is about memory safety around a serializer that mutates its input, I would still fix this before merging.

Split commitOneAggregateResult onto sszSerializeAndGetBytes so
signature_live clears immediately after serialize corrupts the worker
proof, before fallible deserialize or byte ownership. Document the
consumption contract on sszCloneAndGetBytes callers.
zclawz
zclawz previously approved these changes May 26, 2026
Copy link
Copy Markdown
Contributor

@zclawz zclawz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All three concerns from my initial review are resolved:

  • ✅ Fix 1 (.done double-free): correct from the start.
  • ✅ Fix 2 (SSZ corruption): sszSerializeAndGetBytes is now the right primitive — signature_live = false is set at the exact point after serialize succeeds/corrupts the source, before any fallible deserialize step. The helper is well-documented, has a roundtrip test, and sszCloneAndGetBytes is updated to delegate through it.
  • ✅ Fix 3 (RPC mutex): handler lifetime documented and verified — deinit() only frees peer_id; handler.ptr references the node.

Approved.

Use var for signed result so deinit receives mutable access.
commitOneAggregateResult set signature_live=false after serialize but never
freed the worker proof, so the #933 regression test leaked under
DebugAllocator and failed CI. Deinit the corrupted source after a successful
serialize; document and test that this is safe for AggregatedSignatureProof.
Comment thread pkgs/node/src/forkchoice.zig
stored_proof_owned was true while stored_proof was still undefined, so
errdefer could deinit garbage if serialize or deserialize failed.
@ch4r10t33r ch4r10t33r merged commit e399e01 into main May 26, 2026
2 checks passed
@ch4r10t33r ch4r10t33r deleted the fix/aggregator-commit-segfault branch May 26, 2026 10:32
ch4r10t33r added a commit that referenced this pull request May 26, 2026
PR #936 renamed `initTestThreadPool` to `setupTestPrimitives` across
`pkgs/node/src/forkchoice.zig` but missed the call inside the
`commitOneAggregateResult: stored and publish proofs are independent SSZ
copies` test added by PR #933, which breaks `zig build all` on main.

Picked up here so CI on this branch can complete.
ch4r10t33r added a commit that referenced this pull request May 26, 2026
* node, network: recover from gossip ingress stall (#926)

When pre-finalization devnets report synced but wall-clock head lag grows,
treat the node as behind peers, batch status refresh RPCs, proactively
start blocks_by_range from cached peer heads during gossip silence, heal
gossipsub mesh subscriptions after disconnects, and avoid inline RPC block
imports on libxev when the chain-worker is enabled.

* node, network: address PR 938 review feedback

Fix wall-lag snapshot ordering before chain.onInterval, deduplicate RPC
missing-parent cache paths and gossip mesh subscribe logic, and consolidate
sync recovery tick orchestration into focused helpers.

* node: fix use-after-free in proactive-catch-up peer snapshot

`findBestCatchUpPeerStatus` stored `entry.key_ptr.*` from the connected-peers
hash map directly into `CatchUpPeerStatus.peer_id`, then released the shared
read lock before passing the slice down to `shouldCatchUpFromPeerStatus` /
`initiateCatchUpFromPeerStatus`. A concurrent `onPeerDisconnected` (rust-bridge
thread) can free the hash-map key in that window, leaving the slice dangling
on the libxev side. Match the existing `refreshSyncFromPeers` pattern by
duping the peer_id under the lock and freeing it in the caller.

Also lift `maybeInitiateProactiveCatchUp` out of the status-refresh-gated
block. Gating it behind `refresh_decision.refresh` only let it fire every
SYNC_STATUS_REFRESH_INTERVAL_SLOTS (8 slots / 32s), defeating the point of
acting on cached peer status the moment gossip ingress goes silent. The
proactive path now runs at interval_in_slot == 0 on every slot, still
internally gated by wall-lag threshold and gossip silence; overlap is
filtered downstream in `initiateBlocksByRangeCatchUp`.

* forkchoice: rename leftover initTestThreadPool call site

PR #936 renamed `initTestThreadPool` to `setupTestPrimitives` across
`pkgs/node/src/forkchoice.zig` but missed the call inside the
`commitOneAggregateResult: stored and publish proofs are independent SSZ
copies` test added by PR #933, which breaks `zig build all` on main.

Picked up here so CI on this branch can complete.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants