node, types: fix aggregator worker segfault on aggregate commit#933
Conversation
Prevent prep.deinit from freeing a moved single-child aggregate result, and avoid reusing proof memory after ssz.serialize corrupts it during commitOneAggregateResult clone-on-insert.
Guard ethlibp2p rpcCallbacks with a mutex so libxev status refresh and rust-bridge response delivery cannot corrupt callback entries (GPE in onReqRespResponse during status bursts). Drop transport callbacks when the node layer finalizes pending RPCs.
Use var for single-child passthrough test result so deinit receives mutable access.
Snapshot RPC callback fields under one lock with an owned peer_id copy instead of returning map pointers that could be freed before use. Take aggregate signatures by value in commitOneAggregateResult and deinit on duplicate suppress or pre-serialize failure paths.
Review — zclawzThree fixes bundled together. Fix 1 is clean. Fix 2 has a correctness bug introduced by its own implementation. Fix 3 is mostly sound with one lifetime question. ✅ Fix 1 —
|
Gate stored_proof deinit behind stored_proof_owned so a publish-path deserialize failure cannot double-free the map entry. Document that RPC handler.ptr outlives callback deinit, restore peer info in error logs via a locked peer_id dup, and add a commitOneAggregateResult regression test.
Follow-up after latest pushThe two concrete fixes from my prior comment are mostly addressed:
One remaining subtle failure-path issue in var owned_signature = signature;
var signature_live = true;
errdefer if (signature_live) owned_signature.deinit();
const proof_bytes = try types.sszCloneAndGetBytes(
self.allocator,
types.AggregatedSignatureProof,
owned_signature,
&stored_proof,
);
signature_live = false;This still assumes try ssz.serialize(T, data, &bytes, allocator); // known to corrupt this proof value
try ssz.deserialize(T, bytes.items[0..], cloned, allocator); // can fail/OOMIf Suggested fix: don't use This is an OOM/error-path issue, but because the whole PR is about memory safety around a serializer that mutates its input, I would still fix this before merging. |
Split commitOneAggregateResult onto sszSerializeAndGetBytes so signature_live clears immediately after serialize corrupts the worker proof, before fallible deserialize or byte ownership. Document the consumption contract on sszCloneAndGetBytes callers.
zclawz
left a comment
There was a problem hiding this comment.
All three concerns from my initial review are resolved:
- ✅ Fix 1 (
.donedouble-free): correct from the start. - ✅ Fix 2 (SSZ corruption):
sszSerializeAndGetBytesis now the right primitive —signature_live = falseis set at the exact point after serialize succeeds/corrupts the source, before any fallible deserialize step. The helper is well-documented, has a roundtrip test, andsszCloneAndGetBytesis updated to delegate through it. - ✅ Fix 3 (RPC mutex): handler lifetime documented and verified —
deinit()only freespeer_id;handler.ptrreferences the node.
Approved.
Use var for signed result so deinit receives mutable access.
commitOneAggregateResult set signature_live=false after serialize but never freed the worker proof, so the #933 regression test leaked under DebugAllocator and failed CI. Deinit the corrupted source after a successful serialize; document and test that this is safe for AggregatedSignatureProof.
stored_proof_owned was true while stored_proof was still undefined, so errdefer could deinit garbage if serialize or deserialize failed.
PR #936 renamed `initTestThreadPool` to `setupTestPrimitives` across `pkgs/node/src/forkchoice.zig` but missed the call inside the `commitOneAggregateResult: stored and publish proofs are independent SSZ copies` test added by PR #933, which breaks `zig build all` on main. Picked up here so CI on this branch can complete.
* node, network: recover from gossip ingress stall (#926) When pre-finalization devnets report synced but wall-clock head lag grows, treat the node as behind peers, batch status refresh RPCs, proactively start blocks_by_range from cached peer heads during gossip silence, heal gossipsub mesh subscriptions after disconnects, and avoid inline RPC block imports on libxev when the chain-worker is enabled. * node, network: address PR 938 review feedback Fix wall-lag snapshot ordering before chain.onInterval, deduplicate RPC missing-parent cache paths and gossip mesh subscribe logic, and consolidate sync recovery tick orchestration into focused helpers. * node: fix use-after-free in proactive-catch-up peer snapshot `findBestCatchUpPeerStatus` stored `entry.key_ptr.*` from the connected-peers hash map directly into `CatchUpPeerStatus.peer_id`, then released the shared read lock before passing the slice down to `shouldCatchUpFromPeerStatus` / `initiateCatchUpFromPeerStatus`. A concurrent `onPeerDisconnected` (rust-bridge thread) can free the hash-map key in that window, leaving the slice dangling on the libxev side. Match the existing `refreshSyncFromPeers` pattern by duping the peer_id under the lock and freeing it in the caller. Also lift `maybeInitiateProactiveCatchUp` out of the status-refresh-gated block. Gating it behind `refresh_decision.refresh` only let it fire every SYNC_STATUS_REFRESH_INTERVAL_SLOTS (8 slots / 32s), defeating the point of acting on cached peer status the moment gossip ingress goes silent. The proactive path now runs at interval_in_slot == 0 on every slot, still internally gated by wall-lag threshold and gossip silence; overlap is filtered downstream in `initiateBlocksByRangeCatchUp`. * forkchoice: rename leftover initTestThreadPool call site PR #936 renamed `initTestThreadPool` to `setupTestPrimitives` across `pkgs/node/src/forkchoice.zig` but missed the call inside the `commitOneAggregateResult: stored and publish proofs are independent SSZ copies` test added by PR #933, which breaks `zig build all` on main. Picked up here so CI on this branch can complete.
Summary
computeSingleAggregatedSignaturewhen the single-child passthrough (.done) path returns: setprep.outcome = .skipbefore returning, matching the existing.ffipath guard.commitOneAggregateResult:ssz.serializecorrupts the source proof in memory (documented insszCloneAndGetBytes). Use one serialize pass to store the proof and deserialize a separate copy for the publish path, instead of returning the corrupted original.These match the SIGSEGV stack seen on devnet-4
zeam_8(exit 139) in the aggregate worker atsszClone/commitOneAggregateResult.Test plan
computeSingleAggregatedSignature: single-child passthrough survives prep deinitzig build test --summary allzeam_8no longer crash-loops after publishing aggregations