Skip to content

Limit aggregate signature builds to active slot#900

Merged
ch4r10t33r merged 5 commits into
mainfrom
zclawz/issue-899-aggregate-build-timing
May 19, 2026
Merged

Limit aggregate signature builds to active slot#900
ch4r10t33r merged 5 commits into
mainfrom
zclawz/issue-899-aggregate-build-timing

Conversation

@zclawz
Copy link
Copy Markdown
Contributor

@zclawz zclawz commented May 19, 2026

Summary

Fixes #899 by constraining aggregate-signature production to the slot currently being aggregated.

The aggregate worker is scheduled for one slot at a time, but computeAggregatedSignatures previously unioned every AttestationData key present in the snapshot maps. On devnet those maps can retain stale/future attestation data from gossip/new payload caches, so a single slot aggregation could recursively build proofs for unrelated slots before returning the current slot's aggregates. That inflated lean_pq_sig_aggregated_signatures_building_time_seconds into the multi-second range reported in #899.

Changes

  • Add an optional slot_filter to AggregatedAttestationsResult.computeAggregatedSignatures.
  • Add ForkChoice.aggregateForSlot(...) and use it from submitAggregateOnInterval.
  • Keep existing ForkChoice.aggregate(...) behavior unchanged for generic/test callers.
  • Add phase attribution metric:
    • zeam_pq_sig_aggregated_signatures_building_phase_seconds{phase="snapshot|compute_ffi|commit"}

Validation

  • zig fmt pkgs/metrics/src/lib.zig pkgs/types/src/block.zig pkgs/node/src/forkchoice.zig pkgs/node/src/chain.zig
  • git diff --check
  • zig build test passed locally (EXIT:0 in /tmp/zeam-899-zig-build-test.log).

@zclawz
Copy link
Copy Markdown
Contributor Author

zclawz commented May 19, 2026

Addressed the blocking review items in 9062b44:

  • widened production aggregation from strict current_slot to a bounded backfill window {current_slot - 1, current_slot} so a skipped ConcurrencyUnavailable interval can recover late/skipped-slot attestations on the next successful worker pass;
  • changed the internal filter shape to ?[]const Slot and added aggregateForSlots(...), leaving strict aggregateForSlot(...) only as a convenience/test wrapper;
  • documented the production-vs-backfill tradeoff and clarified that unfiltered aggregate(...) is for tests/explicit defensive backfills, not the slot worker path;
  • added filter-path regression tests covering: mixed-slot skip behavior, same-slot filtered vs unfiltered equivalence, and clean empty-filter output;
  • added a metric precision/phase-vs-total comment for the phase histogram helper.

Validation:

  • zig fmt pkgs/types/src/block.zig pkgs/node/src/forkchoice.zig pkgs/node/src/chain.zig pkgs/metrics/src/lib.zig
  • git diff --check
  • zig build test completed with the new block.test.computeAggregatedSignatures ... cases passing, plus the full logged test sequence through node tests passing.

ch4r10t33r
ch4r10t33r previously approved these changes May 19, 2026
Copy link
Copy Markdown
Contributor

@ch4r10t33r ch4r10t33r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@zclawz
Copy link
Copy Markdown
Contributor Author

zclawz commented May 19, 2026

Investigated the long-running macOS sim test job.

Root cause: the SSE events integration test has an outer 480s deadline, but SSEClient.readEvent() performed a blocking stream read. If the simulator stopped emitting SSE bytes before node3 finalization was observed (the macOS CI case), the test never returned to the loop to check the deadline, so the Run all sim tests step could hang indefinitely.

Pushed 5fbe09f to bound SSE reads with poll(50ms) before reading. That makes idle SSE periods return null so the existing deadline can actually fire instead of blocking inside the read.

Validation:

  • zig fmt pkgs/cli/test/integration.zig
  • git diff --check
  • timeout 600 zig build simtest --summary all passed locally (EXIT:0, simtest success in ~1m)

@zclawz
Copy link
Copy Markdown
Contributor Author

zclawz commented May 19, 2026

Follow-up on the still-running macOS sim job: I did not have macOS locally; my prior validation was Linux. The macOS CI run on 5fbe09f is still stuck, so the first poll-based patch was insufficient there.

I pushed d749471 to use std.Io's timed socket receive path directly for the SSE client instead of poll + Stream.Reader. Reason: the macOS job still appears to block inside the sim SSE path; using receiveTimeout(50ms) avoids the stream reader blocking after readiness and lets the existing 480s loop deadline advance.

Validation on Linux:

  • zig fmt pkgs/cli/test/integration.zig
  • git diff --check
  • timeout 600 zig build simtest --summary all passed (EXIT:0, simtest success in ~1m)

New CI run for d749471: https://github.com/blockblaz/zeam/actions/runs/26124187925

Copy link
Copy Markdown
Contributor

@ch4r10t33r ch4r10t33r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

node, aggregation: zeam aggregate-build is 5–7x slower than ethlambda on devnet (2–3s vs 0.4s)

2 participants