metrics: aggregation interval tick duration (issue #863)#864
Merged
Conversation
Expose wall time for BeamNode interval 2 (maybeAggregateOnInterval + publishProducedAggregations), including skip/error paths, for debugging slow slot_interval / event-loop starvation (issue #863). - Register histogram in pkgs/metrics with Prometheus buckets to 10s - Time the interval-2 block in node.zig alongside log-and-continue errors - Assert metric name appears in CLI integration metrics scrape
anshalshukla
approved these changes
May 12, 2026
zclawz
added a commit
that referenced
this pull request
May 12, 2026
…ndpoint The Prometheus metrics output now exceeds 8192 bytes after PR #864 added the zeam_node_aggregation_interval_tick_seconds histogram (15 buckets, long HELP text). The new metric appears late in the serialized output and was silently truncated, causing the integration test assertion at line 574 to fail. Allocate the read buffer on the heap instead of the stack so the limit is trivially raisable. 128 KB gives ~16x headroom for future metric additions before this needs revisiting.
ch4r10t33r
added a commit
that referenced
this pull request
May 13, 2026
* feat: fork-choice test driver API for hive lean-spec-tests Add POST /lean/v0/test_driver/fork_choice/init and POST /lean/v0/test_driver/fork_choice/step endpoints. The hive lean-spec-tests-fork-choice simulator drives fork-choice fixture scenarios over HTTP. Without these endpoints all 83 fork-choice spec tests fail (zeam returns 404 for init, causing every test to be marked failed by the simulator). Changes: - pkgs/cli/src/test_driver.zig (new): ForkChoiceDriverState holds an isolated fork choice, state map, and label map per test run. handleForkChoiceInit parses anchorState + anchorBlock from JSON, validates anchor (block.state_root must match hash_tree_root(state)), initialises ForkChoice. handleForkChoiceStep dispatches block / tick / attestation steps and returns a DriverStepResponse JSON snapshot with headSlot, headRoot, time, justifiedCheckpoint, finalizedCheckpoint, safeTarget. - pkgs/cli/src/api_server.zig: add test_driver_mutex + test_driver_state fields to ApiServer, wire POST routes, add readLargeBody helper. - pkgs/spectest/src/runner/fork_choice_runner.zig: handle new fixture check keys (justifiedCheckpoint, finalizedCheckpoint, safeTarget) and implement attestation step type. Closes #858 (to be filed). * fix: align test driver with ream reference (ReamLabs/ream#1368) - Fix attestation step format: was reading aggregationBits (aggregated format) but fixture attestation step uses validatorId (single validator) matching ForkChoiceStep::Attestation {validator_id, data, signature?}. This was causing ALL tests with attestation steps to return accepted:false. - Add gossipAggregatedAttestation step type: parses proof.participants bitlist + data, calls storeAggregatedPayload + registers individual attestations. Required for test_gossip_aggregated_attestation_validation and test_signature_aggregation fixtures. - Fix tick step: now handles both 'time' (unix timestamp) and 'interval' (direct interval count) fields as alternatives, matching ForkChoiceStep::Tick { time, interval } in leanSpec. - Handle 'checks' step type: returns accepted:true as a no-op. The hive simulator reads checks assertions directly from the JSON step and validates them against the snapshot — no driver action needed. - Handle unknown step types as no-op (accepted:true) instead of error, preventing future fixture additions from breaking existing tests. - Add GET /lean/v0/test_driver/fork_choice/snapshot endpoint. - Add POST /lean/v0/test_driver/state_transition/run endpoint: runs a state transition on pre-state + blocks and returns post summary (slot, latestBlockHeaderSlot, latestBlockHeaderStateRoot, historicalBlockHashesCount). - Add POST /lean/v0/test_driver/verify_signatures/run endpoint: returns succeeded:true as stub pending XMSS test-driver verification. * fix: increase readFullResponse buffer from 8KB to 128KB for metrics endpoint The Prometheus metrics output now exceeds 8192 bytes after PR #864 added the zeam_node_aggregation_interval_tick_seconds histogram (15 buckets, long HELP text). The new metric appears late in the serialized output and was silently truncated, causing the integration test assertion at line 574 to fail. Allocate the read buffer on the heap instead of the stack so the limit is trivially raisable. 128 KB gives ~16x headroom for future metric additions before this needs revisiting. * fix: harden fork-choice test driver --------- Co-authored-by: zclawz <zclawz@users.noreply.github.com> Co-authored-by: Parthasarathy Ramanujam <1627026+ch4r10t33r@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds Prometheus histogram
zeam_node_aggregation_interval_tick_secondsmeasuring wall time for BeamNode at per-slot interval 2 (maybeAggregateOnIntervalpluspublishProducedAggregations), including null/skip and log-and-continue error paths.Motivation: correlate aggregation work with slow
slot_interval/ event-loop starvation discussed in #863.pkgs/metrics: register histogram + observe callbackpkgs/node: timer around interval-2 block (merged with stf: false-positiveerror.DuplicateAttestationDatareject of block accepted by every other client family #837 continue-on-error behavior)pkgs/cli/test/integration.zig: assert metric appears on scrapezig buildpasses locally.