Skip to content

metrics: aggregation interval tick duration (issue #863)#864

Merged
anshalshukla merged 1 commit into
mainfrom
feat/prometheus-aggregation-interval-tick
May 12, 2026
Merged

metrics: aggregation interval tick duration (issue #863)#864
anshalshukla merged 1 commit into
mainfrom
feat/prometheus-aggregation-interval-tick

Conversation

@ch4r10t33r
Copy link
Copy Markdown
Contributor

Adds Prometheus histogram zeam_node_aggregation_interval_tick_seconds measuring wall time for BeamNode at per-slot interval 2 (maybeAggregateOnInterval plus publishProducedAggregations), including null/skip and log-and-continue error paths.

Motivation: correlate aggregation work with slow slot_interval / event-loop starvation discussed in #863.

zig build passes locally.

Expose wall time for BeamNode interval 2 (maybeAggregateOnInterval +
publishProducedAggregations), including skip/error paths, for debugging
slow slot_interval / event-loop starvation (issue #863).

- Register histogram in pkgs/metrics with Prometheus buckets to 10s
- Time the interval-2 block in node.zig alongside log-and-continue errors
- Assert metric name appears in CLI integration metrics scrape
@ch4r10t33r ch4r10t33r marked this pull request as ready for review May 12, 2026 14:40
@anshalshukla anshalshukla merged commit 9312397 into main May 12, 2026
7 of 9 checks passed
@anshalshukla anshalshukla deleted the feat/prometheus-aggregation-interval-tick branch May 12, 2026 14:45
zclawz added a commit that referenced this pull request May 12, 2026
…ndpoint

The Prometheus metrics output now exceeds 8192 bytes after PR #864 added
the zeam_node_aggregation_interval_tick_seconds histogram (15 buckets,
long HELP text). The new metric appears late in the serialized output and
was silently truncated, causing the integration test assertion at line 574
to fail.

Allocate the read buffer on the heap instead of the stack so the limit
is trivially raisable. 128 KB gives ~16x headroom for future metric additions
before this needs revisiting.
ch4r10t33r added a commit that referenced this pull request May 13, 2026
* feat: fork-choice test driver API for hive lean-spec-tests

Add POST /lean/v0/test_driver/fork_choice/init and
POST /lean/v0/test_driver/fork_choice/step endpoints.

The hive lean-spec-tests-fork-choice simulator drives fork-choice
fixture scenarios over HTTP. Without these endpoints all 83
fork-choice spec tests fail (zeam returns 404 for init, causing
every test to be marked failed by the simulator).

Changes:
- pkgs/cli/src/test_driver.zig (new): ForkChoiceDriverState holds
  an isolated fork choice, state map, and label map per test run.
  handleForkChoiceInit parses anchorState + anchorBlock from JSON,
  validates anchor (block.state_root must match hash_tree_root(state)),
  initialises ForkChoice. handleForkChoiceStep dispatches block /
  tick / attestation steps and returns a DriverStepResponse JSON
  snapshot with headSlot, headRoot, time, justifiedCheckpoint,
  finalizedCheckpoint, safeTarget.
- pkgs/cli/src/api_server.zig: add test_driver_mutex + test_driver_state
  fields to ApiServer, wire POST routes, add readLargeBody helper.
- pkgs/spectest/src/runner/fork_choice_runner.zig: handle new fixture
  check keys (justifiedCheckpoint, finalizedCheckpoint, safeTarget)
  and implement attestation step type.

Closes #858 (to be filed).

* fix: align test driver with ream reference (ReamLabs/ream#1368)

- Fix attestation step format: was reading aggregationBits (aggregated
  format) but fixture attestation step uses validatorId (single validator)
  matching ForkChoiceStep::Attestation {validator_id, data, signature?}.
  This was causing ALL tests with attestation steps to return accepted:false.

- Add gossipAggregatedAttestation step type: parses proof.participants
  bitlist + data, calls storeAggregatedPayload + registers individual
  attestations. Required for test_gossip_aggregated_attestation_validation
  and test_signature_aggregation fixtures.

- Fix tick step: now handles both 'time' (unix timestamp) and 'interval'
  (direct interval count) fields as alternatives, matching
  ForkChoiceStep::Tick { time, interval } in leanSpec.

- Handle 'checks' step type: returns accepted:true as a no-op. The hive
  simulator reads checks assertions directly from the JSON step and
  validates them against the snapshot — no driver action needed.

- Handle unknown step types as no-op (accepted:true) instead of error,
  preventing future fixture additions from breaking existing tests.

- Add GET /lean/v0/test_driver/fork_choice/snapshot endpoint.

- Add POST /lean/v0/test_driver/state_transition/run endpoint:
  runs a state transition on pre-state + blocks and returns post summary
  (slot, latestBlockHeaderSlot, latestBlockHeaderStateRoot,
  historicalBlockHashesCount).

- Add POST /lean/v0/test_driver/verify_signatures/run endpoint:
  returns succeeded:true as stub pending XMSS test-driver verification.

* fix: increase readFullResponse buffer from 8KB to 128KB for metrics endpoint

The Prometheus metrics output now exceeds 8192 bytes after PR #864 added
the zeam_node_aggregation_interval_tick_seconds histogram (15 buckets,
long HELP text). The new metric appears late in the serialized output and
was silently truncated, causing the integration test assertion at line 574
to fail.

Allocate the read buffer on the heap instead of the stack so the limit
is trivially raisable. 128 KB gives ~16x headroom for future metric additions
before this needs revisiting.

* fix: harden fork-choice test driver

---------

Co-authored-by: zclawz <zclawz@users.noreply.github.com>
Co-authored-by: Parthasarathy Ramanujam <1627026+ch4r10t33r@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants