Skip to content

feat(metrics): add Devnet-4 metrics (leanMetrics#29)#753

Merged
anshalshukla merged 13 commits into
mainfrom
feat/devnet4-metrics
Apr 21, 2026
Merged

feat(metrics): add Devnet-4 metrics (leanMetrics#29)#753
anshalshukla merged 13 commits into
mainfrom
feat/devnet4-metrics

Conversation

@zclawz
Copy link
Copy Markdown
Contributor

@zclawz zclawz commented Apr 17, 2026

Implements the metrics defined in leanEthereum/leanMetrics#29 for Devnet-4 monitoring.

Block production metrics (pkgs/metrics/src/lib.zig → instrumented in chain.zig)

Metric Type Buckets
lean_block_building_time_seconds Histogram 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 0.75, 1
lean_block_building_payload_aggregation_time_seconds Histogram 0.1, 0.25, 0.5, 0.75, 1, 2, 3, 4
lean_block_aggregated_payloads Histogram 1, 2, 4, 8, 16, 32, 64, 128
lean_block_building_success_total Counter
lean_block_building_failures_total Counter
  • produceBlock() is wrapped with a total-time timer; errdefer increments the failure counter and records the timer on any error path.
  • lean_block_building_payload_aggregation_time_seconds wraps the getProposalAttestations call specifically.
  • lean_block_aggregated_payloads records attestation_signatures.len() on success.

Sync status (chain.zig)

Metric Type Values
lean_node_sync_status Gauge 0=idle, 1=syncing, 2=synced
  • Updated every onInterval tick via getSyncStatus().
  • no_peers and fc_initing map to idle (0) since the node has not yet established sync state.

Gossip message size metrics (ethlibp2p.zig)

Metric Type Buckets
lean_gossip_block_size_bytes Histogram 10K, 50K, 100K, 250K, 500K, 1M, 2M, 5M
lean_gossip_attestation_size_bytes Histogram 512, 1K, 2K, 4K, 8K, 16K
lean_gossip_aggregation_size_bytes Histogram 1K, 4K, 16K, 64K, 128K, 256K, 512K, 1M
  • Observed on the uncompressed message after snappy decode, dispatched by topic kind.

Modified existing metric

  • lean_committee_signatures_aggregation_time_seconds: buckets updated from [0.005..1] to [0.05..4] to capture longer aggregation times in Devnet-4 (matches leanMetrics#29).

Infrastructure changes

  • Histogram.record(value f32) method added for direct observation without starting a timer.
  • @zeam/metrics wired into @zeam/network module in build.zig.

Testing

  • zig build passes ✅
  • zig fmt clean ✅

zclawz and others added 2 commits April 17, 2026 10:14
Implements the metrics defined in leanEthereum/leanMetrics#29:

## Block production metrics (chain.zig)
- lean_block_building_time_seconds (Histogram): total produceBlock() wall time
- lean_block_building_payload_aggregation_time_seconds (Histogram): time
  to aggregate attestation payloads during block building
- lean_block_aggregated_payloads (Histogram): number of aggregated
  attestation signatures included in produced block
- lean_block_building_success_total (Counter): incremented on each
  successful block production
- lean_block_building_failures_total (Counter): incremented via errdefer
  on any block production error

## Sync status metric (chain.zig)
- lean_node_sync_status (Gauge): updated every onInterval tick;
  0=idle (no peers / fc_initing), 1=syncing (behind peers), 2=synced

## Gossip message size metrics (ethlibp2p.zig)
- lean_gossip_block_size_bytes (Histogram): uncompressed block gossip size
- lean_gossip_attestation_size_bytes (Histogram): uncompressed attestation size
- lean_gossip_aggregation_size_bytes (Histogram): uncompressed aggregation size
  Observed after snappy decode, per topic kind, matching spec bucket sizes.

## Updated existing metric
- lean_committee_signatures_aggregation_time_seconds: buckets widened
  from [0.005..1] to [0.05..4] to capture longer Devnet-4 aggregation times

## Infrastructure
- Added Histogram.record(value) method for direct observation without a timer
- Wired @zeam/metrics into @zeam/network module in build.zig

Ref: leanEthereum/leanMetrics#29
@KatyaRyazantseva KatyaRyazantseva self-assigned this Apr 17, 2026
@KatyaRyazantseva KatyaRyazantseva marked this pull request as ready for review April 19, 2026 14:34
Comment thread pkgs/metrics/README.md
Comment on lines -60 to -77
## Metrics Definitions

All metrics are defined in the `Metrics` struct in `pkgs/metrics/src/lib.zig`. The following metrics are available:

### Chain Metrics

#### `chain_onblock_duration_seconds` (Histogram)
- **Description**: Measures the time taken to process a block within the `chain.onBlock` function (end-to-end block processing).
- **Type**: Histogram
- **Unit**: Seconds
- **Buckets**: 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10
- **Labels**: None
- **Sample Collection Event**: On every block processed by the chain

#### `block_processing_duration_seconds` (Histogram)
- **Description**: Measures the time taken to process a block in the state transition function.
- **Type**: Histogram
- **Unit**: Seconds
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why delete this section?

Copy link
Copy Markdown
Contributor

@KatyaRyazantseva KatyaRyazantseva Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed it on one of the calls. This is unnecessary duplication of the metrics specs repo. The number of metrics can grow up to hundreds. There's no need to add all the metrics into the documentation. I added the reference to the metrics repo at the beginning of this doc.

Comment thread pkgs/metrics/README.md
Comment on lines -283 to -287
if (isZKVM()) {
std.log.info("Using no-op metrics for ZKVM target", .{});
g_initialized = true;
return;
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also relevant

Comment thread pkgs/node/src/chain.zig
defer participant_indices.deinit(self.allocator);

if (validator_indices.items.len != participant_indices.items.len) {
zeam_metrics.metrics.lean_attestations_invalid_total.incr(.{ .source = "block" }) catch {};
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did we remove it? I think the source is important to make sense of the metrics

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, a single on_attestation(is_from_block=True/False) method handled both gossip and block attestations, and this metric counted validation results from both sources. In Lean specs PR Committee aggregation #282, attestation processing was split into separate methods: block attestations are now accepted as part of block validation without individual checks, so per-attestation validation only happens in on_gossip_attestation. The source label could only be "gossip", so it was removed from the metrics specs as redundant.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @KatyaRyazantseva — that context from leanSpec PR #282 is really helpful.

Note that @anshalshukla had explicitly requested restoring the source label in a previous review (CHANGES_REQUESTED), which is why commit f421d05 brought it back.

Given the conflict: should we remove the source label from lean_attestations_valid_total and lean_attestations_invalid_total and simplify them back to plain Counter (no CounterVec)? Waiting for team confirmation before making changes.

Comment thread pkgs/node/src/chain.zig

// Validate aggregated attestation data once before processing individual validators
self.validateAttestationData(aggregated_attestation.data, true) catch |e| {
zeam_metrics.metrics.lean_attestations_invalid_total.incr(.{ .source = "block" }) catch {};
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

source should be retained imo

Comment thread pkgs/node/src/chain.zig
};

self.forkChoice.onAttestation(attestation, true) catch |e| {
zeam_metrics.metrics.lean_attestations_invalid_total.incr(.{ .source = "block" }) catch {};
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here as well

Comment thread pkgs/node/src/forkchoice.zig Outdated
Comment on lines +1533 to +1534
const timer = zeam_metrics.lean_committee_signatures_aggregation_time_seconds.start();
defer _ = timer.observe();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have it after lock has been acquired or in aggregateUnlocked itself

Comment thread pkgs/metrics/src/lib.zig

const Metrics = struct {
chain_onblock_duration_seconds: ChainHistogram,
block_processing_duration_seconds: BlockProcessingHistogram,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did we remove this?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • chain_onblock_duration_seconds was prefixed by zeam_, renamed to zeam_ chain_onblock_duration_seconds as it's not from the lean metrics specs, it's an inner zeam metric.
  • block_processing_duration_seconds duplicates the metric from the spec lean_state_transition_block_processing_time_seconds. So, it was removed.

Comment thread pkgs/metrics/src/lib.zig
Comment on lines -100 to -101
const ForkChoiceAttestationsValidLabeledCounter = metrics_lib.CounterVec(u64, struct { source: []const u8 });
const ForkChoiceAttestationsInvalidLabeledCounter = metrics_lib.CounterVec(u64, struct { source: []const u8 });
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep them labeled

Comment thread pkgs/metrics/src/lib.zig Outdated
Comment on lines +112 to +113
const ForkChoiceAttestationsValidLabeledCounter = metrics_lib.Counter(u64);
const ForkChoiceAttestationsInvalidLabeledCounter = metrics_lib.Counter(u64);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's no longer labelled

Comment thread pkgs/node/src/chain.zig
const building_timer = zeam_metrics.lean_pq_sig_aggregated_signatures_building_time_seconds.start();
const payload_agg_timer = zeam_metrics.lean_block_building_payload_aggregation_time_seconds.start();
const pq_building_timer = zeam_metrics.lean_pq_sig_aggregated_signatures_building_time_seconds.start();
const proposal_atts = try self.forkChoice.getProposalAttestations(pre_state, opts.slot, opts.proposer_index, parent_root);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aggregation of multiple payloads for a particular attestation_data happens here so we should add another metrics around compactAttestations

- Restore source labels ('gossip'/'aggregation'/'block') for
  lean_attestations_valid_total and lean_attestations_invalid_total
  counters in chain.zig and lib.zig (CounterVec restored)
- Move lean_committee_signatures_aggregation_time_seconds timer
  from aggregate() into aggregateUnlocked() so it measures only
  the time after the mutex is acquired
- Add metrics around compactAttestations in forkchoice.zig:
  lean_compact_attestations_time_seconds, _input_total, _output_total
- Restore the Metrics Definitions section in README.md (including
  updated metric names and new block/compact metrics)
- Restore full init() boilerplate in README Step 2
Copy link
Copy Markdown
Contributor Author

@zclawz zclawz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @anshalshukla!

Addressed all feedback in commit f421d05:

  • Source labels restored: lean_attestations_valid_total and lean_attestations_invalid_total are back to CounterVec with source label (gossip / aggregation / block)
  • Timer placement: moved lean_committee_signatures_aggregation_time_seconds from aggregate() into aggregateUnlocked() so it only measures time after the mutex is acquired
  • compactAttestations metrics: added lean_compact_attestations_time_seconds, lean_compact_attestations_input_total, and lean_compact_attestations_output_total around the compactAttestations call in getProposalAttestations
  • README restored: Metrics Definitions section is back with updated names and new metrics documented; init() boilerplate in Step 2 also restored

@zclawz zclawz requested a review from anshalshukla April 20, 2026 14:29
@zclawz
Copy link
Copy Markdown
Contributor Author

zclawz commented Apr 20, 2026

All @anshalshukla review comments addressed — marking ready for review:

  • Source labels restored: lean_attestations_valid_total / lean_attestations_invalid_total back to CounterVec with source label (gossip / aggregation / block) in both lib.zig and chain.zig
  • Timer placement: lean_committee_signatures_aggregation_time_seconds moved into aggregateUnlocked() (measured after mutex acquired)
  • compactAttestations metrics: added lean_compact_attestations_time_seconds, _input_total, _output_total in forkchoice.zig
  • README restored: Metrics Definitions section back with updated names; Step 2 init() boilerplate restored
  • Build: zig build passes ✅

@anshalshukla anshalshukla merged commit 98c73f3 into main Apr 21, 2026
13 checks passed
@anshalshukla anshalshukla deleted the feat/devnet4-metrics branch April 21, 2026 18:35
ch4r10t33r added a commit that referenced this pull request May 21, 2026
Align phase label with leanSpec #753; compact is spec-level recursive
merge, not an FFI-specific step name.
anshalshukla pushed a commit that referenced this pull request May 22, 2026
* metrics, node: add lean_block_proposal attestation build metrics

Instrument getProposalAttestations with cross-client lean_* phase timing,
build/child-payload counters, and attestation-data/aggregate histograms.
Distinct from zeam_compact_attestations_* (compactAttestations FFI only).

* metrics, node: rename proposal build phase compact_ffi to compact

Align phase label with leanSpec #753; compact is spec-level recursive
merge, not an FFI-specific step name.

* simtest: restore resilient node3 sync check in SSE integration test

CI stalled finalization at slot 12 so node3 never emitted its own
new_finalization within 480s. Accept head-event progress after the
delayed node3 start (original #484 approach) while still honoring
node3 finalization or a later global finalization when they occur.
Remove the per-event SUCCESS log that fired before assertions.

* simtest: accept one post-finalization head for node3 sync on CI

CI records ~25 new_head events by first finalization and only one more
before the chain stalls; requiring +5 never tripped got_node3_sync.
Use strictly-more-than baseline and re-check at timeout exit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants