feat(metrics): add Devnet-4 metrics (leanMetrics#29)#753
Conversation
Implements the metrics defined in leanEthereum/leanMetrics#29: ## Block production metrics (chain.zig) - lean_block_building_time_seconds (Histogram): total produceBlock() wall time - lean_block_building_payload_aggregation_time_seconds (Histogram): time to aggregate attestation payloads during block building - lean_block_aggregated_payloads (Histogram): number of aggregated attestation signatures included in produced block - lean_block_building_success_total (Counter): incremented on each successful block production - lean_block_building_failures_total (Counter): incremented via errdefer on any block production error ## Sync status metric (chain.zig) - lean_node_sync_status (Gauge): updated every onInterval tick; 0=idle (no peers / fc_initing), 1=syncing (behind peers), 2=synced ## Gossip message size metrics (ethlibp2p.zig) - lean_gossip_block_size_bytes (Histogram): uncompressed block gossip size - lean_gossip_attestation_size_bytes (Histogram): uncompressed attestation size - lean_gossip_aggregation_size_bytes (Histogram): uncompressed aggregation size Observed after snappy decode, per topic kind, matching spec bucket sizes. ## Updated existing metric - lean_committee_signatures_aggregation_time_seconds: buckets widened from [0.005..1] to [0.05..4] to capture longer Devnet-4 aggregation times ## Infrastructure - Added Histogram.record(value) method for direct observation without a timer - Wired @zeam/metrics into @zeam/network module in build.zig Ref: leanEthereum/leanMetrics#29
| ## Metrics Definitions | ||
|
|
||
| All metrics are defined in the `Metrics` struct in `pkgs/metrics/src/lib.zig`. The following metrics are available: | ||
|
|
||
| ### Chain Metrics | ||
|
|
||
| #### `chain_onblock_duration_seconds` (Histogram) | ||
| - **Description**: Measures the time taken to process a block within the `chain.onBlock` function (end-to-end block processing). | ||
| - **Type**: Histogram | ||
| - **Unit**: Seconds | ||
| - **Buckets**: 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10 | ||
| - **Labels**: None | ||
| - **Sample Collection Event**: On every block processed by the chain | ||
|
|
||
| #### `block_processing_duration_seconds` (Histogram) | ||
| - **Description**: Measures the time taken to process a block in the state transition function. | ||
| - **Type**: Histogram | ||
| - **Unit**: Seconds |
There was a problem hiding this comment.
Why delete this section?
There was a problem hiding this comment.
We discussed it on one of the calls. This is unnecessary duplication of the metrics specs repo. The number of metrics can grow up to hundreds. There's no need to add all the metrics into the documentation. I added the reference to the metrics repo at the beginning of this doc.
| if (isZKVM()) { | ||
| std.log.info("Using no-op metrics for ZKVM target", .{}); | ||
| g_initialized = true; | ||
| return; | ||
| } |
There was a problem hiding this comment.
This is also relevant
| defer participant_indices.deinit(self.allocator); | ||
|
|
||
| if (validator_indices.items.len != participant_indices.items.len) { | ||
| zeam_metrics.metrics.lean_attestations_invalid_total.incr(.{ .source = "block" }) catch {}; |
There was a problem hiding this comment.
why did we remove it? I think the source is important to make sense of the metrics
There was a problem hiding this comment.
Previously, a single on_attestation(is_from_block=True/False) method handled both gossip and block attestations, and this metric counted validation results from both sources. In Lean specs PR Committee aggregation #282, attestation processing was split into separate methods: block attestations are now accepted as part of block validation without individual checks, so per-attestation validation only happens in on_gossip_attestation. The source label could only be "gossip", so it was removed from the metrics specs as redundant.
There was a problem hiding this comment.
Thanks @KatyaRyazantseva — that context from leanSpec PR #282 is really helpful.
Note that @anshalshukla had explicitly requested restoring the source label in a previous review (CHANGES_REQUESTED), which is why commit f421d05 brought it back.
Given the conflict: should we remove the source label from lean_attestations_valid_total and lean_attestations_invalid_total and simplify them back to plain Counter (no CounterVec)? Waiting for team confirmation before making changes.
|
|
||
| // Validate aggregated attestation data once before processing individual validators | ||
| self.validateAttestationData(aggregated_attestation.data, true) catch |e| { | ||
| zeam_metrics.metrics.lean_attestations_invalid_total.incr(.{ .source = "block" }) catch {}; |
There was a problem hiding this comment.
source should be retained imo
| }; | ||
|
|
||
| self.forkChoice.onAttestation(attestation, true) catch |e| { | ||
| zeam_metrics.metrics.lean_attestations_invalid_total.incr(.{ .source = "block" }) catch {}; |
| const timer = zeam_metrics.lean_committee_signatures_aggregation_time_seconds.start(); | ||
| defer _ = timer.observe(); |
There was a problem hiding this comment.
I think we should have it after lock has been acquired or in aggregateUnlocked itself
|
|
||
| const Metrics = struct { | ||
| chain_onblock_duration_seconds: ChainHistogram, | ||
| block_processing_duration_seconds: BlockProcessingHistogram, |
There was a problem hiding this comment.
why did we remove this?
There was a problem hiding this comment.
chain_onblock_duration_secondswas prefixed byzeam_, renamed tozeam_ chain_onblock_duration_secondsas it's not from the lean metrics specs, it's an inner zeam metric.block_processing_duration_secondsduplicates the metric from the speclean_state_transition_block_processing_time_seconds. So, it was removed.
| const ForkChoiceAttestationsValidLabeledCounter = metrics_lib.CounterVec(u64, struct { source: []const u8 }); | ||
| const ForkChoiceAttestationsInvalidLabeledCounter = metrics_lib.CounterVec(u64, struct { source: []const u8 }); |
There was a problem hiding this comment.
We should keep them labeled
| const ForkChoiceAttestationsValidLabeledCounter = metrics_lib.Counter(u64); | ||
| const ForkChoiceAttestationsInvalidLabeledCounter = metrics_lib.Counter(u64); |
There was a problem hiding this comment.
it's no longer labelled
| const building_timer = zeam_metrics.lean_pq_sig_aggregated_signatures_building_time_seconds.start(); | ||
| const payload_agg_timer = zeam_metrics.lean_block_building_payload_aggregation_time_seconds.start(); | ||
| const pq_building_timer = zeam_metrics.lean_pq_sig_aggregated_signatures_building_time_seconds.start(); | ||
| const proposal_atts = try self.forkChoice.getProposalAttestations(pre_state, opts.slot, opts.proposer_index, parent_root); |
There was a problem hiding this comment.
aggregation of multiple payloads for a particular attestation_data happens here so we should add another metrics around compactAttestations
- Restore source labels ('gossip'/'aggregation'/'block') for
lean_attestations_valid_total and lean_attestations_invalid_total
counters in chain.zig and lib.zig (CounterVec restored)
- Move lean_committee_signatures_aggregation_time_seconds timer
from aggregate() into aggregateUnlocked() so it measures only
the time after the mutex is acquired
- Add metrics around compactAttestations in forkchoice.zig:
lean_compact_attestations_time_seconds, _input_total, _output_total
- Restore the Metrics Definitions section in README.md (including
updated metric names and new block/compact metrics)
- Restore full init() boilerplate in README Step 2
zclawz
left a comment
There was a problem hiding this comment.
Thanks for the review @anshalshukla!
Addressed all feedback in commit f421d05:
- Source labels restored:
lean_attestations_valid_totalandlean_attestations_invalid_totalare back toCounterVecwithsourcelabel (gossip/aggregation/block) - Timer placement: moved
lean_committee_signatures_aggregation_time_secondsfromaggregate()intoaggregateUnlocked()so it only measures time after the mutex is acquired - compactAttestations metrics: added
lean_compact_attestations_time_seconds,lean_compact_attestations_input_total, andlean_compact_attestations_output_totalaround thecompactAttestationscall ingetProposalAttestations - README restored: Metrics Definitions section is back with updated names and new metrics documented;
init()boilerplate in Step 2 also restored
|
All @anshalshukla review comments addressed — marking ready for review:
|
Align phase label with leanSpec #753; compact is spec-level recursive merge, not an FFI-specific step name.
* metrics, node: add lean_block_proposal attestation build metrics Instrument getProposalAttestations with cross-client lean_* phase timing, build/child-payload counters, and attestation-data/aggregate histograms. Distinct from zeam_compact_attestations_* (compactAttestations FFI only). * metrics, node: rename proposal build phase compact_ffi to compact Align phase label with leanSpec #753; compact is spec-level recursive merge, not an FFI-specific step name. * simtest: restore resilient node3 sync check in SSE integration test CI stalled finalization at slot 12 so node3 never emitted its own new_finalization within 480s. Accept head-event progress after the delayed node3 start (original #484 approach) while still honoring node3 finalization or a later global finalization when they occur. Remove the per-event SUCCESS log that fired before assertions. * simtest: accept one post-finalization head for node3 sync on CI CI records ~25 new_head events by first finalization and only one more before the chain stalls; requiring +5 never tripped got_node3_sync. Use strictly-more-than baseline and re-check at timeout exit.
Implements the metrics defined in leanEthereum/leanMetrics#29 for Devnet-4 monitoring.
Block production metrics (
pkgs/metrics/src/lib.zig→ instrumented inchain.zig)lean_block_building_time_secondslean_block_building_payload_aggregation_time_secondslean_block_aggregated_payloadslean_block_building_success_totallean_block_building_failures_totalproduceBlock()is wrapped with a total-time timer;errdeferincrements the failure counter and records the timer on any error path.lean_block_building_payload_aggregation_time_secondswraps thegetProposalAttestationscall specifically.lean_block_aggregated_payloadsrecordsattestation_signatures.len()on success.Sync status (
chain.zig)lean_node_sync_statusonIntervaltick viagetSyncStatus().no_peersandfc_initingmap toidle (0)since the node has not yet established sync state.Gossip message size metrics (
ethlibp2p.zig)lean_gossip_block_size_byteslean_gossip_attestation_size_byteslean_gossip_aggregation_size_bytesModified existing metric
lean_committee_signatures_aggregation_time_seconds: buckets updated from[0.005..1]to[0.05..4]to capture longer aggregation times in Devnet-4 (matches leanMetrics#29).Infrastructure changes
Histogram.record(value f32)method added for direct observation without starting a timer.@zeam/metricswired into@zeam/networkmodule inbuild.zig.Testing
zig buildpasses ✅zig fmtclean ✅