feat(metrics): add `lean_gossip_mesh_peers` gauge by zclawz · Pull Request #818 · blockblaz/zeam

zclawz · 2026-05-01T14:03:50Z

Mirrors the leanMetrics spec change in leanEthereum/leanMetrics#35: adds a new lean_gossip_mesh_peers Prometheus gauge that reports the number of remote peers in this node's gossipsub mesh across all subscribed topics.

Implementation

Same shape as the zeam_libp2p_swarm_command_dropped_total plumbing from #808:

Rust (rust/libp2p-glue/src/lib.rs): per-network MESH_PEERS_TOTAL: HashMap<u32, AtomicU64>, updated from inside the swarm task whenever a gossipsub event fires, a ConnectionClosed event arrives, or a 1-second tick elapses. Exposed via a new FFI getter get_mesh_peers_total(network_id). The slot is pre-created when the network starts and removed on stop_network so repeated start/stop cycles in tests don't leak entries.
Zig FFI mirror (pkgs/network/src/ethlibp2p.zig): refreshMeshPeersMetric reads the FFI getter and sets the gauge. Because registerScrapeRefresher only stores a single callback, both this and the existing rust-bridge: swarm command channel saturating under devnet-4 load — silent attestation/reqresp drops causing fork-choice divergence #808 swarm-drop refresher are chained via a fan-out refreshNetworkMetrics.
Metric registration (pkgs/metrics/src/lib.zig): adds the gauge to the registry as an unlabeled Gauge(u64).

Notes / open questions for review

Label scheme. The leanMetrics PR description shows a client=<name>_<N> label format. I emitted this as an unlabeled Gauge(u64) because in zeam's deployment that label conventionally comes from the Prometheus scrape job's instance rewriting. zeam's existing lean_connected_peers does use a per-remote-peer 0/1 label scheme, so if reviewers prefer the new metric to match that style exactly, a follow-up PR can wire up gossipsub Subscribed/Unsubscribed event handling (which currently isn't consumed) and resolve peer-id → node-name on those events.
1-second tick. Defensive — gossipsub events and ConnectionClosed should already cover every transition, but a tick guarantees liveness on idle topics. all_mesh_peers().count() is O(peers) and the atomic store is lock-free, so the swarm task isn't blocked.
registerScrapeRefresher is single-slot. Discovered while wiring this up — registering individually would silently overwrite the rust-bridge: swarm command channel saturating under devnet-4 load — silent attestation/reqresp drops causing fork-choice divergence #808 refresher. Fan-out is the smallest fix; a follow-up could turn it into a list. Left a comment at the registration site spelling this out.

Build / test

zig build (ReleaseFast) — clean.
zig build test — clean (no failures introduced; existing negative-path validation tests untouched).

Closes the zeam side of leanEthereum/leanMetrics#35.

Mirrors the leanMetrics spec change in leanEthereum/leanMetrics#35. Reports the number of remote peers in this node's gossipsub mesh across all subscribed topics, kept fresh from the rust-libp2p swarm task and exposed via FFI to the Zig metrics layer for "on scrape" reads — same pattern as `zeam_libp2p_swarm_command_dropped_total` (#808). Implementation: - Rust: per-network `MESH_PEERS_TOTAL: HashMap<u32, AtomicU64>` updated from inside the swarm task on every gossipsub event, every `ConnectionClosed`, and on a 1s liveness tick. New FFI getter `get_mesh_peers_total(network_id) -> u64`. - Zig: `refreshMeshPeersMetric` reads the FFI getter on each Prometheus scrape and calls `lean_gossip_mesh_peers.set(count)`. Chained with the existing #808 swarm-drop refresher via a fan-out `refreshNetworkMetrics`, since `registerScrapeRefresher` only stores one callback. - Metrics: registers `lean_gossip_mesh_peers` as an unlabeled `Gauge(u64)`. Notes: - Emitted as an unlabeled gauge. The leanMetrics PR shows a `client=<name>_<N>` label; in zeam's setup that label conventionally comes from the Prometheus scrape job's `instance` rewriting. Per- remote-peer mesh tracking (one 0/1 series per remote, matching how `lean_connected_peers` is implemented) would require subscribing to gossipsub Subscribed/Unsubscribed events and is left as a follow-up. - The 1s tick is defensive — gossipsub events should already cover all transitions, but a tick guarantees liveness on idle topics.

anshalshukla · 2026-05-02T09:56:24Z

@zclawz merge the main branch into it and resolve conflicts

Resolve conflict in pkgs/metrics/src/lib.zig: main added io parameter to labeled metrics init calls (zig 0.16.0 upgrade #784); our branch adds lean_gossip_mesh_peers gauge. Keep both changes.

zclawz · 2026-05-02T10:02:49Z

Merged main into the branch and resolved the conflict in pkgs/metrics/src/lib.zig.

The conflict was between main adding an io parameter to labeled-metrics init calls (zig 0.16.0 upgrade, #784) and our branch adding the lean_gossip_mesh_peers gauge line. Kept both changes — using the updated io-parameter API and retaining the new gauge. Pushed as commit 4fa0140.

Address PR #818 review: 1) Add a Zig scrape test for `lean_gossip_mesh_peers` so the gauge↑`writeMetrics` contract is locked in CI, not just in docs/PR comments. Also adds a test for the refresher fan-out change below. Wires `metrics_tests` into `zig build test`. 2) Convert `registerScrapeRefresher` and `registerScrapeRefresherCtx` from single-slot to bounded-list (`MAX_SCRAPE_REFRESHERS = 16`) so a future third caller cannot silently overwrite a previous registration. Both lists run on every scrape, void-context first, ctx-bearing second, preserving the legacy ordering. Drops the "do not register individually" guardrail comment in `ethlibp2p.zig`; `refreshNetworkMetrics` is kept as a single per-module fan-out for readability. 3) Make `record_mesh_peers` / `mesh_peers_slot` and the stop-network cleanup in `rust/libp2p-glue` recover from a poisoned `MESH_PEERS_TOTAL` mutex via `match … Err(p) => p.into_inner()`, matching the existing read-path handling in `get_mesh_peers_total`. Avoids escalating a metric-side panic into a swarm-task crash on the 1s mesh-peers tick or on shutdown. (Cannot use `MutexExt::lock_recover` from #819 yet — that PR is unmerged; the explicit `match` is the same recovery, inlined.)

zclawz · 2026-05-07T08:21:15Z

Pushed 2b5c266 addressing all three review items:

1. Automated scrape test for `lean_gossip_mesh_peers` ✅

Added pkgs/metrics/src/lib.zig tests + a metrics_tests step wired into zig build test:

lean_gossip_mesh_peers gauge appears in scrape output — calls metrics.lean_gossip_mesh_peers.set(4242), runs writeMetrics(&writer), asserts the body contains both the metric name and the literal lean_gossip_mesh_peers 4242 value line. Locks the gauge↑serializer contract — same shape as pkgs/node/src/locking.zig's LockTimer → /metrics test from slice (b).
registerScrapeRefresher fans out to all registered callbacks — guards item (2) below in code (see next section).

The FFI side (get_mesh_peers_total → refreshMeshPeersMetric → gauge) still needs a real swarm to exercise end-to-end, but the gauge↔scrape path — the place where a future struct rename or serializer regression would silently break the contract — is now covered.

2. `registerScrapeRefresher` → append-to-list ✅

Replaced both g_scrape_refresher (void→void) and g_scrape_refresher_ctx (*anyopaque→void) single slots with bounded fixed-size lists:

const MAX_SCRAPE_REFRESHERS: usize = 16;
var g_scrape_refreshers: [MAX_SCRAPE_REFRESHERS]*const fn () void = undefined;
var g_scrape_refreshers_len: usize = 0;
// …and the same for ctx-bearing.

registerScrapeRefresher and registerScrapeRefresherCtx now append; writeMetrics iterates both lists on every scrape (void-list first, then ctx-list, preserving the legacy ordering between FFI-backed atomic refreshes and context-bearing observers that may read from them). Allocator-free, ZKVM-safe, panics on overflow (which would indicate a registration bug, not legitimate growth — current usage is 2 callsites).

The refreshNetworkMetrics fan-out in ethlibp2p.zig is kept (one entry per module is still cleaner for auditing, and the comment is updated to reflect that it's no longer load-bearing for correctness). Removed the "do NOT register them individually or you will overwrite" guardrail since the registry is now safe.

3. `record_mesh_peers` write-path mutex poisoning ✅

Applied the same match … Err(poisoned) => poisoned.into_inner() recovery to both:

mesh_peers_slot() (the write path used by record_mesh_peers from the swarm task on every gossipsub event + the 1s tick), and
MESH_PEERS_TOTAL.lock() inside stop_network (cleanup path).

Now symmetric with the existing read-path handling in get_mesh_peers_total. Note: I went with the explicit match rather than MutexExt::lock_recover because PR #819 is not yet merged on main; once #819 lands, a follow-up rebase can switch to the trait. Behaviour is identical — MutexExt::lock_recover is just sugar for the same match.

Verification

zig build — green.

zig build test — all 132 tests pass; the new metrics_tests step runs both new tests cleanly:

1/2 lib.test.lean_gossip_mesh_peers gauge appears in scrape output...OK
2/2 lib.test.registerScrapeRefresher fans out to all registered callbacks...OK
All 2 tests passed.

cargo build -p libp2p-glue and cargo check -p libp2p-glue --tests — green.

Ready for another look 🙏

…orks Address PR #818 follow-up review (findings 4, 5, 6, plus the doc note from finding 7): (4) Gate the mesh-peer recompute on the gossipsub event variants that actually change mesh membership: `Subscribed`, `Unsubscribed`, `GossipsubNotSupported`, `SlowPeer`. `Message` events do NOT affect mesh membership, and a busy node delivering hundreds of `Message`/sec was paying O(peers) per event for nothing — making the cumulative refresh cost O(peers^2). The `ConnectionClosed` recompute and the 1s `mesh_peers_tick` continue to cover transitions outside this branch, so a missed gauge update inside `Message` cannot drift longer than ~1s. (5) Replace `MESH_PEERS_TOTAL: Mutex<HashMap<u32, Arc<AtomicU64>>>` with `static MESH_PEERS_TOTAL: [AtomicU64; MAX_NETWORKS]`, mirroring the `SWARM_COMMAND_DROPPED_TOTAL` pattern (#808). The network slot table at the top of the file already caps live networks at 3, so a fixed-size array fits the actual constraint. No `Mutex`, no `HashMap`, no `Arc`, no poisoning concerns — drops findings (3) and (5) in one move. (6) Drop the Zig-side `mesh_peers_network_id` global. With the fixed-size atomic array on the Rust side, `refreshMeshPeersMetric` now sums across every `network_id` slot rather than tracking a single "last-init'd" id in a Zig-side mutable global. Inactive slots default to 0 and are reset to 0 by `stop_network`, so the sum is the correct single-gauge answer for every current usage. Per-network labelled gauges (the `client=<name>_<N>` scheme the PR description punted on) become a localised follow-up rather than a re-architecture. (7) Document scrape-vs-lifecycle semantics on `get_mesh_peers_total`: between `stop_network` and a subsequent `create_and_run_network` on the same id, a scrape returns 0. Operators must distinguish "network up with 0 mesh peers" from "network restarting" via orthogonal signals; a separate `lean_gossip_mesh_running` gauge is the obvious follow-up if the distinction becomes load-bearing. Also notes the FFI `u64` return is stable across 32/64-bit architectures (cosmetic, finding 9).

zclawz · 2026-05-07T08:47:03Z

Pushed 69cc2fd addressing the follow-up review (findings 4, 5, 6, and the doc note from 7):

4. Gate per-event mesh recompute ✅

The BehaviourEvent::Gossipsub branch in Network::run_eventloop no longer recomputes all_mesh_peers().count() on every event. It now uses matches! to gate on the variants that actually change mesh membership:

Subscribed / Unsubscribed — peer joined/left a topic this node is subscribed to.
GossipsubNotSupported — connected peer turned out not to speak gossipsub.
SlowPeer — gossipsub may evict the peer under backpressure.

Message events (the dominant traffic on a busy node) no longer trigger the O(peers) walk. Cumulative cost drops from O(peers²) per scrape window to O(peers × mesh-changing-events). The 1s mesh_peers_tick and the ConnectionClosed recompute cover "events outside this branch", so a missed update inside Message cannot drift longer than ~1s.

5. Lock-free fixed-size atomic array ✅

const MAX_NETWORKS: usize = 3;
static MESH_PEERS_TOTAL: [AtomicU64; MAX_NETWORKS] =
    [AtomicU64::new(0), AtomicU64::new(0), AtomicU64::new(0)];

Mirrors the SWARM_COMMAND_DROPPED_TOTAL: [AtomicU64; 3] pattern from #808. The hardcoded slot table at the top of lib.rs (get_swarm_mut / set_swarm / get_zig_handler / …) already caps live networks at 3, so a fixed-size array fits the actual constraint.

Results:

No Mutex<HashMap>, no Arc, no lock().unwrap() (= no poisoning concern, no Err(p).into_inner() recovery needed) on this metric path.
record_mesh_peers is a single relaxed atomic store; get_mesh_peers_total is a single relaxed atomic load.
mesh_peers_slot() deleted; the start_network warm-up call (let _ = mesh_peers_slot(self.network_id)) deleted (slots are always present).
stop_network resets the slot to 0 with slot.store(0, Relaxed) instead of removing a HashMap entry.

This drops findings (3) and (5) in one move.

6. Drop Zig-side `mesh_peers_network_id` global ✅

refreshMeshPeersMetric in pkgs/network/src/ethlibp2p.zig now sums across every network_id slot:

fn refreshMeshPeersMetric() void {
    var total: u64 = 0;
    var network_id: u32 = 0;
    while (network_id < MESH_PEERS_MAX_NETWORKS) : (network_id += 1) {
        total += get_mesh_peers_total(network_id);
    }
    zeam_metrics.metrics.lean_gossip_mesh_peers.set(total);
}

The var mesh_peers_network_id: u32 = 0 global is gone, along with its init-time assignment. Inactive slots default to 0 and are reset to 0 by stop_network, so summing is the correct single-gauge answer for every current usage and for any future multi-network test/deployment.

A per-network labelled gauge (client=<name>_<N> from the leanSpec) becomes a localised follow-up: replace the total accumulator with one gauge.set(.{ .network_id = N }, count) call per non-zero slot. The fixed-size atomic shape on the Rust side is what makes that change small.

7. Doc note on scrape-vs-lifecycle semantics ✅

Added a paragraph on get_mesh_peers_total explaining that between stop_network(network_id) and a subsequent create_and_run_network(network_id) a scrape returns 0, with a pointer to a separate lean_gossip_mesh_running boolean gauge as the obvious follow-up if the "up with 0 peers" vs "restarting" distinction becomes load-bearing for dashboards. Also noted the FFI u64 is stable across 32/64-bit architectures (cosmetic, finding 9).

8 / 9 — punted as suggested

(8) 1s tick recomputing on idle nodes — reviewer said "leave for future cleanup if the per-event recompute is also kept". The per-event recompute is now gated, so the tick doesn't need a dirty flag for correctness; left as-is.
(9) as u64 documentation — covered in the doc note above.

Verification

cargo check -p libp2p-glue --tests — green.
zig build — green.

zig build test — all tests pass; the metrics_tests step from the previous round still runs and locks the gauge↔scrape contract:

1/2 lib.test.lean_gossip_mesh_peers gauge appears in scrape output...OK
2/2 lib.test.registerScrapeRefresher fans out to all registered callbacks...OK
All 2 tests passed.

Ready for review again 🙏

ch4r10t33r

Looks good.

zclawz added 2 commits May 1, 2026 14:03

fix: resolve clippy single_match lint - use if let for gossipsub event

97e1684

chore: merge main into feat/lean-gossip-mesh-peers-metric

4fa0140

Resolve conflict in pkgs/metrics/src/lib.zig: main added io parameter to labeled metrics init calls (zig 0.16.0 upgrade #784); our branch adds lean_gossip_mesh_peers gauge. Keep both changes.

ch4r10t33r and others added 4 commits May 6, 2026 19:32

Merge branch 'main' into feat/lean-gossip-mesh-peers-metric

fdc8683

Merge branch 'main' into feat/lean-gossip-mesh-peers-metric

bf50425

Merge branch 'main' into feat/lean-gossip-mesh-peers-metric

df3b889

ch4r10t33r approved these changes May 7, 2026

View reviewed changes

ch4r10t33r merged commit 0d04ea2 into main May 7, 2026
13 checks passed

ch4r10t33r deleted the feat/lean-gossip-mesh-peers-metric branch May 7, 2026 11:04

This was referenced May 7, 2026

Release v0.4.16 (Devnet4) #839

Merged

node: buffer future-slot gossip blocks during clock lag (fixes #788) #841

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(metrics): add `lean_gossip_mesh_peers` gauge#818

feat(metrics): add `lean_gossip_mesh_peers` gauge#818
ch4r10t33r merged 8 commits into
mainfrom
feat/lean-gossip-mesh-peers-metric

zclawz commented May 1, 2026

Uh oh!

anshalshukla commented May 2, 2026

Uh oh!

zclawz commented May 2, 2026

Uh oh!

zclawz commented May 7, 2026

Uh oh!

zclawz commented May 7, 2026

Uh oh!

ch4r10t33r left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zclawz commented May 1, 2026

Implementation

Notes / open questions for review

Build / test

Uh oh!

anshalshukla commented May 2, 2026

Uh oh!

zclawz commented May 2, 2026

Uh oh!

zclawz commented May 7, 2026

1. Automated scrape test for lean_gossip_mesh_peers ✅

2. registerScrapeRefresher → append-to-list ✅

3. record_mesh_peers write-path mutex poisoning ✅

Verification

Uh oh!

zclawz commented May 7, 2026

4. Gate per-event mesh recompute ✅

5. Lock-free fixed-size atomic array ✅

6. Drop Zig-side mesh_peers_network_id global ✅

7. Doc note on scrape-vs-lifecycle semantics ✅

8 / 9 — punted as suggested

Verification

Uh oh!

ch4r10t33r left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1. Automated scrape test for `lean_gossip_mesh_peers` ✅

2. `registerScrapeRefresher` → append-to-list ✅

3. `record_mesh_peers` write-path mutex poisoning ✅

6. Drop Zig-side `mesh_peers_network_id` global ✅