Skip to content

feat(metrics): add lean_gossip_mesh_peers gauge#818

Merged
ch4r10t33r merged 8 commits into
mainfrom
feat/lean-gossip-mesh-peers-metric
May 7, 2026
Merged

feat(metrics): add lean_gossip_mesh_peers gauge#818
ch4r10t33r merged 8 commits into
mainfrom
feat/lean-gossip-mesh-peers-metric

Conversation

@zclawz
Copy link
Copy Markdown
Contributor

@zclawz zclawz commented May 1, 2026

Mirrors the leanMetrics spec change in leanEthereum/leanMetrics#35: adds a new lean_gossip_mesh_peers Prometheus gauge that reports the number of remote peers in this node's gossipsub mesh across all subscribed topics.

Implementation

Same shape as the zeam_libp2p_swarm_command_dropped_total plumbing from #808:

  1. Rust (rust/libp2p-glue/src/lib.rs): per-network MESH_PEERS_TOTAL: HashMap<u32, AtomicU64>, updated from inside the swarm task whenever a gossipsub event fires, a ConnectionClosed event arrives, or a 1-second tick elapses. Exposed via a new FFI getter get_mesh_peers_total(network_id). The slot is pre-created when the network starts and removed on stop_network so repeated start/stop cycles in tests don't leak entries.
  2. Zig FFI mirror (pkgs/network/src/ethlibp2p.zig): refreshMeshPeersMetric reads the FFI getter and sets the gauge. Because registerScrapeRefresher only stores a single callback, both this and the existing rust-bridge: swarm command channel saturating under devnet-4 load — silent attestation/reqresp drops causing fork-choice divergence #808 swarm-drop refresher are chained via a fan-out refreshNetworkMetrics.
  3. Metric registration (pkgs/metrics/src/lib.zig): adds the gauge to the registry as an unlabeled Gauge(u64).

Notes / open questions for review

  • Label scheme. The leanMetrics PR description shows a client=<name>_<N> label format. I emitted this as an unlabeled Gauge(u64) because in zeam's deployment that label conventionally comes from the Prometheus scrape job's instance rewriting. zeam's existing lean_connected_peers does use a per-remote-peer 0/1 label scheme, so if reviewers prefer the new metric to match that style exactly, a follow-up PR can wire up gossipsub Subscribed/Unsubscribed event handling (which currently isn't consumed) and resolve peer-id → node-name on those events.
  • 1-second tick. Defensive — gossipsub events and ConnectionClosed should already cover every transition, but a tick guarantees liveness on idle topics. all_mesh_peers().count() is O(peers) and the atomic store is lock-free, so the swarm task isn't blocked.
  • registerScrapeRefresher is single-slot. Discovered while wiring this up — registering individually would silently overwrite the rust-bridge: swarm command channel saturating under devnet-4 load — silent attestation/reqresp drops causing fork-choice divergence #808 refresher. Fan-out is the smallest fix; a follow-up could turn it into a list. Left a comment at the registration site spelling this out.

Build / test

  • zig build (ReleaseFast) — clean.
  • zig build test — clean (no failures introduced; existing negative-path validation tests untouched).

Closes the zeam side of leanEthereum/leanMetrics#35.

zclawz added 2 commits May 1, 2026 14:03
Mirrors the leanMetrics spec change in leanEthereum/leanMetrics#35.

Reports the number of remote peers in this node's gossipsub mesh across
all subscribed topics, kept fresh from the rust-libp2p swarm task and
exposed via FFI to the Zig metrics layer for "on scrape" reads — same
pattern as `zeam_libp2p_swarm_command_dropped_total` (#808).

Implementation:

- Rust: per-network `MESH_PEERS_TOTAL: HashMap<u32, AtomicU64>` updated
  from inside the swarm task on every gossipsub event, every
  `ConnectionClosed`, and on a 1s liveness tick. New FFI getter
  `get_mesh_peers_total(network_id) -> u64`.
- Zig: `refreshMeshPeersMetric` reads the FFI getter on each Prometheus
  scrape and calls `lean_gossip_mesh_peers.set(count)`. Chained with the
  existing #808 swarm-drop refresher via a fan-out
  `refreshNetworkMetrics`, since `registerScrapeRefresher` only stores
  one callback.
- Metrics: registers `lean_gossip_mesh_peers` as an unlabeled
  `Gauge(u64)`.

Notes:

- Emitted as an unlabeled gauge. The leanMetrics PR shows a
  `client=<name>_<N>` label; in zeam's setup that label conventionally
  comes from the Prometheus scrape job's `instance` rewriting. Per-
  remote-peer mesh tracking (one 0/1 series per remote, matching how
  `lean_connected_peers` is implemented) would require subscribing to
  gossipsub Subscribed/Unsubscribed events and is left as a follow-up.
- The 1s tick is defensive — gossipsub events should already cover all
  transitions, but a tick guarantees liveness on idle topics.
@anshalshukla
Copy link
Copy Markdown
Collaborator

@zclawz merge the main branch into it and resolve conflicts

Resolve conflict in pkgs/metrics/src/lib.zig: main added io parameter
to labeled metrics init calls (zig 0.16.0 upgrade #784); our branch
adds lean_gossip_mesh_peers gauge. Keep both changes.
@zclawz
Copy link
Copy Markdown
Contributor Author

zclawz commented May 2, 2026

Merged main into the branch and resolved the conflict in pkgs/metrics/src/lib.zig.

The conflict was between main adding an io parameter to labeled-metrics init calls (zig 0.16.0 upgrade, #784) and our branch adding the lean_gossip_mesh_peers gauge line. Kept both changes — using the updated io-parameter API and retaining the new gauge. Pushed as commit 4fa0140.

ch4r10t33r and others added 4 commits May 6, 2026 19:32
Address PR #818 review:

1) Add a Zig scrape test for `lean_gossip_mesh_peers` so the
   gauge↑`writeMetrics` contract is locked in CI, not just in
   docs/PR comments. Also adds a test for the refresher fan-out
   change below. Wires `metrics_tests` into `zig build test`.

2) Convert `registerScrapeRefresher` and
   `registerScrapeRefresherCtx` from single-slot to
   bounded-list (`MAX_SCRAPE_REFRESHERS = 16`) so a future
   third caller cannot silently overwrite a previous
   registration. Both lists run on every scrape, void-context
   first, ctx-bearing second, preserving the legacy ordering.
   Drops the "do not register individually" guardrail comment
   in `ethlibp2p.zig`; `refreshNetworkMetrics` is kept as a
   single per-module fan-out for readability.

3) Make `record_mesh_peers` / `mesh_peers_slot` and the
   stop-network cleanup in `rust/libp2p-glue` recover from a
   poisoned `MESH_PEERS_TOTAL` mutex via `match … Err(p) =>
   p.into_inner()`, matching the existing read-path handling
   in `get_mesh_peers_total`. Avoids escalating a metric-side
   panic into a swarm-task crash on the 1s mesh-peers tick or
   on shutdown. (Cannot use `MutexExt::lock_recover` from
   #819 yet — that PR is unmerged; the explicit `match` is the
   same recovery, inlined.)
@zclawz
Copy link
Copy Markdown
Contributor Author

zclawz commented May 7, 2026

Pushed 2b5c266 addressing all three review items:

1. Automated scrape test for lean_gossip_mesh_peers

Added pkgs/metrics/src/lib.zig tests + a metrics_tests step wired into zig build test:

  • lean_gossip_mesh_peers gauge appears in scrape output — calls metrics.lean_gossip_mesh_peers.set(4242), runs writeMetrics(&writer), asserts the body contains both the metric name and the literal lean_gossip_mesh_peers 4242 value line. Locks the gauge↑serializer contract — same shape as pkgs/node/src/locking.zig's LockTimer → /metrics test from slice (b).
  • registerScrapeRefresher fans out to all registered callbacks — guards item (2) below in code (see next section).

The FFI side (get_mesh_peers_total → refreshMeshPeersMetric → gauge) still needs a real swarm to exercise end-to-end, but the gauge↔scrape path — the place where a future struct rename or serializer regression would silently break the contract — is now covered.

2. registerScrapeRefresher → append-to-list ✅

Replaced both g_scrape_refresher (void→void) and g_scrape_refresher_ctx (*anyopaque→void) single slots with bounded fixed-size lists:

const MAX_SCRAPE_REFRESHERS: usize = 16;
var g_scrape_refreshers: [MAX_SCRAPE_REFRESHERS]*const fn () void = undefined;
var g_scrape_refreshers_len: usize = 0;
// …and the same for ctx-bearing.

registerScrapeRefresher and registerScrapeRefresherCtx now append; writeMetrics iterates both lists on every scrape (void-list first, then ctx-list, preserving the legacy ordering between FFI-backed atomic refreshes and context-bearing observers that may read from them). Allocator-free, ZKVM-safe, panics on overflow (which would indicate a registration bug, not legitimate growth — current usage is 2 callsites).

The refreshNetworkMetrics fan-out in ethlibp2p.zig is kept (one entry per module is still cleaner for auditing, and the comment is updated to reflect that it's no longer load-bearing for correctness). Removed the "do NOT register them individually or you will overwrite" guardrail since the registry is now safe.

3. record_mesh_peers write-path mutex poisoning ✅

Applied the same match … Err(poisoned) => poisoned.into_inner() recovery to both:

  • mesh_peers_slot() (the write path used by record_mesh_peers from the swarm task on every gossipsub event + the 1s tick), and
  • MESH_PEERS_TOTAL.lock() inside stop_network (cleanup path).

Now symmetric with the existing read-path handling in get_mesh_peers_total. Note: I went with the explicit match rather than MutexExt::lock_recover because PR #819 is not yet merged on main; once #819 lands, a follow-up rebase can switch to the trait. Behaviour is identical — MutexExt::lock_recover is just sugar for the same match.

Verification

  • zig build — green.
  • zig build test — all 132 tests pass; the new metrics_tests step runs both new tests cleanly:
    1/2 lib.test.lean_gossip_mesh_peers gauge appears in scrape output...OK
    2/2 lib.test.registerScrapeRefresher fans out to all registered callbacks...OK
    All 2 tests passed.
    
  • cargo build -p libp2p-glue and cargo check -p libp2p-glue --tests — green.

Ready for another look 🙏

…orks

Address PR #818 follow-up review (findings 4, 5, 6, plus the doc note
from finding 7):

(4) Gate the mesh-peer recompute on the gossipsub event variants that
    actually change mesh membership: `Subscribed`, `Unsubscribed`,
    `GossipsubNotSupported`, `SlowPeer`. `Message` events do NOT
    affect mesh membership, and a busy node delivering hundreds of
    `Message`/sec was paying O(peers) per event for nothing — making
    the cumulative refresh cost O(peers^2). The `ConnectionClosed`
    recompute and the 1s `mesh_peers_tick` continue to cover
    transitions outside this branch, so a missed gauge update inside
    `Message` cannot drift longer than ~1s.

(5) Replace `MESH_PEERS_TOTAL: Mutex<HashMap<u32, Arc<AtomicU64>>>`
    with `static MESH_PEERS_TOTAL: [AtomicU64; MAX_NETWORKS]`,
    mirroring the `SWARM_COMMAND_DROPPED_TOTAL` pattern (#808). The
    network slot table at the top of the file already caps live
    networks at 3, so a fixed-size array fits the actual constraint.
    No `Mutex`, no `HashMap`, no `Arc`, no poisoning concerns —
    drops findings (3) and (5) in one move.

(6) Drop the Zig-side `mesh_peers_network_id` global. With the
    fixed-size atomic array on the Rust side, `refreshMeshPeersMetric`
    now sums across every `network_id` slot rather than tracking a
    single "last-init'd" id in a Zig-side mutable global. Inactive
    slots default to 0 and are reset to 0 by `stop_network`, so the
    sum is the correct single-gauge answer for every current usage.
    Per-network labelled gauges (the `client=<name>_<N>` scheme the
    PR description punted on) become a localised follow-up rather
    than a re-architecture.

(7) Document scrape-vs-lifecycle semantics on `get_mesh_peers_total`:
    between `stop_network` and a subsequent `create_and_run_network`
    on the same id, a scrape returns 0. Operators must distinguish
    "network up with 0 mesh peers" from "network restarting" via
    orthogonal signals; a separate `lean_gossip_mesh_running` gauge
    is the obvious follow-up if the distinction becomes load-bearing.
    Also notes the FFI `u64` return is stable across 32/64-bit
    architectures (cosmetic, finding 9).
@zclawz
Copy link
Copy Markdown
Contributor Author

zclawz commented May 7, 2026

Pushed 69cc2fd addressing the follow-up review (findings 4, 5, 6, and the doc note from 7):

4. Gate per-event mesh recompute ✅

The BehaviourEvent::Gossipsub branch in Network::run_eventloop no longer recomputes all_mesh_peers().count() on every event. It now uses matches! to gate on the variants that actually change mesh membership:

  • Subscribed / Unsubscribed — peer joined/left a topic this node is subscribed to.
  • GossipsubNotSupported — connected peer turned out not to speak gossipsub.
  • SlowPeer — gossipsub may evict the peer under backpressure.

Message events (the dominant traffic on a busy node) no longer trigger the O(peers) walk. Cumulative cost drops from O(peers²) per scrape window to O(peers × mesh-changing-events). The 1s mesh_peers_tick and the ConnectionClosed recompute cover "events outside this branch", so a missed update inside Message cannot drift longer than ~1s.

5. Lock-free fixed-size atomic array ✅

const MAX_NETWORKS: usize = 3;
static MESH_PEERS_TOTAL: [AtomicU64; MAX_NETWORKS] =
    [AtomicU64::new(0), AtomicU64::new(0), AtomicU64::new(0)];

Mirrors the SWARM_COMMAND_DROPPED_TOTAL: [AtomicU64; 3] pattern from #808. The hardcoded slot table at the top of lib.rs (get_swarm_mut / set_swarm / get_zig_handler / …) already caps live networks at 3, so a fixed-size array fits the actual constraint.

Results:

  • No Mutex<HashMap>, no Arc, no lock().unwrap() (= no poisoning concern, no Err(p).into_inner() recovery needed) on this metric path.
  • record_mesh_peers is a single relaxed atomic store; get_mesh_peers_total is a single relaxed atomic load.
  • mesh_peers_slot() deleted; the start_network warm-up call (let _ = mesh_peers_slot(self.network_id)) deleted (slots are always present).
  • stop_network resets the slot to 0 with slot.store(0, Relaxed) instead of removing a HashMap entry.

This drops findings (3) and (5) in one move.

6. Drop Zig-side mesh_peers_network_id global ✅

refreshMeshPeersMetric in pkgs/network/src/ethlibp2p.zig now sums across every network_id slot:

fn refreshMeshPeersMetric() void {
    var total: u64 = 0;
    var network_id: u32 = 0;
    while (network_id < MESH_PEERS_MAX_NETWORKS) : (network_id += 1) {
        total += get_mesh_peers_total(network_id);
    }
    zeam_metrics.metrics.lean_gossip_mesh_peers.set(total);
}

The var mesh_peers_network_id: u32 = 0 global is gone, along with its init-time assignment. Inactive slots default to 0 and are reset to 0 by stop_network, so summing is the correct single-gauge answer for every current usage and for any future multi-network test/deployment.

A per-network labelled gauge (client=<name>_<N> from the leanSpec) becomes a localised follow-up: replace the total accumulator with one gauge.set(.{ .network_id = N }, count) call per non-zero slot. The fixed-size atomic shape on the Rust side is what makes that change small.

7. Doc note on scrape-vs-lifecycle semantics ✅

Added a paragraph on get_mesh_peers_total explaining that between stop_network(network_id) and a subsequent create_and_run_network(network_id) a scrape returns 0, with a pointer to a separate lean_gossip_mesh_running boolean gauge as the obvious follow-up if the "up with 0 peers" vs "restarting" distinction becomes load-bearing for dashboards. Also noted the FFI u64 is stable across 32/64-bit architectures (cosmetic, finding 9).

8 / 9 — punted as suggested

  • (8) 1s tick recomputing on idle nodes — reviewer said "leave for future cleanup if the per-event recompute is also kept". The per-event recompute is now gated, so the tick doesn't need a dirty flag for correctness; left as-is.
  • (9) as u64 documentation — covered in the doc note above.

Verification

  • cargo check -p libp2p-glue --tests — green.
  • zig build — green.
  • zig build test — all tests pass; the metrics_tests step from the previous round still runs and locks the gauge↔scrape contract:
    1/2 lib.test.lean_gossip_mesh_peers gauge appears in scrape output...OK
    2/2 lib.test.registerScrapeRefresher fans out to all registered callbacks...OK
    All 2 tests passed.
    

Ready for review again 🙏

Copy link
Copy Markdown
Contributor

@ch4r10t33r ch4r10t33r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@ch4r10t33r ch4r10t33r merged commit 0d04ea2 into main May 7, 2026
13 checks passed
@ch4r10t33r ch4r10t33r deleted the feat/lean-gossip-mesh-peers-metric branch May 7, 2026 11:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants