feat(metrics): add lean_gossip_mesh_peers gauge#818
Conversation
Mirrors the leanMetrics spec change in leanEthereum/leanMetrics#35. Reports the number of remote peers in this node's gossipsub mesh across all subscribed topics, kept fresh from the rust-libp2p swarm task and exposed via FFI to the Zig metrics layer for "on scrape" reads — same pattern as `zeam_libp2p_swarm_command_dropped_total` (#808). Implementation: - Rust: per-network `MESH_PEERS_TOTAL: HashMap<u32, AtomicU64>` updated from inside the swarm task on every gossipsub event, every `ConnectionClosed`, and on a 1s liveness tick. New FFI getter `get_mesh_peers_total(network_id) -> u64`. - Zig: `refreshMeshPeersMetric` reads the FFI getter on each Prometheus scrape and calls `lean_gossip_mesh_peers.set(count)`. Chained with the existing #808 swarm-drop refresher via a fan-out `refreshNetworkMetrics`, since `registerScrapeRefresher` only stores one callback. - Metrics: registers `lean_gossip_mesh_peers` as an unlabeled `Gauge(u64)`. Notes: - Emitted as an unlabeled gauge. The leanMetrics PR shows a `client=<name>_<N>` label; in zeam's setup that label conventionally comes from the Prometheus scrape job's `instance` rewriting. Per- remote-peer mesh tracking (one 0/1 series per remote, matching how `lean_connected_peers` is implemented) would require subscribing to gossipsub Subscribed/Unsubscribed events and is left as a follow-up. - The 1s tick is defensive — gossipsub events should already cover all transitions, but a tick guarantees liveness on idle topics.
|
@zclawz merge the main branch into it and resolve conflicts |
Resolve conflict in pkgs/metrics/src/lib.zig: main added io parameter to labeled metrics init calls (zig 0.16.0 upgrade #784); our branch adds lean_gossip_mesh_peers gauge. Keep both changes.
|
Merged main into the branch and resolved the conflict in The conflict was between main adding an |
Address PR #818 review: 1) Add a Zig scrape test for `lean_gossip_mesh_peers` so the gauge↑`writeMetrics` contract is locked in CI, not just in docs/PR comments. Also adds a test for the refresher fan-out change below. Wires `metrics_tests` into `zig build test`. 2) Convert `registerScrapeRefresher` and `registerScrapeRefresherCtx` from single-slot to bounded-list (`MAX_SCRAPE_REFRESHERS = 16`) so a future third caller cannot silently overwrite a previous registration. Both lists run on every scrape, void-context first, ctx-bearing second, preserving the legacy ordering. Drops the "do not register individually" guardrail comment in `ethlibp2p.zig`; `refreshNetworkMetrics` is kept as a single per-module fan-out for readability. 3) Make `record_mesh_peers` / `mesh_peers_slot` and the stop-network cleanup in `rust/libp2p-glue` recover from a poisoned `MESH_PEERS_TOTAL` mutex via `match … Err(p) => p.into_inner()`, matching the existing read-path handling in `get_mesh_peers_total`. Avoids escalating a metric-side panic into a swarm-task crash on the 1s mesh-peers tick or on shutdown. (Cannot use `MutexExt::lock_recover` from #819 yet — that PR is unmerged; the explicit `match` is the same recovery, inlined.)
|
Pushed 1. Automated scrape test for
|
…orks Address PR #818 follow-up review (findings 4, 5, 6, plus the doc note from finding 7): (4) Gate the mesh-peer recompute on the gossipsub event variants that actually change mesh membership: `Subscribed`, `Unsubscribed`, `GossipsubNotSupported`, `SlowPeer`. `Message` events do NOT affect mesh membership, and a busy node delivering hundreds of `Message`/sec was paying O(peers) per event for nothing — making the cumulative refresh cost O(peers^2). The `ConnectionClosed` recompute and the 1s `mesh_peers_tick` continue to cover transitions outside this branch, so a missed gauge update inside `Message` cannot drift longer than ~1s. (5) Replace `MESH_PEERS_TOTAL: Mutex<HashMap<u32, Arc<AtomicU64>>>` with `static MESH_PEERS_TOTAL: [AtomicU64; MAX_NETWORKS]`, mirroring the `SWARM_COMMAND_DROPPED_TOTAL` pattern (#808). The network slot table at the top of the file already caps live networks at 3, so a fixed-size array fits the actual constraint. No `Mutex`, no `HashMap`, no `Arc`, no poisoning concerns — drops findings (3) and (5) in one move. (6) Drop the Zig-side `mesh_peers_network_id` global. With the fixed-size atomic array on the Rust side, `refreshMeshPeersMetric` now sums across every `network_id` slot rather than tracking a single "last-init'd" id in a Zig-side mutable global. Inactive slots default to 0 and are reset to 0 by `stop_network`, so the sum is the correct single-gauge answer for every current usage. Per-network labelled gauges (the `client=<name>_<N>` scheme the PR description punted on) become a localised follow-up rather than a re-architecture. (7) Document scrape-vs-lifecycle semantics on `get_mesh_peers_total`: between `stop_network` and a subsequent `create_and_run_network` on the same id, a scrape returns 0. Operators must distinguish "network up with 0 mesh peers" from "network restarting" via orthogonal signals; a separate `lean_gossip_mesh_running` gauge is the obvious follow-up if the distinction becomes load-bearing. Also notes the FFI `u64` return is stable across 32/64-bit architectures (cosmetic, finding 9).
|
Pushed 4. Gate per-event mesh recompute ✅The
5. Lock-free fixed-size atomic array ✅const MAX_NETWORKS: usize = 3;
static MESH_PEERS_TOTAL: [AtomicU64; MAX_NETWORKS] =
[AtomicU64::new(0), AtomicU64::new(0), AtomicU64::new(0)];Mirrors the Results:
This drops findings (3) and (5) in one move. 6. Drop Zig-side
|
Mirrors the leanMetrics spec change in leanEthereum/leanMetrics#35: adds a new
lean_gossip_mesh_peersPrometheus gauge that reports the number of remote peers in this node's gossipsub mesh across all subscribed topics.Implementation
Same shape as the
zeam_libp2p_swarm_command_dropped_totalplumbing from #808:rust/libp2p-glue/src/lib.rs): per-networkMESH_PEERS_TOTAL: HashMap<u32, AtomicU64>, updated from inside the swarm task whenever a gossipsub event fires, aConnectionClosedevent arrives, or a 1-second tick elapses. Exposed via a new FFI getterget_mesh_peers_total(network_id). The slot is pre-created when the network starts and removed onstop_networkso repeated start/stop cycles in tests don't leak entries.pkgs/network/src/ethlibp2p.zig):refreshMeshPeersMetricreads the FFI getter and sets the gauge. BecauseregisterScrapeRefresheronly stores a single callback, both this and the existing rust-bridge: swarm command channel saturating under devnet-4 load — silent attestation/reqresp drops causing fork-choice divergence #808 swarm-drop refresher are chained via a fan-outrefreshNetworkMetrics.pkgs/metrics/src/lib.zig): adds the gauge to the registry as an unlabeledGauge(u64).Notes / open questions for review
client=<name>_<N>label format. I emitted this as an unlabeledGauge(u64)because in zeam's deployment that label conventionally comes from the Prometheus scrape job'sinstancerewriting. zeam's existinglean_connected_peersdoes use a per-remote-peer 0/1 label scheme, so if reviewers prefer the new metric to match that style exactly, a follow-up PR can wire up gossipsubSubscribed/Unsubscribedevent handling (which currently isn't consumed) and resolve peer-id → node-name on those events.ConnectionClosedshould already cover every transition, but a tick guarantees liveness on idle topics.all_mesh_peers().count()isO(peers)and the atomic store is lock-free, so the swarm task isn't blocked.registerScrapeRefresheris single-slot. Discovered while wiring this up — registering individually would silently overwrite the rust-bridge: swarm command channel saturating under devnet-4 load — silent attestation/reqresp drops causing fork-choice divergence #808 refresher. Fan-out is the smallest fix; a follow-up could turn it into a list. Left a comment at the registration site spelling this out.Build / test
zig build(ReleaseFast) — clean.zig build test— clean (no failures introduced; existing negative-path validation tests untouched).Closes the zeam side of leanEthereum/leanMetrics#35.