fix(network): relieve rust-bridge swarm command channel backpressure (#808)#809
Conversation
…808) The actor-model rust-libp2p bridge introduced in #789 bounds the per-network swarm command channel to 1024 slots and drains 32 commands per event-loop tick. On devnet-4 this saturates under steady-state validator load: ~5 commands/min/node are silently dropped via try_send, which causes our own attestations and outbound req-resp parent fetches to never reach the wire — observable as multiple competing fork-choice branches and persistent head-lag. To make matters worse the node-level publish path logged 'published attestation/block to network' unconditionally even when the underlying try_send returned Err(Full), so operators couldn't tell from the logs that anything had been dropped. This change addresses both problems with the smallest reasonable diff: Rust (rust/libp2p-glue/src/lib.rs): - SWARM_COMMAND_CHANNEL_CAPACITY: 1024 -> 8192 (~8x headroom for the same workload, still bounded). - MAX_SWARM_COMMANDS_PER_TICK: 32 -> 256 to actually drain the new headroom without monopolizing the executor. - publish_msg_to_rust_bridge now returns bool (true on enqueue, false when the command was dropped). Returns false for null topic and for uninitialized / full / closed channels. - Added two regression tests covering the new return contract. Zig: - Threaded the bool return up through GossipSub.publishFn, NetworkBackend.publish and Node.publishBlock / publishAttestation / publishAggregation. - The 'published … to network' info log only fires when the backend accepted the publish; otherwise we emit a 'failed to publish … (backend dropped publish)' warn line so the situation is visible in Loki and to operators. - Mock backend returns true unconditionally (synchronous, no command channel) and the existing mock test asserts it. This does not introduce a new metric or change the FFI ABI for any other call (only publish_msg_to_rust_bridge changes its return type from void to bool); follow-up work tracked in #808 for the metric and priority queue between own-publishes vs. forwarded gossip. Refs: #808 Refs: #789
|
Quick review — flagging edge cases before merge:
Stop-gap is fine; suggest keeping #808 open until at least 1, 2, 3, 4 land. |
…l-channel test, send_rpc_request log) Addresses review points 1-4 from #808 on PR #809. Lint: - cargo fmt the two test bodies the lint job complained about; lint now runs locally with both `cargo fmt --check` and `cargo clippy -D warnings` clean. Review #1 — send_rpc_request not threaded: - Bumped per-reason drop counter on every `send_rpc_request` failure path (uninitialized network, channel full, channel closed) so the same metric covers both publish and req-resp drops. - Added a Zig-side warn at the `request_id == 0` callsite in `EthLibp2p.sendRPCRequest` so operators correlating req-resp timeouts see the dispatch-failure event in the Zig log stream alongside the preceding rust-bridge error (the only previous signal was a generic `error.RequestDispatchFailed` returned upward). Review #2 — drop branch untested on Zig side: - Added `Mock.setForcePublishDrop(bool)` knob on the mock backend. When set, every `publish` returns `false` without invoking subscribers, letting the new `failed to publish … (backend dropped publish)` warn arms in `Node.publishBlock` / `publishAttestation` / `publishAggregation` be exercised in tests without spinning up a real Rust bridge. - Extended the existing "Mock gossip ..." test with the drop case. Review #3 — no saturation test on Rust side: - Added `test_swarm_command_full_channel_drops_and_counts`: installs a bounded sender of capacity 4 into `COMMAND_SENDERS` without a drainer, fills it to capacity, then asserts the next 3 sends each return false AND each bump the Full counter. Review #4 — no metric: - Added `SwarmCommandDropReason` enum (`Full=0`, `Closed=1`, `Uninitialized=2`) and a global `SWARM_COMMAND_DROPPED_TOTAL: [AtomicU64; 3]` on the Rust side. Every drop path bumps the matching atomic. - Added FFI getter `get_swarm_command_dropped_total(reason_tag: u32) -> u64` exposing the cumulative counts. Unknown tags return 0 so a Zig build compiled against an older Rust glue cannot panic. - Added `lean_libp2p_swarm_command_dropped_total` (`CounterVec` with `reason` label) to `pkgs/metrics`. - Added `registerScrapeRefresher` hook in `pkgs/metrics`; `writeMetrics` now calls it before serializing so externally-owned counters can sync. - `pkgs/network/src/ethlibp2p.zig` registers a refresher in `EthLibp2p.init` that polls all 3 reason tags via FFI and `incrBy`s the delta into the labeled counter. - New Rust tests cover the counter increment on uninitialized + full paths and assert unknown reason tags return 0. Test plan: - `zig fmt --check .` ✓ - `cargo fmt --check` ✓ - `cargo clippy --all-targets -D warnings` ✓ - `zig build` ✓ - `zig build test` ✓ - `cargo test -p libp2p-glue` ✓ — 10 passed (5 existing + 5 new) Refs: #808 Refs: #809
|
Pushed Lint —
#1 — #2 — drop branch untested on Zig side. Added #3 — no saturation test on Rust side. New test #4 — no metric. Added
New Rust tests assert the counter increments on uninitialized + full paths and that unknown reason tags return 0. 10 Rust tests pass total (5 existing + 5 new). Intentionally still deferred to follow-up PRs (review points 5-7):
Keeping #808 open per your suggestion until those land. |
as this is a zeam specific metric, we should prefix it as zeam_ instead of lean_ |
…libp2p_swarm_command_dropped_total (#809 review) The rust-libp2p command channel is a zeam-implementation detail (no other lean client has it), so the metric name should follow the existing zeam_* convention used for client-specific instrumentation (zeam_chain_onblock_duration_seconds, zeam_compact_attestations_*, zeam_node_mutex_*, zeam_fork_choice_tick_interval_*) rather than the lean_* convention reserved for protocol-level metrics shared across clients. Refs: #808 Refs: #809
|
Good catch — pushed Fits the existing convention: Updated 3 files (rename + 5 doc-comment touch-ups). Build + fmt clean. |
Summary
Fixes #808.
The actor-model rust-libp2p bridge introduced in #789 bounds the per-network swarm command channel to 1024 slots and drains 32 commands per event-loop tick. On devnet-4 this saturates under steady-state validator load — Loki shows ~5
command channel fullerrors / minute / node — and the dropped messages line up with multiple competing fork-choice branches in the local tree because outbound attestations and req-resp parent fetches never make it onto the wire.To make matters worse the node-level publish path logged
published attestation/block to networkunconditionally, even when the underlyingtry_sendreturnedErr(Full), so operators couldn't tell from the logs that anything had been dropped.This PR addresses both problems with the smallest reasonable diff (6 files, +136 / -28).
Changes
Rust —
rust/libp2p-glue/src/lib.rsSWARM_COMMAND_CHANNEL_CAPACITY: 1024 → 8192 (~8x headroom for the same workload, still bounded).MAX_SWARM_COMMANDS_PER_TICK: 32 → 256 to actually drain the new headroom without monopolizing the executor.publish_msg_to_rust_bridgenow returnsbool(trueon enqueue,falsewhen the command was dropped). Returnsfalsefor null topic and for uninitialized / full / closed channels.test_publish_msg_to_rust_bridge_returns_false_when_uninitializedtest_publish_msg_to_rust_bridge_returns_false_on_null_topicZig —
pkgs/network/,pkgs/node/GossipSub.publishFn,NetworkBackend.publish, andNode.publishBlock/publishAttestation/publishAggregation.[node] published … to networkinfo log only fires when the backend accepted the publish; otherwise we emit a new[node] failed to publish … (backend dropped publish)warn line so the situation is visible in Loki and to operators.trueunconditionally (synchronous, no command channel) and the existingmock.zigpublish test now asserts it.What's intentionally NOT in this PR
Issue #808 lists three additional follow-ups that are out of scope here to keep the diff reviewable:
zeam_libp2p_swarm_command_dropped_total— needs new Rust→Zig FFI plumbing into the metrics registry. Worth its own PR.send().awaitfallback with a short timeout before dropping our own attestations — possibly desirable, but changes the FFI contract from non-blocking to potentially-blocking which warrants a separate review.The capacity bump alone should be sufficient to clear the symptom on devnet-4 (the channel was at most ~1.5x oversubscribed at peak).
Test plan
ABI / FFI impact
publish_msg_to_rust_bridgechanges its return type fromvoidtobool. No other FFI symbols are affected. Both sides are updated atomically in this PR — there is no mixed-build window.