Skip to content

fix(network): relieve rust-bridge swarm command channel backpressure (#808)#809

Merged
ch4r10t33r merged 3 commits into
mainfrom
fix/swarm-command-channel-backpressure-808
Apr 30, 2026
Merged

fix(network): relieve rust-bridge swarm command channel backpressure (#808)#809
ch4r10t33r merged 3 commits into
mainfrom
fix/swarm-command-channel-backpressure-808

Conversation

@zclawz
Copy link
Copy Markdown
Contributor

@zclawz zclawz commented Apr 30, 2026

Summary

Fixes #808.

The actor-model rust-libp2p bridge introduced in #789 bounds the per-network swarm command channel to 1024 slots and drains 32 commands per event-loop tick. On devnet-4 this saturates under steady-state validator load — Loki shows ~5 command channel full errors / minute / node — and the dropped messages line up with multiple competing fork-choice branches in the local tree because outbound attestations and req-resp parent fetches never make it onto the wire.

To make matters worse the node-level publish path logged published attestation/block to network unconditionally, even when the underlying try_send returned Err(Full), so operators couldn't tell from the logs that anything had been dropped.

This PR addresses both problems with the smallest reasonable diff (6 files, +136 / -28).

Changes

Rust — rust/libp2p-glue/src/lib.rs

  • SWARM_COMMAND_CHANNEL_CAPACITY: 1024 → 8192 (~8x headroom for the same workload, still bounded).
  • MAX_SWARM_COMMANDS_PER_TICK: 32 → 256 to actually drain the new headroom without monopolizing the executor.
  • publish_msg_to_rust_bridge now returns bool (true on enqueue, false when the command was dropped). Returns false for null topic and for uninitialized / full / closed channels.
  • Added two regression tests covering the new return contract:
    • test_publish_msg_to_rust_bridge_returns_false_when_uninitialized
    • test_publish_msg_to_rust_bridge_returns_false_on_null_topic

Zig — pkgs/network/, pkgs/node/

  • Threaded the bool return up through GossipSub.publishFn, NetworkBackend.publish, and Node.publishBlock / publishAttestation / publishAggregation.
  • The [node] published … to network info log only fires when the backend accepted the publish; otherwise we emit a new
    [node] failed to publish … (backend dropped publish) warn line so the situation is visible in Loki and to operators.
  • Mock backend returns true unconditionally (synchronous, no command channel) and the existing mock.zig publish test now asserts it.

What's intentionally NOT in this PR

Issue #808 lists three additional follow-ups that are out of scope here to keep the diff reviewable:

  1. Metric zeam_libp2p_swarm_command_dropped_total — needs new Rust→Zig FFI plumbing into the metrics registry. Worth its own PR.
  2. Priority split between own-publishes and forwarded gossip — needs a second channel + select biasing. Design discussion would help before coding.
  3. send().await fallback with a short timeout before dropping our own attestations — possibly desirable, but changes the FFI contract from non-blocking to potentially-blocking which warrants a separate review.

The capacity bump alone should be sufficient to clear the symptom on devnet-4 (the channel was at most ~1.5x oversubscribed at peak).

Test plan

zig build              # ✓ EXIT:0
zig build test         # ✓ EXIT:0
cd rust && cargo test -p libp2p-glue
# 7 passed; 0 failed (5 existing + 2 new)

ABI / FFI impact

publish_msg_to_rust_bridge changes its return type from void to bool. No other FFI symbols are affected. Both sides are updated atomically in this PR — there is no mixed-build window.

…808)

The actor-model rust-libp2p bridge introduced in #789 bounds the
per-network swarm command channel to 1024 slots and drains 32 commands
per event-loop tick. On devnet-4 this saturates under steady-state
validator load: ~5 commands/min/node are silently dropped via try_send,
which causes our own attestations and outbound req-resp parent fetches
to never reach the wire — observable as multiple competing fork-choice
branches and persistent head-lag.

To make matters worse the node-level publish path logged
'published attestation/block to network' unconditionally even when the
underlying try_send returned Err(Full), so operators couldn't tell from
the logs that anything had been dropped.

This change addresses both problems with the smallest reasonable diff:

Rust (rust/libp2p-glue/src/lib.rs):
- SWARM_COMMAND_CHANNEL_CAPACITY: 1024 -> 8192 (~8x headroom for the
  same workload, still bounded).
- MAX_SWARM_COMMANDS_PER_TICK: 32 -> 256 to actually drain the new
  headroom without monopolizing the executor.
- publish_msg_to_rust_bridge now returns bool (true on enqueue, false
  when the command was dropped). Returns false for null topic and for
  uninitialized / full / closed channels.
- Added two regression tests covering the new return contract.

Zig:
- Threaded the bool return up through GossipSub.publishFn, NetworkBackend.publish
  and Node.publishBlock / publishAttestation / publishAggregation.
- The 'published … to network' info log only fires when the backend
  accepted the publish; otherwise we emit a 'failed to publish …
  (backend dropped publish)' warn line so the situation is visible in
  Loki and to operators.
- Mock backend returns true unconditionally (synchronous, no command
  channel) and the existing mock test asserts it.

This does not introduce a new metric or change the FFI ABI for any
other call (only publish_msg_to_rust_bridge changes its return type
from void to bool); follow-up work tracked in #808 for the metric and
priority queue between own-publishes vs. forwarded gossip.

Refs: #808
Refs: #789
@ch4r10t33r
Copy link
Copy Markdown
Contributor

Quick review — flagging edge cases before merge:

  1. send_rpc_request not threaded. rust-bridge: swarm command channel saturating under devnet-4 load — silent attestation/reqresp drops causing fork-choice divergence #808 names req-resp drops as half the fork-choice divergence; this PR only fixes publish_msg_to_rust_bridge. send_rpc_request still drops silently with request_id == 0 and no caller audit.
  2. Drop branch untested on Zig side. mock.zig::publish returns true unconditionally → new else arms in publishBlock / publishAttestation / publishAggregation have no test coverage. Suggest a force_drop knob on the mock.
  3. No saturation test on Rust side. Both new tests cover null-topic / uninitialized; neither exercises the actual full-channel path. Please add: pause drainer, push 8193 commands, assert last returns false.
  4. No metric. Without zeam_libp2p_swarm_command_dropped_total we can't verify the capacity bump is actually sufficient on devnet-4 post-merge — Loki grep is the only signal. Strong preference to land the counter here or immediately after.
  5. 8192 sized by anecdote. "1.5× peak oversubscription → 8× headroom" is steady-state math; burst peak unknown. Tied to (4).
  6. Full vs Closed collapsed to same false. Future retry policy will need them split; doing it now avoids changing the warn-line contract later.
  7. Local-chain vs network divergence persists on drop. chain.onBlockFollowup / onGossipAggregatedAttestation run regardless of publish bool — message is in local fork-choice but never gossiped, no retry path. Visibility improved, divergence not resolved.

Stop-gap is fine; suggest keeping #808 open until at least 1, 2, 3, 4 land.

…l-channel test, send_rpc_request log)

Addresses review points 1-4 from #808 on PR #809.

Lint:
- cargo fmt the two test bodies the lint job complained about; lint now
  runs locally with both `cargo fmt --check` and `cargo clippy -D warnings` clean.

Review #1 — send_rpc_request not threaded:
- Bumped per-reason drop counter on every `send_rpc_request` failure path
  (uninitialized network, channel full, channel closed) so the same
  metric covers both publish and req-resp drops.
- Added a Zig-side warn at the `request_id == 0` callsite in
  `EthLibp2p.sendRPCRequest` so operators correlating req-resp timeouts
  see the dispatch-failure event in the Zig log stream alongside the
  preceding rust-bridge error (the only previous signal was a generic
  `error.RequestDispatchFailed` returned upward).

Review #2 — drop branch untested on Zig side:
- Added `Mock.setForcePublishDrop(bool)` knob on the mock backend.
  When set, every `publish` returns `false` without invoking subscribers,
  letting the new `failed to publish … (backend dropped publish)` warn
  arms in `Node.publishBlock` / `publishAttestation` / `publishAggregation`
  be exercised in tests without spinning up a real Rust bridge.
- Extended the existing "Mock gossip ..." test with the drop case.

Review #3 — no saturation test on Rust side:
- Added `test_swarm_command_full_channel_drops_and_counts`: installs a
  bounded sender of capacity 4 into `COMMAND_SENDERS` without a drainer,
  fills it to capacity, then asserts the next 3 sends each return false
  AND each bump the Full counter.

Review #4 — no metric:
- Added `SwarmCommandDropReason` enum (`Full=0`, `Closed=1`, `Uninitialized=2`)
  and a global `SWARM_COMMAND_DROPPED_TOTAL: [AtomicU64; 3]` on the Rust
  side. Every drop path bumps the matching atomic.
- Added FFI getter `get_swarm_command_dropped_total(reason_tag: u32) -> u64`
  exposing the cumulative counts. Unknown tags return 0 so a Zig build
  compiled against an older Rust glue cannot panic.
- Added `lean_libp2p_swarm_command_dropped_total` (`CounterVec` with
  `reason` label) to `pkgs/metrics`.
- Added `registerScrapeRefresher` hook in `pkgs/metrics`; `writeMetrics`
  now calls it before serializing so externally-owned counters can sync.
- `pkgs/network/src/ethlibp2p.zig` registers a refresher in
  `EthLibp2p.init` that polls all 3 reason tags via FFI and `incrBy`s
  the delta into the labeled counter.
- New Rust tests cover the counter increment on uninitialized + full
  paths and assert unknown reason tags return 0.

Test plan:
- `zig fmt --check .` ✓
- `cargo fmt --check` ✓
- `cargo clippy --all-targets -D warnings` ✓
- `zig build` ✓
- `zig build test` ✓
- `cargo test -p libp2p-glue` ✓ — 10 passed (5 existing + 5 new)

Refs: #808
Refs: #809
@zclawz
Copy link
Copy Markdown
Contributor Author

zclawz commented Apr 30, 2026

Pushed cd7a6a7 addressing review points 1-4 plus the lint failure. Diff for this round: 4 files, +266 / -10.

Lintcargo fmt complaint on the multi-line FFI call. Now passes locally:

  • zig fmt --check .
  • cargo fmt --check
  • cargo clippy --all-targets -D warnings

#1send_rpc_request not threaded. Bumped the new per-reason drop counter on every send_rpc_request failure path (uninitialized / full / closed) so the same metric covers both publish and req-resp drops. Also added a Zig-side warn in EthLibp2p.sendRPCRequest at the request_id == 0 callsite — operators correlating req-resp timeouts now see the dispatch-failure event in the Zig log stream alongside the preceding rust-bridge error (the only previous signal was a generic error.RequestDispatchFailed that the upper layers swallowed).

#2 — drop branch untested on Zig side. Added Mock.setForcePublishDrop(bool) knob; when set, every publish returns false without invoking subscribers. The existing mock gossip test now exercises both the normal and dropped paths back-to-back.

#3 — no saturation test on Rust side. New test test_swarm_command_full_channel_drops_and_counts: installs a bounded sender of capacity 4 into COMMAND_SENDERS without a drainer, fills it to capacity, then asserts the next 3 sends each return false and each bump the Full counter. Runs in microseconds vs. allocating 8192 commands.

#4 — no metric. Added lean_libp2p_swarm_command_dropped_total{reason="full|closed|uninitialized"}:

  • Rust: SwarmCommandDropReason enum + static SWARM_COMMAND_DROPPED_TOTAL: [AtomicU64; 3] bumped on every drop path. New FFI getter get_swarm_command_dropped_total(reason_tag: u32) -> u64 exposes the cumulative count. Unknown tags return 0 so a Zig build compiled against an older Rust glue can't panic.
  • Zig: pkgs/metrics declares the labeled CounterVec and a new registerScrapeRefresher hook (writeMetrics calls it before serializing). pkgs/network/src/ethlibp2p.zig registers a refresher in EthLibp2p.init that polls all 3 reason tags and incrBys the delta. Counter is monotonic + per-reason, so no double-counting if multiple EthLibp2p instances co-exist.

New Rust tests assert the counter increments on uninitialized + full paths and that unknown reason tags return 0. 10 Rust tests pass total (5 existing + 5 new).

Intentionally still deferred to follow-up PRs (review points 5-7):

  • (5) Capacity tuning numbers based on actual devnet-4 metric data once this lands and we can read lean_libp2p_swarm_command_dropped_total directly.
  • (6) Splitting Full vs Closed in the warn-line contract — counter already splits them; widening the warn-line API is easier once retry policy is decided.
  • (7) Local-chain vs network divergence on drop (no retry path) — needs design discussion (re-publish queue vs. validator-level retry vs. attestation aggregation timing). I'll spin up a separate issue to scope it.

Keeping #808 open per your suggestion until those land.

@ch4r10t33r
Copy link
Copy Markdown
Contributor

  • lean_libp2p_swarm_command_dropped_total

as this is a zeam specific metric, we should prefix it as zeam_ instead of lean_

…libp2p_swarm_command_dropped_total (#809 review)

The rust-libp2p command channel is a zeam-implementation detail (no other
lean client has it), so the metric name should follow the existing
zeam_* convention used for client-specific instrumentation
(zeam_chain_onblock_duration_seconds, zeam_compact_attestations_*,
zeam_node_mutex_*, zeam_fork_choice_tick_interval_*) rather than the
lean_* convention reserved for protocol-level metrics shared across
clients.

Refs: #808
Refs: #809
@zclawz
Copy link
Copy Markdown
Contributor Author

zclawz commented Apr 30, 2026

Good catch — pushed 65bbf8c renaming the metric to zeam_libp2p_swarm_command_dropped_total.

Fits the existing convention: lean_* for protocol-level metrics shared across clients, zeam_* for client-implementation specifics (alongside zeam_chain_onblock_duration_seconds, zeam_compact_attestations_*, zeam_node_mutex_*, zeam_fork_choice_tick_interval_*). The rust-libp2p command channel is purely a zeam concern, so zeam_ is the right prefix.

Updated 3 files (rename + 5 doc-comment touch-ups). Build + fmt clean.

Copy link
Copy Markdown
Contributor

@ch4r10t33r ch4r10t33r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@ch4r10t33r ch4r10t33r merged commit dd06711 into main Apr 30, 2026
13 checks passed
@ch4r10t33r ch4r10t33r deleted the fix/swarm-command-channel-backpressure-808 branch April 30, 2026 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

rust-bridge: swarm command channel saturating under devnet-4 load — silent attestation/reqresp drops causing fork-choice divergence

2 participants