Skip to content

Make peer_connection_listener persistent #2078

@sanity

Description

@sanity

Summary

While debugging the flaky PUT/connectivity tests we instrumented the transport layer and found that peer_connection_listener (the task that owns each UDP socket) returns after handing off a single inbound packet. The main event loop immediately re-spawns a new listener for that connection, but if anything in the select! chain fails to push the returned future back into the stream, that connection stops draining inbound packets entirely. That’s exactly what we’re seeing: the intermediate node logs that it successfully queued and sent PutForward, yet the requester never logs any inbound message for that transaction—the listener simply isn’t polled again.

This “spawn-per-packet” architecture couples inbound progress to the main event loop’s scheduling and is brittle under load. A more robust design is the standard 1:1 model: one long-lived async task per peer connection that continuously drives both the outbound mpsc and inbound datagrams, forwarding inbound packets via a channel to the event loop. Only connection lifecycle events should be sent back through the priority select. This decouples transport I/O from the higher-level routing logic and eliminates the possibility that a bookkeeping hiccup starves a connection.

Evidence

  • In /tmp/connectivity_trace_run10.log we see the intermediate node (v6MWKgqHeCHyzLfJ) log Sending outbound message… PutForward and the peer_connection_listener log that it wrote the packet to socket 127.0.0.1:38147. However, the requester (v6MWKgqK3rzUh1F6) never logs a matching Received message… PutForward, the PUT result never reaches the operation manager, and the test times out.
  • There are no [CONN_LIFECYCLE] … closed logs for that socket, so the connection itself is still open; it simply isn’t polled after the first packet.
  • The direct-ack “shortcut” we added earlier doesn’t fire either, because the final hop never processes the PutForward message, so no SuccessfulPut is generated at all.

Proposal

  1. Refactor peer_connection_listener into a persistent task (one per connection) that loops forever, driving both outbound mpsc and inbound UDP reads. When it receives data it sends a ConnEvent::InboundMessage into a new conn_events channel that the event loop polls alongside the existing sources.
  2. Update priority_select::SelectResult and process_select_result to consume these ConnEvents instead of whole PeerConnectionInbound structs, eliminating the need to re-spawn futures on every packet.
  3. Keep connection lifecycle handling (drop, errors, etc.) inside that task; it can emit ConnEvent::ClosedChannel when necessary so the event loop cleans up state.
  4. Once that’s in place, remove the direct-ack workaround (tracked in issue Clean up PUT direct-ack shortcut #2077) so we rely solely on the proper upstream response path.

This aligns the transport code with a more conventional, maintainable architecture and prevents individual connections from being starved because the select! loop lost track of their listener future.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-networkingArea: Networking, ring protocol, peer discoveryP-highHigh priorityT-bugType: Something is broken

    Type

    No type

    Projects

    Status

    Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions