-
-
Notifications
You must be signed in to change notification settings - Fork 107
Description
Summary
While debugging the flaky PUT/connectivity tests we instrumented the transport layer and found that peer_connection_listener (the task that owns each UDP socket) returns after handing off a single inbound packet. The main event loop immediately re-spawns a new listener for that connection, but if anything in the select! chain fails to push the returned future back into the stream, that connection stops draining inbound packets entirely. That’s exactly what we’re seeing: the intermediate node logs that it successfully queued and sent PutForward, yet the requester never logs any inbound message for that transaction—the listener simply isn’t polled again.
This “spawn-per-packet” architecture couples inbound progress to the main event loop’s scheduling and is brittle under load. A more robust design is the standard 1:1 model: one long-lived async task per peer connection that continuously drives both the outbound mpsc and inbound datagrams, forwarding inbound packets via a channel to the event loop. Only connection lifecycle events should be sent back through the priority select. This decouples transport I/O from the higher-level routing logic and eliminates the possibility that a bookkeeping hiccup starves a connection.
Evidence
- In
/tmp/connectivity_trace_run10.logwe see the intermediate node (v6MWKgqHeCHyzLfJ) logSending outbound message… PutForwardand thepeer_connection_listenerlog that it wrote the packet to socket127.0.0.1:38147. However, the requester (v6MWKgqK3rzUh1F6) never logs a matchingReceived message… PutForward, the PUT result never reaches the operation manager, and the test times out. - There are no
[CONN_LIFECYCLE] … closedlogs for that socket, so the connection itself is still open; it simply isn’t polled after the first packet. - The direct-ack “shortcut” we added earlier doesn’t fire either, because the final hop never processes the
PutForwardmessage, so noSuccessfulPutis generated at all.
Proposal
- Refactor
peer_connection_listenerinto a persistent task (one per connection) that loops forever, driving both outboundmpscand inbound UDP reads. When it receives data it sends aConnEvent::InboundMessageinto a newconn_eventschannel that the event loop polls alongside the existing sources. - Update
priority_select::SelectResultandprocess_select_resultto consume theseConnEvents instead of wholePeerConnectionInboundstructs, eliminating the need to re-spawn futures on every packet. - Keep connection lifecycle handling (drop, errors, etc.) inside that task; it can emit
ConnEvent::ClosedChannelwhen necessary so the event loop cleans up state. - Once that’s in place, remove the direct-ack workaround (tracked in issue Clean up PUT direct-ack shortcut #2077) so we rely solely on the proper upstream response path.
This aligns the transport code with a more conventional, maintainable architecture and prevents individual connections from being starved because the select! loop lost track of their listener future.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status