Symptoms
A production initiator hit a state where:
- Session was terminated with
reconnect=false (e.g. via handle_sequence_number_too_low)
- The process did not exit — we did not respawn the task
- The session emitted no logs for hours
- At the daily schedule rollover (midnight), only two log lines appeared:
trying to write without an established connection
disconnecting an already disconnected session
- No reconnect was attempted at the new session period
Prior to that stuck state, the same sequence-number mismatch was producing a healthy crash-loop: process exited, ECS respawned, repeat. The stuck state began when the peer's behaviour changed (stopped closing the TCP socket promptly after receiving our Logout).
Suspected root cause
When we terminate on a fatal session error, the flow is:
handle_sequence_number_too_low (inbound.rs:131-155) queues a Logout to the writer, calls writer.disconnect(), and transitions state to Disconnected(reconnect=false).
- The writer actor processes
Disconnect and exits, dropping its WriteHalf.
- State is now
Disconnected; the outer establish_connection loop is blocked in run_until_disconnect().await (initiator.rs:125), which waits on the reader's dc_sender signal.
The problem: tokio::io::split shares the underlying stream between ReadHalf and WriteHalf via a BiLock. Dropping only the WriteHalf does not close the TCP socket and does not call shutdown(Shutdown::Write). No FIN is sent to the peer, and there is no shutdown/Shutdown call anywhere in crates/hotfix/src/transport/.
Consequently:
- The reader (socket_reader.rs) stays blocked in
read_buf, because it only exits on peer-initiated EOF (Ok(0)) or a read error.
- The reader's
dc_sender.send(()) signal never fires.
run_until_disconnect never returns.
- The outer loop never reaches
should_reconnect(), so the reconnect=false signal never takes effect.
In the Disconnected state there is also no watchdog that can break us out: heartbeat_deadline and peer_deadline both return None (session.rs:744-760), so the session select! loop effectively only services the 1s schedule-check timer, which has no path to force-close a stuck reader.
At the midnight rollover, handle_schedule_check sees SessionPeriodComparison::DifferentPeriod and calls logout_and_terminate on the zombie Disconnected state, producing the two observed logs from state.rs:177 (send_message fallthrough arm) and state.rs:191 (disconnect_writer fallthrough arm). It does not transition state or wake the outer loop.
Proposed fix
Make writer termination actually close the TCP stream so the reader observes EOF.
Options, roughly in order of preference:
-
Call AsyncWriteExt::shutdown().await on the WriteHalf before the writer actor exits, in socket_writer.rs::run_writer / on WriterMessage::Disconnect. This sends FIN; the peer's read-half will observe this and most peers will close their end, which gets the reader EOF.
-
Stop splitting the stream. Hold the TcpStream (or TLS stream) in a single owner and call .shutdown(Shutdown::Both) on terminate. More invasive but removes the bi-lock/split pitfall entirely.
-
Defense-in-depth regardless of the above: add a session-level liveness timer that fires while in Disconnected (or any non-Active state with a live reader) and forcibly drops the reader after N seconds. This covers peers that accept our FIN but still refuse to close their write side.
Option 1 is the smallest change and directly addresses the root cause. Option 3 is cheap to add alongside as a safety net.
Additional consideration
The await_in_schedule issue (session_ref.rs:85-95 ignores ScheduleResponse::Shutdown) is related but orthogonal — tracked separately.
Symptoms
A production initiator hit a state where:
reconnect=false(e.g. viahandle_sequence_number_too_low)Prior to that stuck state, the same sequence-number mismatch was producing a healthy crash-loop: process exited, ECS respawned, repeat. The stuck state began when the peer's behaviour changed (stopped closing the TCP socket promptly after receiving our Logout).
Suspected root cause
When we terminate on a fatal session error, the flow is:
handle_sequence_number_too_low(inbound.rs:131-155) queues a Logout to the writer, callswriter.disconnect(), and transitions state toDisconnected(reconnect=false).Disconnectand exits, dropping itsWriteHalf.Disconnected; the outerestablish_connectionloop is blocked inrun_until_disconnect().await(initiator.rs:125), which waits on the reader'sdc_sendersignal.The problem:
tokio::io::splitshares the underlying stream betweenReadHalfandWriteHalfvia aBiLock. Dropping only theWriteHalfdoes not close the TCP socket and does not callshutdown(Shutdown::Write). No FIN is sent to the peer, and there is noshutdown/Shutdowncall anywhere incrates/hotfix/src/transport/.Consequently:
read_buf, because it only exits on peer-initiated EOF (Ok(0)) or a read error.dc_sender.send(())signal never fires.run_until_disconnectnever returns.should_reconnect(), so thereconnect=falsesignal never takes effect.In the
Disconnectedstate there is also no watchdog that can break us out:heartbeat_deadlineandpeer_deadlineboth returnNone(session.rs:744-760), so the sessionselect!loop effectively only services the 1s schedule-check timer, which has no path to force-close a stuck reader.At the midnight rollover,
handle_schedule_checkseesSessionPeriodComparison::DifferentPeriodand callslogout_and_terminateon the zombieDisconnectedstate, producing the two observed logs from state.rs:177 (send_messagefallthrough arm) and state.rs:191 (disconnect_writerfallthrough arm). It does not transition state or wake the outer loop.Proposed fix
Make writer termination actually close the TCP stream so the reader observes EOF.
Options, roughly in order of preference:
Call
AsyncWriteExt::shutdown().awaiton theWriteHalfbefore the writer actor exits, insocket_writer.rs::run_writer/ onWriterMessage::Disconnect. This sends FIN; the peer's read-half will observe this and most peers will close their end, which gets the reader EOF.Stop splitting the stream. Hold the
TcpStream(or TLS stream) in a single owner and call.shutdown(Shutdown::Both)on terminate. More invasive but removes the bi-lock/split pitfall entirely.Defense-in-depth regardless of the above: add a session-level liveness timer that fires while in
Disconnected(or any non-Activestate with a live reader) and forcibly drops the reader after N seconds. This covers peers that accept our FIN but still refuse to close their write side.Option 1 is the smallest change and directly addresses the root cause. Option 3 is cheap to add alongside as a safety net.
Additional consideration
The
await_in_scheduleissue (session_ref.rs:85-95 ignoresScheduleResponse::Shutdown) is related but orthogonal — tracked separately.