Skip to content

Session can hang indefinitely in Disconnected(reconnect=false) when peer doesn't close TCP socket #344

@davidsteiner

Description

@davidsteiner

Symptoms

A production initiator hit a state where:

  • Session was terminated with reconnect=false (e.g. via handle_sequence_number_too_low)
  • The process did not exit — we did not respawn the task
  • The session emitted no logs for hours
  • At the daily schedule rollover (midnight), only two log lines appeared:
    trying to write without an established connection
    disconnecting an already disconnected session
    
  • No reconnect was attempted at the new session period

Prior to that stuck state, the same sequence-number mismatch was producing a healthy crash-loop: process exited, ECS respawned, repeat. The stuck state began when the peer's behaviour changed (stopped closing the TCP socket promptly after receiving our Logout).

Suspected root cause

When we terminate on a fatal session error, the flow is:

  1. handle_sequence_number_too_low (inbound.rs:131-155) queues a Logout to the writer, calls writer.disconnect(), and transitions state to Disconnected(reconnect=false).
  2. The writer actor processes Disconnect and exits, dropping its WriteHalf.
  3. State is now Disconnected; the outer establish_connection loop is blocked in run_until_disconnect().await (initiator.rs:125), which waits on the reader's dc_sender signal.

The problem: tokio::io::split shares the underlying stream between ReadHalf and WriteHalf via a BiLock. Dropping only the WriteHalf does not close the TCP socket and does not call shutdown(Shutdown::Write). No FIN is sent to the peer, and there is no shutdown/Shutdown call anywhere in crates/hotfix/src/transport/.

Consequently:

  • The reader (socket_reader.rs) stays blocked in read_buf, because it only exits on peer-initiated EOF (Ok(0)) or a read error.
  • The reader's dc_sender.send(()) signal never fires.
  • run_until_disconnect never returns.
  • The outer loop never reaches should_reconnect(), so the reconnect=false signal never takes effect.

In the Disconnected state there is also no watchdog that can break us out: heartbeat_deadline and peer_deadline both return None (session.rs:744-760), so the session select! loop effectively only services the 1s schedule-check timer, which has no path to force-close a stuck reader.

At the midnight rollover, handle_schedule_check sees SessionPeriodComparison::DifferentPeriod and calls logout_and_terminate on the zombie Disconnected state, producing the two observed logs from state.rs:177 (send_message fallthrough arm) and state.rs:191 (disconnect_writer fallthrough arm). It does not transition state or wake the outer loop.

Proposed fix

Make writer termination actually close the TCP stream so the reader observes EOF.

Options, roughly in order of preference:

  1. Call AsyncWriteExt::shutdown().await on the WriteHalf before the writer actor exits, in socket_writer.rs::run_writer / on WriterMessage::Disconnect. This sends FIN; the peer's read-half will observe this and most peers will close their end, which gets the reader EOF.

  2. Stop splitting the stream. Hold the TcpStream (or TLS stream) in a single owner and call .shutdown(Shutdown::Both) on terminate. More invasive but removes the bi-lock/split pitfall entirely.

  3. Defense-in-depth regardless of the above: add a session-level liveness timer that fires while in Disconnected (or any non-Active state with a live reader) and forcibly drops the reader after N seconds. This covers peers that accept our FIN but still refuse to close their write side.

Option 1 is the smallest change and directly addresses the root cause. Option 3 is cheap to add alongside as a safety net.

Additional consideration

The await_in_schedule issue (session_ref.rs:85-95 ignores ScheduleResponse::Shutdown) is related but orthogonal — tracked separately.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions