Session can hang indefinitely in Disconnected(reconnect=false) when peer doesn't close TCP socket

## Symptoms

A production initiator hit a state where:

- Session was terminated with `reconnect=false` (e.g. via `handle_sequence_number_too_low`)
- The process did **not** exit — we did not respawn the task
- The session emitted no logs for hours
- At the daily schedule rollover (midnight), only two log lines appeared:
  ```
  trying to write without an established connection
  disconnecting an already disconnected session
  ```
- No reconnect was attempted at the new session period

Prior to that stuck state, the same sequence-number mismatch was producing a healthy crash-loop: process exited, ECS respawned, repeat. The stuck state began when the peer's behaviour changed (stopped closing the TCP socket promptly after receiving our Logout).

## Suspected root cause

When we terminate on a fatal session error, the flow is:

1. `handle_sequence_number_too_low` (inbound.rs:131-155) queues a Logout to the writer, calls `writer.disconnect()`, and transitions state to `Disconnected(reconnect=false)`.
2. The writer actor processes `Disconnect` and exits, dropping its `WriteHalf`.
3. State is now `Disconnected`; the outer `establish_connection` loop is blocked in `run_until_disconnect().await` (initiator.rs:125), which waits on the reader's `dc_sender` signal.

The problem: `tokio::io::split` shares the underlying stream between `ReadHalf` and `WriteHalf` via a `BiLock`. Dropping only the `WriteHalf` does **not** close the TCP socket and does **not** call `shutdown(Shutdown::Write)`. No FIN is sent to the peer, and there is no `shutdown`/`Shutdown` call anywhere in `crates/hotfix/src/transport/`.

Consequently:
- The **reader** (socket_reader.rs) stays blocked in `read_buf`, because it only exits on peer-initiated EOF (`Ok(0)`) or a read error.
- The reader's `dc_sender.send(())` signal never fires.
- `run_until_disconnect` never returns.
- The outer loop never reaches `should_reconnect()`, so the `reconnect=false` signal never takes effect.

In the `Disconnected` state there is also no watchdog that can break us out: `heartbeat_deadline` and `peer_deadline` both return `None` (session.rs:744-760), so the session `select!` loop effectively only services the 1s schedule-check timer, which has no path to force-close a stuck reader.

At the midnight rollover, `handle_schedule_check` sees `SessionPeriodComparison::DifferentPeriod` and calls `logout_and_terminate` on the zombie `Disconnected` state, producing the two observed logs from state.rs:177 (`send_message` fallthrough arm) and state.rs:191 (`disconnect_writer` fallthrough arm). It does not transition state or wake the outer loop.

## Proposed fix

Make writer termination actually close the TCP stream so the reader observes EOF.

Options, roughly in order of preference:

1. **Call `AsyncWriteExt::shutdown().await` on the `WriteHalf` before the writer actor exits**, in `socket_writer.rs::run_writer` / on `WriterMessage::Disconnect`. This sends FIN; the peer's read-half will observe this and most peers will close their end, which gets the reader EOF.

2. **Stop splitting the stream.** Hold the `TcpStream` (or TLS stream) in a single owner and call `.shutdown(Shutdown::Both)` on terminate. More invasive but removes the bi-lock/split pitfall entirely.

3. **Defense-in-depth regardless of the above:** add a session-level liveness timer that fires while in `Disconnected` (or any non-`Active` state with a live reader) and forcibly drops the reader after N seconds. This covers peers that accept our FIN but still refuse to close their write side.

Option 1 is the smallest change and directly addresses the root cause. Option 3 is cheap to add alongside as a safety net.

## Additional consideration

The `await_in_schedule` issue (session_ref.rs:85-95 ignores `ScheduleResponse::Shutdown`) is related but orthogonal — tracked separately.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Session can hang indefinitely in Disconnected(reconnect=false) when peer doesn't close TCP socket #344

Symptoms

Suspected root cause

Proposed fix

Additional consideration

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Session can hang indefinitely in Disconnected(reconnect=false) when peer doesn't close TCP socket #344

Description

Symptoms

Suspected root cause

Proposed fix

Additional consideration

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions