Skip to content

Fix worker silent connection drops with idle timeout#259

Merged
ericflo merged 1 commit intomainfrom
ce/fix-worker-silent-drops
Apr 11, 2026
Merged

Fix worker silent connection drops with idle timeout#259
ericflo merged 1 commit intomainfrom
ce/fix-worker-silent-drops

Conversation

@ericflo
Copy link
Copy Markdown
Owner

@ericflo ericflo commented Apr 10, 2026

What

Adds a 90-second idle timeout to the worker's WebSocket session loop. If no message is received from the server within that window, the worker assumes the connection is dead and triggers a reconnect.

Why

The worker had no way to detect silently dead connections. When the TCP connection to the proxy dropped without a FIN/RST (common with NAT timeouts, load balancer idle timeouts, or TLS middleboxes), the worker's read loop would block forever — no errors logged, no reconnect attempted. The server's heartbeat mechanism only detects staleness from the server's side; the worker never learned its connection was dead.

This was observed on 2026-04-10: the worker appeared healthy (no errors on CLI) but the proxy returned "Request timed out waiting for worker" to callers. Restarting the worker immediately fixed it.

Changes

  • Idle timeout (90s): The server pings every 30s, so 90s (3×) gives margin for transient delays while catching dead connections within ~2 minutes. On timeout, logs a warning and breaks out of the session loop → existing run_with_reconnect handles retry.
  • Backoff reset on long sessions: If a session lasted longer than the idle timeout before dropping, it was a genuine connection that eventually died — reset backoff to 1s for fast recovery instead of escalating to 30s.
  • Better observability: Added session_duration_secs to all reconnect log messages, and a connected to proxy, registered worker info log on successful connection.

Testing

  • All 30 existing tests pass (cargo test -p modelrelay-worker)
  • cargo clippy -p modelrelay-worker -- -D warnings clean
  • The idle timeout integrates with the existing run_with_reconnect loop — no new reconnect machinery needed

The worker had no mechanism to detect silently dead connections. When the
TCP connection to the proxy dropped without a FIN/RST (NAT timeout, load
balancer idle timeout, TLS middlebox), the worker's read loop would block
forever waiting for a message that would never arrive. The server-side
heartbeat detection only works from the server's perspective — the worker
never learned its connection was dead.

Changes:
- Add 90-second idle timeout to the worker session loop. Since the server
  sends application-level pings every 30s, no message within 90s means
  the connection is dead. The worker now logs a warning and exits the
  session, triggering the existing reconnect logic.
- Track last_server_activity timestamp, updated on every received message
  (Text, Ping, Pong, Binary).
- Reset exponential backoff when a session lasted longer than the idle
  timeout, since a long-lived session that eventually drops is a transient
  network issue, not a persistent connection failure.
- Add session_duration_secs to reconnect log messages for observability.
- Log successful connection/registration for visibility into reconnects.
@ericflo ericflo merged commit 3eeccac into main Apr 11, 2026
12 checks passed
@ericflo ericflo deleted the ce/fix-worker-silent-drops branch April 11, 2026 00:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant