Fix worker silent connection drops with idle timeout by ericflo · Pull Request #259 · ericflo/modelrelay

ericflo · 2026-04-10T22:31:10Z

What

Adds a 90-second idle timeout to the worker's WebSocket session loop. If no message is received from the server within that window, the worker assumes the connection is dead and triggers a reconnect.

Why

The worker had no way to detect silently dead connections. When the TCP connection to the proxy dropped without a FIN/RST (common with NAT timeouts, load balancer idle timeouts, or TLS middleboxes), the worker's read loop would block forever — no errors logged, no reconnect attempted. The server's heartbeat mechanism only detects staleness from the server's side; the worker never learned its connection was dead.

This was observed on 2026-04-10: the worker appeared healthy (no errors on CLI) but the proxy returned "Request timed out waiting for worker" to callers. Restarting the worker immediately fixed it.

Changes

Idle timeout (90s): The server pings every 30s, so 90s (3×) gives margin for transient delays while catching dead connections within ~2 minutes. On timeout, logs a warning and breaks out of the session loop → existing run_with_reconnect handles retry.
Backoff reset on long sessions: If a session lasted longer than the idle timeout before dropping, it was a genuine connection that eventually died — reset backoff to 1s for fast recovery instead of escalating to 30s.
Better observability: Added session_duration_secs to all reconnect log messages, and a connected to proxy, registered worker info log on successful connection.

Testing

All 30 existing tests pass (cargo test -p modelrelay-worker)
cargo clippy -p modelrelay-worker -- -D warnings clean
The idle timeout integrates with the existing run_with_reconnect loop — no new reconnect machinery needed

The worker had no mechanism to detect silently dead connections. When the TCP connection to the proxy dropped without a FIN/RST (NAT timeout, load balancer idle timeout, TLS middlebox), the worker's read loop would block forever waiting for a message that would never arrive. The server-side heartbeat detection only works from the server's perspective — the worker never learned its connection was dead. Changes: - Add 90-second idle timeout to the worker session loop. Since the server sends application-level pings every 30s, no message within 90s means the connection is dead. The worker now logs a warning and exits the session, triggering the existing reconnect logic. - Track last_server_activity timestamp, updated on every received message (Text, Ping, Pong, Binary). - Reset exponential backoff when a session lasted longer than the idle timeout, since a long-lived session that eventually drops is a transient network issue, not a persistent connection failure. - Add session_duration_secs to reconnect log messages for observability. - Log successful connection/registration for visibility into reconnects.

ericflo merged commit 3eeccac into main Apr 11, 2026
12 checks passed

ericflo deleted the ce/fix-worker-silent-drops branch April 11, 2026 00:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix worker silent connection drops with idle timeout#259

Fix worker silent connection drops with idle timeout#259
ericflo merged 1 commit intomainfrom
ce/fix-worker-silent-drops

ericflo commented Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ericflo commented Apr 10, 2026

What

Why

Changes

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant