Fix worker silent connection drops with idle timeout#259
Merged
Conversation
The worker had no mechanism to detect silently dead connections. When the TCP connection to the proxy dropped without a FIN/RST (NAT timeout, load balancer idle timeout, TLS middlebox), the worker's read loop would block forever waiting for a message that would never arrive. The server-side heartbeat detection only works from the server's perspective — the worker never learned its connection was dead. Changes: - Add 90-second idle timeout to the worker session loop. Since the server sends application-level pings every 30s, no message within 90s means the connection is dead. The worker now logs a warning and exits the session, triggering the existing reconnect logic. - Track last_server_activity timestamp, updated on every received message (Text, Ping, Pong, Binary). - Reset exponential backoff when a session lasted longer than the idle timeout, since a long-lived session that eventually drops is a transient network issue, not a persistent connection failure. - Add session_duration_secs to reconnect log messages for observability. - Log successful connection/registration for visibility into reconnects.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds a 90-second idle timeout to the worker's WebSocket session loop. If no message is received from the server within that window, the worker assumes the connection is dead and triggers a reconnect.
Why
The worker had no way to detect silently dead connections. When the TCP connection to the proxy dropped without a FIN/RST (common with NAT timeouts, load balancer idle timeouts, or TLS middleboxes), the worker's read loop would block forever — no errors logged, no reconnect attempted. The server's heartbeat mechanism only detects staleness from the server's side; the worker never learned its connection was dead.
This was observed on 2026-04-10: the worker appeared healthy (no errors on CLI) but the proxy returned "Request timed out waiting for worker" to callers. Restarting the worker immediately fixed it.
Changes
run_with_reconnecthandles retry.session_duration_secsto all reconnect log messages, and aconnected to proxy, registered workerinfo log on successful connection.Testing
cargo test -p modelrelay-worker)cargo clippy -p modelrelay-worker -- -D warningscleanrun_with_reconnectloop — no new reconnect machinery needed