fix: replace relay self-reconnect with external sweep and improve backoff#279
Merged
fix: replace relay self-reconnect with external sweep and improve backoff#279
Conversation
…koff Root cause of relay thrashing bug: dual ownership of reconnect logic. Relay self-reconnected internally via onFailure/onClosed while RelayPool/RelayLifecycleManager also drove reconnection externally. The reconnectEnabled flag could not reliably coordinate these because onFailure callbacks fire asynchronously on OkHttp's thread pool, creating a race with forceReconnectAll()'s disconnect/connect/enable sequence. Changes: - Relay: remove all self-reconnect machinery (autoReconnect, reconnectEnabled, pendingReconnect, reconnectScheduler, scope param, attempt-window rate limiter). Add exponential backoff state (reconnectDelayMs, lastAttemptMs) and needsReconnect()/connectIfNeeded() for external polling. Backoff only resets when a connection was stable for >10s, preventing fast open-then-fail cycles (e.g. uWebSockets relays that drop idle connections) from resetting the delay. - Relay: add onConnected callback invoked synchronously in onOpen before drainPendingMessages, so subscriptions hit the wire immediately on connect. - Relay: add User-Agent header to WebSocket upgrade requests. - RelayPool: replace setReconnectEnabled() with a 3s sweep coroutine (startReconnectSweep/stopReconnectSweep) tied to appIsActive. Sweep calls connectIfNeeded() on persistent and DM relays only; ephemerals recreated on demand. - RelayPool: wire relay.onConnected to resyncSubscriptions() so subscriptions are resent synchronously on reconnect rather than via async StateFlow dispatch. - RelayPool: remove reconnectEnabled assignments from reconnectAll/forceReconnectAll. - RelayHealthTracker: remove session-based bad-relay evaluation (zero-event, disconnect, rate-limit thresholds). Mark relays bad only on 5xx HTTP errors via new onServerError() method. Rate limits still tracked as a stat for display. - RelayProber, NwcRepository, FeedViewModel: remove autoReconnect/scope refs now that Relay no longer has those fields. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Picks up two improvements from main that were added in parallel: - disconnect() uses ws.close(1001) instead of ws.cancel() for graceful WebSocket closure, reducing RST-then-SYN storms on high-traffic relays - send() skips queueing REQ messages since resyncSubscriptions already handles replay on reconnect, preventing duplicate subscriptions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Relay.onFailureself-reconnected asynchronously whileRelayPool.forceReconnectAll()had already established a new connection, producing a thrashing loop of rapid disconnect/reconnect cycles and EOSE floods on app resume.Relayno longer self-reconnects. All reconnection is driven externally by a 3-second sweep coroutine inRelayPoolthat callsconnectIfNeeded()on each relay.RelayLifecycleManager's existingappIsActive = false → reconnect → appIsActive = truepattern cleanly stops/restarts the sweep around explicit reconnects.reconnectDelayMsonly resets to 1s when a connection was stable for >10 seconds. Relays that consistently open then immediately close (e.g. uWebSockets relays with idle-connection timeouts) now accumulate exponential backoff (2s → 4s → 8s → … → 5 min cap) rather than retrying every 3 seconds forever.onConnectedcallback invoked on OkHttp's thread inonOpenbefore returning, so active subscriptions are resent immediately on reconnect without waiting for asyncStateFlowdispatch.RelayHealthTrackerno longer uses session-based heuristics (zero-event sessions, disconnect counts, rate-limit counts) to mark relays bad. Relays are now only marked bad on 5xx HTTP errors.autoReconnect,reconnectEnabled,scopeconstructor param,pendingReconnect,reconnectScheduler, and attempt-window rate limiter fromRelay. AddedUser-Agentheader to WebSocket upgrade requests.Test plan
RLClogcatCONN_FAILUREentries on resume