Skip to content

fix: replace relay self-reconnect with external sweep and improve backoff#279

Merged
barrydeen merged 3 commits intomainfrom
fix/relay-reconnect-thrashing
Mar 19, 2026
Merged

fix: replace relay self-reconnect with external sweep and improve backoff#279
barrydeen merged 3 commits intomainfrom
fix/relay-reconnect-thrashing

Conversation

@barrydeen
Copy link
Copy Markdown
Owner

Summary

  • Root cause fixed: Dual ownership of reconnect logic caused a race condition where Relay.onFailure self-reconnected asynchronously while RelayPool.forceReconnectAll() had already established a new connection, producing a thrashing loop of rapid disconnect/reconnect cycles and EOSE floods on app resume.
  • Amethyst-style sweep: Relay no longer self-reconnects. All reconnection is driven externally by a 3-second sweep coroutine in RelayPool that calls connectIfNeeded() on each relay. RelayLifecycleManager's existing appIsActive = false → reconnect → appIsActive = true pattern cleanly stops/restarts the sweep around explicit reconnects.
  • Stable backoff: reconnectDelayMs only resets to 1s when a connection was stable for >10 seconds. Relays that consistently open then immediately close (e.g. uWebSockets relays with idle-connection timeouts) now accumulate exponential backoff (2s → 4s → 8s → … → 5 min cap) rather than retrying every 3 seconds forever.
  • Synchronous subscription resync: Added onConnected callback invoked on OkHttp's thread in onOpen before returning, so active subscriptions are resent immediately on reconnect without waiting for async StateFlow dispatch.
  • Bad relay logic simplified: RelayHealthTracker no longer uses session-based heuristics (zero-event sessions, disconnect counts, rate-limit counts) to mark relays bad. Relays are now only marked bad on 5xx HTTP errors.
  • Cleanup: Removed autoReconnect, reconnectEnabled, scope constructor param, pendingReconnect, reconnectScheduler, and attempt-window rate limiter from Relay. Added User-Agent header to WebSocket upgrade requests.

Test plan

  • Open app, use feed for ~1 min, minimize, wait 45s, reopen — relays should reconnect once cleanly with no thrashing in RLC logcat
  • Minimize for <10s and reopen — lightweight reconnect, existing subscriptions reused
  • Toggle WiFi off/on — relay reconnects once after network stabilises
  • Relay debug console — no flood of CONN_FAILURE entries on resume
  • Relay that drops mid-session reconnects within ~3–6s via sweep without disrupting others

barrydeen and others added 3 commits March 19, 2026 17:15
…koff

Root cause of relay thrashing bug: dual ownership of reconnect logic. Relay
self-reconnected internally via onFailure/onClosed while RelayPool/RelayLifecycleManager
also drove reconnection externally. The reconnectEnabled flag could not reliably
coordinate these because onFailure callbacks fire asynchronously on OkHttp's thread
pool, creating a race with forceReconnectAll()'s disconnect/connect/enable sequence.

Changes:
- Relay: remove all self-reconnect machinery (autoReconnect, reconnectEnabled,
  pendingReconnect, reconnectScheduler, scope param, attempt-window rate limiter).
  Add exponential backoff state (reconnectDelayMs, lastAttemptMs) and
  needsReconnect()/connectIfNeeded() for external polling. Backoff only resets
  when a connection was stable for >10s, preventing fast open-then-fail cycles
  (e.g. uWebSockets relays that drop idle connections) from resetting the delay.
- Relay: add onConnected callback invoked synchronously in onOpen before
  drainPendingMessages, so subscriptions hit the wire immediately on connect.
- Relay: add User-Agent header to WebSocket upgrade requests.
- RelayPool: replace setReconnectEnabled() with a 3s sweep coroutine
  (startReconnectSweep/stopReconnectSweep) tied to appIsActive. Sweep calls
  connectIfNeeded() on persistent and DM relays only; ephemerals recreated on demand.
- RelayPool: wire relay.onConnected to resyncSubscriptions() so subscriptions
  are resent synchronously on reconnect rather than via async StateFlow dispatch.
- RelayPool: remove reconnectEnabled assignments from reconnectAll/forceReconnectAll.
- RelayHealthTracker: remove session-based bad-relay evaluation (zero-event,
  disconnect, rate-limit thresholds). Mark relays bad only on 5xx HTTP errors
  via new onServerError() method. Rate limits still tracked as a stat for display.
- RelayProber, NwcRepository, FeedViewModel: remove autoReconnect/scope refs
  now that Relay no longer has those fields.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Picks up two improvements from main that were added in parallel:
- disconnect() uses ws.close(1001) instead of ws.cancel() for graceful
  WebSocket closure, reducing RST-then-SYN storms on high-traffic relays
- send() skips queueing REQ messages since resyncSubscriptions already
  handles replay on reconnect, preventing duplicate subscriptions

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant