fix(relay-reconnect): resilient reconnect with fast-path, escalation, and polling#1456
Conversation
b6de4e8 to
9215148
Compare
… and polling The reconnect button had three compounding failure modes: 1. Instant false-fail during sign-in: the transport-recovery hook ran and exited in ~0.1s, but the tunnel liveness check passed while the Access layer was still blocked, so preconnect() hit the unreachable network and failed immediately — the user saw a 'failed' toast mid-login. 2. False-fail after actual reconnect: a single promise race with a 15s cap determined the verdict. preconnect() set keepAliveRequested=true so the background retry loop kept going and often succeeded right after the toast fired, leaving the user confused. 3. Spurious browser tab on transient blips: the hook ran unconditionally even when the transport was healthy and the disconnect was a network flap. New strategy — try-fast-first → escalate → poll-until-connected: - Phase 1 (fast path): attempt preconnect() with a 4s cap. Succeeds immediately for transient blips — no hook fires, no browser tab opens. - Phase 2 (escalation, hook-configured builds only): only if the fast path fails and relay_reconnect_hook_configured() returns true, run the transport-recovery hook. The hook returns quickly; any browser-based network sign-in it triggers runs asynchronously. - Phase 3 (poll-until-connected): retry preconnect() every 3s and watch the connection-state emitter. The moment the relay is reachable the next poll or emitter event declares success — no waiting for a timer. A 120s backstop fires a soft toast only; keepAliveRequested keeps the background loop alive. Extracts a RelayReconnectController singleton (plain TS, injected deps) so all three consumers — ConnectionBanner, useSidebarRelayConnectionCard, and ProfileStep — share one in-flight state via useSyncExternalStore-style subscribe. No double browser tabs; consistent 'Waiting for sign-in' state across surfaces. Cancellation token (generation counter) is checked after every await and inside every async continuation, closing the gap where a cancelled poll's .then() could still mutate state. Adds relay_reconnect_hook_configured Rust command (OSS-clean, bool) so the frontend skips escalation and 'Waiting for sign-in...' copy in OSS builds. e2eBridge returns false for this command in test environments. Tests rewritten to exercise the real RelayReconnectController class with injected fakes — no simulation helpers. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
9215148 to
f42a301
Compare
…t user click connected state is now authoritative for recovery UI — a stale query error no longer pins the reconnect card while the relay is healthy. - useSidebarRelayConnectionCard: gate hasActiveRelayUnreachableError on !isRelayConnectionConnected so connected state clears the card even when channelsQuery.error has not yet been cleared - ConnectionBanner: add state !== connected guard to hasCollapsedRelayError for the same reason - RelayAutoHealScheduler: expand constructor parameter properties to explicit field declarations so node --experimental-strip-types can load the class in tests; initialize lastHealAt to -Infinity so the first recovery is never rate-limited - Add useRelayAutoHeal.test.mjs covering all scheduler state transitions Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
Replace stale sign-in wording in relay_reconnect_hook_configured() doc comment with the same hook-agnostic copy used in the UI. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
…-state gate test 01 previously booted with channelsReadError and expected the card to show while the relay was still connected — relying on the old behavior where a stale query error alone could pin the card open. After the connected-state authority change, the card is suppressed when state === 'connected', even if channelsReadError is present. Drive the relay to 'reconnecting' (which IS in isRelayConnectionStateDegraded) so the card appears via the state path, independent of query-error timing. Drop the connection-banner count 0 assertion: the banner correctly shows for any degraded state, including reconnecting. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
sidebar-relay-card.spec.ts and relay-connectivity.spec.ts both tested the old contract where a stale channelsReadError alone could pin the relay card visible while connection state was "connected". The new gate (hasActiveRelayUnreachableError && !isRelayConnectionConnected) correctly suppresses stale errors when connected is authoritative. Add __BUZZ_E2E_GET_RELAY_CONNECTION_STATE__ seam to e2eBridge.ts so specs can observe the relay state before overriding it. Gate the seam wait on the target state: when driving to a degraded state, wait for the relay to reach "connected" first (prevents the async auth handshake from racing back over the override); when driving to "connected", just wait for the seam to be installed — no baseline required. All 15 targeted E2E tests pass locally (6 smoke + 9 integration). Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
Collapse waitForFunction arrow style to match biome's expected format (remove trailing comma wrapping around the callback). Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
wesbillman
left a comment
There was a problem hiding this comment.
Two blocking correctness issues:
-
relayReconnectController.ts:197-220— synchronous connection-state delivery leaks the phase-3 machinery after success.RelayClientSession.subscribeToConnectionState()immediately invokes its listener with the current state. If recovery lands after the fast-path failure/hook but before this subscription, the callback callsfinish()synchronously, beforeunsubscribeConnectionStatereceives the returned cleanup and before the interval/backstop are installed. Execution then continues and installs both timers even though the controller already finished. The interval remains alive for the app lifetime (its callback only no-ops becauseresolvedis true), and the connection-state subscription is never removed. I reproduced this against the production controller with a synchronously emittedconnected:onSuccess=1, state idle, but one interval and one subscription remain. Subscribe/install timers in an order that cannot schedule work after synchronous success, and add this immediate-current-state case to the controller tests. -
ProfileStep.tsx:80-103+relayReconnectController.ts:222-223— onboarding never records an eventual phase-3 reconnect as success.start()intentionally returnsfalseas soon as polling begins, andrunConnectivityActioninterprets exactlyfalseas “do not callmarkSuccess().” The controller can later reconnect and clear its shared pending state, but this component has no connection-state subscription or completion callback, so its original save error card remains in the failure state (and becomes clickable again) while the relay is healthy. The sidebar has separate connection-state success handling; onboarding needs equivalent eventual-success handling or a completion contract from the controller.
Validation: local desktop unit suite passes 1496/1496; CI is green. The first issue was additionally reproduced with a focused production-controller script, so existing tests are missing the synchronous subscriber path rather than disproving it.
Finding 1 (relayReconnectController.ts): subscribeToConnectionState can invoke its listener synchronously with the current state before returning. When that sync emission signals connected, onConnected() → finish() → cancelTimers() runs while unsubscribeConnectionState is still null — then the caller assigns all three handles after finish, leaking them for the app lifetime. Fix: subscribe first, then check resolved immediately after the return value is assigned. If resolved, unsubscribe the handle (which cancelTimers couldn't reach because the assignment hadn't happened yet) and return before installing the poll interval and backstop timer. Finding 2 (ProfileStep.tsx): reconnect() always returns false in phase 3, so runConnectivityAction never calls markSuccess(), and the onboarding component had no connection-state subscription. The error card stayed showing failure even after the relay healed. Fix: observe the shared relayConnectivitySuccess store (already used by the sidebar) via useSyncExternalStore + useEffect in OnboardingRelayConnectionErrorCard. Export subscribeRelayConnectivitySuccess and getRelayConnectivitySuccessSnapshot from useSidebarRelayConnectionCard.ts so both surfaces share exactly one success-signalling mechanism. Test: add a sync-emission unit test with a subscribeToConnectionState fake that calls the listener before returning its cleanup handle — asserts onSuccess fires once, no interval/backstop installed, cleanup called. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
ProfileStep was observing the sidebar's relayConnectivitySuccess store to detect phase-3 reconnect success. That store is only ever produced by useSidebarRelayConnectionCard, which never mounts during onboarding — AppReady returns OnboardingFlow before the sidebar path renders. Replace the dead cross-surface coupling with a local approach: - Add useRelayConnection() to OnboardingRelayConnectionErrorCard - Add hadActiveReconnectRef to gate markSuccess() on a genuine attempt - Set the ref true when runConnectivityAction starts; clear it and call markSuccess() synchronously on phase-1 success, or leave it armed for the connection-state effect to fire on phase-3 connected transition - Remove the two exports added to useSidebarRelayConnectionCard (no dead API surface) Also fix the controller comment: the first comment incorrectly implied cancelTimers() reached the subscription handle during sync emission. In the assignment-order race, cancelTimers() ran before the handle was assigned; cleanup happens in the if (resolved) block. Say that directly. Add OnboardingRelayConnectionErrorCard.test.mjs: 5 tests covering the guard contract — phase-3 async success, disconnected no-op, no-attempt guard, phase-1 sync success idempotency, and double-connected no-op. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
The OnboardingRelayConnectionErrorCard.test.mjs unit test was a mirror state machine — it duplicated the component's ref/effect logic in the test file itself, meaning it would pass even if ProfileStep.tsx regressed. Thufir flagged this as not exercising the production path. Replace with two E2E tests in onboarding.spec.ts that drive the real component against the real relay state seam: - 'click shows Connected then auto-dismisses': produces the error card via a relay-unreachable profile save error, drives relay to disconnected, clicks onboarding-reconnect-relay, drives relay to connected, and asserts the card transitions to 'Connected' then auto-dismisses. Exercises the full hadActiveReconnectRef guard and markSuccess() path through real code. - 'connected without a prior click does not show Connected': same card setup, drives disconnected → connected with no click, asserts the card stays in error state. Guards against spurious success on background recovery without user action. Also adds the setRelayConnectionState helper to onboarding.spec.ts, copied from sidebar-relay-card.spec.ts (same seam, same bidirectional guard). Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
Problem
The relay reconnect button had three compounding failure modes:
Symptom 1 — instant false-fail during Okta sign-in for Block-internal users:
warp-cli debug access-reauthexits in ~0.1s (it just opens a browser tab). The ready probe matchedConnectedon the tunnel (which stays up during Access reauth), so the hook returned immediately — thenpreconnect()hit the still-blocked network and failed. The user saw a 'Reconnect failed' toast while still mid-login.Symptom 2 — false-fail after actual reconnect: A single 15s
Promise.raceonpreconnect()determined the verdict. Sincepreconnect()setskeepAliveRequested=true, the session's background retry loop kept running and often succeeded right after the toast fired — leaving messages flowing in with a 'failed' toast still on screen.Symptom 3 — spurious browser tab on transient blips: The hook ran unconditionally. A brief wifi dropout that didn't require reauth would still run the internal release's hook and pop a Cloudflare Access tab.
All three mechanisms verified hands-on (see investigation thread).
Solution
Restructure
useReconnectRelayinto a three-phase strategy:Phase 1 — fast path: Attempt
relayClient.preconnect()with a short 4s timeout. For transient network blips where WARP is already healthy, this succeeds immediately — no hook fires, no browser tab, reconnect is nearly instant.Phase 2 — escalation (hook-configured builds only): Only if the fast path fails and
relay_reconnect_hook_configured()returns true, run the transport-layer hook steps. The hook returns fast (~0.1s); the browser tab opening is the async part of the work.Phase 3 — poll-until-connected: Retry
preconnect()every 3s and watch the connection-state emitter directly. The moment the relay is reachable (Okta completed, or network recovered), the next poll or emitter event declares success — success preempts the backstop regardless of how long login takes. A 120s backstop fires a softtoast()(nottoast.error) if polling never succeeds;keepAliveRequestedkeeps the background retry loop alive after that.Verdict is now state-driven (connection-state emitter at
connected) rather than promise-driven.Changes
Rust (OSS-clean, no Block-specific code):
relay_reconnect_hook_configured() -> boolTauri command — returnsoption_env!(BUZZ_DESKTOP_BUILD_RELAY_RECONNECT_CMD).is_some(). Registered inlib.rs. The existingrelay_reconnect_hookcommand is unchanged.TypeScript:
useReconnectRelay.ts— full redesign to three-phase strategy. ExposesisWaitingForSignIn: booleanfor UX copy.ConnectionBanner.tsx— shows "Waiting for sign-in…" button label while in phase 3 post-escalation.useSidebarRelayConnectionCard.ts— threadsisWaitingForSignInthrough to card.SidebarRelayConnectionCard.tsx— accepts optionalisWaitingForSignInprop; shows "Waiting for sign-in" in title/description when set.AppSidebar.tsx— passesisWaitingForSignIntoSidebarRelayConnectionCard.e2eBridge.ts— addsrelay_reconnect_hook_configuredcase returningfalse.useReconnectRelay.test.mjs— 10 new unit tests covering all phases.OSS boundary: No WARP, Okta, or Cloudflare strings in any user-facing copy or runtime code path. The phase 2 escalation only fires when
relay_reconnect_hook_configuredreturns true (internal builds only). "Waiting for sign-in…" copy is generic and only shown in hook-configured builds.Test results
pnpm testpassing (includes 10 new tests)just desktop-check(biome + file-size gates) cleanjust desktop-typecheck(tsc) cleanrg -i "warp|okta|cloudflare" desktop/srcclean for all new/modified linesRelated: squareup/buzz-releases#48 — drops the vestigial ready probe from the baked config (depends on this PR)