Skip to content

fix(relay-reconnect): resilient reconnect with fast-path, escalation, and polling#1456

Merged
wpfleger96 merged 9 commits into
mainfrom
duncan/relay-reconnect-resilience
Jul 2, 2026
Merged

fix(relay-reconnect): resilient reconnect with fast-path, escalation, and polling#1456
wpfleger96 merged 9 commits into
mainfrom
duncan/relay-reconnect-resilience

Conversation

@wpfleger96

@wpfleger96 wpfleger96 commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Problem

The relay reconnect button had three compounding failure modes:

Symptom 1 — instant false-fail during Okta sign-in for Block-internal users: warp-cli debug access-reauth exits in ~0.1s (it just opens a browser tab). The ready probe matched Connected on the tunnel (which stays up during Access reauth), so the hook returned immediately — then preconnect() hit the still-blocked network and failed. The user saw a 'Reconnect failed' toast while still mid-login.

Symptom 2 — false-fail after actual reconnect: A single 15s Promise.race on preconnect() determined the verdict. Since preconnect() sets keepAliveRequested=true, the session's background retry loop kept running and often succeeded right after the toast fired — leaving messages flowing in with a 'failed' toast still on screen.

Symptom 3 — spurious browser tab on transient blips: The hook ran unconditionally. A brief wifi dropout that didn't require reauth would still run the internal release's hook and pop a Cloudflare Access tab.

All three mechanisms verified hands-on (see investigation thread).

Solution

Restructure useReconnectRelay into a three-phase strategy:

Phase 1 — fast path: Attempt relayClient.preconnect() with a short 4s timeout. For transient network blips where WARP is already healthy, this succeeds immediately — no hook fires, no browser tab, reconnect is nearly instant.

Phase 2 — escalation (hook-configured builds only): Only if the fast path fails and relay_reconnect_hook_configured() returns true, run the transport-layer hook steps. The hook returns fast (~0.1s); the browser tab opening is the async part of the work.

Phase 3 — poll-until-connected: Retry preconnect() every 3s and watch the connection-state emitter directly. The moment the relay is reachable (Okta completed, or network recovered), the next poll or emitter event declares success — success preempts the backstop regardless of how long login takes. A 120s backstop fires a soft toast() (not toast.error) if polling never succeeds; keepAliveRequested keeps the background retry loop alive after that.

Verdict is now state-driven (connection-state emitter at connected) rather than promise-driven.

Changes

Rust (OSS-clean, no Block-specific code):

  • Adds relay_reconnect_hook_configured() -> bool Tauri command — returns option_env!(BUZZ_DESKTOP_BUILD_RELAY_RECONNECT_CMD).is_some(). Registered in lib.rs. The existing relay_reconnect_hook command is unchanged.

TypeScript:

  • useReconnectRelay.ts — full redesign to three-phase strategy. Exposes isWaitingForSignIn: boolean for UX copy.
  • ConnectionBanner.tsx — shows "Waiting for sign-in…" button label while in phase 3 post-escalation.
  • useSidebarRelayConnectionCard.ts — threads isWaitingForSignIn through to card.
  • SidebarRelayConnectionCard.tsx — accepts optional isWaitingForSignIn prop; shows "Waiting for sign-in" in title/description when set.
  • AppSidebar.tsx — passes isWaitingForSignIn to SidebarRelayConnectionCard.
  • e2eBridge.ts — adds relay_reconnect_hook_configured case returning false.
  • useReconnectRelay.test.mjs — 10 new unit tests covering all phases.

OSS boundary: No WARP, Okta, or Cloudflare strings in any user-facing copy or runtime code path. The phase 2 escalation only fires when relay_reconnect_hook_configured returns true (internal builds only). "Waiting for sign-in…" copy is generic and only shown in hook-configured builds.

Test results

  • 1475/1475 pnpm test passing (includes 10 new tests)
  • just desktop-check (biome + file-size gates) clean
  • just desktop-typecheck (tsc) clean
  • No Block-specific strings in diff: rg -i "warp|okta|cloudflare" desktop/src clean for all new/modified lines

Related: squareup/buzz-releases#48 — drops the vestigial ready probe from the baked config (depends on this PR)

@wpfleger96 wpfleger96 force-pushed the duncan/relay-reconnect-resilience branch 2 times, most recently from b6de4e8 to 9215148 Compare July 2, 2026 15:10
… and polling

The reconnect button had three compounding failure modes:

1. Instant false-fail during sign-in: the transport-recovery hook ran and
   exited in ~0.1s, but the tunnel liveness check passed while the Access
   layer was still blocked, so preconnect() hit the unreachable network and
   failed immediately — the user saw a 'failed' toast mid-login.

2. False-fail after actual reconnect: a single promise race with a 15s cap
   determined the verdict. preconnect() set keepAliveRequested=true so the
   background retry loop kept going and often succeeded right after the
   toast fired, leaving the user confused.

3. Spurious browser tab on transient blips: the hook ran unconditionally
   even when the transport was healthy and the disconnect was a network flap.

New strategy — try-fast-first → escalate → poll-until-connected:

- Phase 1 (fast path): attempt preconnect() with a 4s cap. Succeeds
  immediately for transient blips — no hook fires, no browser tab opens.

- Phase 2 (escalation, hook-configured builds only): only if the fast path
  fails and relay_reconnect_hook_configured() returns true, run the
  transport-recovery hook. The hook returns quickly; any browser-based
  network sign-in it triggers runs asynchronously.

- Phase 3 (poll-until-connected): retry preconnect() every 3s and watch the
  connection-state emitter. The moment the relay is reachable the next poll
  or emitter event declares success — no waiting for a timer. A 120s backstop
  fires a soft toast only; keepAliveRequested keeps the background loop alive.

Extracts a RelayReconnectController singleton (plain TS, injected deps) so
all three consumers — ConnectionBanner, useSidebarRelayConnectionCard, and
ProfileStep — share one in-flight state via useSyncExternalStore-style
subscribe. No double browser tabs; consistent 'Waiting for sign-in' state
across surfaces. Cancellation token (generation counter) is checked after
every await and inside every async continuation, closing the gap where a
cancelled poll's .then() could still mutate state.

Adds relay_reconnect_hook_configured Rust command (OSS-clean, bool) so the
frontend skips escalation and 'Waiting for sign-in...' copy in OSS builds.
e2eBridge returns false for this command in test environments.

Tests rewritten to exercise the real RelayReconnectController class with
injected fakes — no simulation helpers.

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
@wpfleger96 wpfleger96 force-pushed the duncan/relay-reconnect-resilience branch from 9215148 to f42a301 Compare July 2, 2026 15:26
npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 and others added 5 commits July 2, 2026 12:23
…t user click

connected state is now authoritative for recovery UI — a stale query
error no longer pins the reconnect card while the relay is healthy.

- useSidebarRelayConnectionCard: gate hasActiveRelayUnreachableError on
  !isRelayConnectionConnected so connected state clears the card even
  when channelsQuery.error has not yet been cleared
- ConnectionBanner: add state !== connected guard to
  hasCollapsedRelayError for the same reason
- RelayAutoHealScheduler: expand constructor parameter properties to
  explicit field declarations so node --experimental-strip-types can
  load the class in tests; initialize lastHealAt to -Infinity so the
  first recovery is never rate-limited
- Add useRelayAutoHeal.test.mjs covering all scheduler state transitions

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
Replace stale sign-in wording in relay_reconnect_hook_configured() doc
comment with the same hook-agnostic copy used in the UI.

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
…-state gate

test 01 previously booted with channelsReadError and expected the card
to show while the relay was still connected — relying on the old behavior
where a stale query error alone could pin the card open.

After the connected-state authority change, the card is suppressed when
state === 'connected', even if channelsReadError is present. Drive the
relay to 'reconnecting' (which IS in isRelayConnectionStateDegraded) so
the card appears via the state path, independent of query-error timing.
Drop the connection-banner count 0 assertion: the banner correctly shows
for any degraded state, including reconnecting.

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
sidebar-relay-card.spec.ts and relay-connectivity.spec.ts both tested
the old contract where a stale channelsReadError alone could pin the
relay card visible while connection state was "connected". The new gate
(hasActiveRelayUnreachableError && !isRelayConnectionConnected) correctly
suppresses stale errors when connected is authoritative.

Add __BUZZ_E2E_GET_RELAY_CONNECTION_STATE__ seam to e2eBridge.ts so
specs can observe the relay state before overriding it. Gate the seam
wait on the target state: when driving to a degraded state, wait for
the relay to reach "connected" first (prevents the async auth handshake
from racing back over the override); when driving to "connected",
just wait for the seam to be installed — no baseline required.

All 15 targeted E2E tests pass locally (6 smoke + 9 integration).

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
Collapse waitForFunction arrow style to match biome's expected format
(remove trailing comma wrapping around the callback).

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
@wpfleger96 wpfleger96 enabled auto-merge (squash) July 2, 2026 18:36

@wesbillman wesbillman left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two blocking correctness issues:

  1. relayReconnectController.ts:197-220 — synchronous connection-state delivery leaks the phase-3 machinery after success. RelayClientSession.subscribeToConnectionState() immediately invokes its listener with the current state. If recovery lands after the fast-path failure/hook but before this subscription, the callback calls finish() synchronously, before unsubscribeConnectionState receives the returned cleanup and before the interval/backstop are installed. Execution then continues and installs both timers even though the controller already finished. The interval remains alive for the app lifetime (its callback only no-ops because resolved is true), and the connection-state subscription is never removed. I reproduced this against the production controller with a synchronously emitted connected: onSuccess=1, state idle, but one interval and one subscription remain. Subscribe/install timers in an order that cannot schedule work after synchronous success, and add this immediate-current-state case to the controller tests.

  2. ProfileStep.tsx:80-103 + relayReconnectController.ts:222-223 — onboarding never records an eventual phase-3 reconnect as success. start() intentionally returns false as soon as polling begins, and runConnectivityAction interprets exactly false as “do not call markSuccess().” The controller can later reconnect and clear its shared pending state, but this component has no connection-state subscription or completion callback, so its original save error card remains in the failure state (and becomes clickable again) while the relay is healthy. The sidebar has separate connection-state success handling; onboarding needs equivalent eventual-success handling or a completion contract from the controller.

Validation: local desktop unit suite passes 1496/1496; CI is green. The first issue was additionally reproduced with a focused production-controller script, so existing tests are missing the synchronous subscriber path rather than disproving it.

Comment thread desktop/src/shared/api/relayReconnectController.ts
Comment thread desktop/src/features/onboarding/ui/ProfileStep.tsx
npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 and others added 3 commits July 2, 2026 14:56
Finding 1 (relayReconnectController.ts): subscribeToConnectionState can invoke
its listener synchronously with the current state before returning. When that
sync emission signals connected, onConnected() → finish() → cancelTimers() runs
while unsubscribeConnectionState is still null — then the caller assigns all
three handles after finish, leaking them for the app lifetime.

Fix: subscribe first, then check resolved immediately after the return value is
assigned. If resolved, unsubscribe the handle (which cancelTimers couldn't reach
because the assignment hadn't happened yet) and return before installing the
poll interval and backstop timer.

Finding 2 (ProfileStep.tsx): reconnect() always returns false in phase 3, so
runConnectivityAction never calls markSuccess(), and the onboarding component
had no connection-state subscription. The error card stayed showing failure
even after the relay healed.

Fix: observe the shared relayConnectivitySuccess store (already used by the
sidebar) via useSyncExternalStore + useEffect in OnboardingRelayConnectionErrorCard.
Export subscribeRelayConnectivitySuccess and getRelayConnectivitySuccessSnapshot
from useSidebarRelayConnectionCard.ts so both surfaces share exactly one
success-signalling mechanism.

Test: add a sync-emission unit test with a subscribeToConnectionState fake that
calls the listener before returning its cleanup handle — asserts onSuccess fires
once, no interval/backstop installed, cleanup called.

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
ProfileStep was observing the sidebar's relayConnectivitySuccess store to
detect phase-3 reconnect success. That store is only ever produced by
useSidebarRelayConnectionCard, which never mounts during onboarding —
AppReady returns OnboardingFlow before the sidebar path renders.

Replace the dead cross-surface coupling with a local approach:
- Add useRelayConnection() to OnboardingRelayConnectionErrorCard
- Add hadActiveReconnectRef to gate markSuccess() on a genuine attempt
- Set the ref true when runConnectivityAction starts; clear it and call
  markSuccess() synchronously on phase-1 success, or leave it armed for
  the connection-state effect to fire on phase-3 connected transition
- Remove the two exports added to useSidebarRelayConnectionCard (no dead
  API surface)

Also fix the controller comment: the first comment incorrectly implied
cancelTimers() reached the subscription handle during sync emission. In
the assignment-order race, cancelTimers() ran before the handle was
assigned; cleanup happens in the if (resolved) block. Say that directly.

Add OnboardingRelayConnectionErrorCard.test.mjs: 5 tests covering the
guard contract — phase-3 async success, disconnected no-op, no-attempt
guard, phase-1 sync success idempotency, and double-connected no-op.

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
The OnboardingRelayConnectionErrorCard.test.mjs unit test was a mirror
state machine — it duplicated the component's ref/effect logic in the test
file itself, meaning it would pass even if ProfileStep.tsx regressed. Thufir
flagged this as not exercising the production path.

Replace with two E2E tests in onboarding.spec.ts that drive the real
component against the real relay state seam:

- 'click shows Connected then auto-dismisses': produces the error card via
  a relay-unreachable profile save error, drives relay to disconnected,
  clicks onboarding-reconnect-relay, drives relay to connected, and asserts
  the card transitions to 'Connected' then auto-dismisses. Exercises the full
  hadActiveReconnectRef guard and markSuccess() path through real code.

- 'connected without a prior click does not show Connected': same card setup,
  drives disconnected → connected with no click, asserts the card stays in
  error state. Guards against spurious success on background recovery without
  user action.

Also adds the setRelayConnectionState helper to onboarding.spec.ts, copied
from sidebar-relay-card.spec.ts (same seam, same bidirectional guard).

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
@wpfleger96 wpfleger96 merged commit 02ff06c into main Jul 2, 2026
25 checks passed
@wpfleger96 wpfleger96 deleted the duncan/relay-reconnect-resilience branch July 2, 2026 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants