Skip to content

fix(realtime): reconnect WebSocket and catch up on missed events after iOS Safari backgrounding#4590

Open
ekumanov wants to merge 2 commits intoflarum:2.xfrom
ekumanov:fix/realtime-ios-reconnect
Open

fix(realtime): reconnect WebSocket and catch up on missed events after iOS Safari backgrounding#4590
ekumanov wants to merge 2 commits intoflarum:2.xfrom
ekumanov:fix/realtime-ios-reconnect

Conversation

@ekumanov
Copy link
Copy Markdown

@ekumanov ekumanov commented Apr 19, 2026

Closes #4588.

What

Add visibilitychange + pageshow handlers in realtime/js/src/forum/extend/Application.ts that force a WebSocket reconnect after iOS Safari backgrounding, and — once the new connection is live — refresh the visible discussions list to catch up on events that fired while the socket was dead.

Why

See the linked issue for the full diagnosis. Two compounding iOS Safari behaviors produce the same "realtime is broken after switching apps" symptom, and both need handling:

Plus one iOS-specific return path that a visibilitychange-only fix misses: iOS bfcaches pages on app-switch. Return via bfcache restores the page with pageshow (persisted: true) and does NOT fire visibilitychange. So both events need to be hooked.

How

const forceReconnect = (): void => {
  if (!app.websocket) return;
  const connection = (app.websocket as any).connection;

  const onReconnected = (): void => {
    connection?.unbind('connected', onReconnected);
    (app as any).discussions?.refresh?.();
  };
  connection?.bind('connected', onReconnected);

  app.websocket.disconnect();
  setTimeout(() => app.websocket?.connect(), 100);
};

document.addEventListener('visibilitychange', () => {
  if (document.visibilityState === 'hidden') { hiddenSince = Date.now(); return; }
  if (hiddenSince === null) return;
  const wasHiddenFor = Date.now() - hiddenSince;
  hiddenSince = null;
  if (wasHiddenFor > 5_000) forceReconnect();
});

window.addEventListener('pageshow', (event) => {
  if (event.persisted) forceReconnect();
});

Design choices

  • 5 s hidden threshold on visibilitychange. Avoids churning the socket on brief app-switches; in practice iOS doesn't kill a socket that quickly.
  • Also pageshow with persisted: true. Covers the bfcache-restore return path that visibilitychange alone misses on iOS.
  • disconnect() then connect() with a 100 ms gap. pusher-js will no-op connect() when it still believes the socket is alive (the exact symptom we're working around) — an explicit teardown is required. The 100 ms gap prevents a "connecting zombie" race where the connect() fires while the previous teardown is still in flight. Channels are preserved in pusher-js's internal map and automatically re-subscribed on the new socket.
  • discussions.refresh() on 'connected', not immediately after connect(). This was the critical fix — an earlier iteration that called refresh() synchronously after connect() regressed scenario 1 (new pushes stopped arriving to the newly-reconnected socket). The immediate refresh triggers a Mithril redraw that race-conditions with pusher-js's subscribeAll() on the fresh socket, sometimes leaving the client bound but receiving no events. Gating on the connection's 'connected' event ensures the refresh only fires after resubscription has completed, which is race-free.
  • No enabledTransports change. Leaving ['wss', 'ws'] alone — HTTP polling fallback would mask the real reconnect and isn't needed once visibility-driven recovery is in place.
  • Listeners installed once, at Application.mount(). Comment in-code documents that mount() runs once per page load, so there's no teardown path needed for the event listeners.

Scope

  • One source file changed (extensions/realtime/js/src/forum/extend/Application.ts, +55 / -0).
  • No new dependencies; no API changes.
  • Bundled output committed in a follow-up commit, matching this repo's convention.
  • Catch-up refresh covers the discussion list only (via app.discussions.refresh()). The single-discussion view ("I'm reading a thread, switch apps, return — show me posts that appeared while I was away") is explicitly out of scope for v1; it needs a stream-level re-hydration that is a different code path and can land in a follow-up.

Testing

  • yarn build in extensions/realtime/js/ succeeds.
  • Tests have been added, or are not appropriate here — this is a runtime fix for a mobile-browser behavior (iOS Safari WebSocket teardown + bfcache return), which has no practical headless/unit-test surface. Verified on real hardware instead; see below.
  • Real-device verified on iPhone Safari against a v2.0.0-beta.8 staging forum:
    • Scenario 1 (app-switch → return → post from another device → expect push): ✅ push arrives.
    • Scenario 2 (app-switch → post from another device while away → return): ✅ on return, discussions.refresh() fires after the new socket reconnects and the missed content appears.
    • bfcache path specifically verified: returning to the page via app-switch (which fires pageshow persisted=true, not visibilitychange) triggers reconnect.
  • Desktop browsers: behaves as before; the 5 s threshold means desktop tab-switches under that duration are unaffected; longer absences reconnect silently.

Iteration history (for reviewer context, can collapse)

Why the code looks the way it does
  • v1: Just disconnect() + connect() on visibilitychange. Fixed scenario 1 but missed scenario 2 (events posted while away were gone) and missed bfcache returns.
  • v2: Added a non-conditional app.discussions.refresh() immediately after connect(). Fixed scenario 2.
  • v3: Added pageshow handler for bfcache.
  • v4 (this PR): Regression from v2/v3 — scenario 1 broke (new pushes stopped arriving). Root cause: the immediate refresh() triggered a Mithril redraw that race-conditioned with pusher-js's subscribeAll() on the new socket. Moved refresh() into an onReconnected callback bound to the 'connected' event on the connection object, so it only fires after re-subscription. All three behaviors now work.

Marking ready for review.

@ekumanov ekumanov force-pushed the fix/realtime-ios-reconnect branch from bd5b0b2 to 3fde090 Compare April 19, 2026 19:38
@ekumanov ekumanov changed the title fix(realtime): reconnect WebSocket when tab regains visibility after long hide fix(realtime): reconnect WebSocket and catch up on missed events after iOS Safari backgrounding Apr 19, 2026
@ekumanov ekumanov marked this pull request as ready for review April 19, 2026 19:38
@ekumanov ekumanov requested a review from a team as a code owner April 19, 2026 19:38
Copy link
Copy Markdown
Member

@imorland imorland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough write-up and the real-device verification — the visibility/pageshow reconnect logic is well-reasoned and addresses a real iOS Safari issue. I'd like to see this land, but the PR as it currently stands has one significant issue and a few smaller ones that should be addressed first, especially given we're in 2.0.0-rc territory.

Main concern: undisclosed second change

The PR description covers the forceReconnect + visibilitychange + pageshow logic clearly. But commit a2e2c8d ("neutralise pusher-js WebSocket lives kill-switch") adds a second, more invasive change — patchPusherTransportManagers — which walks the pusher-js object graph up to 10 levels deep, duck-types TransportManager instances, and mutates their internals to disable the transport-death budget. This isn't mentioned in the PR body at all.

Concerns with this part:

  1. Reaches into pusher-js private API. reportDeath, isAlive, livesLeft are internal to pusher-js's transport strategy, not public API. A patch-level pusher-js upgrade that renames or restructures these will silently stop matching the duck-type detector, at which point the iOS reconnect regresses with no loud failure.
  2. The kill-switch exists for a reason. pusher-js's lives: 2 budget lets the client stop hammering a genuinely broken transport (bad wsHost, blocked by network middlebox, TLS failure). Pinning livesLeft = Infinity disables that for every failure mode, not just iOS backgrounding. The justification "our app layer owns reconnect end to end" doesn't quite hold — the app layer only reconnects on visibility/pageshow events, not on genuine transport failures.
  3. Indiscriminate walk. patchPusherTransportManagers patches anything that duck-types correctly, to depth 10. Low probability of collateral damage, but still a footgun.
  4. No reproduction steps. The issue describes scenarios 1 and 2, neither of which exercises the lives-budget path. If the "code 1006 unclean close on the same transport" behaviour the comment describes is real and reproducible, it deserves its own issue + repro + test.
  5. RC risk. The visibility/pageshow logic is surgical and easy to reason about. The transport-manager patch changes pusher-js failure behaviour globally via runtime introspection — not something I'd want to ship on an RC without more bake time.

Suggestion: split this into two PRs. Land the forceReconnect + visibility/pageshow handlers for rc.2 (clear win, well-tested, scoped to the reported scenarios). Open a separate PR for the transport-manager patch with its own repro steps, ideally targeting post-GA. If the lives-budget issue is genuinely blocking real users, consider a less invasive fix first — e.g. constructing a new Pusher instance in forceReconnect() rather than reusing the existing one, which sidesteps the private-field mutation entirely.

Smaller notes on the described change (non-blocking)

  • Listener cleanup. The visibilitychange and pageshow handlers are added inside Application.prototype.mount extension with no removeEventListener. On a normal page load this is fine, but a second mount() call (test environments, any future SPA re-bootstrap) will stack handlers. Worth a one-line comment noting the assumption.
  • Refresh scope. app.discussions?.refresh?.() covers the index page, but users viewing a single discussion won't see new posts via this path until they navigate. That's an acceptable v1 scope, but worth documenting as a known limitation so a follow-up can broaden it to whichever PaginatedListState is currently visible.
  • Checklist — please tick "tests have been added, or are not appropriate here" if tests aren't appropriate, and confirm composer test status (even though this is JS-only, the box is on the template).

Recommendation

  • Split the PR; keep the visibility/pageshow reconnect here.
  • Open a separate PR for the transport-manager patch with repro steps and justification.
  • Once split, I'm happy to approve the reconnect half for rc.2.

@ekumanov ekumanov force-pushed the fix/realtime-ios-reconnect branch from e022ca0 to d5ff3f4 Compare April 20, 2026 21:44
@ekumanov
Copy link
Copy Markdown
Author

Thanks for the detailed review — all points are fair, and I agree with the split.

I've force-pushed fix/realtime-ios-reconnect back to just the visibility/pageshow reconnect: the a2e2c8d transport-manager patch and its bundle are gone, the branch is now a single source commit + bundle covering only forceReconnect + visibilitychange + pageshow. New HEAD is d5ff3f4.

On the smaller notes:

  • Listener cleanup. Added an in-code comment noting that Application.mount() runs once per page load, so the handlers are installed once and don't need a teardown path. Line is next to the addEventListener calls so the assumption is visible at the point it matters.
  • Refresh scope. Documented in the PR body (new bullet in ### Scope) — app.discussions.refresh() covers the index only; the single-discussion view needs a stream-level re-hydration which is a different code path, explicitly v1-out-of-scope for a follow-up.
  • Checklist. Ticked "tests have been added, or are not appropriate here" with a note that this is a mobile-browser behavior fix (iOS Safari WebSocket teardown + bfcache) without a practical headless/unit-test surface; verification is real-device on a staging forum.

For the transport-manager patch: you're right that lives-budget depletion is a separate scenario that isn't exercised by #4588's repro, and the fix as written is too blunt for an RC. I'll open a follow-up issue with staging-log excerpts from the cycle-2+ regression I hit during testing (which is how I found the lives: 2 kill-switch in the first place), and explore the "new Pusher in forceReconnect" approach you suggested — that's a much cleaner angle than the runtime walk. I'll link it back here once it's filed.

Ready for another look when you have a moment.

@ekumanov
Copy link
Copy Markdown
Author

Phase-2 issue filed: #4597 — includes the cycle-2 repro, pusher-js default_strategy.ts reference, staging logs, and a sketch of the Option A refactor you suggested.

…r iOS Safari backgrounding

iOS Safari silently drops WebSocket connections when the tab is backgrounded
or the device sleeps; no `close` event fires, so pusher-js's built-in recovery
never triggers and realtime updates go missing until reload. Additionally, iOS
bfcaches pages on app-switch, which restores via `pageshow` (persisted=true)
and does NOT fire `visibilitychange` on return — so handlers bound only to
visibilitychange miss the common app-switch path.

Add a `forceReconnect()` that hooks both `visibilitychange` (after >5s hidden)
and `pageshow` (bfcache). It disconnects, waits 100ms to avoid a connecting-
zombie race in pusher-js, then reconnects. After the new connection reports
`'connected'` (i.e. after pusher-js's `subscribeAll()` has re-subscribed the
channels on the fresh socket), it calls `app.discussions.refresh()` so the
visible list catches up on events that fired while the socket was dead —
Pusher is fire-and-forget with no server-side buffering.

Refresh must be gated on the `'connected'` event rather than fired immediately
after `connect()`: an immediate refresh triggers a Mithril redraw that races
with pusher-js's channel resubscription on the new socket and leaves the
client receiving no further push events.

Fixes flarum#4588.
Includes transpiled JS/TS.
@ekumanov ekumanov force-pushed the fix/realtime-ios-reconnect branch from d5ff3f4 to 84ef143 Compare April 22, 2026 21:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

flarum/realtime: iOS Safari — realtime stops after backgrounding (silent socket death + missed-while-away events)

2 participants