fix(perf): boot-stall instrumentation + dial hub early + stale-blob cleanup (0229)#280
Merged
crs48 merged 6 commits intoJun 26, 2026
Conversation
…to ground truth Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The offline-queue load was awaited before connection.connect(), so when the single SQLite worker is stalled at boot that load (~18s) delayed the hub handshake by the same amount even though the hub answers in ~200ms (exploration 0229). The queue now loads in the background; the connect-time drain re-runs once entries are loaded, and stop() waits for the load before persisting. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…tall Adds boot-debug-gated diagnostics inside the SQLite worker (exploration 0229): each scheduled op logs its queue-wait vs execution time (the split that finally separates head-of-line queueing from real SQL/OPFS cost), plus a one-shot db-stats line at open (file size, page/freelist counts, storage mode). Threaded via a bootDebug open-config flag since workers can't read localStorage. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nect The hub now connects early (0229 item B), so a single boot-timeline log at hub:connected would miss the residual time-to-first-paint. logBootTimeline now logs once per distinct reason and the landing surface logs again at query:first-rows, keeping firstPaint visible so a future storage stall can't hide inside the connect phase. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
0227 stopped writing the gc:false presence Yjs doc but never deleted the existing blob, which still bloats the OPFS xnet.db file and raises every cold read (exploration 0229). Deletes presence-* yjs_state rows once per origin and VACUUMs only when a row was removed, scheduled on requestIdleCallback so the heavy VACUUM never lands on the boot critical path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
🖼️ UI changes in this PRInteractionsAuto-captured by CI · run. Informational — not a blocking check. |
Contributor
|
Preview removed for PR #280. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Implements the code items of exploration 0229 — the 3rd/4th look at the ~18s cold-start stall.
The finding
0227 "fixed" the stall by taking presence off the critical path — but the latest logs show it just migrated to the next storage call (
offlineQueue.load), proving it's not any one caller: one operation monopolizes the single SQLite worker, and everything drains together. And the hub isn't slow — it connects in ~232ms and answers in ~190ms; it was just dialed 18s late becauseconnection.connect()was sequenced afterawait offlineQueue.load(). We've guessed the root cause wrong several times because no log measured per-operation worker execution time.What landed (4/8 checklist items — the codeable "now" set)
@xnetjs/sqlite): each scheduled op logs its queueMs vs execMs (head-of-line wait vs real SQL/OPFS cost), plus a one-shot db-stats line at open (file size, page/freelist counts, storage mode). Threaded via a newbootDebugopen-config flag since workers can't readlocalStorage. This is the log split that will finally name the 18s op in one capture.@xnetjs/runtime): the offline-queue load now runs in the background so the WS handshake isn't serialized behind local storage; the connect-time drain re-runs once entries load, andstop()waits for the load before persisting. Directly fixes "remote sync is slow."apps/web): a one-time, idle-scheduledDELETE FROM yjs_state WHERE node_id LIKE 'presence-%'+VACUUM(only when a row is removed) to reclaim the bloat 0227 left behind. Never touches the boot critical path.logBootTimelinenow logs once per reason and the landing surface logs again atquery:first-rows, so the residual time-to-first-paint stays visible now that the hub connects early.New tests: scheduler
onOptiming + coalesced-no-double-report; sync-manager connect-before-load drain; boot-timeline per-reason; presence-blob-cleanup gating/VACUUM/latch.runtime/sqlite/apps/webtypecheck + suites green.Deliberately deferred (per the doc's measure-first thesis)
xnet:boot:debugon to record which op shows the 18sexecMs. Runtime diagnostic, not code.INVALID_HASHskew (0224).🤖 Generated with Claude Code