Skip to content

fix(perf): boot-stall instrumentation + dial hub early + stale-blob cleanup (0229)#280

Merged
crs48 merged 6 commits into
mainfrom
claude/0229-the-migrating-18s-boot-stall-instrument-to-groun
Jun 26, 2026
Merged

fix(perf): boot-stall instrumentation + dial hub early + stale-blob cleanup (0229)#280
crs48 merged 6 commits into
mainfrom
claude/0229-the-migrating-18s-boot-stall-instrument-to-groun

Conversation

@crs48

@crs48 crs48 commented Jun 26, 2026

Copy link
Copy Markdown
Owner

Implements the code items of exploration 0229 — the 3rd/4th look at the ~18s cold-start stall.

The finding

0227 "fixed" the stall by taking presence off the critical path — but the latest logs show it just migrated to the next storage call (offlineQueue.load), proving it's not any one caller: one operation monopolizes the single SQLite worker, and everything drains together. And the hub isn't slow — it connects in ~232ms and answers in ~190ms; it was just dialed 18s late because connection.connect() was sequenced after await offlineQueue.load(). We've guessed the root cause wrong several times because no log measured per-operation worker execution time.

What landed (4/8 checklist items — the codeable "now" set)

  • A — worker instrumentation (@xnetjs/sqlite): each scheduled op logs its queueMs vs execMs (head-of-line wait vs real SQL/OPFS cost), plus a one-shot db-stats line at open (file size, page/freelist counts, storage mode). Threaded via a new bootDebug open-config flag since workers can't read localStorage. This is the log split that will finally name the 18s op in one capture.
  • B — dial the hub early (@xnetjs/runtime): the offline-queue load now runs in the background so the WS handshake isn't serialized behind local storage; the connect-time drain re-runs once entries load, and stop() waits for the load before persisting. Directly fixes "remote sync is slow."
  • C — stale-blob cleanup (apps/web): a one-time, idle-scheduled DELETE FROM yjs_state WHERE node_id LIKE 'presence-%' + VACUUM (only when a row is removed) to reclaim the bloat 0227 left behind. Never touches the boot critical path.
  • Boot-timeline split: logBootTimeline now logs once per reason and the landing surface logs again at query:first-rows, so the residual time-to-first-paint stays visible now that the hub connects early.

New tests: scheduler onOp timing + coalesced-no-double-report; sync-manager connect-before-load drain; boot-timeline per-reason; presence-blob-cleanup gating/VACUUM/latch. runtime/sqlite/apps/web typecheck + suites green.

Deliberately deferred (per the doc's measure-first thesis)

  • A capture — a live cold boot with xnet:boot:debug on to record which op shows the 18s execMs. Runtime diagnostic, not code.
  • D (cache_size/mmap) and E (prewarm) — explicitly gated on that capture; shipping them now would be another guess.
  • Hub redeploy — ops action for the INVALID_HASH skew (0224).

🤖 Generated with Claude Code

xNet Test and others added 6 commits June 25, 2026 19:28
…to ground truth

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The offline-queue load was awaited before connection.connect(), so when the
single SQLite worker is stalled at boot that load (~18s) delayed the hub
handshake by the same amount even though the hub answers in ~200ms (exploration
0229). The queue now loads in the background; the connect-time drain re-runs once
entries are loaded, and stop() waits for the load before persisting.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…tall

Adds boot-debug-gated diagnostics inside the SQLite worker (exploration 0229):
each scheduled op logs its queue-wait vs execution time (the split that finally
separates head-of-line queueing from real SQL/OPFS cost), plus a one-shot
db-stats line at open (file size, page/freelist counts, storage mode). Threaded
via a bootDebug open-config flag since workers can't read localStorage.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nect

The hub now connects early (0229 item B), so a single boot-timeline log at
hub:connected would miss the residual time-to-first-paint. logBootTimeline now
logs once per distinct reason and the landing surface logs again at
query:first-rows, keeping firstPaint visible so a future storage stall can't hide
inside the connect phase.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
0227 stopped writing the gc:false presence Yjs doc but never deleted the
existing blob, which still bloats the OPFS xnet.db file and raises every cold
read (exploration 0229). Deletes presence-* yjs_state rows once per origin and
VACUUMs only when a row was removed, scheduled on requestIdleCallback so the
heavy VACUUM never lands on the boot critical path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@crs48 crs48 temporarily deployed to pr-280 June 26, 2026 02:50 — with GitHub Actions Inactive
@github-actions

Copy link
Copy Markdown
Contributor

🖼️ UI changes in this PR

Interactions

🎬 Create a page and use the editor

Create a page and use the editor

▶ Watch MP4

Auto-captured by CI · run. Informational — not a blocking check.

github-actions Bot added a commit that referenced this pull request Jun 26, 2026
@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Preview removed for PR #280.

github-actions Bot added a commit that referenced this pull request Jun 26, 2026
@crs48 crs48 merged commit 4f52aef into main Jun 26, 2026
15 of 16 checks passed
@crs48 crs48 deleted the claude/0229-the-migrating-18s-boot-stall-instrument-to-groun branch June 26, 2026 03:06
github-actions Bot added a commit that referenced this pull request Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant