fix(perf): boot-stall instrumentation + dial hub early + stale-blob cleanup (0229) by crs48 · Pull Request #280 · crs48/xNet

crs48 · 2026-06-26T02:50:52Z

Implements the code items of exploration 0229 — the 3rd/4th look at the ~18s cold-start stall.

The finding

0227 "fixed" the stall by taking presence off the critical path — but the latest logs show it just migrated to the next storage call (offlineQueue.load), proving it's not any one caller: one operation monopolizes the single SQLite worker, and everything drains together. And the hub isn't slow — it connects in ~232ms and answers in ~190ms; it was just dialed 18s late because connection.connect() was sequenced after await offlineQueue.load(). We've guessed the root cause wrong several times because no log measured per-operation worker execution time.

What landed (4/8 checklist items — the codeable "now" set)

A — worker instrumentation (@xnetjs/sqlite): each scheduled op logs its queueMs vs execMs (head-of-line wait vs real SQL/OPFS cost), plus a one-shot db-stats line at open (file size, page/freelist counts, storage mode). Threaded via a new bootDebug open-config flag since workers can't read localStorage. This is the log split that will finally name the 18s op in one capture.
B — dial the hub early (@xnetjs/runtime): the offline-queue load now runs in the background so the WS handshake isn't serialized behind local storage; the connect-time drain re-runs once entries load, and stop() waits for the load before persisting. Directly fixes "remote sync is slow."
C — stale-blob cleanup (apps/web): a one-time, idle-scheduled DELETE FROM yjs_state WHERE node_id LIKE 'presence-%' + VACUUM (only when a row is removed) to reclaim the bloat 0227 left behind. Never touches the boot critical path.
Boot-timeline split: logBootTimeline now logs once per reason and the landing surface logs again at query:first-rows, so the residual time-to-first-paint stays visible now that the hub connects early.

New tests: scheduler onOp timing + coalesced-no-double-report; sync-manager connect-before-load drain; boot-timeline per-reason; presence-blob-cleanup gating/VACUUM/latch. runtime/sqlite/apps/web typecheck + suites green.

Deliberately deferred (per the doc's measure-first thesis)

A capture — a live cold boot with xnet:boot:debug on to record which op shows the 18s execMs. Runtime diagnostic, not code.
D (cache_size/mmap) and E (prewarm) — explicitly gated on that capture; shipping them now would be another guess.
Hub redeploy — ops action for the INVALID_HASH skew (0224).

🤖 Generated with Claude Code

…to ground truth Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The offline-queue load was awaited before connection.connect(), so when the single SQLite worker is stalled at boot that load (~18s) delayed the hub handshake by the same amount even though the hub answers in ~200ms (exploration 0229). The queue now loads in the background; the connect-time drain re-runs once entries are loaded, and stop() waits for the load before persisting. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…tall Adds boot-debug-gated diagnostics inside the SQLite worker (exploration 0229): each scheduled op logs its queue-wait vs execution time (the split that finally separates head-of-line queueing from real SQL/OPFS cost), plus a one-shot db-stats line at open (file size, page/freelist counts, storage mode). Threaded via a bootDebug open-config flag since workers can't read localStorage. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…nect The hub now connects early (0229 item B), so a single boot-timeline log at hub:connected would miss the residual time-to-first-paint. logBootTimeline now logs once per distinct reason and the landing surface logs again at query:first-rows, keeping firstPaint visible so a future storage stall can't hide inside the connect phase. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

0227 stopped writing the gc:false presence Yjs doc but never deleted the existing blob, which still bloats the OPFS xnet.db file and raises every cold read (exploration 0229). Deletes presence-* yjs_state rows once per origin and VACUUMs only when a row was removed, scheduled on requestIdleCallback so the heavy VACUUM never lands on the boot critical path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

github-actions · 2026-06-26T02:55:26Z

🖼️ UI changes in this PR

Interactions

🎬 Create a page and use the editor

▶ Watch MP4

_{Auto-captured by CI · run. Informational — not a blocking check.}

github-actions · 2026-06-26T02:55:57Z

Preview removed for PR #280.

xNet Test and others added 6 commits June 25, 2026 19:28

docs(exploration): explore the migrating 18s boot stall — instrument …

06cd831

…to ground truth Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

docs(changeset): add changeset + changelog for boot-stall fixes (0229)

985ac8f

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

crs48 temporarily deployed to pr-280 June 26, 2026 02:50 — with GitHub Actions Inactive

github-actions Bot added a commit that referenced this pull request Jun 26, 2026

deploy(visuals): publish PR #280 UI captures

b3fc1c9

github-actions Bot added a commit that referenced this pull request Jun 26, 2026

deploy(preview): publish PR #280 preview

78e08b0

crs48 merged commit 4f52aef into main Jun 26, 2026
15 of 16 checks passed

crs48 deleted the claude/0229-the-migrating-18s-boot-stall-instrument-to-groun branch June 26, 2026 03:06

github-actions Bot added a commit that referenced this pull request Jun 26, 2026

deploy(preview): remove PR #280 preview

b0fa084

xnet-changelog-bot Bot pushed a commit that referenced this pull request Jun 26, 2026

docs(changelog): link PR #280 to its fragment [skip ci]

ad68420

github-actions Bot mentioned this pull request Jun 26, 2026

chore(release): version packages #281

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(perf): boot-stall instrumentation + dial hub early + stale-blob cleanup (0229)#280

fix(perf): boot-stall instrumentation + dial hub early + stale-blob cleanup (0229)#280
crs48 merged 6 commits into
mainfrom
claude/0229-the-migrating-18s-boot-stall-instrument-to-groun

crs48 commented Jun 26, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

crs48 commented Jun 26, 2026

The finding

What landed (4/8 checklist items — the codeable "now" set)

Deliberately deferred (per the doc's measure-first thesis)

Uh oh!

github-actions Bot commented Jun 26, 2026

🖼️ UI changes in this PR

Interactions

Uh oh!

github-actions Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Jun 26, 2026 •

edited

Loading