Skip to content

fix(sync): eliminate the 18s blank-page boot stall + registry persistence (0227)#278

Merged
crs48 merged 7 commits into
mainfrom
claude/0227-boot-stall-sqlite-worker-head-of-line-blocking
Jun 26, 2026
Merged

fix(sync): eliminate the 18s blank-page boot stall + registry persistence (0227)#278
crs48 merged 7 commits into
mainfrom
claude/0227-boot-stall-sqlite-worker-head-of-line-blocking

Conversation

@crs48

@crs48 crs48 commented Jun 26, 2026

Copy link
Copy Markdown
Owner

Implements exploration 0227: the ~18s blank-page cold-start stall, plus two latent sync bugs surfaced in the same logs.

Root cause

Every storage op — landing read queries and Yjs document I/O — goes through one FIFO SQLite worker. At boot the workspace presence doc (presence-main) was cold-loaded first; that one read sat at the head of the queue and head-of-line blocked every landing query for ~18s (all queries reported ~18403ms yet drained within a 56ms burst — queue wait, not SQL time). Presence is gc:false and persisted on every tick, so its yjs_state blob grew unboundedly.

What landed (code-complete: 6/8 checklist items)

  • B — ephemeral presence is in-memory only (node-pool.ts): presence-* docs are never cold-loaded from yjs_state nor persisted back. Removes the head-of-line read at its source and bounds the blob. New NodePoolConfig.isEphemeral / largeDocWarnBytes.
  • A — defer presence join off first paint (CommsContext.tsx): workspace presence joins on requestIdleCallback, so landing reads paint first.
  • E + guardrail (node-pool.ts): boot-debug loadDoc timing log + a >5MiB blob warning tripwire.
  • Boot timeline (boot-timeline.ts / BootTimelineProbe.tsx): a new docwarm phase (observes the runtime's xnet:docpool:first-acquire mark) so a storage stall is attributed to storage, not mislabelled as network connect time.
  • Registry FK (sqlite-adapter.ts, types.ts, sync-manager.ts): the registry persisted its tracked-node set under the synthetic key _xnet_tracked_nodes via setDocumentContent, hitting yjs_state's node_id → nodes(id) FK (SQLITE_CONSTRAINT_FOREIGNKEY 787) — so it silently never persisted. Adds FK-free getAppState/setAppState (backed by sync_state) and routes the registry through it.

Tests

New: node-pool.test.ts (5), boot-timeline docwarm (2), sqlite-adapter app state (4). All touched suites green; data/runtime/apps/web typecheck clean.

Deliberately out of scope (the other 2 checklist items)

  • Hub redeploy — the INVALID_HASH flood is the tenant hub running an incompatible @xnetjs/sync (the client circuit breaker already handles it correctly). That's an ops/redeploy action, not a repo change.
  • Dedicated read worker (item C) — explicitly a follow-up exploration in the doc.

Validation items in the doc are field/manual QA (real cold-boot timing, throttled CPU, hub redeploy) pending a deployed build.

🤖 Generated with Claude Code

xNet Test and others added 7 commits June 25, 2026 17:45
…orker head-of-line blocking)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…iming + large-blob guardrail

Presence docs (presence-*) are now never cold-loaded from yjs_state nor
persisted back, removing the 18s presence-doc read that head-of-line blocked
every landing query on the single SQLite worker at boot (exploration 0227).
Adds a boot-debug loadDoc timing log, a >5MiB blob guardrail, and a one-shot
xnet:docpool:first-acquire performance mark for the boot timeline.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Joins the workspace presence room on requestIdleCallback instead of during the
initial render burst, so presence-doc warming never competes with the landing
read queries on the single SQLite worker (exploration 0227).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…contention

Splits the first doc-warm out of the connect window: BootTimelineProbe observes
the runtime's xnet:docpool:first-acquire mark and records a docwarm phase
(store:ready -> first doc acquired), so a storage stall is attributed to storage
rather than mislabelled as network connect time (exploration 0227).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The registry stored its tracked-node set under the synthetic key
_xnet_tracked_nodes via setDocumentContent, which writes yjs_state and violates
its node_id -> nodes(id) foreign key (SQLITE_CONSTRAINT_FOREIGNKEY 787) — so the
registry silently never persisted. Adds FK-free getAppState/setAppState
(backed by sync_state) to the storage adapter and routes the registry through it
(exploration 0227).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@crs48 crs48 temporarily deployed to pr-278 June 26, 2026 01:31 — with GitHub Actions Inactive
github-actions Bot added a commit that referenced this pull request Jun 26, 2026
@github-actions

Copy link
Copy Markdown
Contributor

🖼️ UI changes in this PR

Interactions

🎬 Open a channel and post a message

Open a channel and post a message

▶ Watch MP4

Auto-captured by CI · run. Informational — not a blocking check.

@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Preview removed for PR #278.

github-actions Bot added a commit that referenced this pull request Jun 26, 2026
@crs48 crs48 merged commit b89a11f into main Jun 26, 2026
14 of 15 checks passed
@crs48 crs48 deleted the claude/0227-boot-stall-sqlite-worker-head-of-line-blocking branch June 26, 2026 01:42
github-actions Bot added a commit that referenced this pull request Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant