Retry SQLite open on timeout instead of failing the boot (0253) by crs48 · Pull Request #355 · crs48/xNet

crs48 · 2026-07-01T14:46:09Z

Why — the actual failure, finally captured

[WebSQLiteProxy] Calling proxy.open()...
[App] Initialization failed: Error: Worker initialization timeout after 15s

The SQLite worker open() exceeded 15s and the timeout fired, failing the whole boot to an error screen. This disproves the load-bearing argument the whole investigation leaned on ("the 15s timeout never fires, so open must be fast") — it does fire. The open is intermittently slow: 292ms on the fast localStorage boots, ~17s on the JSON captures, >15s here. That intermittency is the fingerprint of OPFS sync-access-handle contention, most likely self-inflicted — a boot whose open times out leaks its worker (pre-#351 the graceful close() queued behind the stuck open and never released the handle), so the next boot's createSyncAccessHandle() on the large 318k-change DB file blocks on the held handle → cascade.

What

WebSQLiteProxy.open() no longer hard-fails on the first timeout. It terminates the stuck worker (freeing the OPFS handle, building on #351's R1) and retries with a fresh worker up to 3 attempts, via the new pure/tested openWithTimeoutRetry (open-retry.ts). The leaked-handle cascade recovers instead of erroring; a genuinely broken OPFS still fails cleanly after bounded attempts. Adds SQLiteConfig.openTimeoutMs (default 15000).

Tests

@xnetjs/sqlite: 155/158 (4 new open-retry tests — first-try success, timeout→retry→recover, exhaustion, onRetry cadence); typecheck + lint clean. Changeset: @xnetjs/sqlite minor.

Not in this PR (root cause)

The open is only slow because the append-only changes table holds ~318k rows (constant across every capture) → a large DB-body file → slow cold createSyncAccessHandle(). Compacting the change log (snapshot + truncate history) is the durable fix (0233/0249's deferred "F3"); this PR + deploying #351 stop the failure and the cascade now.

🤖 Generated with Claude Code

…he boot (0253) A fresh capture caught the actual user-facing failure: [App] Initialization failed: Error: Worker initialization timeout after 15s The cold installOpfsSAHPoolVfs() on the large 318k-change DB file intermittently exceeds the 15s open timeout — usually because a PRIOR boot's open timed out and leaked a worker still holding the file's exclusive OPFS sync access handle, so this boot's createSyncAccessHandle() blocks on the contended handle. The old code hard-failed on the first timeout and showed an error screen. WebSQLiteProxy.open() now terminates the stuck worker (releasing the handle, building on #351's R1) and retries with a fresh worker up to 3 attempts, via the new pure/tested openWithTimeoutRetry helper. The leaked-handle cascade recovers instead of failing; a genuinely broken OPFS still fails cleanly after bounded attempts. Adds SQLiteConfig.openTimeoutMs (default 15000) to tune it. Corrects exploration 0253: the open is NOT innocent — it is intermittently slow (292ms usually, >15s sometimes). The '15s timeout never fires' argument that earlier ruled out the open was simply never true; this capture shows it firing. Root-cause follow-up: compact the 318k-row change log to shrink the DB file. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: xNet Test <test@xnet.dev>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: xNet Test <test@xnet.dev>

github-actions · 2026-07-01T14:51:50Z

Preview removed for PR #355.

## The ~5s cold-open freeze Exploration **0253**'s main-thread stall detector (#354) caught a **~5s uninterrupted `window` long task** *after* `hub:connected`: ```json { "blockMs": 4980, "atOffsetMs": 21069, "phaseBefore": "hub:connected", "phaseAfter": "hub:connected", "longTaskAttribution": "window" } ``` This is the freeze that hid behind the intermittent slow SQLite open (fixed by #355). Its source is the **first outbound resync**: when the persisted sync cursor lags far behind the local change log (the hub never confirmed the tail — `INVALID_HASH` skew, 0224), `syncLocalChanges()` fetches *every* change since the cursor and processes the whole slice **synchronously** right after the sync-response resolves — on the 318k-change dataset, that's seconds of unbroken main-thread work. ## Fix `packages/runtime/src/sync/node-store-sync-provider.ts` — `syncLocalChanges()`: - **Code-unit tie-break, not `localeCompare`.** The DB query already returns `lamport`-ASC order, so the sort only orders equal-lamport ties. `String.localeCompare` over a large tie-heavy slice is orders of magnitude slower than a code-unit compare *and* violated the repo's code-unit collation invariant (the inbound apply path already orders by code units). This alone removes the dominant cost. - **Yield every 1024 changes.** The enqueue loop yields a macrotask between batches, so a large first-sync slice can never monopolise a frame regardless of size (it bails cleanly if the connection drops mid-yield). - **Self-gating diagnostic.** A one-line `[NodeStoreSync] heavy outbound resync: N changes, fetch+deserialize Xms, sort Yms` warn fires only when a resync is actually heavy — naming the residual synchronous cost (the per-row `JSON.parse` deserialize *inside* `getChangesSince`) so the next capture confirms the fix and sizes the durable follow-up. The durable root fix — **compacting the 318k-row `changes` log** (F3) so the slice is small — is the next step (a separate `/explore`). This PR removes the freeze now. ## Tests `node-store-sync-provider.test.ts` gains two cases (18 total, all green): - **Code-unit tie ordering** — equal-lamport changes from `did:key:zB` vs `did:key:za` publish `B` before `a` (code-unit), which the old `localeCompare` would have reversed. Guards the fix. - **Yield-boundary integrity** — a 1100-change resync (> the 1024 batch) publishes every change exactly once; nothing dropped or duped at the seam. Full `runtime` project: 151/151 pass. No public API or wire-contract change → `@xnetjs/runtime: patch`. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

…pen fix (0254) (#357) ## What Exploration **0254 (F3)** — the design for the durable root fix of the cold-open saga: **compacting the ~424k-row local `changes` log** that makes both cold-open costs slow. This is the fix `0233` → `0249` → `0253` kept deferring. It builds on the two already-merged failure-mode fixes ([#355](#355) resilient open, [#356](#356) resync yield), which made boot *survivable* but did not shrink the log. ## How it was produced Grounded in three code-mapped subsystems (kernel hash-chain, sync/convergence, state materialization), then hardened by an adversarial design-panel + 5-lens red-team pass before writing. ## Key findings - **State is fully materialized** in `nodes`/`node_properties`; **live queries never replay `changes`**. The hub is the **authoritative** full log. So the local `changes` table is a mostly-redundant cache — safe to GC. - **The adversarial pass falsified the obvious rule.** "Drop everything below the sync cursor" is **unsafe**: the cursor = the hub `highWaterMark` across *all* authors, which is *not* proof this client's own low-lamport (offline/concurrent) changes were pushed. That rule would delete the only recoverable copy of a stranded self-authored change → **permanent data loss + divergence** (confirmed across convergence, lost-write, cursor-regression, hub-rollback, and BYO-hub lenses). ## The verified-safe design **Superseded-history GC**, keyed on **live-value lineage** rather than the cursor: prune a `changes` row only if it is below the safe floor, **not** a per-node tip, and **backs no currently-winning LWW value** (`node_properties` provenance). Keep per-node tips (chaining), the unconfirmed tail, and every live-value backer; gate the whole pass on rollback detection and stable hub identity. **Client-only, zero hub/protocol/wire/DDL change, behind a kill-switch.** Hub-assisted signed-snapshot bootstrap (the only fix for *fresh-device* cost) is recorded as a deliberate follow-up. Doc includes Options, the adversarial narrative, the recommendation with 9 correctness invariants, example code, risks, and Implementation + Validation checklists (including the convergence conformance test that would currently fail without the lineage guard). Docs-only; no changeset. Next step: `/implement` Phase 1 (the client-only GC). 🤖 Generated with [Claude Code](https://claude.com/claude-code)

xNet Test and others added 2 commits July 1, 2026 07:43

docs(changelog): add fragment for resilient SQLite open retry

12fd5b1

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: xNet Test <test@xnet.dev>

crs48 temporarily deployed to pr-355 July 1, 2026 14:46 — with GitHub Actions Inactive

github-actions Bot added a commit that referenced this pull request Jul 1, 2026

deploy(preview): publish PR #355 preview

95ddc05

crs48 merged commit f350c9b into main Jul 1, 2026
16 checks passed

crs48 deleted the claude/0253-resilient-sqlite-open branch July 1, 2026 15:01

github-actions Bot added a commit that referenced this pull request Jul 1, 2026

deploy(preview): remove PR #355 preview

e14e95c

xnet-changelog-bot Bot pushed a commit that referenced this pull request Jul 1, 2026

docs(changelog): link PR #355 to its fragment [skip ci]

e2a8aa5

github-actions Bot mentioned this pull request Jul 1, 2026

chore(release): version packages #281

Open

crs48 mentioned this pull request Jul 1, 2026

perf(sync): yield outbound resync off the main thread (0253) #356

Merged

crs48 mentioned this pull request Jul 1, 2026

docs(explorations): design change-log compaction — the durable cold-open fix (0254) #357

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Retry SQLite open on timeout instead of failing the boot (0253)#355

Retry SQLite open on timeout instead of failing the boot (0253)#355
crs48 merged 2 commits into
mainfrom
claude/0253-resilient-sqlite-open

crs48 commented Jul 1, 2026

Uh oh!

github-actions Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

crs48 commented Jul 1, 2026

Why — the actual failure, finally captured

What

Tests

Not in this PR (root cause)

Uh oh!

github-actions Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Jul 1, 2026 •

edited

Loading