Retry SQLite open on timeout instead of failing the boot (0253)#355
Merged
Conversation
…he boot (0253) A fresh capture caught the actual user-facing failure: [App] Initialization failed: Error: Worker initialization timeout after 15s The cold installOpfsSAHPoolVfs() on the large 318k-change DB file intermittently exceeds the 15s open timeout — usually because a PRIOR boot's open timed out and leaked a worker still holding the file's exclusive OPFS sync access handle, so this boot's createSyncAccessHandle() blocks on the contended handle. The old code hard-failed on the first timeout and showed an error screen. WebSQLiteProxy.open() now terminates the stuck worker (releasing the handle, building on #351's R1) and retries with a fresh worker up to 3 attempts, via the new pure/tested openWithTimeoutRetry helper. The leaked-handle cascade recovers instead of failing; a genuinely broken OPFS still fails cleanly after bounded attempts. Adds SQLiteConfig.openTimeoutMs (default 15000) to tune it. Corrects exploration 0253: the open is NOT innocent — it is intermittently slow (292ms usually, >15s sometimes). The '15s timeout never fires' argument that earlier ruled out the open was simply never true; this capture shows it firing. Root-cause follow-up: compact the 318k-row change log to shrink the DB file. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: xNet Test <test@xnet.dev>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: xNet Test <test@xnet.dev>
Contributor
|
Preview removed for PR #355. |
crs48
added a commit
that referenced
this pull request
Jul 1, 2026
## The ~5s cold-open freeze Exploration **0253**'s main-thread stall detector (#354) caught a **~5s uninterrupted `window` long task** *after* `hub:connected`: ```json { "blockMs": 4980, "atOffsetMs": 21069, "phaseBefore": "hub:connected", "phaseAfter": "hub:connected", "longTaskAttribution": "window" } ``` This is the freeze that hid behind the intermittent slow SQLite open (fixed by #355). Its source is the **first outbound resync**: when the persisted sync cursor lags far behind the local change log (the hub never confirmed the tail — `INVALID_HASH` skew, 0224), `syncLocalChanges()` fetches *every* change since the cursor and processes the whole slice **synchronously** right after the sync-response resolves — on the 318k-change dataset, that's seconds of unbroken main-thread work. ## Fix `packages/runtime/src/sync/node-store-sync-provider.ts` — `syncLocalChanges()`: - **Code-unit tie-break, not `localeCompare`.** The DB query already returns `lamport`-ASC order, so the sort only orders equal-lamport ties. `String.localeCompare` over a large tie-heavy slice is orders of magnitude slower than a code-unit compare *and* violated the repo's code-unit collation invariant (the inbound apply path already orders by code units). This alone removes the dominant cost. - **Yield every 1024 changes.** The enqueue loop yields a macrotask between batches, so a large first-sync slice can never monopolise a frame regardless of size (it bails cleanly if the connection drops mid-yield). - **Self-gating diagnostic.** A one-line `[NodeStoreSync] heavy outbound resync: N changes, fetch+deserialize Xms, sort Yms` warn fires only when a resync is actually heavy — naming the residual synchronous cost (the per-row `JSON.parse` deserialize *inside* `getChangesSince`) so the next capture confirms the fix and sizes the durable follow-up. The durable root fix — **compacting the 318k-row `changes` log** (F3) so the slice is small — is the next step (a separate `/explore`). This PR removes the freeze now. ## Tests `node-store-sync-provider.test.ts` gains two cases (18 total, all green): - **Code-unit tie ordering** — equal-lamport changes from `did:key:zB` vs `did:key:za` publish `B` before `a` (code-unit), which the old `localeCompare` would have reversed. Guards the fix. - **Yield-boundary integrity** — a 1100-change resync (> the 1024 batch) publishes every change exactly once; nothing dropped or duped at the seam. Full `runtime` project: 151/151 pass. No public API or wire-contract change → `@xnetjs/runtime: patch`. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
crs48
added a commit
that referenced
this pull request
Jul 1, 2026
…pen fix (0254) (#357) ## What Exploration **0254 (F3)** — the design for the durable root fix of the cold-open saga: **compacting the ~424k-row local `changes` log** that makes both cold-open costs slow. This is the fix `0233` → `0249` → `0253` kept deferring. It builds on the two already-merged failure-mode fixes ([#355](#355) resilient open, [#356](#356) resync yield), which made boot *survivable* but did not shrink the log. ## How it was produced Grounded in three code-mapped subsystems (kernel hash-chain, sync/convergence, state materialization), then hardened by an adversarial design-panel + 5-lens red-team pass before writing. ## Key findings - **State is fully materialized** in `nodes`/`node_properties`; **live queries never replay `changes`**. The hub is the **authoritative** full log. So the local `changes` table is a mostly-redundant cache — safe to GC. - **The adversarial pass falsified the obvious rule.** "Drop everything below the sync cursor" is **unsafe**: the cursor = the hub `highWaterMark` across *all* authors, which is *not* proof this client's own low-lamport (offline/concurrent) changes were pushed. That rule would delete the only recoverable copy of a stranded self-authored change → **permanent data loss + divergence** (confirmed across convergence, lost-write, cursor-regression, hub-rollback, and BYO-hub lenses). ## The verified-safe design **Superseded-history GC**, keyed on **live-value lineage** rather than the cursor: prune a `changes` row only if it is below the safe floor, **not** a per-node tip, and **backs no currently-winning LWW value** (`node_properties` provenance). Keep per-node tips (chaining), the unconfirmed tail, and every live-value backer; gate the whole pass on rollback detection and stable hub identity. **Client-only, zero hub/protocol/wire/DDL change, behind a kill-switch.** Hub-assisted signed-snapshot bootstrap (the only fix for *fresh-device* cost) is recorded as a deliberate follow-up. Doc includes Options, the adversarial narrative, the recommendation with 9 correctness invariants, example code, risks, and Implementation + Validation checklists (including the convergence conformance test that would currently fail without the lineage guard). Docs-only; no changeset. Next step: `/implement` Phase 1 (the client-only GC). 🤖 Generated with [Claude Code](https://claude.com/claude-code)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why — the actual failure, finally captured
The SQLite worker
open()exceeded 15s and the timeout fired, failing the whole boot to an error screen. This disproves the load-bearing argument the whole investigation leaned on ("the 15s timeout never fires, so open must be fast") — it does fire. The open is intermittently slow: 292ms on the fast localStorage boots, ~17s on the JSON captures, >15s here. That intermittency is the fingerprint of OPFS sync-access-handle contention, most likely self-inflicted — a boot whose open times out leaks its worker (pre-#351 the gracefulclose()queued behind the stuck open and never released the handle), so the next boot'screateSyncAccessHandle()on the large 318k-change DB file blocks on the held handle → cascade.What
WebSQLiteProxy.open()no longer hard-fails on the first timeout. It terminates the stuck worker (freeing the OPFS handle, building on #351's R1) and retries with a fresh worker up to 3 attempts, via the new pure/testedopenWithTimeoutRetry(open-retry.ts). The leaked-handle cascade recovers instead of erroring; a genuinely broken OPFS still fails cleanly after bounded attempts. AddsSQLiteConfig.openTimeoutMs(default 15000).Tests
@xnetjs/sqlite: 155/158 (4 newopen-retrytests — first-try success, timeout→retry→recover, exhaustion, onRetry cadence); typecheck + lint clean. Changeset:@xnetjs/sqliteminor.Not in this PR (root cause)
The open is only slow because the append-only
changestable holds ~318k rows (constant across every capture) → a large DB-body file → slow coldcreateSyncAccessHandle(). Compacting the change log (snapshot + truncate history) is the durable fix (0233/0249's deferred "F3"); this PR + deploying #351 stop the failure and the cascade now.🤖 Generated with Claude Code