Skip to content

Retry SQLite open on timeout instead of failing the boot (0253)#355

Merged
crs48 merged 2 commits into
mainfrom
claude/0253-resilient-sqlite-open
Jul 1, 2026
Merged

Retry SQLite open on timeout instead of failing the boot (0253)#355
crs48 merged 2 commits into
mainfrom
claude/0253-resilient-sqlite-open

Conversation

@crs48

@crs48 crs48 commented Jul 1, 2026

Copy link
Copy Markdown
Owner

Why — the actual failure, finally captured

[WebSQLiteProxy] Calling proxy.open()...
[App] Initialization failed: Error: Worker initialization timeout after 15s

The SQLite worker open() exceeded 15s and the timeout fired, failing the whole boot to an error screen. This disproves the load-bearing argument the whole investigation leaned on ("the 15s timeout never fires, so open must be fast") — it does fire. The open is intermittently slow: 292ms on the fast localStorage boots, ~17s on the JSON captures, >15s here. That intermittency is the fingerprint of OPFS sync-access-handle contention, most likely self-inflicted — a boot whose open times out leaks its worker (pre-#351 the graceful close() queued behind the stuck open and never released the handle), so the next boot's createSyncAccessHandle() on the large 318k-change DB file blocks on the held handle → cascade.

What

WebSQLiteProxy.open() no longer hard-fails on the first timeout. It terminates the stuck worker (freeing the OPFS handle, building on #351's R1) and retries with a fresh worker up to 3 attempts, via the new pure/tested openWithTimeoutRetry (open-retry.ts). The leaked-handle cascade recovers instead of erroring; a genuinely broken OPFS still fails cleanly after bounded attempts. Adds SQLiteConfig.openTimeoutMs (default 15000).

Tests

@xnetjs/sqlite: 155/158 (4 new open-retry tests — first-try success, timeout→retry→recover, exhaustion, onRetry cadence); typecheck + lint clean. Changeset: @xnetjs/sqlite minor.

Not in this PR (root cause)

The open is only slow because the append-only changes table holds ~318k rows (constant across every capture) → a large DB-body file → slow cold createSyncAccessHandle(). Compacting the change log (snapshot + truncate history) is the durable fix (0233/0249's deferred "F3"); this PR + deploying #351 stop the failure and the cascade now.

🤖 Generated with Claude Code

xNet Test and others added 2 commits July 1, 2026 07:43
…he boot (0253)

A fresh capture caught the actual user-facing failure:
  [App] Initialization failed: Error: Worker initialization timeout after 15s

The cold installOpfsSAHPoolVfs() on the large 318k-change DB file intermittently
exceeds the 15s open timeout — usually because a PRIOR boot's open timed out and
leaked a worker still holding the file's exclusive OPFS sync access handle, so
this boot's createSyncAccessHandle() blocks on the contended handle. The old code
hard-failed on the first timeout and showed an error screen.

WebSQLiteProxy.open() now terminates the stuck worker (releasing the handle,
building on #351's R1) and retries with a fresh worker up to 3 attempts, via the
new pure/tested openWithTimeoutRetry helper. The leaked-handle cascade recovers
instead of failing; a genuinely broken OPFS still fails cleanly after bounded
attempts. Adds SQLiteConfig.openTimeoutMs (default 15000) to tune it.

Corrects exploration 0253: the open is NOT innocent — it is intermittently slow
(292ms usually, >15s sometimes). The '15s timeout never fires' argument that
earlier ruled out the open was simply never true; this capture shows it firing.
Root-cause follow-up: compact the 318k-row change log to shrink the DB file.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: xNet Test <test@xnet.dev>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: xNet Test <test@xnet.dev>
@crs48 crs48 temporarily deployed to pr-355 July 1, 2026 14:46 — with GitHub Actions Inactive
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Preview removed for PR #355.

github-actions Bot added a commit that referenced this pull request Jul 1, 2026
@crs48 crs48 merged commit f350c9b into main Jul 1, 2026
16 checks passed
@crs48 crs48 deleted the claude/0253-resilient-sqlite-open branch July 1, 2026 15:01
github-actions Bot added a commit that referenced this pull request Jul 1, 2026
crs48 added a commit that referenced this pull request Jul 1, 2026
## The ~5s cold-open freeze

Exploration **0253**'s main-thread stall detector (#354) caught a **~5s
uninterrupted `window` long task** *after* `hub:connected`:

```json
{ "blockMs": 4980, "atOffsetMs": 21069, "phaseBefore": "hub:connected",
  "phaseAfter": "hub:connected", "longTaskAttribution": "window" }
```

This is the freeze that hid behind the intermittent slow SQLite open
(fixed by #355). Its source is the **first outbound resync**: when the
persisted sync cursor lags far behind the local change log (the hub
never confirmed the tail — `INVALID_HASH` skew, 0224),
`syncLocalChanges()` fetches *every* change since the cursor and
processes the whole slice **synchronously** right after the
sync-response resolves — on the 318k-change dataset, that's seconds of
unbroken main-thread work.

## Fix

`packages/runtime/src/sync/node-store-sync-provider.ts` —
`syncLocalChanges()`:

- **Code-unit tie-break, not `localeCompare`.** The DB query already
returns `lamport`-ASC order, so the sort only orders equal-lamport ties.
`String.localeCompare` over a large tie-heavy slice is orders of
magnitude slower than a code-unit compare *and* violated the repo's
code-unit collation invariant (the inbound apply path already orders by
code units). This alone removes the dominant cost.
- **Yield every 1024 changes.** The enqueue loop yields a macrotask
between batches, so a large first-sync slice can never monopolise a
frame regardless of size (it bails cleanly if the connection drops
mid-yield).
- **Self-gating diagnostic.** A one-line `[NodeStoreSync] heavy outbound
resync: N changes, fetch+deserialize Xms, sort Yms` warn fires only when
a resync is actually heavy — naming the residual synchronous cost (the
per-row `JSON.parse` deserialize *inside* `getChangesSince`) so the next
capture confirms the fix and sizes the durable follow-up.

The durable root fix — **compacting the 318k-row `changes` log** (F3) so
the slice is small — is the next step (a separate `/explore`). This PR
removes the freeze now.

## Tests

`node-store-sync-provider.test.ts` gains two cases (18 total, all
green):
- **Code-unit tie ordering** — equal-lamport changes from `did:key:zB`
vs `did:key:za` publish `B` before `a` (code-unit), which the old
`localeCompare` would have reversed. Guards the fix.
- **Yield-boundary integrity** — a 1100-change resync (> the 1024 batch)
publishes every change exactly once; nothing dropped or duped at the
seam.

Full `runtime` project: 151/151 pass. No public API or wire-contract
change → `@xnetjs/runtime: patch`.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
crs48 added a commit that referenced this pull request Jul 1, 2026
…pen fix (0254) (#357)

## What

Exploration **0254 (F3)** — the design for the durable root fix of the
cold-open saga: **compacting the ~424k-row local `changes` log** that
makes both cold-open costs slow. This is the fix `0233` → `0249` →
`0253` kept deferring. It builds on the two already-merged failure-mode
fixes ([#355](#355) resilient open,
[#356](#356) resync yield), which made
boot *survivable* but did not shrink the log.

## How it was produced

Grounded in three code-mapped subsystems (kernel hash-chain,
sync/convergence, state materialization), then hardened by an
adversarial design-panel + 5-lens red-team pass before writing.

## Key findings

- **State is fully materialized** in `nodes`/`node_properties`; **live
queries never replay `changes`**. The hub is the **authoritative** full
log. So the local `changes` table is a mostly-redundant cache — safe to
GC.
- **The adversarial pass falsified the obvious rule.** "Drop everything
below the sync cursor" is **unsafe**: the cursor = the hub
`highWaterMark` across *all* authors, which is *not* proof this client's
own low-lamport (offline/concurrent) changes were pushed. That rule
would delete the only recoverable copy of a stranded self-authored
change → **permanent data loss + divergence** (confirmed across
convergence, lost-write, cursor-regression, hub-rollback, and BYO-hub
lenses).

## The verified-safe design

**Superseded-history GC**, keyed on **live-value lineage** rather than
the cursor: prune a `changes` row only if it is below the safe floor,
**not** a per-node tip, and **backs no currently-winning LWW value**
(`node_properties` provenance). Keep per-node tips (chaining), the
unconfirmed tail, and every live-value backer; gate the whole pass on
rollback detection and stable hub identity. **Client-only, zero
hub/protocol/wire/DDL change, behind a kill-switch.** Hub-assisted
signed-snapshot bootstrap (the only fix for *fresh-device* cost) is
recorded as a deliberate follow-up.

Doc includes Options, the adversarial narrative, the recommendation with
9 correctness invariants, example code, risks, and Implementation +
Validation checklists (including the convergence conformance test that
would currently fail without the lineage guard).

Docs-only; no changeset. Next step: `/implement` Phase 1 (the
client-only GC).

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant