Skip to content

Fix CRR PK retrofit corruption, sync schema skew, and failed brain connect wipe#367

Merged
arul28 merged 1 commit into
mainfrom
cursor/critical-bug-detection-03c4
May 28, 2026
Merged

Fix CRR PK retrofit corruption, sync schema skew, and failed brain connect wipe#367
arul28 merged 1 commit into
mainfrom
cursor/critical-bug-detection-03c4

Conversation

@cursor
Copy link
Copy Markdown
Contributor

@cursor cursor Bot commented May 26, 2026

Bug and impact

CRR primary-key retrofit corrupts replicated tables (data corruption)
retrofitLegacyPrimaryKeyNotNullSchema could DROP TABLE + rebuild CRR-managed tables (e.g. lanes) without crsql_begin_alter / rebuildCrrTableWithBackfill. The FK retrofit pass already skipped tables with __crsql_clock companions; the PK pass did not. Legacy DBs needing PK repair on open could break CRDT clocks and cause desktop/iOS divergence.

Unknown-table changesets ACKed without applying (sync data loss)
applyChanges silently skipped rows for tables missing from the local schema, but syncHostService / syncPeerService ACKed appliedCount = changes.length and advanced the sync cursor. A viewer on an older build receiving batches that include newer tables would permanently skip those rows after upgrading.

Failed connectToBrain wiped local device registry (data loss)
clearClusterRegistryForViewerJoin() ran before the WebSocket handshake. On auth/timeout failure, local devices / sync_cluster_state rows were already deleted and not restored.

Root cause

  • Asymmetric guards between FK and PK retrofit passes in kvDb.ts.
  • Host/peer ACK paths ignored applyChanges return value; implementation contradicted crdt-model.md (unknown tables must be rejected, not skipped-and-ACKed).
  • Viewer join cleanup ordered before connect success.

Fix

  • Skip PK retrofit when ${table}__crsql_clock exists.
  • Throw unknown_sync_table:<name> for non-retired missing tables; report real appliedCount in changeset ACKs.
  • Move registry clear to after successful syncPeerService.connect().

Validation

  • npx vitest run src/main/services/state/kvDb.migrations.test.ts
  • Sync regression tests in kvDb.sync.test.ts (run when cr-sqlite is available on macOS CI)

Does not overlap open draft PRs #363#366 (CRR alter leak, orchestration approval/manifest, viewer join DELETE broadcast).

Open in Web View Automation 

ADE   Open in ADE  ·  cursor/critical-bug-detection-03c4 branch  ·  PR #367

Greptile Summary

This PR fixes three distinct data-integrity bugs: the CRR PK retrofit could DROP and rebuild cr-sqlite-managed tables, applyChanges silently advanced the sync cursor for unknown tables, and clearClusterRegistryForViewerJoin() ran before the WebSocket handshake so a failed connectToBrain wiped the device registry.

  • kvDb.ts: Adds a __crsql_clock companion check to skip CRR tables in the PK retrofit pass, and replaces the silent-skip on unknown tables with a thrown unknown_sync_table:<name> error (retired tables continue to be ignored via SYNC_RETIRED_TABLES).
  • syncHostService.ts / syncPeerService.ts: Both changeset-batch ACK paths now report applyResult.appliedCount instead of echoing changes.length, so the remote cursor only advances for rows that were actually written.
  • syncService.ts: clearClusterRegistryForViewerJoin() is moved to after a successful syncPeerService.connect(), with a new integration test covering the failure path.

Confidence Score: 3/5

The three core fixes are logically correct, but the connectToBrain reordering introduces a new failure mode: if anything after connect() throws, the catch block clears the saved draft without closing the live WebSocket, leaving the device syncing silently while the caller believes the connection failed.

The ACK-count and unknown-table fixes are straightforward and well-tested. The registry-wipe reordering solves the stated bug cleanly for the happy path. However, the catch block in connectToBrain does not call syncPeerService.disconnect(), so any exception thrown by clearClusterRegistryForViewerJoin(), touchLocalDevice(), or flushLocalChanges() leaves an open WebSocket connection processing changesets in the background while the UI shows a failed-connect state. The migration regression test also may not reach the guarded code path on a fresh schema.

apps/ade-cli/src/services/sync/syncService.ts (missing disconnect in catch) and apps/desktop/src/main/services/state/kvDb.migrations.test.ts (test fixture may not exercise the retrofit branch)

Important Files Changed

Filename Overview
apps/desktop/src/main/services/state/kvDb.ts Adds CRR clock-companion guard to PK retrofit, throws on unknown non-retired tables in applyChanges, and introduces SYNC_RETIRED_TABLES; the retired set has a redundant entry but logic is correct.
apps/ade-cli/src/services/sync/syncService.ts Moves clearClusterRegistryForViewerJoin() after connect() to prevent premature wipe on failure, but the catch block doesn't disconnect the peer service, leaving an orphaned WebSocket if any post-connect step throws.
apps/ade-cli/src/services/sync/syncHostService.ts Correctly plumbs applyResult.appliedCount into the changeset ACK instead of echoing changes.length; error path was already present and is untouched.
apps/ade-cli/src/services/sync/syncPeerService.ts Mirrors the host-side ACK fix — reports real appliedCount instead of the raw batch length, with correct zero fallback when changes array is empty.
apps/desktop/src/main/services/state/kvDb.migrations.test.ts Adds a CRR clock-companion skip test, but the assertion may pass trivially on a fresh DB where the PK retrofit branch is never entered; test needs a legacy schema fixture to fully cover the regression.
apps/desktop/src/main/services/state/kvDb.sync.test.ts Updates the retired-table skip test to cover two FTS variants, and correctly updates the future-table test to assert that applyChanges now throws instead of silently dropping rows.
apps/ade-cli/src/services/sync/syncService.test.ts New end-to-end test for registry preservation on failed connect; exercises the TCP-refused path via an unused port, correctly verifies the registry survives the failure.

Sequence Diagram

sequenceDiagram
    participant UI
    participant SyncService
    participant PeerService
    participant DeviceRegistry
    participant Brain

    UI->>SyncService: connectToBrain(draft)
    SyncService->>PeerService: connect(draft)
    PeerService->>Brain: WebSocket open + hello
    Brain-->>PeerService: hello_ok (handshake complete)
    PeerService-->>SyncService: connect() resolves

    Note over SyncService,DeviceRegistry: Registry cleared AFTER successful connect (fix)
    SyncService->>DeviceRegistry: clearClusterRegistryForViewerJoin()
    SyncService->>PeerService: flushLocalChanges()

    Brain->>PeerService: changeset_batch
    alt all tables known
        PeerService->>PeerService: applyChanges() to appliedCount
        PeerService-->>Brain: "changeset_ack(ok=true, appliedCount)"
    else unknown non-retired table
        PeerService->>PeerService: applyChanges() throws unknown_sync_table
        PeerService-->>Brain: "changeset_ack(ok=false, appliedCount=0)"
    end
Loading

Comments Outside Diff (2)

  1. apps/ade-cli/src/services/sync/syncService.ts, line 1023-1036 (link)

    P1 Orphaned WebSocket when post-connect operation throws

    syncPeerService.connect(draft) opens the WebSocket and completes the handshake before returning. If any of the subsequent calls — clearClusterRegistryForViewerJoin(), touchLocalDevice(), or flushLocalChanges() — throw, the catch block clears the saved draft via setSavedDraft(null) but never calls syncPeerService.disconnect(). The live WebSocket connection remains open and keeps receiving changesets from the brain, while the caller receives an error indicating that the connection failed. The mismatch leaves the device syncing silently in the background in a state the UI believes is disconnected.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: apps/ade-cli/src/services/sync/syncService.ts
    Line: 1023-1036
    
    Comment:
    **Orphaned WebSocket when post-connect operation throws**
    
    `syncPeerService.connect(draft)` opens the WebSocket and completes the handshake before returning. If any of the subsequent calls — `clearClusterRegistryForViewerJoin()`, `touchLocalDevice()`, or `flushLocalChanges()` — throw, the catch block clears the saved draft via `setSavedDraft(null)` but never calls `syncPeerService.disconnect()`. The live WebSocket connection remains open and keeps receiving changesets from the brain, while the caller receives an error indicating that the connection failed. The mismatch leaves the device syncing silently in the background in a state the UI believes is disconnected.
    
    How can I resolve this? If you propose a fix, please make it concise.

    Fix in Cursor Fix in Codex Fix in Claude Code

  2. apps/desktop/src/main/services/state/kvDb.ts

    P2 "unified_memories_fts" is already covered by the startsWith("unified_memories_") guard below, making its entry in SYNC_RETIRED_TABLES redundant. The constant only needs to hold names that would NOT match the startsWith predicate — currently only "unified_memories" itself.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: apps/desktop/src/main/services/state/kvDb.ts
    Line: ?
    
    Comment:
    `"unified_memories_fts"` is already covered by the `startsWith("unified_memories_")` guard below, making its entry in `SYNC_RETIRED_TABLES` redundant. The constant only needs to hold names that would NOT match the `startsWith` predicate — currently only `"unified_memories"` itself.
    
    
    
    How can I resolve this? If you propose a fix, please make it concise.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

    Fix in Cursor Fix in Codex Fix in Claude Code

Fix All in Cursor Fix All in Codex Fix All in Claude Code

Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
apps/ade-cli/src/services/sync/syncService.ts:1023-1036
**Orphaned WebSocket when post-connect operation throws**

`syncPeerService.connect(draft)` opens the WebSocket and completes the handshake before returning. If any of the subsequent calls — `clearClusterRegistryForViewerJoin()`, `touchLocalDevice()`, or `flushLocalChanges()` — throw, the catch block clears the saved draft via `setSavedDraft(null)` but never calls `syncPeerService.disconnect()`. The live WebSocket connection remains open and keeps receiving changesets from the brain, while the caller receives an error indicating that the connection failed. The mismatch leaves the device syncing silently in the background in a state the UI believes is disconnected.

### Issue 2 of 3
apps/desktop/src/main/services/state/kvDb.migrations.test.ts:157-179
**Test may not exercise the actual PK-retrofit path for `lanes`**

The probe index (`temp_ade_pk_retrofit_probe`) is dropped during the table-rebuild in the old (unfixed) code path, so its presence after reopen could serve as a witness. However, `retrofitLegacyPrimaryKeyNotNullSchema` only attempts a rebuild when a PK column in the stored `CREATE TABLE` SQL is missing the `NOT NULL` keyword. A fresh `openKvDb` produces a fully-constrained `lanes` schema, so the retrofit branch is never entered for `lanes` at all — the assertion that `lanes__crsql_clock` still exists would pass trivially even without the `if (rawHasTable(db, ...)) continue;` guard. To actually cover the regression, the test should inject a legacy schema for `lanes` (strip `NOT NULL` from the stored DDL, or use a fixture DB created with the old migration) so the migration genuinely reaches the guarded skip.

### Issue 3 of 3
apps/desktop/src/main/services/state/kvDb.ts:?
`"unified_memories_fts"` is already covered by the `startsWith("unified_memories_")` guard below, making its entry in `SYNC_RETIRED_TABLES` redundant. The constant only needs to hold names that would NOT match the `startsWith` predicate — currently only `"unified_memories"` itself.

```suggestion
/** CRDT tables removed from the schema; inbound tombstones are ignored. */
const SYNC_RETIRED_TABLES = new Set(["unified_memories"]);
```

Reviews (1): Last reviewed commit: "Fix CRR PK retrofit corruption, sync sch..." | Re-trigger Greptile

Greptile also left 1 inline comment on this PR.

@vercel
Copy link
Copy Markdown

vercel Bot commented May 26, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
ade Ignored Ignored Preview May 28, 2026 8:59am

@arul28
Copy link
Copy Markdown
Owner

arul28 commented May 28, 2026

@copilot review but do not make fixes

Copy link
Copy Markdown
Contributor

Copilot AI commented May 28, 2026

@copilot review but do not make fixes

Reviewed c426318 and ran the targeted tests touched by this PR:

  • npm --prefix apps/ade-cli run test -- src/services/sync/syncService.test.ts
  • npm --prefix apps/desktop run test -- src/main/services/state/kvDb.migrations.test.ts src/main/services/state/kvDb.sync.test.ts

No blocking issues found in this pass, and the targeted tests passed (with CR-SQLite-gated tests skipped as expected).

Copilot finished work on behalf of arul28 May 28, 2026 07:23
Copilot AI requested a review from arul28 May 28, 2026 07:23
@arul28 arul28 force-pushed the cursor/critical-bug-detection-03c4 branch from c426318 to 5deafde Compare May 28, 2026 07:49
@arul28
Copy link
Copy Markdown
Owner

arul28 commented May 28, 2026

@copilot review but do not make fixes

@arul28 arul28 force-pushed the cursor/critical-bug-detection-03c4 branch from 5deafde to 6b3f3c4 Compare May 28, 2026 08:21
@arul28
Copy link
Copy Markdown
Owner

arul28 commented May 28, 2026

@copilot review but do not make fixes

@arul28
Copy link
Copy Markdown
Owner

arul28 commented May 28, 2026

@copilot review but do not make fixes

…n connect wipe

- Skip primary-key retrofit for tables with __crsql_clock companions (same guard as FK retrofit)
- Reject unknown sync tables instead of silently skipping; use applyChanges appliedCount in ACKs
- Defer device registry clear until connectToBrain succeeds so failed connects keep local pairing data

Co-authored-by: Arul Sharma <arul28@users.noreply.github.com>
@arul28 arul28 force-pushed the cursor/critical-bug-detection-03c4 branch from ce6457b to b003b82 Compare May 28, 2026 08:59
@arul28
Copy link
Copy Markdown
Owner

arul28 commented May 28, 2026

@copilot review but do not make fixes

@arul28 arul28 marked this pull request as ready for review May 28, 2026 09:29
@arul28 arul28 merged commit 7140d8e into main May 28, 2026
28 of 29 checks passed
Copy link
Copy Markdown
Contributor Author

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review

Scope: 7 file(s), +158 / −19
Verdict: Looks good

This PR hardens CRDT sync and local DB migration: it skips wholesale PK retrofit on CRR-managed tables, reports honest appliedCount in changeset ACKs, rejects unknown inbound sync tables (while still ignoring retired unified_memories* tombstones), and defers clearing the device registry until connectToBrain succeeds. The changes are tightly scoped with matching tests and correct transactional rollback in applyChanges.


Notes

  • Good patterns: applyChanges wraps the loop in begin/rollback/commit, so a mid-batch unknown_sync_table error cannot leave a partially applied batch (kvDb.ts:L3048-L3079). Host and peer services already map apply failures to ok: false ACKs with retry/disconnect handling.
  • Operational tradeoff: Rejecting unknown sync tables (instead of silently skipping) will hard-fail sync when peers run mismatched schema versions until all devices upgrade. That is stricter than before but fixes the prior ACK skew where changes.length could over-report applied rows.
  • Could not verify locally: Vitest could not run in this environment (node_modules not installed), so test pass/fail was not executed here.
Open in Web View Automation 

Sent by Cursor Automation: BUGBOT in Versic

Comment on lines 157 to 179
}
});

it.skipIf(!isCrsqliteAvailable())("skips primary-key retrofit for tables that already have __crsql_clock companions", async () => {
const dbPath = makeDbPath("ade-kvdb-pk-retrofit-skip-crr-");
const first = await openKvDb(dbPath, createLogger());
first.run("create unique index if not exists temp_ade_pk_retrofit_probe on lanes(project_id, name)");
first.close();

const reopened = await openKvDb(dbPath, createLogger());
try {
expect(
reopened.get<{ name: string }>(
"select name from sqlite_master where type = 'table' and name = 'lanes__crsql_clock' limit 1",
)?.name,
).toBe("lanes__crsql_clock");
reopened.run("drop index if exists temp_ade_pk_retrofit_probe");
} finally {
reopened.close();
}
});

it("coalesces duplicate lane_linear_issue_links rows during migrate", async () => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Test may not exercise the actual PK-retrofit path for lanes

The probe index (temp_ade_pk_retrofit_probe) is dropped during the table-rebuild in the old (unfixed) code path, so its presence after reopen could serve as a witness. However, retrofitLegacyPrimaryKeyNotNullSchema only attempts a rebuild when a PK column in the stored CREATE TABLE SQL is missing the NOT NULL keyword. A fresh openKvDb produces a fully-constrained lanes schema, so the retrofit branch is never entered for lanes at all — the assertion that lanes__crsql_clock still exists would pass trivially even without the if (rawHasTable(db, ...)) continue; guard. To actually cover the regression, the test should inject a legacy schema for lanes (strip NOT NULL from the stored DDL, or use a fixture DB created with the old migration) so the migration genuinely reaches the guarded skip.

Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/desktop/src/main/services/state/kvDb.migrations.test.ts
Line: 157-179

Comment:
**Test may not exercise the actual PK-retrofit path for `lanes`**

The probe index (`temp_ade_pk_retrofit_probe`) is dropped during the table-rebuild in the old (unfixed) code path, so its presence after reopen could serve as a witness. However, `retrofitLegacyPrimaryKeyNotNullSchema` only attempts a rebuild when a PK column in the stored `CREATE TABLE` SQL is missing the `NOT NULL` keyword. A fresh `openKvDb` produces a fully-constrained `lanes` schema, so the retrofit branch is never entered for `lanes` at all — the assertion that `lanes__crsql_clock` still exists would pass trivially even without the `if (rawHasTable(db, ...)) continue;` guard. To actually cover the regression, the test should inject a legacy schema for `lanes` (strip `NOT NULL` from the stored DDL, or use a fixture DB created with the old migration) so the migration genuinely reaches the guarded skip.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Codex Fix in Claude Code

@arul28 arul28 deleted the cursor/critical-bug-detection-03c4 branch May 28, 2026 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants