Skip to content

CS-11156: cross-replica clearLocalCaches broadcast via NOTIFY#4842

Draft
lukemelia wants to merge 5 commits into
cs-11125-per-realm-advisory-lock-for-data-plane-write-pathsfrom
cs-11156-publish-realm-cross-replica-clearlocalcaches-broadcast
Draft

CS-11156: cross-replica clearLocalCaches broadcast via NOTIFY#4842
lukemelia wants to merge 5 commits into
cs-11125-per-realm-advisory-lock-for-data-plane-write-pathsfrom
cs-11156-publish-realm-cross-replica-clearlocalcaches-broadcast

Conversation

@lukemelia
Copy link
Copy Markdown
Contributor

@lukemelia lukemelia commented May 14, 2026

Summary

  • Pre-CS-11156 the publish-realm handler invalidates #sourceCache / #moduleCache only on the replica that processes the request. On 2+ replicas, peers keep pre-swap bytes and the reindex's prerender HTTP fan-out lands stale source in boxel_index.isolated_html — served forever. Same shape on unpublish / delete during the window between the registry-row commit and the peer-side reconciler unmount.
  • Extends the existing per-path realm_file_changes NOTIFY channel with a bulk payload <realmURL>:* meaning "drop every cached path for this realm." Wired into publish, unpublish, and delete realm handlers; on receive, peers call Realm.clearLocalCaches().
  • While there, replace the publish handler's hand-rolled DELETE FROM modules WHERE resolved_realm_url = $1 with definitionLookup.clearRealmCache(url) so the modules cache invalidation gets the same cross-replica treatment (generation bump + in-flight drop + DELETE + NOTIFY) instead of just the DB delete.

Stacks on #4840.

Linear: CS-11156.

API surface

Mirrors CachingDefinitionLookup.clearRealmCache(url) — one method bundles local invalidation + cross-instance NOTIFY so handlers don't have to remember both steps. Three entry points after this PR:

Caller need API
Local clear AND peer broadcast (realm staying up — publish handler) realm.clearLocalCachesAndBroadcast()
Peer broadcast ONLY (realm being torn down — unpublish / delete) notifyAllFileChanges(dbAdapter, url) (free function)
Receive-side replay (LISTEN handler, no broadcast, no NOTIFY loop) realm.clearLocalCaches()

The modules cache (separate from the in-process byte caches above) keeps its existing definitionLookup.clearRealmCache(url) entry point — this PR just stops bypassing it from the publish handler.

What's in

  • runtime-common/realm.tsREALM_FILE_CHANGES_WILDCARD = '*' sentinel, standalone notifyAllFileChanges(dbAdapter, realmURL) emitter (the single cross-replica emit surface — Realm doesn't need to know about channel names or payload formats), and Realm.clearLocalCachesAndBroadcast() instance method that bundles clearLocalCaches() + the free-function emit. Same best-effort fire-and-forget shape as Realm.#notifyFileChange; missed NOTIFY is a bounded staleness window per §9 of docs/db-authoritative-realm-registry.md, not data corruption.
  • realm-file-changes-listener.ts — dispatch branches on path === '*' to Realm.clearLocalCaches(). Existing regex parser + realm lookup reused as-is (the wildcard payload parses cleanly with path = '*').
  • handle-publish-realm.ts
    • Replaces the raw DELETE FROM modules WHERE resolved_realm_url = $1 with await definitionLookup.clearRealmCache(publishedRealmURL) so the modules-cache invalidation also bumps the per-realm generation counter, drops in-flight prerender promises, and broadcasts on module_cache_invalidated — the modules-table analog of the byte-cache fix this PR is making. Without those extra steps an in-flight prerender that started before the DELETE could re-insert a stale row at persist time, and peer replicas would keep their cached rows + generation counters until their own next invalidation arrived. clearRealmCache already runs via the post-fullIndex completion path (realm.ts:1068) but that's at the end of the reindex — too late for the prerender fan-out at the start.
    • One call (await mountedRealmForCacheClear.clearLocalCachesAndBroadcast()) for the byte-cache wipe + cross-replica broadcast before the reindex enqueue. Self-NOTIFY is a no-op since clearLocalCaches is idempotent.
  • handle-unpublish-realm.ts and handle-delete-realm.ts — call the standalone notifyAllFileChanges(dbAdapter, url) after the FS removal. No local clear needed: the realm is about to be unmounted, so the in-process cache will be garbage-collected with the Realm instance. Defense-in-depth against the brief window before peers unmount via NOTIFY realm_registry. (Per-file deleteAll in unpublish already emits per-path NOTIFYs; this is the catch-all for the registry-commit-to-unmount window.)

Tests

packages/realm-server/tests/realm-file-changes-listener-test.ts (12 existing pass; 4 new):

  • parsePayload round-trips <realmURL>:* to path: '*' for both port-bearing and port-less URLs.
  • Dispatch test: wildcard payload calls clearLocalCaches() exactly once and never invalidateCache().
  • End-to-end through the live LISTEN client: notifyAllFileChanges emitter → Postgres NOTIFY → listener → clearLocalCaches on a fake peer-side Realm.

Why stack on CS-11125

The advisory lock from #4840 is what makes the broadcast's "after the swap" ordering meaningful. Without serialization, a concurrent same-realm write could land in the staleness window between the registry pointer flip and the NOTIFY landing on a peer.

Compatibility

  • Single-replica deploys: the local clearLocalCaches() inside clearLocalCachesAndBroadcast() still runs; the NOTIFY is a no-op when no other replicas are LISTENing. clearRealmCache was already in use in the post-fullIndex completion path; calling it pre-reindex too is purely additive.
  • SQLite (host) is a passthrough — notify is a no-op there.

Test plan

  • realm-file-changes-listener-test.ts (16/16) including 4 new wildcard tests
  • tsc on packages/runtime-common + packages/realm-server — no new errors
  • Prettier clean on all touched files
  • CI realm-server suite (should clear once CS-11125: per-realm advisory lock on data-plane write paths #4840 lands the clearLocalCaches() restoration on its branch)

🤖 Generated with Claude Code

The CS-11043 publish-realm fix invalidates the publish-handling
replica's #sourceCache / #moduleCache before the reindex enqueues so
the reindex's prerender doesn't see pre-swap bytes. That fix is
correct on one replica. On two+ replicas behind a load balancer,
peers still hold pre-swap bytes in their own caches and the
reindex's HTTP fan-out to peers serves stale source — back into
boxel_index.isolated_html, served forever.

Extends the existing per-path `realm_file_changes` NOTIFY channel
with a bulk payload `<realmURL>:*` meaning "drop every cached path
for this realm". Wired into publish, unpublish, and delete realm
handlers; on receive, peers call `Realm.clearLocalCaches()`.

* runtime-common/realm.ts: `REALM_FILE_CHANGES_WILDCARD` sentinel,
  standalone `notifyAllFileChanges(dbAdapter, realmURL)` emitter,
  and `Realm.notifyAllFileChanges()` instance form. Same
  fire-and-forget semantics as `Realm.#notifyFileChange`; missed
  NOTIFY is a bounded staleness window per §9 of the registry doc,
  not data corruption.
* realm-file-changes-listener.ts: dispatch branches on the wildcard
  payload to `Realm.clearLocalCaches()`. Existing per-path parser +
  realm lookup reused as-is.
* handle-publish-realm.ts: keeps the sync local `clearLocalCaches()`
  before the reindex enqueue (replica's own prerender fan-out must
  bypass its cache) and adds the broadcast after. Self-NOTIFY is a
  no-op since clearLocalCaches is idempotent.
* handle-unpublish-realm.ts and handle-delete-realm.ts: broadcast
  after the FS removal. Defense-in-depth against the brief window
  before peers unmount via `NOTIFY realm_registry`.

Tests in realm-file-changes-listener-test.ts:
* parsePayload returns `path: '*'` for both `host:port` and
  port-less URLs
* dispatch routes wildcard to `clearLocalCaches`, not
  `invalidateCache`
* end-to-end through the live LISTEN client: the new emitter →
  Postgres NOTIFY → the listener → `clearLocalCaches` on a fake
  peer-side realm

Stacks on #4840 (CS-11125 — per-realm advisory locks on the data
plane). The lock is what makes the broadcast's "after the swap"
ordering meaningful — without serialization a concurrent same-realm
write could land in the staleness window.

Linear: https://linear.app/cardstack/issue/CS-11156

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 14, 2026

Preview deployments

Host Test Results

    1 files  ±0      1 suites  ±0   1h 37m 53s ⏱️ + 1m 37s
2 658 tests ±0  2 642 ✅  - 1  15 💤 ±0  0 ❌ ±0  1 🔥 +1 
2 677 runs  ±0  2 660 ✅  - 2  15 💤 ±0  1 ❌ +1  1 🔥 +1 

Results for commit bbdeef8. ± Comparison against earlier commit 4691944.

For more details on these errors, see this check.

Realm Server Test Results

    1 files  ±0      1 suites  ±0   9m 28s ⏱️ -10s
1 381 tests ±0  1 381 ✅ ±0  0 💤 ±0  0 ❌ ±0 
1 462 runs  ±0  1 462 ✅ ±0  0 💤 ±0  0 ❌ ±0 

Results for commit bbdeef8. ± Comparison against earlier commit 4691944.

lukemelia and others added 4 commits May 14, 2026 22:35
Follow-up to the initial CS-11156 PR. The publish-realm handler had to
call two methods in sequence to fully invalidate the publishing
replica's cache plus all peers' caches:

    mountedRealmForCacheClear.clearLocalCaches();
    await mountedRealmForCacheClear.notifyAllFileChanges();

Every future emitter would have to remember both lines. Mirroring
`CachingDefinitionLookup.clearRealmCache(url)` — which bundles local
generation bump + DB DELETE + cross-instance NOTIFY in one method —
introduce `Realm.clearLocalCachesAndBroadcast()` that does both steps
and let the handler make one call.

Also drop `Realm.notifyAllFileChanges()`. It was a thin wrapper around
the standalone free function `notifyAllFileChanges(dbAdapter, url)` and
they were used inconsistently — publish used the method, unpublish
used the free function despite having a Realm instance in scope. The
two surfaces collapse to one clear rule:

  - Need local clear AND broadcast (publish handler, realm staying up):
    `realm.clearLocalCachesAndBroadcast()`.
  - Need ONLY the peer broadcast (unpublish/delete handlers, realm
    being torn down — local cache is about to be GC'd with the Realm
    instance): `notifyAllFileChanges(dbAdapter, url)`.

`Realm.clearLocalCaches()` stays as the local-only primitive the
LISTEN handler calls on receive (no broadcast, no NOTIFY loop). The
free function `notifyAllFileChanges` is the single cross-replica emit
surface — the Realm class no longer needs to know about channel names
or payload formats.

No behavior change. All 16 realm-file-changes-listener tests still
pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The publish-realm handler had a hand-rolled `DELETE FROM modules WHERE
resolved_realm_url = $1` to drop stale error entries before the reindex
fan-out. That covers the DB rows but is strictly weaker than
`CachingDefinitionLookup.clearRealmCache(url)`, which:

  1. bumps the per-realm generation counter so in-flight prerenders
     on this replica that started before the DELETE see a mismatch at
     persist time and discard their result instead of re-inserting a
     row this invalidation just removed,
  2. drops in-flight prerender promises for the realm so new callers
     install their own pending against post-swap state rather than
     joining a stale shared transpile,
  3. runs the same DELETE, and
  4. broadcasts on `module_cache_invalidated` so peer realm-server
     replicas perform 1-3 on their own state.

The raw DELETE did only step 3. The reindex worker's prerender fan-out
fires immediately after this code path through HTTP into both this
realm-server and its peers, so missing steps 1, 2, and 4 was exactly
the modules-cache analog of the byte-cache staleness this PR fixes via
`clearLocalCachesAndBroadcast()`.

`clearRealmCache` already runs via the post-fullIndex completion path
in `Realm.startReindex` (realm.ts:1068), but that's at the *end* of
the reindex — too late for the prerender fan-out at the start. Running
it pre-reindex ensures the rebuild starts against a coherent cache on
every replica.

`definitionLookup` is already plumbed through `CreateRoutesArgs`; the
handler just needed to destructure it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two completely different caches were both called "the module cache" in
this codebase:
  - `Realm.#moduleCache` — in-process bytes of transpiled JS (the
    prerender's input)
  - `CachingDefinitionLookup`'s `modules` DB table — assembled card
    definitions (the prerender's output)

Both even had a type named `ModuleCacheEntry` with different shapes. The
juxtaposition in `handle-publish-realm.ts` after #4842
(`definitionLookup.clearRealmCache(url)` next to
`realm.clearLocalCachesAndBroadcast()`) made the collision impossible
to ignore.

This commit renames the Realm-side cache to make the "transpiled JS
bytes" framing explicit at the API surface, and renames the public
cache-wipe methods so each call site self-documents which cache it
touches.

  - `Realm.#moduleCache` → `Realm.#transpiledModuleCache`
  - Type `ModuleCacheEntry` (in `realm.ts`, local to that file) →
    `TranspiledModuleEntry`
  - `Realm.clearLocalCaches()` → `Realm.clearLocalSourceCaches()`
  - `Realm.clearLocalCachesAndBroadcast()` →
    `Realm.clearLocalSourceCachesAndBroadcast()`
  - Internal helpers renamed consistently
    (`#dropModuleCacheEntry`, `#bumpModuleCacheGeneration`, the
    generation maps, etc.)

Mechanical rename — no behavior change. 16/16 listener tests pass.
Tier 2 (DefinitionLookup-side renames: `ModuleCacheEntry` →
`DefinitionCacheEntry`, `clearRealmCache` → `clearRealmDefinitions`,
`clearAllModules` → `clearAllDefinitions`, etc.) is a separate
follow-up commit. Tier 3 (DB column + NOTIFY channel rename, needs a
deploy plan for rolling-update compatibility) is deliberately deferred.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`CachingDefinitionLookup` caches assembled card *definitions* (per-export
results, error entries, dependency lists). It's not a module-byte cache.
But every public type and method on it was named "module cache" or
"modules" — which collided directly with `Realm.#transpiledModuleCache`
(renamed last commit), the actual JS-bytes cache.

Public API now reads as what it does:
  - `ModuleCacheEntry`      → `DefinitionCacheEntry`
  - `ModuleCacheEntries`    → `DefinitionCacheEntries`
  - `ModuleCacheEntryQuery` → `DefinitionCacheEntryQuery`
  - `getModuleCacheEntry`   → `getCachedDefinitions`
  - `getModuleCacheEntries` → `getCachedDefinitionsBatch`
  - `clearAllModules`       → `clearAllDefinitions`
  - `clearRealmCache`       → `clearRealmDefinitions`

Plus internal-consistency renames on the notify-emitter helpers
(`notifyModuleCacheInvalidations` → `notifyDefinitionCacheInvalidations`,
etc.).

What deliberately did NOT move (Tier 3, deferred — needs a deploy
plan for rolling-update compatibility between replicas listening on
the old vs. new channel name):
  - `modules` DB table name and the `MODULES_TABLE` JS constant
  - `module_cache_invalidated` NOTIFY channel name and the
    `MODULE_CACHE_INVALIDATED_CHANNEL` constant
  - File names containing "module-cache-*"

All 16 realm-file-changes-listener tests, 21 module-cache-invalidation-
listener tests, and 9 module-cache-coordination tests pass after the
rename. `tsc` clean across runtime-common / realm-server / host.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant