CS-10953: cross-process populate coordination for CachingDefinitionLookup by lukemelia · Pull Request #4645 · cardstack/boxel

lukemelia · 2026-05-04T20:46:04Z

Stacked on #4644 (CS-10952). Second sub-issue under the CS-10950 umbrella. Closes the populate-coalescing half of the cross-process coordination story.

Note

This PR's base is cs-10952-cross-process-invalidation-broadcast (the CS-10952 PR's branch), not main. Once CS-10952 lands, GitHub will auto-rebase this PR's diff onto main; or rebase manually.

Why

CS-10948 added in-process generation discard. CS-10952 broadcasts invalidations across processes. But cross-process populate — coalescing the prerender work itself — still has a gap: each of N realm-server processes independently misses the modules cache on cold fan-out and fires its own prerender for the same URL. Prerender servers see N× the work they need. Measurable today during from-scratch indexing.

This PR adds a pg_try_advisory_xact_lock + NOTIFY-wait coalescing layer between CachingDefinitionLookup's #inFlight coalescer and the prerenderer, so at most one realm-server process per coalesce key reaches the prerenderer. Peer processes block on NOTIFY and re-read the populated row.

What changes

`runtime-common/definition-lookup.ts`

Exports MODULE_CACHE_POPULATED_CHANNEL and a new PopulateCoordinator interface (two methods: tryAcquireAndRun(coalesceKey, fn) and waitForKey(coalesceKey, timeoutMs)).
CachingDefinitionLookup constructor takes an optional populateCoordinator (5th arg). When provided, loadModuleCacheEntry routes through a new loadModuleCacheEntryCoordinated that adds an outer for COALESCE_MAX_ITERATIONS loop:
1. Optimistic cache read (avoid contending the lock on hits).
2. Try the advisory lock via the coordinator.
3. Winner: run the existing uncoordinated body inside the lock — its existing cache double-check + prerender + generation-check + persist. Coordinator emits NOTIFY on commit.
4. Loser: wait for peer's NOTIFY (180 s timeout). Loop. The next iteration's optimistic cache read picks up the peer's row.
5. Throws after MAX_ITERATIONS=4 so a pathological peer crash-loop or NOTIFY-drop sequence surfaces as an error instead of silently hanging.
When no coordinator is provided (default; sqlite/in-memory deployments; the vast majority of test setups), the original uncoordinated path runs unchanged.

`realm-server/lib/module-cache-coordination.ts` (new)

ModuleCacheCoordinator implements PopulateCoordinator. Mirrors the withRealmWriteLock pattern but with pg_try_advisory_xact_lock (non-blocking) so losers don't pin pool clients for the full prerender wall time (could be 150s in production). waitForKey registers a callback on a per-key Set; the LISTEN handler (PgAdapter.listen on module_cache_populated) dispatches NOTIFYs into the matching set.

pg_notify is emitted inside the same transaction as the lock, so peers only see the signal on commit. The persist itself ran on the shared dbAdapter and is already visible by the time peers re-read on wake.

`realm-server/main.ts`

Behind PRERENDER_COALESCE_ACROSS_PROCESSES=true env flag (default off). When enabled, constructs + starts a ModuleCacheCoordinator and passes it to CachingDefinitionLookup. Added to the shutdown Promise.all alongside the other listeners.

Two small spec divergences

1. NOTIFY on every winner outcome (including missing-module). The CS-10953 spec says winners don't notify when all populationCandidates produced missing-module errors. We notify regardless. Trade-off: an extra harmless wake for peers vs. a 180 s timeout cycle for parallel callers of a nonexistent URL. Cheap to choose the wake.

2. ModuleCacheCoordinator lives in realm-server/lib/, not runtime-common/. Same reason as CS-10952: runtime-common doesn't depend on @cardstack/postgres, and adding the dep would be circular (@cardstack/postgres already depends on runtime-common via DBAdapter). The PopulateCoordinator interface is in runtime-common; the implementation is in realm-server.

Behavior

N=1 (today's production): effectively inert. Try-lock always succeeds uncontended; loser path is never taken; self-NOTIFY is dropped (no waiters registered). Overhead is one extra BEGIN; SELECT pg_try_advisory_xact_lock; pg_notify; COMMIT per cache miss — measurable but sub-millisecond. The PRERENDER_COALESCE_ACROSS_PROCESSES flag is off by default so this overhead doesn't ship to production until we explicitly flip it.

N>1 (with the flag on): N× prerender-server load reduction on cold fan-out. 1 prerender per unique module across the whole fleet instead of N.

Test plan

CI Realm Server suite green
CI Software Factory job green
CI Host suite green

New tests in realm-server/tests/module-cache-coordination-test.ts:

Coordinator unit tests (operate on ModuleCacheCoordinator directly):

tryAcquireAndRun uncontended → acquired:true, fn runs, peer waiter sees the NOTIFY on commit.
tryAcquireAndRun contended → second caller gets acquired:false immediately (does not pin the pool).
waitForKey resolves on NOTIFY before timeout.
waitForKey resolves on timeout when no NOTIFY arrives.
waitForKey ignores NOTIFYs for unrelated keys.
shutDown wakes parked waiters so callers don't hang during teardown.

Integration tests (full CachingDefinitionLookup with coordinator on real PgAdapter):

Two instances + concurrent same-module lookup → exactly one prerender call total. B parks on NOTIFY, wakes after A persists, re-reads cache, returns A's row.
Coordinator-less single instance still works (sqlite/in-memory deployment guard).
Cache hit short-circuits before contending the lock (fresh second instance reading an already-cached row never calls its prerenderer).

What's NOT in scope (per ticket)

Unified DB-backed response cache replacing #moduleCache.
ALB sticky routing (infrastructure config; independently tunable).
Per-waiter AbortController cancellation.
Strict cross-process invalidation closure beyond what NOTIFY latency provides.

CS-10950 — umbrella
CS-10952 (CS-10952: cross-process invalidation broadcast for CachingDefinitionLookup #4644) — invalidation broadcast (this PR's base)
CS-10948 (Re-land CS-10948: discard in-flight prerender result if invalidate ran concurrently #4641) — in-process generation discard
CS-10891 — withRealmWriteLock / hashRealmUrlForAdvisoryLock primitive that the non-blocking try-lock here extends
CS-10892 — realm_file_changes NOTIFY pattern the populate listener follows

🤖 Generated with Claude Code

github-actions · 2026-05-04T21:15:14Z

Host Test Results

1 files ±0 1 suites ±0 1h 44m 4s ⏱️ +36s
2 563 tests +1 2 548 ✅ +1 15 💤 ±0 0 ❌ ±0
2 582 runs +1 2 567 ✅ +1 15 💤 ±0 0 ❌ ±0

Results for commit 030d174. ± Comparison against earlier commit bffbfea.

Realm Server Test Results

1 files ± 0 1 suites ±0 16m 20s ⏱️ -5s
1 259 tests +16 1 259 ✅ +16 0 💤 ±0 0 ❌ ±0
1 332 runs +16 1 332 ✅ +16 0 💤 ±0 0 ❌ ±0

Results for commit 030d174. ± Comparison against earlier commit bffbfea.

Stacks on CS-10952. Adds a `pg_try_advisory_xact_lock` + NOTIFY-wait coalescing layer between CachingDefinitionLookup's #inFlight coalescer and the prerenderer, so at most one realm-server process per coalesce key reaches the prerenderer; peers block on NOTIFY and re-read the populated row. * runtime-common/definition-lookup.ts: - exports MODULE_CACHE_POPULATED_CHANNEL + the PopulateCoordinator interface (`tryAcquireAndRun`, `waitForKey`) - CachingDefinitionLookup constructor takes an optional `populateCoordinator` (5th arg). When provided, loadModuleCacheEntry routes through a new loadModuleCacheEntryCoordinated that does an outer `for COALESCE_MAX_ITERATIONS` loop: optimistic cache read → try lock via coordinator → on win, run uncoordinated body inside the lock (the body's existing cache double-check + prerender + generation-check + persist) → on loss, wait for peer's NOTIFY (180s timeout) and loop. Throws after MAX_ITERATIONS so a pathological peer crash-loop or NOTIFY-drop sequence surfaces. - When no coordinator is provided (default; sqlite/in-memory deployments; the vast majority of test setups), the original uncoordinated path runs unchanged. * realm-server/lib/module-cache-coordination.ts (new): ModuleCacheCoordinator implements PopulateCoordinator. Mirrors the withRealmWriteLock pattern but with `pg_try_advisory_xact_lock` (non- blocking) so losers don't pin pool clients for the duration of a peer's prerender. `waitForKey` registers a callback on a per-key Set, the LISTEN handler (PgAdapter.listen on module_cache_populated) dispatches NOTIFYs into the matching set. pg_notify is emitted INSIDE the same tx as the lock so peers only see the signal on commit (the persist itself ran on the shared dbAdapter and is already visible by then). Always notifies on success, even when fn returned undefined (all populationCandidates produced missing-module errors), so peers don't sit on the 180s timeout for a "no row" outcome — small spec divergence documented in the file. * realm-server/main.ts: behind PRERENDER_COALESCE_ACROSS_PROCESSES=true env flag (default off). When on, constructs + starts a ModuleCacheCoordinator and passes it to CachingDefinitionLookup. Added to the shutdown Promise.all alongside the other listeners. Behavior at N=1: inert. The try-lock always succeeds uncontended; the loser path is never taken; self-NOTIFY is dropped (no waiters registered). Behavior at N>1 (with the flag on): N× prerender-server load reduction on cold fan-out — 1 prerender per unique module across the whole fleet instead of N. Tests forthcoming in a follow-up commit. Linear: https://linear.app/cardstack/issue/CS-10953 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirrors the lightweight pattern from module-cache-invalidation-listener-test: real PgAdapter via setupDB, stub prerenderer/virtualNetwork, no realm-server fixture. Two modules: * `ModuleCacheCoordinator unit` — exercises the coordinator surface directly: - tryAcquireAndRun uncontended → acquired:true, fn runs, peer waiter sees NOTIFY on commit - tryAcquireAndRun contended → second caller gets acquired:false immediately (loser does not pin the pool client) - waitForKey resolves on NOTIFY before timeout - waitForKey resolves on timeout when no NOTIFY arrives - waitForKey ignores NOTIFYs for unrelated keys - shutDown wakes parked waiters so callers don't hang during teardown * `CachingDefinitionLookup coordinated path (integration)` — exercises the wired-up lookup with two instances on one DB: - concurrent same-module lookup across two instances → exactly one prerender call total (B parks on NOTIFY, wakes after A persists, re-reads cache, returns row) - coordinator-less single instance still works (sqlite/in-memory deployment guard) - cache-hit short-circuits before contending the lock (fresh second instance reading an already-cached row never calls its prerenderer) Linear: https://linear.app/cardstack/issue/CS-10953 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

lukemelia force-pushed the cs-10953-cross-process-populate-coordination branch from cd00c51 to bffbfea Compare May 4, 2026 21:27

lukemelia force-pushed the cs-10952-cross-process-invalidation-broadcast branch from 33049fc to 7378223 Compare May 4, 2026 23:35

lukemelia and others added 2 commits May 4, 2026 19:35

lukemelia force-pushed the cs-10953-cross-process-populate-coordination branch from bffbfea to 030d174 Compare May 4, 2026 23:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CS-10953: cross-process populate coordination for CachingDefinitionLookup#4645

CS-10953: cross-process populate coordination for CachingDefinitionLookup#4645
lukemelia wants to merge 2 commits intocs-10952-cross-process-invalidation-broadcastfrom
cs-10953-cross-process-populate-coordination

lukemelia commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lukemelia commented May 4, 2026

Why

What changes

runtime-common/definition-lookup.ts

realm-server/lib/module-cache-coordination.ts (new)

realm-server/main.ts

Two small spec divergences

Behavior

Test plan

What's NOT in scope (per ticket)

Related

Uh oh!

github-actions Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Host Test Results

Realm Server Test Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`runtime-common/definition-lookup.ts`

`realm-server/lib/module-cache-coordination.ts` (new)

`realm-server/main.ts`

github-actions Bot commented May 4, 2026 •

edited

Loading