Break prerender deadlock during indexing-time federated search#4777
Conversation
## Why During card indexing the worker fires a prerender request to the prerender server, which navigates a Chrome tab to the host SPA's render route. While Glimmer renders the card, the host fires `_federated-search` calls back to the realm server to populate computed query-fields (linksTo / linksToMany with a generated query). The realm server's `RealmIndexQueryEngine.loadLinks` walks each linked resource and calls `populateQueryFields`. If a resource's `adoptsFrom` definition isn't already in the local module cache, `lookupDefinition` triggers a *new* prerender so the definition can be extracted. That nested prerender enters the same tab queue the original render is occupying. Under parallel indexing — and even under serial indexing on a card with a wide query-field fan-out — this can pin every page in the affinity-scoped pool against a sub-prerender that's waiting for a page slot that will never come. The render-timeout machinery eventually fires, but every file the indexer would have rendered during that window gets recorded as a failed render. Bench runs of the prudent-octopus realm showed 4 of these timeouts per parallel pass; on the larger ambitious-piranha realm the indexer was rejected outright after exhausting the 12-attempt prerender retry budget. The recovery option `cacheOnlyDefinitions` was added to `RealmIndexQueryEngine.Options` for exactly this case — its existing docstring says "avoids deadlocks when file-meta responses are served during card prerendering" — but no caller was setting it. This patch plumbs the signal end-to-end. ## How the signal travels 1. `page-pool.ts` injects `globalThis.__boxelDuringPrerender = true` into every Chrome tab it creates, via puppeteer's `evaluateOnNewDocument`. The host SPA reads it synchronously at boot and the value survives across the prerender server's `transitionTo`-driven render flow. 2. The host's federated-search fetch wrapper in `services/store.ts` reads that flag and adds an `x-boxel-during-prerender` header to its outbound search request. Narrowly scoped — only fetches that travel to the realm server's search endpoint get the header, so adjacent services (icons server, vite dev) never see it on a CORS preflight. 3. `handleSearch` reads the header and threads `cacheOnlyDefinitions: true` through `searchRealms → Realm.search → RealmIndexQueryEngine.searchCards`. The same plumbing is added to the single-realm `_search` route in `realm.ts`. 4. `RealmIndexQueryEngine.populateQueryFields` already had branching on `cacheOnlyDefinitions`: when a resource carries pre-extracted `queryFieldDefs` in its meta (the indexer writes this during file extract) it uses `populateQueryFieldsFromMeta`, which doesn't need a live `lookupDefinition`. When metadata is missing, `lookupDefinitionForOpts` falls through to `lookupCachedDefinition` — pure DB read, no prerender side-effect. Either branch keeps the original render's tab free. ## Behavioral notes - The host's outbound user-facing searches (not from inside a prerender tab) don't carry the header and behave exactly as before. This change is invisible outside the indexing path. - `cacheOnlyDefinitions` opting out of `lookupDefinition` only matters for cards whose linked-resource `adoptsFrom` isn't yet cached. In a steady-state indexed realm those definitions are always cached; in a fresh from-scratch reindex the indexer's own pre-warm phase populates them before the parallel visits begin. The edge case where neither path holds — a brand-new realm with cold modules — degrades the query-field walk to "use whatever metadata was extracted at indexing time," which is what the existing `populateQueryFieldsFromMeta` path was designed to do. - The header constant lives in runtime-common's `realm.ts` so the Realm class can read it without taking a realm-server dependency. `realm-server/prerender/prerender-constants.ts` re-exports it for the host SPA's import locality. Realm server's CORS allow-list gains `X-Boxel-During-Prerender` so the preflight succeeds for the host's outbound call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
node-postgres ships with a 10-connection pool default. Under INDEX_RUNNER_MAX_CONCURRENCY=4 each in-flight file render fans out into several federated-search calls (primaryQuery + per-layer loadLinks queries that each hold a connection for the round-trip), and during the ambitious-piranha benchmark we observed primaryQuery=73s for queries returning 3 rows — pure node-postgres acquire-wait, indistinguishable from slow SQL in diagnostics. 40 covers the observed peak (~20-25) plus headroom for the realm-server's non-search work — advisory locks, indexer writes, NOTIFY dispatch — so a search burst doesn't crowd out the indexer's own commits. Stays well under hosted RDS defaults (staging db.r7g.large ≈ 1700, prod db.r7g.xlarge ≈ 3500 max_connections); 4-6 client processes × 40 = 160-240 connections total, ~6-12% of that budget. PG_POOL_MAX env override is still honored for per-fleet tuning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
realm-test-harness was explicitly setting PG_POOL_MAX=2 for the isolated test stacks it spawns, far below the new production default of 40. With parallel indexing in tests, that small pool would create connection-wait masquerading as slow SQL — a flake source that's hard to diagnose because the symptom looks like the query itself being slow. Match the production pg-adapter default (40). The test pg's max_connections=50 still fits because pool max is a ceiling, not a steady-state allocation — typical test fixtures (a handful of cards) keep concurrent connection usage well below the cap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d12e6b49ce
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Preview deploymentsHost Test Results 1 files ±0 1 suites ±0 2h 5m 58s ⏱️ + 1m 23s Results for commit b3a2ab7. ± Comparison against earlier commit fdd15f9. Realm Server Test Results 1 files ±0 1 suites ±0 13m 3s ⏱️ +44s Results for commit b3a2ab7. ± Comparison against earlier commit fdd15f9. |
There was a problem hiding this comment.
Pull request overview
This PR aims to unblock parallel realm indexing by preventing indexing-time _federated-search calls (originating from inside prerender tabs) from triggering recursive prerender/module-definition lookups that can deadlock the prerender page pool, and by increasing Postgres connection pool capacity to avoid connection-acquire stalls being misdiagnosed as slow SQL.
Changes:
- Plumbs an “in prerender tab” signal from the prerender server → host SPA → realm-server search handlers, enabling
cacheOnlyDefinitions:trueduring indexing-time searches. - Increases the default node-postgres pool max from the library default to 40 (while preserving env overrides), and aligns the test harness default accordingly.
- Updates realm-server CORS allow-list to permit the new marker header.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/runtime-common/search-utils.ts | Extends searchRealms() to pass optional cacheOnlyDefinitions to realm searches. |
| packages/runtime-common/realm.ts | Adds DURING_PRERENDER_HEADER and threads cacheOnlyDefinitions into _search handling. |
| packages/host/app/services/store.ts | Adds the marker header to federated search requests when running inside prerender tabs. |
| packages/realm-server/handlers/handle-search.ts | Detects marker header and forwards cacheOnlyDefinitions into multi-realm search flow. |
| packages/realm-server/server.ts | Adds X-Boxel-During-Prerender to CORS allowHeaders. |
| packages/realm-server/prerender/page-pool.ts | Injects a global flag into new pages via evaluateOnNewDocument to let the host SPA detect prerender context. |
| packages/realm-server/prerender/prerender-constants.ts | Re-exports the marker header constant for realm-server-side locality. |
| packages/postgres/pg-adapter.ts | Raises default pg pool max to 40 and always passes max into Pool config. |
| packages/realm-test-harness/src/shared.ts | Updates harness default pool max to 40 to mirror production defaults. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- prettier: rewrap the long globalThis cast in #markPageAsInPrerender - page-pool.ts: untangle the split block comments — restore the #attachPageObservability preamble in one piece and give #markPageAsInPrerender its own self-contained comment - realm.ts: rewrite the DURING_PRERENDER_HEADER comment to describe the actual mechanism (puppeteer evaluateOnNewDocument injects a window global; host fetch wrapper reads it and adds the header on outbound _federated-search / _search calls only). The previous comment described a setExtraHTTPHeaders approach that was never shipped. - presence check: unify the two DURING_PRERENDER_HEADER sites on `.length > 0` so handle-search.ts (koa ctxt.get returns '') and realm.ts (Request.headers.get returns null) have the same grep-able shape. - realm-test-harness: drop DEFAULT_PG_POOL_MAX back to 20 and document the math against the seeded test pg's max_connections=50 ceiling (one stack = 2 pg-pool clients × 20 = 40 peak, leaving headroom for the harness's own client). Production stays at the pg-adapter default of 40 because hosted RDS has 1700+ connections.
prerender-deadlock-test.ts uses a hand-rolled minimal Page object that doesn't include every puppeteer method — `evaluateOnNewDocument` isn't in the surface those tests care about, and calling it produces `TypeError: page.evaluateOnNewDocument is not a function` from inside #createStandby, which surfaces as the misleading "No standby page available for prerender" the tests then trip over. The window-global injection is observability — if it can't be set, the host fetch wrapper will simply not stamp the during-prerender header, which under any sane test environment is harmless. Guard the call and skip it on stubs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
This PR removes two load-related bottlenecks in the indexing-time
path that fires
_federated-searchfrom inside a prerender tab:the host SPA mid-prerender, and short-circuits the recursive
lookupDefinitionwalk inpopulateQueryFieldsso an indexingrender can't fan out into self-referential prerender chains.
headroom that the realm server's concurrent search workload
actually exercises SQL instead of queueing on the pg client's
internal connection-acquire queue.
Background — the deadlock pattern
During card indexing, the worker calls the prerender server to
render a card. The prerender server navigates a Chrome tab to the
host SPA's render route. As Glimmer renders the card, the host fires
_federated-searchto the realm server to populate computedquery-fields (linksTo / linksToMany with a generated query).
RealmIndexQueryEngine.loadLinkswalks each linked resource andcalls
populateQueryFields. For a resource whoseadoptsFromdefinition isn't already in the module cache,
lookupDefinitionfires a new prerender to extract the definition. That nested
prerender enters the same tab queue the original render is sitting
in. The original render's tab is busy waiting on the data; the
nested prerender's tab can't be allocated because every page in the
affinity-scoped pool is taken; the system stalls until the
render-timeout machinery fires.
The recovery option
cacheOnlyDefinitionshad already been added toRealmIndexQueryEngine.Optionsfor this exact case — its existingdocstring literally says "avoids deadlocks when file-meta responses
are served during card prerendering" — but no caller was setting
it. This PR plumbs the signal so the right caller does.
The trigger condition (a wide query-field fan-out under concurrent
search activity) can be reached during serial indexing too. The
parallel-indexing work that exposed it in benchmark runs is being
landed separately.
How the in-prerender signal travels
The host SPA cannot self-identify as running inside a prerender tab —
it's the same code that runs in a real user browser. The signal has
to come from the prerender server.
page-pool.tsinjectsglobalThis.__boxelDuringPrerender = trueinto every Chrome tab it creates, via puppeteer'sevaluateOnNewDocument. The host SPA reads the flag once at/_standbyboot and the value survives across the prerenderserver's later
transitionTo-driven renders.services/store.tsin the host reads that flag and adds anx-boxel-during-prerenderheader to its outbound_federated-searchrequests. The header is added on thosefetches only — not on icon-server requests, vite asset requests,
or anything else the host happens to load — so no other service
has to widen its CORS allow-list.
handleSearchreads the header and threadscacheOnlyDefinitions: truethroughsearchRealms → Realm.search → RealmIndexQueryEngine.searchCards.The same plumbing is added to the single-realm
_searchroutein
realm.ts. The realm server's CORS allow-list gainsX-Boxel-During-Prerenderso the preflight succeeds.RealmIndexQueryEngine.populateQueryFieldsalready had twopaths. When the resource carries pre-extracted
queryFieldDefsin its meta (the indexer writes this during file extract), it
uses
populateQueryFieldsFromMetawhich doesn't need a livelookupDefinitionat all. When metadata is missing,lookupDefinitionForOptsfalls through tolookupCachedDefinition— pure DB read, no prerenderside-effect. Either branch keeps the original render's tab free
for the work it was already doing.
User-facing searches (not from inside a prerender tab) don't carry
the header and behave exactly as before. This change is invisible
outside the indexing path.
pg pool sizing
node-postgres ships with a 10-connection default. Under any
concurrent search workload (whether that's a single render's
loadLinkswalk or two simultaneous user-facing queries) eachin-flight search runs primaryQuery plus a stack of per-layer
loadLinksqueries; each query holds a connection for the full SQLround-trip. With a few concurrent renders peak demand quickly
exceeds 20. Any waiter past
maxqueues in node-postgres' internalacquire queue — and that wait is indistinguishable from slow SQL in
diagnostic logs.
Indexing-job diagnostics on a 461-card realm made this visible:
slow-search logs showed
primaryQuery=70-75s for queries returning 1-3 rows. Not the query — connection-acquire wait. After bumpingthe pool default to 40 the same queries returned in 5-16s (the
actual SQL latency under heavy concurrent load).
Sizing logic, captured in the new comment in
pg-adapter.ts:~25 connections
indexer writes, NOTIFY dispatch) keeps a search burst from
crowding out the indexer's own commits
default
max_connections(staging db.r7g.large ≈ 1700, proddb.r7g.xlarge ≈ 3500). At ~6 realm-server-side processes per
fleet × 40 = 240 connections, the new default uses 6-12% of
that budget.
PG_POOL_MAXenv var override is still honored forper-fleet tuning.
The realm-test-harness used to pin pool size to 2 to fit the
seeded test-pg's
max_connections=50. Bumping the harness defaultto 40 in lockstep keeps tests aligned with production behavior so
that test runs don't surface (or hide) connection-acquire patterns
that production wouldn't see. Test pg stays at
max_connections=50because pool max is a ceiling, not a steady-state allocation —
typical test fixtures (a handful of cards) keep concurrent
connection usage well below the cap.
Test plan
default
INDEX_RUNNER_MAX_CONCURRENCY=1(serial) completesin the same wall-clock as main (~190s, 0 errors), confirming
no regression on the existing baseline
realm:index-query-engine) confirmprimaryQuerytimebounds dropped from 70s+ to under 10s under concurrent search
load
prerender tabs to non-realm-server origins (icons, vite)
don't carry the marker header
inside a prerender tab don't get
cacheOnlyDefinitions: true🤖 Generated with Claude Code