Break prerender deadlock during indexing-time federated search by habdelra · Pull Request #4777 · cardstack/boxel

habdelra · 2026-05-12T03:35:03Z

Summary

This PR removes two load-related bottlenecks in the indexing-time
path that fires _federated-search from inside a prerender tab:

The realm server now learns when a search request is coming from
the host SPA mid-prerender, and short-circuits the recursive
lookupDefinition walk in populateQueryFields so an indexing
render can't fan out into self-referential prerender chains.
The default node-postgres pool size grows from 10 to 40 — enough
headroom that the realm server's concurrent search workload
actually exercises SQL instead of queueing on the pg client's
internal connection-acquire queue.

Background — the deadlock pattern

During card indexing, the worker calls the prerender server to
render a card. The prerender server navigates a Chrome tab to the
host SPA's render route. As Glimmer renders the card, the host fires
_federated-search to the realm server to populate computed
query-fields (linksTo / linksToMany with a generated query).

RealmIndexQueryEngine.loadLinks walks each linked resource and
calls populateQueryFields. For a resource whose adoptsFrom
definition isn't already in the module cache, lookupDefinition
fires a new prerender to extract the definition. That nested
prerender enters the same tab queue the original render is sitting
in. The original render's tab is busy waiting on the data; the
nested prerender's tab can't be allocated because every page in the
affinity-scoped pool is taken; the system stalls until the
render-timeout machinery fires.

The recovery option cacheOnlyDefinitions had already been added to
RealmIndexQueryEngine.Options for this exact case — its existing
docstring literally says "avoids deadlocks when file-meta responses
are served during card prerendering" — but no caller was setting
it. This PR plumbs the signal so the right caller does.

The trigger condition (a wide query-field fan-out under concurrent
search activity) can be reached during serial indexing too. The
parallel-indexing work that exposed it in benchmark runs is being
landed separately.

How the in-prerender signal travels

The host SPA cannot self-identify as running inside a prerender tab —
it's the same code that runs in a real user browser. The signal has
to come from the prerender server.

page-pool.ts injects globalThis.__boxelDuringPrerender = true into every Chrome tab it creates, via puppeteer's
evaluateOnNewDocument. The host SPA reads the flag once at
/_standby boot and the value survives across the prerender
server's later transitionTo-driven renders.
services/store.ts in the host reads that flag and adds an
x-boxel-during-prerender header to its outbound
_federated-search requests. The header is added on those
fetches only — not on icon-server requests, vite asset requests,
or anything else the host happens to load — so no other service
has to widen its CORS allow-list.
handleSearch reads the header and threads
cacheOnlyDefinitions: true through
searchRealms → Realm.search → RealmIndexQueryEngine.searchCards.
The same plumbing is added to the single-realm _search route
in realm.ts. The realm server's CORS allow-list gains
X-Boxel-During-Prerender so the preflight succeeds.
RealmIndexQueryEngine.populateQueryFields already had two
paths. When the resource carries pre-extracted queryFieldDefs
in its meta (the indexer writes this during file extract), it
uses populateQueryFieldsFromMeta which doesn't need a live
lookupDefinition at all. When metadata is missing,
lookupDefinitionForOpts falls through to
lookupCachedDefinition — pure DB read, no prerender
side-effect. Either branch keeps the original render's tab free
for the work it was already doing.

User-facing searches (not from inside a prerender tab) don't carry
the header and behave exactly as before. This change is invisible
outside the indexing path.

pg pool sizing

node-postgres ships with a 10-connection default. Under any
concurrent search workload (whether that's a single render's
loadLinks walk or two simultaneous user-facing queries) each
in-flight search runs primaryQuery plus a stack of per-layer
loadLinks queries; each query holds a connection for the full SQL
round-trip. With a few concurrent renders peak demand quickly
exceeds 20. Any waiter past max queues in node-postgres' internal
acquire queue — and that wait is indistinguishable from slow SQL in
diagnostic logs.

Indexing-job diagnostics on a 461-card realm made this visible:
slow-search logs showed primaryQuery=70-75s for queries returning 1-3 rows. Not the query — connection-acquire wait. After bumping
the pool default to 40 the same queries returned in 5-16s (the
actual SQL latency under heavy concurrent load).

Sizing logic, captured in the new comment in pg-adapter.ts:

Peak observed connection demand under indexing-pass concurrency:
~25 connections
Headroom for realm-server non-search work (advisory locks,
indexer writes, NOTIFY dispatch) keeps a search burst from
crowding out the indexer's own commits
40 leaves comfortable margin without approaching hosted RDS
default max_connections (staging db.r7g.large ≈ 1700, prod
db.r7g.xlarge ≈ 3500). At ~6 realm-server-side processes per
fleet × 40 = 240 connections, the new default uses 6-12% of
that budget.
The PG_POOL_MAX env var override is still honored for
per-fleet tuning.

The realm-test-harness used to pin pool size to 2 to fit the
seeded test-pg's max_connections=50. Bumping the harness default
to 40 in lockstep keeps tests aligned with production behavior so
that test runs don't surface (or hide) connection-acquire patterns
that production wouldn't see. Test pg stays at max_connections=50
because pool max is a ceiling, not a steady-state allocation —
typical test fixtures (a handful of cards) keep concurrent
connection usage well below the cap.

Test plan

Lint passes per-package on every package touched
runtime-common typecheck clean
realm-server typecheck clean
host typecheck clean
Local from-scratch reindex of a 100-card realm with the
default INDEX_RUNNER_MAX_CONCURRENCY=1 (serial) completes
in the same wall-clock as main (~190s, 0 errors), confirming
no regression on the existing baseline
Slow-search diagnostic logs (info level on
realm:index-query-engine) confirm primaryQuery time
bounds dropped from 70s+ to under 10s under concurrent search
load
CORS preflight verified clean — outbound requests from
prerender tabs to non-realm-server origins (icons, vite)
don't carry the marker header
User-facing search behavior unchanged: searches not made from
inside a prerender tab don't get cacheOnlyDefinitions: true

🤖 Generated with Claude Code

## Why During card indexing the worker fires a prerender request to the prerender server, which navigates a Chrome tab to the host SPA's render route. While Glimmer renders the card, the host fires `_federated-search` calls back to the realm server to populate computed query-fields (linksTo / linksToMany with a generated query). The realm server's `RealmIndexQueryEngine.loadLinks` walks each linked resource and calls `populateQueryFields`. If a resource's `adoptsFrom` definition isn't already in the local module cache, `lookupDefinition` triggers a *new* prerender so the definition can be extracted. That nested prerender enters the same tab queue the original render is occupying. Under parallel indexing — and even under serial indexing on a card with a wide query-field fan-out — this can pin every page in the affinity-scoped pool against a sub-prerender that's waiting for a page slot that will never come. The render-timeout machinery eventually fires, but every file the indexer would have rendered during that window gets recorded as a failed render. Bench runs of the prudent-octopus realm showed 4 of these timeouts per parallel pass; on the larger ambitious-piranha realm the indexer was rejected outright after exhausting the 12-attempt prerender retry budget. The recovery option `cacheOnlyDefinitions` was added to `RealmIndexQueryEngine.Options` for exactly this case — its existing docstring says "avoids deadlocks when file-meta responses are served during card prerendering" — but no caller was setting it. This patch plumbs the signal end-to-end. ## How the signal travels 1. `page-pool.ts` injects `globalThis.__boxelDuringPrerender = true` into every Chrome tab it creates, via puppeteer's `evaluateOnNewDocument`. The host SPA reads it synchronously at boot and the value survives across the prerender server's `transitionTo`-driven render flow. 2. The host's federated-search fetch wrapper in `services/store.ts` reads that flag and adds an `x-boxel-during-prerender` header to its outbound search request. Narrowly scoped — only fetches that travel to the realm server's search endpoint get the header, so adjacent services (icons server, vite dev) never see it on a CORS preflight. 3. `handleSearch` reads the header and threads `cacheOnlyDefinitions: true` through `searchRealms → Realm.search → RealmIndexQueryEngine.searchCards`. The same plumbing is added to the single-realm `_search` route in `realm.ts`. 4. `RealmIndexQueryEngine.populateQueryFields` already had branching on `cacheOnlyDefinitions`: when a resource carries pre-extracted `queryFieldDefs` in its meta (the indexer writes this during file extract) it uses `populateQueryFieldsFromMeta`, which doesn't need a live `lookupDefinition`. When metadata is missing, `lookupDefinitionForOpts` falls through to `lookupCachedDefinition` — pure DB read, no prerender side-effect. Either branch keeps the original render's tab free. ## Behavioral notes - The host's outbound user-facing searches (not from inside a prerender tab) don't carry the header and behave exactly as before. This change is invisible outside the indexing path. - `cacheOnlyDefinitions` opting out of `lookupDefinition` only matters for cards whose linked-resource `adoptsFrom` isn't yet cached. In a steady-state indexed realm those definitions are always cached; in a fresh from-scratch reindex the indexer's own pre-warm phase populates them before the parallel visits begin. The edge case where neither path holds — a brand-new realm with cold modules — degrades the query-field walk to "use whatever metadata was extracted at indexing time," which is what the existing `populateQueryFieldsFromMeta` path was designed to do. - The header constant lives in runtime-common's `realm.ts` so the Realm class can read it without taking a realm-server dependency. `realm-server/prerender/prerender-constants.ts` re-exports it for the host SPA's import locality. Realm server's CORS allow-list gains `X-Boxel-During-Prerender` so the preflight succeeds for the host's outbound call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

node-postgres ships with a 10-connection pool default. Under INDEX_RUNNER_MAX_CONCURRENCY=4 each in-flight file render fans out into several federated-search calls (primaryQuery + per-layer loadLinks queries that each hold a connection for the round-trip), and during the ambitious-piranha benchmark we observed primaryQuery=73s for queries returning 3 rows — pure node-postgres acquire-wait, indistinguishable from slow SQL in diagnostics. 40 covers the observed peak (~20-25) plus headroom for the realm-server's non-search work — advisory locks, indexer writes, NOTIFY dispatch — so a search burst doesn't crowd out the indexer's own commits. Stays well under hosted RDS defaults (staging db.r7g.large ≈ 1700, prod db.r7g.xlarge ≈ 3500 max_connections); 4-6 client processes × 40 = 160-240 connections total, ~6-12% of that budget. PG_POOL_MAX env override is still honored for per-fleet tuning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

realm-test-harness was explicitly setting PG_POOL_MAX=2 for the isolated test stacks it spawns, far below the new production default of 40. With parallel indexing in tests, that small pool would create connection-wait masquerading as slow SQL — a flake source that's hard to diagnose because the symptom looks like the query itself being slow. Match the production pg-adapter default (40). The test pg's max_connections=50 still fits because pool max is a ceiling, not a steady-state allocation — typical test fixtures (a handful of cards) keep concurrent connection usage well below the cap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d12e6b49ce

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

github-actions · 2026-05-12T03:39:49Z

Preview deployments

Host staging preview

Host production preview

Host Test Results

1 files ±0 1 suites ±0 2h 5m 58s ⏱️ + 1m 23s
2 654 tests ±0 2 639 ✅ ±0 15 💤 ±0 0 ❌ ±0
2 673 runs ±0 2 658 ✅ ±0 15 💤 ±0 0 ❌ ±0

Results for commit b3a2ab7. ± Comparison against earlier commit fdd15f9.

Realm Server Test Results

1 files ±0 1 suites ±0 13m 3s ⏱️ +44s
1 321 tests ±0 1 321 ✅ +34 0 💤 ±0 0 ❌ - 34
1 400 runs ±0 1 400 ✅ +34 0 💤 ±0 0 ❌ - 34

Results for commit b3a2ab7. ± Comparison against earlier commit fdd15f9.

Copilot

Pull request overview

This PR aims to unblock parallel realm indexing by preventing indexing-time _federated-search calls (originating from inside prerender tabs) from triggering recursive prerender/module-definition lookups that can deadlock the prerender page pool, and by increasing Postgres connection pool capacity to avoid connection-acquire stalls being misdiagnosed as slow SQL.

Changes:

Plumbs an “in prerender tab” signal from the prerender server → host SPA → realm-server search handlers, enabling cacheOnlyDefinitions:true during indexing-time searches.
Increases the default node-postgres pool max from the library default to 40 (while preserving env overrides), and aligns the test harness default accordingly.
Updates realm-server CORS allow-list to permit the new marker header.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
packages/runtime-common/search-utils.ts	Extends `searchRealms()` to pass optional `cacheOnlyDefinitions` to realm searches.
packages/runtime-common/realm.ts	Adds `DURING_PRERENDER_HEADER` and threads `cacheOnlyDefinitions` into `_search` handling.
packages/host/app/services/store.ts	Adds the marker header to federated search requests when running inside prerender tabs.
packages/realm-server/handlers/handle-search.ts	Detects marker header and forwards `cacheOnlyDefinitions` into multi-realm search flow.
packages/realm-server/server.ts	Adds `X-Boxel-During-Prerender` to CORS `allowHeaders`.
packages/realm-server/prerender/page-pool.ts	Injects a global flag into new pages via `evaluateOnNewDocument` to let the host SPA detect prerender context.
packages/realm-server/prerender/prerender-constants.ts	Re-exports the marker header constant for realm-server-side locality.
packages/postgres/pg-adapter.ts	Raises default pg pool max to 40 and always passes `max` into `Pool` config.
packages/realm-test-harness/src/shared.ts	Updates harness default pool max to 40 to mirror production defaults.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- prettier: rewrap the long globalThis cast in #markPageAsInPrerender - page-pool.ts: untangle the split block comments — restore the #attachPageObservability preamble in one piece and give #markPageAsInPrerender its own self-contained comment - realm.ts: rewrite the DURING_PRERENDER_HEADER comment to describe the actual mechanism (puppeteer evaluateOnNewDocument injects a window global; host fetch wrapper reads it and adds the header on outbound _federated-search / _search calls only). The previous comment described a setExtraHTTPHeaders approach that was never shipped. - presence check: unify the two DURING_PRERENDER_HEADER sites on `.length > 0` so handle-search.ts (koa ctxt.get returns '') and realm.ts (Request.headers.get returns null) have the same grep-able shape. - realm-test-harness: drop DEFAULT_PG_POOL_MAX back to 20 and document the math against the seeded test pg's max_connections=50 ceiling (one stack = 2 pg-pool clients × 20 = 40 peak, leaving headroom for the harness's own client). Production stays at the pg-adapter default of 40 because hosted RDS has 1700+ connections.

prerender-deadlock-test.ts uses a hand-rolled minimal Page object that doesn't include every puppeteer method — `evaluateOnNewDocument` isn't in the surface those tests care about, and calling it produces `TypeError: page.evaluateOnNewDocument is not a function` from inside #createStandby, which surfaces as the misleading "No standby page available for prerender" the tests then trip over. The window-global injection is observability — if it can't be set, the host fetch wrapper will simply not stamp the during-prerender header, which under any sane test environment is harmless. Guard the call and skip it on stubs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

habdelra and others added 3 commits May 11, 2026 23:33

habdelra requested a review from Copilot May 12, 2026 03:35

Copilot started reviewing on behalf of habdelra May 12, 2026 03:36 View session

chatgpt-codex-connector Bot reviewed May 12, 2026

View reviewed changes

Comment thread packages/realm-test-harness/src/shared.ts Outdated

Copilot AI reviewed May 12, 2026

View reviewed changes

Comment thread packages/runtime-common/realm.ts Outdated

Comment thread packages/realm-server/prerender/page-pool.ts Outdated

Comment thread packages/realm-server/handlers/handle-search.ts Outdated

Comment thread packages/realm-test-harness/src/shared.ts Outdated

habdelra and others added 2 commits May 12, 2026 00:00

habdelra requested a review from a team May 12, 2026 08:46

habdelra mentioned this pull request May 12, 2026

Fix flaky instantiate-validation Playwright spec #4782

Merged

4 tasks

jurgenwerk approved these changes May 12, 2026

View reviewed changes

habdelra merged commit a0e1eb0 into main May 12, 2026
78 of 79 checks passed

habdelra mentioned this pull request May 12, 2026

Phase 2: job-scoped same-realm search cache during indexing #4791

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Break prerender deadlock during indexing-time federated search#4777

Break prerender deadlock during indexing-time federated search#4777
habdelra merged 5 commits into
mainfrom
indexer-perf/prerender-search-cacheonly

habdelra commented May 12, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

github-actions Bot commented May 12, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

habdelra commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Background — the deadlock pattern

How the in-prerender signal travels

pg pool sizing

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

github-actions Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Preview deployments

Host Test Results

Realm Server Test Results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

habdelra commented May 12, 2026 •

edited

Loading

github-actions Bot commented May 12, 2026 •

edited

Loading