Skip to content

Break prerender deadlock during indexing-time federated search#4777

Merged
habdelra merged 5 commits into
mainfrom
indexer-perf/prerender-search-cacheonly
May 12, 2026
Merged

Break prerender deadlock during indexing-time federated search#4777
habdelra merged 5 commits into
mainfrom
indexer-perf/prerender-search-cacheonly

Conversation

@habdelra
Copy link
Copy Markdown
Contributor

@habdelra habdelra commented May 12, 2026

Summary

This PR removes two load-related bottlenecks in the indexing-time
path that fires _federated-search from inside a prerender tab:

  1. The realm server now learns when a search request is coming from
    the host SPA mid-prerender, and short-circuits the recursive
    lookupDefinition walk in populateQueryFields so an indexing
    render can't fan out into self-referential prerender chains.
  2. The default node-postgres pool size grows from 10 to 40 — enough
    headroom that the realm server's concurrent search workload
    actually exercises SQL instead of queueing on the pg client's
    internal connection-acquire queue.

Background — the deadlock pattern

During card indexing, the worker calls the prerender server to
render a card. The prerender server navigates a Chrome tab to the
host SPA's render route. As Glimmer renders the card, the host fires
_federated-search to the realm server to populate computed
query-fields (linksTo / linksToMany with a generated query).

RealmIndexQueryEngine.loadLinks walks each linked resource and
calls populateQueryFields. For a resource whose adoptsFrom
definition isn't already in the module cache, lookupDefinition
fires a new prerender to extract the definition. That nested
prerender enters the same tab queue the original render is sitting
in. The original render's tab is busy waiting on the data; the
nested prerender's tab can't be allocated because every page in the
affinity-scoped pool is taken; the system stalls until the
render-timeout machinery fires.

The recovery option cacheOnlyDefinitions had already been added to
RealmIndexQueryEngine.Options for this exact case — its existing
docstring literally says "avoids deadlocks when file-meta responses
are served during card prerendering" — but no caller was setting
it. This PR plumbs the signal so the right caller does.

The trigger condition (a wide query-field fan-out under concurrent
search activity) can be reached during serial indexing too. The
parallel-indexing work that exposed it in benchmark runs is being
landed separately.

How the in-prerender signal travels

The host SPA cannot self-identify as running inside a prerender tab —
it's the same code that runs in a real user browser. The signal has
to come from the prerender server.

  1. page-pool.ts injects globalThis.__boxelDuringPrerender = true into every Chrome tab it creates, via puppeteer's
    evaluateOnNewDocument. The host SPA reads the flag once at
    /_standby boot and the value survives across the prerender
    server's later transitionTo-driven renders.

  2. services/store.ts in the host reads that flag and adds an
    x-boxel-during-prerender header to its outbound
    _federated-search requests. The header is added on those
    fetches only — not on icon-server requests, vite asset requests,
    or anything else the host happens to load — so no other service
    has to widen its CORS allow-list.

  3. handleSearch reads the header and threads
    cacheOnlyDefinitions: true through
    searchRealms → Realm.search → RealmIndexQueryEngine.searchCards.
    The same plumbing is added to the single-realm _search route
    in realm.ts. The realm server's CORS allow-list gains
    X-Boxel-During-Prerender so the preflight succeeds.

  4. RealmIndexQueryEngine.populateQueryFields already had two
    paths. When the resource carries pre-extracted queryFieldDefs
    in its meta (the indexer writes this during file extract), it
    uses populateQueryFieldsFromMeta which doesn't need a live
    lookupDefinition at all. When metadata is missing,
    lookupDefinitionForOpts falls through to
    lookupCachedDefinition — pure DB read, no prerender
    side-effect. Either branch keeps the original render's tab free
    for the work it was already doing.

User-facing searches (not from inside a prerender tab) don't carry
the header and behave exactly as before. This change is invisible
outside the indexing path.

pg pool sizing

node-postgres ships with a 10-connection default. Under any
concurrent search workload (whether that's a single render's
loadLinks walk or two simultaneous user-facing queries) each
in-flight search runs primaryQuery plus a stack of per-layer
loadLinks queries; each query holds a connection for the full SQL
round-trip. With a few concurrent renders peak demand quickly
exceeds 20. Any waiter past max queues in node-postgres' internal
acquire queue — and that wait is indistinguishable from slow SQL in
diagnostic logs.

Indexing-job diagnostics on a 461-card realm made this visible:
slow-search logs showed primaryQuery=70-75s for queries returning 1-3 rows. Not the query — connection-acquire wait. After bumping
the pool default to 40 the same queries returned in 5-16s (the
actual SQL latency under heavy concurrent load).

Sizing logic, captured in the new comment in pg-adapter.ts:

  • Peak observed connection demand under indexing-pass concurrency:
    ~25 connections
  • Headroom for realm-server non-search work (advisory locks,
    indexer writes, NOTIFY dispatch) keeps a search burst from
    crowding out the indexer's own commits
  • 40 leaves comfortable margin without approaching hosted RDS
    default max_connections (staging db.r7g.large ≈ 1700, prod
    db.r7g.xlarge ≈ 3500). At ~6 realm-server-side processes per
    fleet × 40 = 240 connections, the new default uses 6-12% of
    that budget.
  • The PG_POOL_MAX env var override is still honored for
    per-fleet tuning.

The realm-test-harness used to pin pool size to 2 to fit the
seeded test-pg's max_connections=50. Bumping the harness default
to 40 in lockstep keeps tests aligned with production behavior so
that test runs don't surface (or hide) connection-acquire patterns
that production wouldn't see. Test pg stays at max_connections=50
because pool max is a ceiling, not a steady-state allocation —
typical test fixtures (a handful of cards) keep concurrent
connection usage well below the cap.

Test plan

  • Lint passes per-package on every package touched
  • runtime-common typecheck clean
  • realm-server typecheck clean
  • host typecheck clean
  • Local from-scratch reindex of a 100-card realm with the
    default INDEX_RUNNER_MAX_CONCURRENCY=1 (serial) completes
    in the same wall-clock as main (~190s, 0 errors), confirming
    no regression on the existing baseline
  • Slow-search diagnostic logs (info level on
    realm:index-query-engine) confirm primaryQuery time
    bounds dropped from 70s+ to under 10s under concurrent search
    load
  • CORS preflight verified clean — outbound requests from
    prerender tabs to non-realm-server origins (icons, vite)
    don't carry the marker header
  • User-facing search behavior unchanged: searches not made from
    inside a prerender tab don't get cacheOnlyDefinitions: true

🤖 Generated with Claude Code

habdelra and others added 3 commits May 11, 2026 23:33
## Why

During card indexing the worker fires a prerender request to the
prerender server, which navigates a Chrome tab to the host SPA's
render route. While Glimmer renders the card, the host fires
`_federated-search` calls back to the realm server to populate
computed query-fields (linksTo / linksToMany with a generated query).

The realm server's `RealmIndexQueryEngine.loadLinks` walks each linked
resource and calls `populateQueryFields`. If a resource's
`adoptsFrom` definition isn't already in the local module cache,
`lookupDefinition` triggers a *new* prerender so the definition can
be extracted. That nested prerender enters the same tab queue the
original render is occupying.

Under parallel indexing — and even under serial indexing on a card
with a wide query-field fan-out — this can pin every page in the
affinity-scoped pool against a sub-prerender that's waiting for a
page slot that will never come. The render-timeout machinery
eventually fires, but every file the indexer would have rendered
during that window gets recorded as a failed render. Bench runs of
the prudent-octopus realm showed 4 of these timeouts per parallel
pass; on the larger ambitious-piranha realm the indexer was rejected
outright after exhausting the 12-attempt prerender retry budget.

The recovery option `cacheOnlyDefinitions` was added to
`RealmIndexQueryEngine.Options` for exactly this case — its existing
docstring says "avoids deadlocks when file-meta responses are served
during card prerendering" — but no caller was setting it. This patch
plumbs the signal end-to-end.

## How the signal travels

1. `page-pool.ts` injects `globalThis.__boxelDuringPrerender = true`
   into every Chrome tab it creates, via puppeteer's
   `evaluateOnNewDocument`. The host SPA reads it synchronously at
   boot and the value survives across the prerender server's
   `transitionTo`-driven render flow.

2. The host's federated-search fetch wrapper in `services/store.ts`
   reads that flag and adds an `x-boxel-during-prerender` header to
   its outbound search request. Narrowly scoped — only fetches that
   travel to the realm server's search endpoint get the header, so
   adjacent services (icons server, vite dev) never see it on a
   CORS preflight.

3. `handleSearch` reads the header and threads
   `cacheOnlyDefinitions: true` through `searchRealms → Realm.search
   → RealmIndexQueryEngine.searchCards`. The same plumbing is added
   to the single-realm `_search` route in `realm.ts`.

4. `RealmIndexQueryEngine.populateQueryFields` already had branching
   on `cacheOnlyDefinitions`: when a resource carries pre-extracted
   `queryFieldDefs` in its meta (the indexer writes this during
   file extract) it uses `populateQueryFieldsFromMeta`, which
   doesn't need a live `lookupDefinition`. When metadata is missing,
   `lookupDefinitionForOpts` falls through to `lookupCachedDefinition`
   — pure DB read, no prerender side-effect. Either branch keeps
   the original render's tab free.

## Behavioral notes

- The host's outbound user-facing searches (not from inside a
  prerender tab) don't carry the header and behave exactly as
  before. This change is invisible outside the indexing path.
- `cacheOnlyDefinitions` opting out of `lookupDefinition` only
  matters for cards whose linked-resource `adoptsFrom` isn't yet
  cached. In a steady-state indexed realm those definitions are
  always cached; in a fresh from-scratch reindex the indexer's
  own pre-warm phase populates them before the parallel visits
  begin. The edge case where neither path holds — a brand-new
  realm with cold modules — degrades the query-field walk to
  "use whatever metadata was extracted at indexing time," which
  is what the existing `populateQueryFieldsFromMeta` path was
  designed to do.
- The header constant lives in runtime-common's `realm.ts` so the
  Realm class can read it without taking a realm-server dependency.
  `realm-server/prerender/prerender-constants.ts` re-exports it
  for the host SPA's import locality. Realm server's CORS
  allow-list gains `X-Boxel-During-Prerender` so the preflight
  succeeds for the host's outbound call.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
node-postgres ships with a 10-connection pool default. Under
INDEX_RUNNER_MAX_CONCURRENCY=4 each in-flight file render fans out
into several federated-search calls (primaryQuery + per-layer
loadLinks queries that each hold a connection for the round-trip),
and during the ambitious-piranha benchmark we observed
primaryQuery=73s for queries returning 3 rows — pure node-postgres
acquire-wait, indistinguishable from slow SQL in diagnostics.

40 covers the observed peak (~20-25) plus headroom for the
realm-server's non-search work — advisory locks, indexer writes,
NOTIFY dispatch — so a search burst doesn't crowd out the indexer's
own commits. Stays well under hosted RDS defaults (staging
db.r7g.large ≈ 1700, prod db.r7g.xlarge ≈ 3500 max_connections);
4-6 client processes × 40 = 160-240 connections total, ~6-12% of
that budget. PG_POOL_MAX env override is still honored for
per-fleet tuning.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
realm-test-harness was explicitly setting PG_POOL_MAX=2 for the
isolated test stacks it spawns, far below the new production default
of 40. With parallel indexing in tests, that small pool would create
connection-wait masquerading as slow SQL — a flake source that's
hard to diagnose because the symptom looks like the query itself
being slow.

Match the production pg-adapter default (40). The test pg's
max_connections=50 still fits because pool max is a ceiling, not a
steady-state allocation — typical test fixtures (a handful of cards)
keep concurrent connection usage well below the cap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d12e6b49ce

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/realm-test-harness/src/shared.ts Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 12, 2026

Preview deployments

Host Test Results

    1 files  ±0      1 suites  ±0   2h 5m 58s ⏱️ + 1m 23s
2 654 tests ±0  2 639 ✅ ±0  15 💤 ±0  0 ❌ ±0 
2 673 runs  ±0  2 658 ✅ ±0  15 💤 ±0  0 ❌ ±0 

Results for commit b3a2ab7. ± Comparison against earlier commit fdd15f9.

Realm Server Test Results

    1 files  ±0      1 suites  ±0   13m 3s ⏱️ +44s
1 321 tests ±0  1 321 ✅ +34  0 💤 ±0  0 ❌  - 34 
1 400 runs  ±0  1 400 ✅ +34  0 💤 ±0  0 ❌  - 34 

Results for commit b3a2ab7. ± Comparison against earlier commit fdd15f9.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to unblock parallel realm indexing by preventing indexing-time _federated-search calls (originating from inside prerender tabs) from triggering recursive prerender/module-definition lookups that can deadlock the prerender page pool, and by increasing Postgres connection pool capacity to avoid connection-acquire stalls being misdiagnosed as slow SQL.

Changes:

  • Plumbs an “in prerender tab” signal from the prerender server → host SPA → realm-server search handlers, enabling cacheOnlyDefinitions:true during indexing-time searches.
  • Increases the default node-postgres pool max from the library default to 40 (while preserving env overrides), and aligns the test harness default accordingly.
  • Updates realm-server CORS allow-list to permit the new marker header.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
packages/runtime-common/search-utils.ts Extends searchRealms() to pass optional cacheOnlyDefinitions to realm searches.
packages/runtime-common/realm.ts Adds DURING_PRERENDER_HEADER and threads cacheOnlyDefinitions into _search handling.
packages/host/app/services/store.ts Adds the marker header to federated search requests when running inside prerender tabs.
packages/realm-server/handlers/handle-search.ts Detects marker header and forwards cacheOnlyDefinitions into multi-realm search flow.
packages/realm-server/server.ts Adds X-Boxel-During-Prerender to CORS allowHeaders.
packages/realm-server/prerender/page-pool.ts Injects a global flag into new pages via evaluateOnNewDocument to let the host SPA detect prerender context.
packages/realm-server/prerender/prerender-constants.ts Re-exports the marker header constant for realm-server-side locality.
packages/postgres/pg-adapter.ts Raises default pg pool max to 40 and always passes max into Pool config.
packages/realm-test-harness/src/shared.ts Updates harness default pool max to 40 to mirror production defaults.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/runtime-common/realm.ts Outdated
Comment thread packages/realm-server/prerender/page-pool.ts Outdated
Comment thread packages/realm-server/handlers/handle-search.ts Outdated
Comment thread packages/realm-test-harness/src/shared.ts Outdated
habdelra and others added 2 commits May 12, 2026 00:00
- prettier: rewrap the long globalThis cast in #markPageAsInPrerender
- page-pool.ts: untangle the split block comments — restore the
  #attachPageObservability preamble in one piece and give
  #markPageAsInPrerender its own self-contained comment
- realm.ts: rewrite the DURING_PRERENDER_HEADER comment to describe
  the actual mechanism (puppeteer evaluateOnNewDocument injects a
  window global; host fetch wrapper reads it and adds the header on
  outbound _federated-search / _search calls only). The previous
  comment described a setExtraHTTPHeaders approach that was never
  shipped.
- presence check: unify the two DURING_PRERENDER_HEADER sites on
  `.length > 0` so handle-search.ts (koa ctxt.get returns '') and
  realm.ts (Request.headers.get returns null) have the same grep-able
  shape.
- realm-test-harness: drop DEFAULT_PG_POOL_MAX back to 20 and
  document the math against the seeded test pg's max_connections=50
  ceiling (one stack = 2 pg-pool clients × 20 = 40 peak, leaving
  headroom for the harness's own client). Production stays at the
  pg-adapter default of 40 because hosted RDS has 1700+ connections.
prerender-deadlock-test.ts uses a hand-rolled minimal Page object
that doesn't include every puppeteer method — `evaluateOnNewDocument`
isn't in the surface those tests care about, and calling it produces
`TypeError: page.evaluateOnNewDocument is not a function` from
inside #createStandby, which surfaces as the misleading "No standby
page available for prerender" the tests then trip over.

The window-global injection is observability — if it can't be set,
the host fetch wrapper will simply not stamp the during-prerender
header, which under any sane test environment is harmless. Guard the
call and skip it on stubs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@habdelra habdelra requested a review from a team May 12, 2026 08:46
@habdelra habdelra merged commit a0e1eb0 into main May 12, 2026
78 of 79 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants