Skip to content

Phase 2: job-scoped same-realm search cache during indexing#4791

Merged
habdelra merged 10 commits into
mainfrom
cs-11115-phase2-job-cache
May 13, 2026
Merged

Phase 2: job-scoped same-realm search cache during indexing#4791
habdelra merged 10 commits into
mainfrom
cs-11115-phase2-job-cache

Conversation

@habdelra
Copy link
Copy Markdown
Contributor

@habdelra habdelra commented May 12, 2026

Summary

Wraps searchRealms in _federated-search with a per-process JobScopedSearchCache keyed by (jobId, normalizedQuery, normalizedOpts). Within an indexing batch, boxel_index is snapshot-stable for same-realm reads (the writer touches boxel_index_working until applyBatchUpdates commits) — so once a batch's first _federated-search for a given query resolves, every later same-key call within the same jobId short-circuits to the cached doc.

The cache stores resolved docs, not in-flight promises. Concurrent same-key callers each run their own populate; Phase 1's in-flight dedup at RealmIndexQueryEngine.searchCards (#4784) already coalesces the inner SQL + loadLinks walk for same-realm calls arriving concurrently. Storing promises was tried first and produced a tail-latency stall — a slow first populate blocked every later same-key caller past its render-timeout window. Resolved-only avoids that failure mode and keeps the sequential-dedup win, which is what the cohort-render pattern actually needs.

Plumbing

  • New X_BOXEL_CONSUMING_REALM_HEADER + X_BOXEL_JOB_ID_HEADER constants (with sanitizers) in packages/runtime-common/prerender-headers.ts. Re-exported from realm-server's prerender-constants.ts so existing imports of PRERENDER_JOB_ID_HEADER keep working.
  • Host's render route writes globalThis.__boxelConsumingRealm in model(); prerender's render-runner sets globalThis.__boxelJobId via page.evaluate (folded into the existing auth setup, one CDP round-trip total). The host's federated-search wrapper in services/store.ts reads both globals and attaches the headers alongside the existing x-boxel-during-prerender.
  • CORS allow-list gains X-Boxel-Consuming-Realm + X-Boxel-Job-Id and sets Access-Control-Max-Age: 86400 — without maxAge, @koa/cors falls back to Chrome's ~5 s default and the new headers force a fresh OPTIONS round-trip in front of nearly every QUERY.

Cache gate

Three conditions all hold to consult the cache:

  1. x-boxel-job-id present + well-formed (only the indexer worker stamps this; user / API callers never carry it and always see fresh data).
  2. x-boxel-consuming-realm present + well-formed (host's render route only sets it during prerender).
  3. Request realms array is exactly [consumingRealm] — cross-realm reads bypass because a peer realm can swap its boxel_index mid-batch and a cached value would freeze a stale snapshot.

Cache bounds

  • TTL: 10 min from initial populate. Hits don't refresh the timer — tying TTL to populate time bounds the leak deterministically, where touch-refresh would let a hot entry survive indefinitely past job completion.
  • Hard cap: 5000 total entries with FIFO eviction by populate-order sequence. Bounds worst-case memory under a synthetic-jobId flood (an authenticated reader could otherwise mint arbitrary <digits>.<digits> jobId values).
  • Future NOTIFY-driven clearJob(jobId) eviction (hook exists) will release entries the moment the worker completes a job; for now TTL is the bound.

Bench (local, vite-dev host, cold tabs)

Fresh cold stack after mise run kill-all + TRUNCATE job_reservations, jobs, job_progress + mise run dev-all + drain + lazy-mount.

Realm Variant Wall fed-search hits Status
user/prudent-octopus (~118 files) main 0119bac76d 617 s 120 resolved (91/91 instances)
Phase 2 merged c543f6846f 246 s 354 resolved (91/91 instances)
user/ambitious-piranha (~461 files) main 652 s 614 rejected at cht-*.json cohort, attempt=1/12, 150 s prerender abort
Phase 2 merged 183 s 18 rejected at same cohort, same 150 s abort

Octopus: −60 % wall-clock, no errors, identical output.

Piranha: both variants reject on the same single-card cohort-render timeout (attempt=1/12 of a cht-*.json Cohort card, 150 s prerender abort). Main lasts 652 s because more non-cohort cards process before the indexer reaches the cohort; Phase 2 lasts 183 s because the cache makes non-cohort renders so much faster that the indexer reaches the cohort sooner. The terminal failure is identical and orthogonal to this PR — see CS-11123 (pre-warm modules table) which targets the cohort-render slowness directly. Phase 2 + CS-11123 should stack: pre-warm gets the cohort under 150 s, Phase 2 dedupes the 86 sub-searches it fires.

vite-dev rig is ~3–4× slower than the prod-built host setup that produced the ticket-quoted baselines (190 s octopus / ~585 s piranha). Absolute numbers aren't directly comparable to the ticket — the same-rig deltas across Main → Phase 2 are the load-bearing signal.

Follow-up: horizontal scaling

CS-11119 refactors this cache to DB-backed under the DB-Authoritative Realm Registry project. The current in-process Map works correctly for single-replica deployments (the launch configuration) but degrades to ~1/N hit rate under multi-replica because each replica only sees its own cache. The DB-backed version stores entries in a search_cache table keyed by (jobId, query-digest, opts-digest, consuming_realm) and trivially survives horizontal scaling.

Tests

  • JobScopedSearchCache unit tests: hit/miss by (jobId, query, opts), miss across different jobIds, sequential hit, concurrent each-runs-own-populate (last-write-wins), clearJob isolation, TTL eviction, opts variance, FIFO cap eviction (new in latest commit), realm-agnostic at the cache layer (the gate lives in handle-search).
  • sanitizeConsumingRealmHeader unit tests: accepts plain http(s) URLs, trims whitespace, rejects non-http(s) schemes, empty/null, control characters (log-injection defence), pathologically long values, non-string inputs.

Test plan

  • CI cache + sanitizer unit tests green.
  • CI realm-perf bench job (prod-built host) — confirm the octopus win on the rig the ticket baselines come from.
  • Verify Phase 2 + CS-11123 stack: pre-warm should let piranha complete; this cache should then trim its wall.

Stacked PR — this is Phase 2 of CS-11115. Phase 1 (#4784) is merged.

🤖 Generated with Claude Code

habdelra and others added 3 commits May 12, 2026 12:52
Wrap `searchRealms` in `_federated-search` with a per-process
`JobScopedSearchCache` keyed by (jobId, normalizedQuery, normalizedOpts).
Entries store the *resolved* doc only (not in-flight promises) so a
slow first populate cannot tail-latency-block concurrent peers past
their render-timeout window. Concurrent same-key callers each run
their own populate; Phase 1's in-flight dedup at
`RealmIndexQueryEngine.searchCards` absorbs the duplicate inner
SQL+loadLinks work. This cache only optimises *sequential* repeats
within the same indexing job.

Gated on three conditions all holding:
- `x-boxel-job-id` present (only the indexer worker stamps this; live
  user / API callers never carry it and therefore always see fresh
  data).
- `x-boxel-consuming-realm` present (the host's render route only sets
  it during prerender; new constant + sanitizer live in
  runtime-common's prerender-headers.ts so the host SPA and realm-
  server can both import it).
- The request's `realms` list is exactly `[consumingRealm]` — cross-
  realm reads bypass the cache because a peer realm can swap its
  `boxel_index` mid-batch and a cached value would freeze a stale
  snapshot.

Cache lifetime: 10-minute TTL with `setTimeout(unref)` cleanup. A
`clearJob(jobId)` method is exposed for future NOTIFY-driven eviction
on worker job done/reject; until then, leaked entries persist at most
TTL after the job ends (no correctness impact — `jobId` keying means
no future job hits a stale entry).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ob-cache

# Conflicts:
#	packages/host/app/services/store.ts
#	packages/realm-server/handlers/handle-search.ts
#	packages/realm-server/server.ts
Prettier line wrapping + replace assert.ok with logical operators by
binding the boolean to a local first (qunit lint rule
no-assert-logical-expression).
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1501a8cfdb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/host/app/services/store.ts
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 12, 2026

Preview deployments

Host Test Results

    1 files  ±    0      1 suites  ±0   4h 9m 24s ⏱️ + 2h 8m 3s
2 658 tests ±    0  2 640 ✅  -     3  15 💤 ± 0  0 ❌ ±0  3 🔥 +3 
5 354 runs  +2 677  5 318 ✅ +2 656  30 💤 +15  3 ❌ +3  3 🔥 +3 

Results for commit c6f1e2c. ± Comparison against earlier commit 82fb488.

For more details on these errors, see this check.

Realm Server Test Results

    1 files  ±0      1 suites  ±0   11m 16s ⏱️ + 2m 2s
1 363 tests ±0  1 363 ✅ +1  0 💤 ±0  0 ❌  - 1 
1 442 runs  ±0  1 442 ✅ +1  0 💤 ±0  0 ❌  - 1 

Results for commit c6f1e2c. ± Comparison against earlier commit 82fb488.

@habdelra habdelra marked this pull request as draft May 12, 2026 23:05
- Introduce `X_BOXEL_JOB_ID_HEADER` in runtime-common's
  prerender-headers.ts as the canonical definition; realm-server's
  prerender-constants.ts re-exports as `PRERENDER_JOB_ID_HEADER` so
  existing imports keep working.
- prerender-app's `/prerender-visit` route forwards the inbound
  `x-boxel-job-id` header into `prerenderer.prerenderVisit({...jobId})`.
- Prerenderer + RenderRunner thread `jobId` through to
  `prerenderVisitAttempt`, which `page.evaluate`s
  `globalThis.__boxelJobId = jobId` before the render begins.
- Host's `_federated-search` fetch wrapper reads the global and
  attaches the header alongside `consumingRealmHeader()` /
  `duringPrerenderHeaders()`.
- CORS allow-list adds `X-Boxel-Job-Id` (preflight blocker on the
  prior commit).

With this in place, `handle-search`'s cache gate
(`jobId && consumingRealm && realms === [consumingRealm]`) can
actually evaluate true during indexing.
@habdelra habdelra changed the base branch from cs-11115-phase1-in-flight-dedup to main May 12, 2026 23:28
- Set `Access-Control-Max-Age: 86400` on the CORS middleware. Without
  this @koa/cors omits the header and Chrome falls back to its ~5 s
  default preflight TTL, which forces a fresh OPTIONS round-trip in
  front of nearly every cross-origin QUERY the host fires during a
  long indexing run. Doubles realm-server arrival count and adds a
  serial RTT in front of each QUERY.

- Fold the new `__boxelJobId` assignment into the existing
  `localStorage.setItem('boxel-session', ...)` `page.evaluate` so we
  pay one CDP round-trip per visit instead of two. Same effect:
  global is set before transitionTo, host's `_federated-search`
  wrapper reads it and stamps `x-boxel-job-id` on outbound calls.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a job-scoped, same-realm-only cache for _federated-search during indexing/prerender to eliminate sequential duplicate searches within a single indexing job, plus the header plumbing needed to safely gate that cache.

Changes:

  • Introduces JobScopedSearchCache in realm-server and wires it into _federated-search behind a strict “same realm + job id + consuming realm” gate.
  • Adds x-boxel-consuming-realm / x-boxel-job-id header constants + sanitization in runtime-common, and plumbs globals/headers through host prerender renders.
  • Extends CORS allow-headers and adds unit tests for both the cache and header sanitizer.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
packages/runtime-common/tests/consuming-realm-header-test.ts Shared test suite for consuming-realm header sanitizer.
packages/runtime-common/prerender-headers.ts Adds header constants and sanitizeConsumingRealmHeader().
packages/runtime-common/index.ts Re-exports prerender-headers from runtime-common entrypoint.
packages/realm-server/tests/job-scoped-search-cache-test.ts Unit tests for JobScopedSearchCache.
packages/realm-server/tests/index.ts Registers the new realm-server tests.
packages/realm-server/tests/consuming-realm-header-test.ts Runs the shared runtime-common sanitizer tests under realm-server’s test runner.
packages/realm-server/server.ts Adds new headers to CORS allow-list.
packages/realm-server/routes.ts Instantiates a process-wide JobScopedSearchCache and passes it to _federated-search.
packages/realm-server/prerender/render-runner.ts Injects globalThis.__boxelJobId into the page before render so host can stamp the job header.
packages/realm-server/prerender/prerenderer.ts Threads jobId through prerender visit plumbing into the render runner.
packages/realm-server/prerender/prerender-constants.ts Re-exports job-id header constant from runtime-common; keeps sanitizer local.
packages/realm-server/prerender/prerender-app.ts Reads/sanitizes inbound job id header and forwards it into prerenderer visit args.
packages/realm-server/job-scoped-search-cache.ts New cache implementation keyed by (jobId, normalized query, normalized opts) with TTL eviction.
packages/realm-server/handlers/handle-search.ts Adds cache gate and searchCache.getOrPopulate() wrapping searchRealms().
packages/host/app/services/store.ts Stamps consuming-realm + job-id headers onto _federated-search requests when globals are present.
packages/host/app/routes/render.ts Sets globalThis.__boxelConsumingRealm for the card being rendered.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/realm-server/handlers/handle-search.ts
Comment thread packages/realm-server/routes.ts Outdated
Comment thread packages/realm-server/tests/job-scoped-search-cache-test.ts Outdated
habdelra and others added 2 commits May 12, 2026 20:33
- Hard cap on total entries with FIFO eviction by populate-order
  sequence. Bounds worst-case memory under a synthetic-jobId flood:
  an authenticated reader could otherwise mint arbitrary
  `<digits>.<digits>` jobId values and grow the cache without bound
  for the full TTL window. Default 5000 entries.

- Routes.ts comment said "TTL-evicts entries 10 min after last
  touch" — actually the timer fires 10 min after the initial
  populate; hits don't refresh it. Tying TTL to populate time bounds
  the leak deterministically, so the code stays; only the comment
  changes.

- job-scoped-search-cache-test.ts: dropped the stale "handler-level
  integration test below covers the gate" claim and replaced it with
  a `TODO` pointing at the right home for that coverage
  (`realm-endpoints/search-test.ts`). Added a unit test for the cap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@habdelra habdelra marked this pull request as ready for review May 13, 2026 01:30
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c543f6846f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/realm-server/prerender/prerenderer.ts
habdelra added 2 commits May 12, 2026 21:35
The first attempt and the post-restart retry call `prerenderVisitAttempt`
with the same args, but the retry was missing `jobId`. Without it,
`render-runner` injects `undefined` into `__boxelJobId` on the
restarted page → the host's federated-search wrapper omits the header
on outbound calls → `handle-search` bypasses the cache for the whole
retried render even though those calls still belong to the same
indexing job.

Render-runner's `page.evaluate` is already a single statement that
overwrites both globals atomically, so this fix only needs to pass the
ID through the missing destructure site.
The cap-eviction test asserted "B was a cache hit (not evicted)"
after re-requesting A on a full cache. That expectation was LRU-shaped
— under the actual strict-FIFO semantics, B is the oldest survivor
after A is evicted by the D insert; re-inserting A pushes B out next.

Rewrite the assertions to match strict FIFO:
- After A is evicted by D and re-inserted (seq 4), the map holds
  C, D, A (B is now evicted).
- D was the youngest survivor before A re-entered, so D should still
  be a hit. Pick D rather than C as the "still hit" assertion so the
  test is stable under future cap-math changes.
- B is the entry the re-insert pushed out, so B should miss now —
  add that assertion to make the eviction observable.

Test now correctly characterises the implementation; impl unchanged.
@habdelra habdelra requested a review from a team May 13, 2026 03:00
@habdelra habdelra merged commit 1b4e47a into main May 13, 2026
100 of 103 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants