Skip to content

Indexer: bounded-parallel pre-warm (concurrency=4)#4871

Closed
habdelra wants to merge 1 commit into
prerender-deadlock-pre-warm-realm-modulesfrom
bounded-parallel-prewarm
Closed

Indexer: bounded-parallel pre-warm (concurrency=4)#4871
habdelra wants to merge 1 commit into
prerender-deadlock-pre-warm-realm-modulesfrom
bounded-parallel-prewarm

Conversation

@habdelra
Copy link
Copy Markdown
Contributor

@habdelra habdelra commented May 18, 2026

Stacked on top of #4864. Merge that first (or retarget this PR's base to main after #4864 merges).

⚠️ Bench regression — do not merge. First local bench against ctse/ambitious-piranha showed a wall-time regression vs. #4864's serial baseline; the job was rejected after 12 min with dashboards stuck in renderStage = waiting-stability carrying ~99 in-flight queries. Same fingerprint family as the self-referential prerender deadlock #4863 addresses, but firing at a different point in the lifecycle. Investigation needed before re-trying. See the Bench section below.

Why

preWarmModulesTable's loop is currently strictly serial: each getCachedDefinitions(moduleUrl) round-trips through the prerender server before the next URL fires. On a realm whose pre-warm set is dominated by the realm-wide .gts/.gjs sweep (the layer #4864 introduces), wall-time is roughly |toWarm| × per-module-prerender-cost. For piranha-class realms that's hundreds of modules in series before the visit phase even begins.

The prerender server has spare capacity during pre-warm — the visit phase hasn't started yet so the realm's affinity-tab budget is fully idle. Sweeping with bounded parallelism uses that headroom.

What changes

Replaces the for-loop in preWarmModulesTable with a worker-pool helper at concurrency = 4. The bound is chosen to match the prerender server's default #fileAdmissionCap (affinityTabMax − 1, default 4 with the default PRERENDER_AFFINITY_TAB_MAX=5). The intent was:

  • Keep pre-warm from oversubscribing the per-affinity tab budget. At most #fileAdmissionCap file/render tabs can be in flight in the realm's affinity; pre-warm borrows that same headroom and releases it before the visit phase starts.
  • Bound contention. The existing serial shape already absorbs a prior experiment that regressed under wider parallelism. Capping at 4 was meant to keep the experiment's failure mode out of reach.

No semantic change to the cache contract: same toWarm set, same per-call getCachedDefinitions invocation, same failed += 1 accounting. Only the loop shape differs. The pre-warm complete perfLog line now reports concurrency=N for ops attribution.

Bench

Result: regressed. Do not merge until investigated.

Local cold full-reindex against https://localhost:4201/user/ambitious-piranha/ on this branch (with #4863 + #4864 merged in, CORS fix applied):

approach wall time outcome
serial pre-warm (#4864 baseline) ~665 s (~11 min) 0 errored rows
bounded-parallel pre-warm, concurrency=4 (this PR) rejected at 12+ min dashboards stuck in waiting-stability

The rejected run's slowest rows pulled via boxel_index_working.timing_diagnostics:

url                                                | render_ms | stage              | queries_stuck
Tracking/Cohort/cohort-2023-plus.json              |   131 407 | waiting-stability  |             0
PolicyTrackingDashboard/dashboard-2-3.json         |   111 417 | waiting-stability  |            99
Tracking/Cohort/cht-2023-2023-2.json               |   111 156 |                    |             0

Snapshot of same-affinity activity on the stuck dashboard:

portfolio.gts          kind=module  state=running  priority=0  ageMs=2193
book-of-business.gts   kind=module  state=running  priority=0  ageMs=2190
cohort.gts             kind=module  state=queued   priority=0  ageMs=1949
policy.gts             kind=module  state=queued   priority=0  ageMs=1925
claim.gts              kind=module  state=queued   priority=0  ageMs=1541

Five module renders against the realm's affinity at the time the dashboard timed out — two running, three queued. Same fingerprint family as the original self-referential prerender deadlock #4863 addresses, except these modules are firing during the dashboard's <Search> block instead of from a fresh visit. Pre-warm completed (18 passes=fileExtract visits before the visit phase kicked off), and the modules table did have cached definitions for the heavy .gts files, but something during the dashboard's search path is still triggering module sub-renders in the same affinity.

The clear directive from the plan applies: roll back to serial, investigate before re-trying.

Hypotheses worth checking:

  1. The 4 concurrent pre-warm fileExtract tabs leave the prerender server in a Chrome-state that's not equivalent to what serial pre-warm produces — e.g., LRU eviction patterns, GC pressure, or warm-tab affinity ordering. This would manifest as visit-phase renders being slower-than-serial despite a fully populated modules table.
  2. A search-time lookupDefinition cache-scope mismatch — pre-warm caches modules under cacheScope=realm-auth, authUserId=<realm owner>, but the dashboard's <Search> triggers a search whose lookupDefinition resolves a different cacheScope/authUserId tuple, missing the cache and firing a fresh prerenderModule. Would explain why the modules are running in the same affinity after pre-warm completed.
  3. A race between pre-warm completion and the visit phase: parallel pre-warm's await Promise.all(...) returns after the slowest worker, but the modules table writes may not have propagated to the lookup path used by the visit phase by the time the first card visit fires. Less plausible since the writes are synchronous within getCachedDefinitions, but worth ruling out.

Files

  • packages/runtime-common/index-runner.ts — single function body change inside preWarmModulesTable.

Test plan

  • Investigate the hypothesis above and identify the root cause before merging.
  • If addressable: re-run cold full-reindex on prudent-octopus and ambitious-piranha, three trials each, both branches.
  • If not addressable cleanly: close this PR. The serial baseline in Indexer: pre-warm every realm .gts/.gjs module, not just per-row deps #4864 is already a major win over main.
  • CI green.

🤖 Generated with Claude Code

Replaces the serial `for (let moduleUrl of toWarm)` loop in
`preWarmModulesTable` with a bounded-parallel sweep at
concurrency = 4.

**Bound rationale.** Matches the prerender server's default
`#fileAdmissionCap` (`affinityTabMax − 1`, default 4 with the
default `PRERENDER_AFFINITY_TAB_MAX=5`). Tying parallelism to the
per-affinity file-admission cap keeps pre-warm from oversubscribing
the realm's tab budget — at most `#fileAdmissionCap` file/render
tabs can be in flight in the realm's affinity, and pre-warm borrows
the same headroom strictly *before* the visit phase starts. The
visit phase pays no penalty because the pre-warm has fully drained
by the time the first visit fires.

Higher concurrency re-introduces the prerender-server contention
that motivated the original serial shape (an earlier experimental
parallel pre-warm regressed reindex time on these realms — that
experiment ran with the *wrong* set of modules in `toWarm`, but the
contention story still applied for the modules it did warm).
Lower concurrency leaves prerender tabs idle for the duration of
the sweep.

**No semantic change.** The set of URLs in `toWarm` is unchanged,
the per-call `getCachedDefinitions` invocation is unchanged, and
the failure handling (`failed += 1`, log on any non-zero) is
unchanged. Only the loop shape differs.

The `pre-warm complete` perfLog line now also reports
`concurrency=N` so wall-time deltas between serial and parallel
runs are easy to attribute.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@habdelra
Copy link
Copy Markdown
Contributor Author

Closing — bench result showed a clear wall-time regression vs. the serial baseline (#4864), with dashboards stuck in waiting-stability while pre-warmed modules were re-rendering inside the same affinity. The serial pre-warm in #4864 is already the production target. If we revisit parallel pre-warm later, the regression analysis in the closed PR description plus the three hypotheses there are the starting point.

@habdelra habdelra closed this May 18, 2026
@habdelra habdelra deleted the bounded-parallel-prewarm branch May 18, 2026 22:36
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the indexer’s preWarmModulesTable phase to run module-cache warmups with bounded parallelism (worker pool) instead of strictly serial awaits, aiming to reduce wall time when many realm modules need warming.

Changes:

  • Replace the serial for..of await pre-warm loop with a bounded-parallel worker pool (intended concurrency = 4).
  • Add detailed in-code rationale for the chosen concurrency bound relative to prerender PagePool behavior.
  • Extend the pre-warm completion perf log to include a concurrency= field.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +678 to +692
// Bounded-parallel sweep. The bound matches the default
// `#fileAdmissionCap` of the prerender server's PagePool
// (`affinityTabMax − 1`, default 4). Tying it to the
// file-admission cap keeps the parallel pre-warm from
// oversubscribing the per-affinity tab budget — at most
// `#fileAdmissionCap` file/render tabs will be in flight in the
// realm's affinity, and pre-warm modules borrow the same headroom
// before the visit phase starts. The visit phase pays no penalty
// because pre-warm runs strictly before it. Concurrency above 4
// re-introduces the contention that motivated the original serial
// shape; below 4 leaves the prerender server's idle tabs unused
// for the duration of the sweep.
let prewarmConcurrency = 4;
let urls = [...toWarm];
let cursor = 0;
Comment on lines +706 to 717
await Promise.all(
Array.from({ length: Math.min(prewarmConcurrency, urls.length) }, worker),
);
if (failed > 0) {
this.#log.warn(
`${jobIdentity(this.#jobInfo)} ${failed} of ${toWarm.size} module pre-warm lookups failed; the visit phase will retry on-demand if needed`,
);
}

this.#perfLog.debug(
`${jobIdentity(this.#jobInfo)} pre-warm complete in ${Date.now() - preWarmStart} ms (candidates=${toWarm.size} failed=${failed})`,
`${jobIdentity(this.#jobInfo)} pre-warm complete in ${Date.now() - preWarmStart} ms (candidates=${toWarm.size} failed=${failed} concurrency=${prewarmConcurrency})`,
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same — index keys are fine for a read-only list; ${type}:${email} triggers dup-key warnings Copilot earlier asked us to avoid.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +690 to +691
let prewarmConcurrency = 4;
let urls = [...toWarm];
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants