Skip to content

PagePool: non-blocking eviction + fix shared-context bookkeeping (CS-11140)#4823

Merged
habdelra merged 7 commits into
mainfrom
cs-11140-prerender-recover-after-render-timeout
May 14, 2026
Merged

PagePool: non-blocking eviction + fix shared-context bookkeeping (CS-11140)#4823
habdelra merged 7 commits into
mainfrom
cs-11140-prerender-recover-after-render-timeout

Conversation

@habdelra
Copy link
Copy Markdown
Contributor

Summary

  • #evictLRUAffinity no longer blocks on page.close() of a stuck render — switches to disposeAffinity({ awaitIdle: false }) so the slot is logically freed for #prepareSlotForStandby immediately. Adds #selectEvictionCandidate that prefers idle affinities over busy ones within LRU order.
  • disposeAffinity is now race-safe against concurrent getPage on the same affinity. Marks oldShared.closing = true synchronously and closes the snapshotted BrowserContext (idempotent) at the end, only deleting the map entry if it's still our snapshot.
  • #assignStandbyToAffinity / #reassignAffinityTab only register the entry's context as the affinity's primary shared context when no active primary exists. Supplementary tabs (typical: same-affinity module sub-call concurrent with a file render) stay entry-owned; their close path is unchanged.

Together these eliminate the Shared-context invariant violated warning and prevent a single stuck render from corrupting the container's bookkeeping for the duration of its host-side timeout.

Closes CS-11140. Complements CS-11139: CS-11139 prevents one slow refill from amplifying into a fleet-wide stall; CS-11140 prevents the slow refill from happening in the first place (and from corrupting the container's state when it does).

The 2026-05-13 incident, mapped to these fixes

A render of …ctse/ambitious-piranha/Tracking/Cohort/cht-2018-2025-2.json went into a Glimmer tracking loop and timed out at 90 s in stage=waiting-stability. The container then:

  • Held the page in page.close()-waiting state, gating the standby-refill loop (fix (a)).
  • Fired Shared-context invariant violated for ctse/personal/ and …/annual-cicada/ log.errors during the in-flight disposeAffinity race (fix (b)).
  • Inflated the affinity's primary pageCount every time a second concurrent tab adopted a standby, preventing orphan-claim recovery (fix (c)).

The container "recovered" 90 minutes later only because the bug was self-limiting (the stuck renders eventually aged out). After this PR, the recovery is bounded by the host-side render timeout itself, not by the operational backpressure that the stuck render caused.

Test plan

  • pnpm exec qunit … tests/page-pool-eviction-recovery-test.ts — 2/2 pass (background BrowserContext.close is non-blocking; concurrent getPage mid-disposeAffinity produces zero Shared-context invariant violated warnings).
  • pnpm exec qunit … tests/page-pool-expansion-test.ts tests/page-pool-priority-test.ts tests/page-pool-eviction-recovery-test.ts tests/prerender-deadlock-test.ts tests/prerender-cancellation-test.ts — 47/47 pass. The 5+ Shared-context invariant violated warnings that previously fired in prerender-deadlock-test.ts no longer fire at all.
  • pnpm lint:js clean on touched files.
  • Soak in staging: zero Shared-context invariant violated warnings in 24h; recover-after-stuck-render windows bounded to ~render-timeout duration, not multi-minute.

🤖 Generated with Claude Code

A single stuck render on a prerender server used to corrupt the
container's state for tens of minutes after the offending render
finally timed out. Three interlocking bugs:

(a) `#evictLRUAffinity` awaited `disposeAffinity(awaitIdle: true)`,
    which `await`ed each entry's `page.close()`. When the LRU
    candidate was a Glimmer-tracking-loop render, that close waited
    on Chrome — gating the standby-refill loop for the full host-
    side render timeout. Switch to `awaitIdle: false` and add
    `#selectEvictionCandidate` that prefers idle affinities over
    busy ones (preserving LRU order within each bucket).

(b) `disposeAffinity` left `oldShared.closing` unset while the
    per-entry close loop awaited. A concurrent caller arriving in
    that window adopted a fresh standby, and
    `#recordSharedContextForFirstPage` fired
    `Shared-context invariant violated` because the affinity's
    map entry still pointed to the in-flight-closing context. Mark
    `oldShared.closing = true` synchronously at the start of
    `disposeAffinity`. Capture the snapshot and close the specific
    `BrowserContext` (idempotent) at the end, only deleting the
    map entry if it's still our snapshot. `disposeAffinity` is now
    race-safe against concurrent `getPage` on the same affinity.

(c) `#assignStandbyToAffinity` and `#reassignAffinityTab` always
    called `#recordSharedContextForFirstPage`, even when the
    affinity already had an active primary shared context. The
    second concurrent caller (typical: file render + same-affinity
    module sub-call) fired the invariant warning AND permanently
    inflated `pageCount` on the existing primary, preventing
    orphan-claim recovery for the affinity. Guard both call sites
    so supplementary tabs stay entry-owned; `#closeEntry`'s
    `entryOwnsContext` path already handles their teardown
    without affecting the primary.

Together these eliminate the `Shared-context invariant violated`
warning that fired throughout the 2026-05-13 incident (verified by
the 5+ warnings in `prerender-deadlock-test.ts` going to 0), and
prevent a single stuck render from holding up the standby-refill
loop and the affinity's bookkeeping for the duration of the host
timeout.

Adds `page-pool-eviction-recovery-test.ts` covering the
non-blocking eviction and the no-warning-under-concurrent-dispose
contracts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e1b81e52f8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/realm-server/prerender/page-pool.ts
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates prerender PagePool eviction and shared-context bookkeeping to avoid blocking standby refill on slow page/context closes and to reduce shared-context invariant warnings during concurrent affinity disposal/reuse.

Changes:

  • Makes LRU eviction call disposeAffinity with non-blocking close behavior and adds idle-preferred eviction candidate selection.
  • Adjusts shared-context registration/disposal logic for concurrent affinity reuse and supplementary tabs.
  • Adds a new QUnit regression test file and includes it in the realm-server test index.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
packages/realm-server/prerender/page-pool.ts Updates affinity disposal, eviction candidate selection, and shared-context registration behavior.
packages/realm-server/tests/page-pool-eviction-recovery-test.ts Adds tests for non-blocking disposal and concurrent getPage during disposal.
packages/realm-server/tests/index.ts Registers the new eviction recovery test file.
Comments suppressed due to low confidence (4)

packages/realm-server/prerender/page-pool.ts:1300

  • transitioning entries are not idle; they are in the middle of being acquired for cross-affinity reassignment. Skipping them leaves allIdle true when an affinity contains only transitioning/closing entries, so eviction can dispose the affinity and mark the tab closing underneath the caller that is about to reassign it. Treat transitioning entries as a busy/disqualifying candidate instead of an idle one.
      let allIdle = true;
      for (let entry of entries) {
        if (entry.closing || entry.transitioning) continue;
        if (entry.currentQueue !== undefined) {

packages/realm-server/prerender/page-pool.ts:2119

  • After the new guard allows supplementary tabs to remain unregistered in #sharedContexts, this reassignment path still unconditionally calls #transferSharedContextBookkeeping(oldAffinityKey) for the source affinity. Moving an untracked supplementary tab will decrement (and possibly delete) the source affinity's primary shared-context row even though that tab was never counted there, leaving the still-active primary context untracked.
    let existing = this.#sharedContexts.get(affinityKey);
    if (!existing || existing.closing) {
      this.#recordSharedContextForFirstPage(entry.context, affinityKey);
    }

packages/realm-server/tests/page-pool-eviction-recovery-test.ts:203

  • This test blocks BrowserContext.close, but the pre-fix race window was while disposeAffinity was still awaiting the per-entry close loop before #closeSharedContext marked/deleted the shared row. With the stubbed page.close() completing immediately, the old code would already be inside #closeSharedContext (with the map entry deleted) by this point, so the test can pass without exercising the reported invariant-warning race. Block the entry/page close path instead to make the concurrent getPage arrive while the old shared row is still present and not closing.
      // Block context-close so disposeAffinity's await spans long
      // enough for a concurrent getPage('A') to race in.
      control.blockContextClose = true;

      // Start disposing A (awaitIdle:true so the close path is the
      // one that historically held the race window open).
      let disposePromise = pool.disposeAffinity('A', { awaitIdle: true });

      // Yield to let disposeAffinity reach the await on context.close.
      // After this microtask, #affinityPages.delete has run,
      // #sharedContexts.get('A').closing === true, and the in-flight
      // close is awaiting (gated by our blockContextClose flag).
      await new Promise<void>((r) => setTimeout(r, 5));

packages/realm-server/prerender/page-pool.ts:1099

  • When this check detects that a concurrent caller has already replaced the shared-context row, the affinity is live again, but the method still falls through to #notifyManagerAffinityEvicted and #onAffinityDisposed. That sends a stale DELETE to the prerender manager (and clears batch ownership) for an active affinity, so routing/bookkeeping can be removed immediately after the replacement render starts.
        if (this.#sharedContexts.get(affinityKey) === oldShared) {
          this.#sharedContexts.delete(affinityKey);
        }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/realm-server/prerender/page-pool.ts
Comment thread packages/realm-server/tests/page-pool-eviction-recovery-test.ts
Comment thread packages/realm-server/prerender/page-pool.ts Outdated
habdelra and others added 2 commits May 13, 2026 18:42
`#transferSharedContextBookkeeping` and `#releaseSharedContextForClosedPage`
read `#sharedContexts[affinityKey]` at call time, but a concurrent
`getPage` racing a `disposeAffinity` (or a sibling caller registering a
fresh primary) can replace that map entry while we're awaiting
`page.close()` or `#reassignAffinityTab`-ing across affinities. The
pre-fix shape decremented whichever context was currently in the map —
which could drive a fresh, active primary's `pageCount` to 0,
silently turning it into a spurious "orphan" eligible for next-caller
reclaim.

Pass the closing/moving entry's own `BrowserContext` into both helpers
and skip the decrement when the map's current context doesn't match.
That keeps `pageCount` semantics scoped to the specific entry being
torn down or moved, regardless of what's happened to the map entry in
the meantime. The pre-existing snapshot-based cleanup in
`disposeAffinity` already owns teardown of the old primary, so no
work is dropped.

Also includes a test for cross-affinity reuse of a supplementary tab —
the path Codex flagged where a multi-tab affinity's primary could
lose its bookkeeping when one of its supplementary tabs was
commandeered by a new affinity.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…test

The idle-over-busy candidate-selection logic in `#evictLRUAffinity`
was protecting busy LRU candidates from eviction on the assumption
that their `page.close()` would block. With CS-11140's
`awaitIdle: false` switch the close is non-blocking regardless of
LRU choice, so the busy-vs-idle preference adds code (and a tricky
test surface) without an operational win. Revert to plain LRU
eviction — simpler and the actual fix (`awaitIdle: false` + the
shared-context bookkeeping fixes) is unaffected.

Strengthens the `disposeAffinity(awaitIdle: false)` test with a
`getWarmAffinities()` assertion that pins down the synchronous-
delete contract directly. A regression that re-added the per-entry
`.finally` deferred-delete shape would still satisfy
`elapsed < 100ms` (with the fast stub), but it'd leave the affinity
in `#affinityPages` past the return — the new assertion catches that.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 13, 2026

Host Test Results

    1 files  ±    0      1 suites  ±0   3h 19m 9s ⏱️ + 1h 32m 27s
2 658 tests  -     1  2 642 ✅  -     2  15 💤 ± 0  0 ❌ ±0  1 🔥 +1 
5 354 runs  +2 676  5 322 ✅ +2 659  30 💤 +15  1 ❌ +1  1 🔥 +1 

Results for commit d35cd98. ± Comparison against earlier commit a76c682.

For more details on these errors, see this check.

Realm Server Test Results

    1 files  ± 0      1 suites  ±0   9m 10s ⏱️ - 5m 17s
1 375 tests + 9  1 375 ✅ + 9  0 💤 ±0  0 ❌ ±0 
1 456 runs  +11  1 456 ✅ +11  0 💤 ±0  0 ❌ ±0 

Results for commit d35cd98. ± Comparison against earlier commit a76c682.

The previous attempt at slot-freeing did the `#affinityPages.delete`
synchronously and let `#closeEntry` run in the background. That
oversubscribed the pool: `#totalContextCount` dropped immediately
(reflecting routing intent, not memory truth), so the next refill
loop's `#prepareSlotForStandby` saw headroom and created a new
standby while the old contexts were still alive in memory. Pre-fix
tests `does not oversubscribe contexts during async eviction` and
`queued cross-realm requests realign the tab per request` both
caught it.

Revert to the deferred-delete shape: each entry is marked
`closing = true` synchronously (so `#selectEntryForAffinity`'s
filter skips it for routing) and removed from `#affinityPages`
only when its own `#closeEntry` settles. `#totalContextCount`
keeps counting closing entries, so refill won't oversubscribe.

The CS-11140 win is preserved — `#evictLRUAffinity` calls
`disposeAffinity(awaitIdle: false)` which still returns
synchronously after kicking off the closes. The caller doesn't
block on Chrome acknowledging a stuck page; new refill cycles
just wait for the closes to settle naturally before creating a
new standby (which is the right behaviour — oversubscribing on a
wedged page would only make the resource pressure worse).

Test updated to assert the contract that's actually meaningful:
disposeAffinity returns fast while the background context.close is
still blocked, and `#affinityPages` clears once the close
completes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@habdelra
Copy link
Copy Markdown
Contributor Author

[Claude Code] Two CI failures on the previous commit (queued cross-realm requests realign the tab per request and does not oversubscribe contexts during async eviction) traced to my eager #affinityPages.delete in the awaitIdle: false branch. The synchronous delete dropped #totalContextCount immediately (reflecting routing intent, not memory truth), so the next refill loop saw headroom and created a fresh standby while the old contexts were still alive — oversubscribing the pool past maxPages + 1.

Fixed in 6055eb3 by reverting to the deferred-delete shape: each entry is marked closing = true synchronously (so #selectEntryForAffinity's filter skips it for routing) and removed from #affinityPages only when its own #closeEntry settles. #totalContextCount keeps counting closing entries, so refill won't oversubscribe.

The CS-11140 win is preserved — #evictLRUAffinity still calls disposeAffinity(awaitIdle: false) and returns synchronously after kicking off the closes. The eviction caller doesn't block on Chrome acknowledging a stuck page; new refill cycles just naturally wait for the closes to settle before creating a new standby (the right behaviour — oversubscribing on a wedged page would only make resource pressure worse).

Test updated to assert the meaningful contract: disposeAffinity returns fast while the background context.close is gated, and #affinityPages clears once the close completes. 48/48 page-pool + prerender tests pass locally.

The previous revert fixed oversubscription but introduced a new
regression: with closing entries kept in `#affinityPages`,
`getWarmAffinities` and routing decisions saw evicted affinities
as still warm — for an unpredictable duration (until Chrome's
async `page.close()` fired the per-entry `.finally`). Earlier
tests' evictions would silently shrink the warm list mid-way
through later tests, breaking
`prerender-server-test.ts > tracks warmed affinities for heartbeat`.

Split the two concerns:
  - `#affinityPages.delete(affinityKey)` happens SYNCHRONOUSLY in
    `disposeAffinity({ awaitIdle: false })`. `getWarmAffinities`,
    `#selectEntryForAffinity`, and `#poolEntryCount` see the
    affinity as gone immediately.
  - In-flight close work is counted via a new
    `#closingPoolEntriesCount` field that participates in
    `#totalContextCount`. `#prepareSlotForStandby` still sees the
    real memory pressure of the still-closing contexts and won't
    oversubscribe past `maxPages + 1`. The count is decremented in
    the per-entry `.finally` once Chrome releases the page.

This restores CS-11140's slot-freed-for-routing semantic without
the routing/memory-truth divergence the revert introduced.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@habdelra
Copy link
Copy Markdown
Contributor Author

[Claude Code] Another CI failure (tracks warmed affinities for heartbeat) exposed the design hole in my previous revert. Splitting the two concerns properly in 63355b5:

  • Routing: disposeAffinity({ awaitIdle: false }) synchronously removes the affinity from #affinityPages. getWarmAffinities, #selectEntryForAffinity, and #poolEntryCount see it as gone immediately.
  • Memory truth: a new #closingPoolEntriesCount field tracks in-flight closes. #totalContextCount includes it, so #prepareSlotForStandby still sees the real pressure of unclosed contexts and won't oversubscribe past maxPages + 1. Decremented in the per-entry .finally once Chrome releases the page.

This is the proper shape: routing semantic is eagerly correct, memory accounting stays truthful. 48/48 page-pool + prerender tests pass locally.

The async eviction path interacts badly with the refill loop's
dedup: with `awaitIdle: false`, the refill loop terminates before
the close completes, leaving concurrent `getPage` callers (queued
on the shared `#ensuringStandbys` promise) without a fresh standby
when they wake up. They fall through to cross-affinity steal on
the only remaining busy tab — and the second caller throws
`'No standby page available for prerender'` because the first set
`transitioning: true` on the candidate.

`prerendering-test.ts > queued cross-realm requests realign the
tab per request` captured this contract: pre-fix, the refill's
synchronous eviction guaranteed that concurrent awaiting callers
saw the post-close state (including a fresh standby). The
non-blocking eviction broke that guarantee.

Revert to `awaitIdle: true`. CS-11139's structural fix already
keeps independent `getPage` callers off the dedup'd
`#ensuringStandbys` promise, so only the refill loop pays for the
stuck-page close — which is the right tradeoff. The CS-11140 wins
that remain:

  * Orphan-claim race fix (`oldShared.closing = true` upfront +
    snapshot-based cleanup in `disposeAffinity`).
  * Supplementary-tab bookkeeping in `#assignStandbyToAffinity`
    and `#reassignAffinityTab` (the second-concurrent-tab adopting
    a fresh standby no longer inflates the primary's `pageCount`).
  * Context-aware `#transferSharedContextBookkeeping` and
    `#releaseSharedContextForClosedPage` (don't decrement a fresh
    primary's `pageCount` when a concurrent caller has overwritten
    the map entry).

Drops the `#closingPoolEntriesCount` tracker — no longer needed
with the synchronous eviction. The test for non-blocking dispose
is kept but simplified to assert what `disposeAffinity(awaitIdle:
false)` still guarantees as a public API (returns fast, background
close eventually completes, affinity gone once close finishes).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@habdelra
Copy link
Copy Markdown
Contributor Author

[Claude Code] Multiple iterations on the eviction path have surfaced a fundamental tension I was missing. The summary in a76c682:

The pre-fix disposeAffinity(awaitIdle: true) from #evictLRUAffinity wasn't just blocking — it was load-bearing serialization. The refill loop's #ensuringStandbys dedup meant concurrent getPage callers awaiting it would all wake up to the post-close state (including a fresh standby). When I switched to awaitIdle: false, the refill loop terminated before the close completed — leaving those concurrent callers without a fresh standby when they woke up, falling through to cross-affinity steal on the only remaining busy tab, with the second caller throwing 'No standby page available for prerender'. Captured by prerendering-test.ts > queued cross-realm requests realign the tab per request.

Reverted #evictLRUAffinity to awaitIdle: true. CS-11139's structural fix already keeps independent getPage callers off the dedup'd #ensuringStandbys promise, so only the refill loop itself pays for the stuck-page close — which is the right tradeoff. The other CS-11140 wins remain intact:

  • Orphan-claim race fix (oldShared.closing = true upfront + snapshot-based cleanup in disposeAffinity).
  • Supplementary-tab bookkeeping in #assignStandbyToAffinity / #reassignAffinityTab (the warning that fired throughout the 2026-05-13 incident no longer fires).
  • Context-aware #transferSharedContextBookkeeping / #releaseSharedContextForClosedPage (don't corrupt a fresh primary's pageCount when concurrent callers race the map).

48/48 page-pool + prerender tests pass locally.

@habdelra habdelra requested a review from a team May 14, 2026 01:14
…cover-after-render-timeout

# Conflicts:
#	packages/realm-server/tests/index.ts
@habdelra habdelra merged commit 561952f into main May 14, 2026
93 of 95 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants