Skip to content

CS-11182 follow-up: widen cache invalidation to every reindex entry point#4900

Merged
lukemelia merged 2 commits into
mainfrom
cs-11182-broaden-cache-drop-to-worker
May 20, 2026
Merged

CS-11182 follow-up: widen cache invalidation to every reindex entry point#4900
lukemelia merged 2 commits into
mainfrom
cs-11182-broaden-cache-drop-to-worker

Conversation

@lukemelia
Copy link
Copy Markdown
Contributor

Summary

  • The original CS-11182 fix only fired the L2 module_transpile_cache bulk tombstone from Realm.startReindex's post-completion .then, which covers POST <realm>/_full-reindex and <realm>/_reindex but nothing else.
  • Production reindexes routed through /_grafana-reindex, /_grafana-full-reindex, /_post-deployment, the publish-realm flow (Realm.fullIndex), and direct enqueueReindexRealmJob calls all bypass startReindex. On staging this morning a Grafana-button reindex of the base realm completed cleanly but never tombstoned the L2 rows — clients kept being served pre-deploy bytes (the originating symptom for CS-11182: "All Files" still not visible after the deploy + reindex).
  • Move the cross-replica invalidation to the worker side of the from-scratch task: emit notifyAllFileChanges(dbAdapter, realmURL) after IndexRunner.fromScratch returns. Every replica's existing realm_file_changes wildcard listener picks it up and calls realm.clearLocalSourceCaches() (sync L1 wipe + fire-and-forget L2 bulk tombstone). One chokepoint covers every from-scratch trigger uniformly — current and future.

The original Realm.startReindex .then is now strictly belt-and-suspenders for the POST /_full-reindex path, which is fine to leave in place.

Test plan

  • pnpm test-module module-cache-race-test.ts — new test asserts L2 rows are tombstoned after a reindex triggered via realm.realmIndexUpdater.fullIndex (the bypass path that doesn't wire up startReindex's .then).
  • Verified the test fails without the fix (1 live L2 row vs. 0 expected) by stashing the indexer.ts change and re-running just the new test.
  • After deploy to staging: hit the Grafana reindex button on base; confirm cards-grid.gts flips from 13579 bytes (stale) to ~15573 bytes (new "All Files" code) and module_transpile_cache.body IS NULL for the realm's rows.

🤖 Generated with Claude Code

The original CS-11182 fix only fired the L2 `module_transpile_cache`
bulk tombstone from `Realm.startReindex`'s post-completion `.then` —
which covered POST `<realm>/_full-reindex` and `<realm>/_reindex` but
nothing else.

Production reindexes triggered via the operator-action endpoints
(`/_grafana-reindex`, `/_grafana-full-reindex`, `/_post-deployment`),
the publish-realm flow (`Realm.fullIndex`), and direct
`enqueueReindexRealmJob` calls all bypassed `startReindex`. On staging
today, a Grafana-button reindex of the base realm completed without
ever tombstoning the L2 rows, so clients continued to be served
pre-deploy bytes.

Emit `notifyAllFileChanges(dbAdapter, realmURL)` from the worker side
of `fromScratchIndex`, right after `IndexRunner.fromScratch` returns.
The existing `realm_file_changes` wildcard listener on every replica
then calls `realm.clearLocalSourceCaches()` — synchronous L1 wipe plus
the fire-and-forget L2 bulk tombstone. One chokepoint covers every
from-scratch trigger uniformly, including future ones.

The regression test in `module-cache-race-test.ts` drives a reindex
through `realm.realmIndexUpdater.fullIndex` (the bypass path that
never wires up the original `startReindex` callback) and asserts the
L2 rows are tombstoned. Verified to fail without the fix and pass
with it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 20, 2026

Host Test Results

    1 files  ±0      1 suites  ±0   1h 47m 51s ⏱️ - 1m 9s
2 712 tests ±0  2 697 ✅ ±0  15 💤 ±0  0 ❌ ±0 
2 731 runs  ±0  2 716 ✅ ±0  15 💤 ±0  0 ❌ ±0 

Results for commit 6d491f3. ± Comparison against earlier commit 20dd768.

Realm Server Test Results

    1 files  ±0      1 suites  ±0   8m 4s ⏱️ -25s
1 453 tests ±0  1 453 ✅ +1  0 💤 ±0  0 ❌  - 1 
1 544 runs  ±0  1 544 ✅ +1  0 💤 ±0  0 ❌  - 1 

Results for commit 6d491f3. ± Comparison against earlier commit 20dd768.

@lukemelia lukemelia requested review from a team and habdelra May 20, 2026 10:53
Two CI-only failures on the previous commit:

- `while (true)` in `waitForZeroLiveRows` tripped `no-constant-condition`.
  Rewrote the loop to test the condition explicitly so the lint rule is
  satisfied (and the code reads more directly).
- The pre-existing failure-isolation test
  (`reindex still tombstones L2 rows when clearRealmDefinitions throws`)
  asserted `countLiveRowsForRealm() === 0` immediately after
  `await testRealm.reindex()` returned, but `#dropAllTranspiledModuleCacheEntries`
  fires the L2 bulk UPDATE as `void` (fire-and-forget). Locally the
  UPDATE landed before the next query; on slower CI runners it didn't,
  so the test was flaky. Both tests in that sub-module now go through
  the same poll-with-timeout helper, matching the new worker-side test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lukemelia lukemelia merged commit 9ab7284 into main May 20, 2026
67 of 68 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant