Skip to content

from-scratch coalesce: fall back to in-flight candidates (CS-11157)#4850

Open
habdelra wants to merge 4 commits into
mainfrom
cs-11157-coalesce-in-flight
Open

from-scratch coalesce: fall back to in-flight candidates (CS-11157)#4850
habdelra wants to merge 4 commits into
mainfrom
cs-11157-coalesce-in-flight

Conversation

@habdelra
Copy link
Copy Markdown
Contributor

Summary

Defense-in-depth fix from CS-11157. Independent of PR #4849 (the realm-creation fix), but useful any time two from-scratch publishes for the same realm race a worker claim.

chooseFromScratchCoalesceDecision (packages/runtime-common/tasks/indexer.ts) previously consulted only the pending candidates bucket. If a worker claimed the first row between two near-simultaneous publishes for the same realm, the row moved out of candidates into inFlightCandidates before the second publish's coalesce ran — so the pending lookup found nothing and the second publish minted a fresh row at its own priority, even though the in-flight job was going to produce exactly the result the second caller wanted.

chooseIncrementalCoalesceDecision in the same file already has an in-flight fallback (packages/runtime-common/tasks/indexer.ts:220-234). This PR adds the equivalent for from-scratch. The from-scratch case is simpler than incremental: same concurrency group + same jobType is sufficient because a from-scratch reindex subsumes any other from-scratch for that realm by definition — no per-args coverage check needed.

Test plan

  • pnpm lint:types clean

  • pnpm lint:js clean on touched files

  • New regression test in packages/realm-server/tests/queue-test.ts: publish a from-scratch job, wait for the worker to actually claim it (moving it from candidates into inFlightCandidates), then publish a second from-scratch for the same realm. Asserts:

    • second.id === first.id (waiter attaches to the running job)
    • exactly one row in the concurrency group

    Locally: all 30 queue tests pass (TEST_MODULES=queue-test.ts ./tests/scripts/run-qunit-with-test-pg.sh).

  • Full realm-server suite in CI.

Relationship to PR #4849

PR #4849 (stacked on the refactor #4846) prevents the realm-creation flow from ever creating a second from-scratch enqueue in the first place, by mounting through lookupOrMount(..., { fromScratchIndexPriority: userInitiatedPriority }) so the realm's own #startup produces the single canonical job. That change is sufficient for the realm-creation hang.

This PR is the orthogonal hardening: any other code path that enqueues a from-scratch and then races a worker claim against another publish (e.g. handle-publish-realm.ts and the realm.start() enqueue it triggers via lookupOrMount) gets correct coalesce behaviour without each call site needing to thread a priority through.

🤖 Generated with Claude Code

chooseFromScratchCoalesceDecision previously consulted only the
pending bucket (candidates). A worker claim arriving between two
near-simultaneous from-scratch publishes for the same realm moved
the first row into inFlightCandidates, so the second publish's
pending lookup found nothing and minted a fresh row at its own
priority — even though the in-flight job would produce exactly the
result the second caller wanted.

Mirror chooseIncrementalCoalesceDecision's in-flight fallback. The
from-scratch case is simpler than incremental: same concurrency
group + same jobType is sufficient because a from-scratch reindex
subsumes any other from-scratch for that realm by definition; no
per-args coverage check is needed.

Regression test: publish first job, wait for the worker to claim
(moving it to in-flight), then publish a second. Assert second.id
=== first.id and exactly one row exists in the concurrency group.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1fb17b7e62

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/runtime-common/tasks/indexer.ts Outdated
habdelra and others added 2 commits May 15, 2026 10:49
The in-flight fallback added in the previous commit is unsafe for a
publish that has just nulled boxel_index.last_modified for the realm
(handle-publish-realm, handle-reindex, full-reindex). The running
job's mtimes snapshot pre-dates the clear, so attaching the
clearing publish to it would let the caller observe a successful
job that did NOT re-render the swapped files.

Surface the clearing intent as args.clearLastModified on
from-scratch-index jobs (always present; required field on
FromScratchArgs so the JSON-shape index signature on WorkerArgs is
satisfied). The coalesce decision checks the flag and forces a fresh
row instead of joining an in-flight candidate when set.

Pending coalesce is unchanged: a pending join is still safe because
the pending job hasn't read its mtimes snapshot yet.

Regression test: publish a from-scratch and wait for the worker to
claim it; publish a second with clearLastModified=true; assert the
second got its own row (not attached to the running job).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ex args

The grafana reindex path uses clearLastModified: true so every row in
boxel_index re-renders even when its mtime hasn't changed. The flag
is now surfaced in args (so the from-scratch coalesce can refuse to
attach this publish to a running same-realm from-scratch), which
means the strict args shape check has to include it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 15, 2026

Host Test Results

    1 files  ±0      1 suites  ±0   1h 45m 2s ⏱️ + 2m 2s
2 659 tests ±0  2 644 ✅ ±0  15 💤 ±0  0 ❌ ±0 
2 678 runs  ±0  2 663 ✅ ±0  15 💤 ±0  0 ❌ ±0 

Results for commit 67bf0ec. ± Comparison against earlier commit a7f2108.

Realm Server Test Results

    1 files  ±0      1 suites  ±0   9m 36s ⏱️ +22s
1 379 tests ±0  1 379 ✅ +1  0 💤 ±0  0 ❌  - 1 
1 460 runs  ±0  1 460 ✅ +1  0 💤 ±0  0 ❌  - 1 

Results for commit 67bf0ec. ± Comparison against earlier commit a7f2108.

The full-reindex task enqueues from-scratch with clearLastModified:
true. The flag is now surfaced in args (so the from-scratch coalesce
can refuse to attach this kind of publish to a running same-realm
from-scratch), which means the strict args shape check has to include
it for both the source and published realm jobs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@habdelra habdelra requested review from a team and Copilot May 15, 2026 16:39
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Hardens queue coalescing for from-scratch-index jobs so a second near-simultaneous publish for the same realm can correctly attach to an already-claimed (in-flight) job, avoiding duplicate rows and priority skew during worker-claim races (CS-11157).

Changes:

  • Add an inFlightCandidates fallback to chooseFromScratchCoalesceDecision, mirroring the existing incremental behavior.
  • Introduce/propagate clearLastModified in from-scratch job args to prevent incorrectly joining a “forced refresh” publish onto an already-running job whose mtime snapshot predates the DB clear.
  • Add queue regression tests to cover both the in-flight dedup path and the clearLastModified exception.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
packages/runtime-common/tasks/indexer.ts Adds in-flight fallback coalescing for from-scratch jobs and introduces clearLastModified args flag + guard.
packages/runtime-common/jobs/reindex-realm.ts Plumbs clearLastModified into published from-scratch job args (and normalizes the clear condition).
packages/realm-server/tests/server-endpoints/maintenance-endpoints-test.ts Updates expected job args to include clearLastModified: true for Grafana-triggered reindex.
packages/realm-server/tests/queue-test.ts Adds regression tests for in-flight from-scratch dedup and for the clearLastModified non-join behavior.
packages/realm-server/tests/full-reindex-test.ts Updates expected job args to include clearLastModified: true for full-reindex enqueues.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants