CS-11125: per-realm advisory lock on data-plane write paths by lukemelia · Pull Request #4840 · cardstack/boxel

lukemelia · 2026-05-14T19:31:43Z

Summary

Threads the per-realm withWriteLock (now on DBAdapter, via Refactor: move per-realm advisory lock into DBAdapter.withWriteLock #4839) through the data-plane mutations in packages/runtime-common/realm.ts, so two replicas behind a load balancer can no longer race on the same realm URL.
Pre-CS-11125 the lock from CS-10891 only wrapped realm-lifecycle handlers (create / publish / unpublish / delete). POST / PATCH / DELETE <realm>/<path> and POST <realm>/_atomic — the routes real client traffic actually hits — did not take it. Single-replica deploys were fine because the Node event loop was the implicit serializer; the moment a second replica appears, last-writer-wins on EFS and interleaved boxel_index rows.

Stacks on #4839.

What's in

write / writeMany / delete / deleteAll wrap their inner work in dbAdapter.withWriteLock(this.url, ...). Inner methods are renamed _batchWriteUnlocked / _deleteUnlocked / _deleteAllUnlocked so handlers that need the lock around a wider critical section reuse the inner without re-entering the lock (a second pg_advisory_xact_lock on the same key would block on a different pinned pool connection).
handleAtomicOperations takes the lock at the handler boundary so the add / update precheck is in the same critical section as the writeMany — fixes the /_atomic TOCTOU spelled out in the ticket.
patchCardInstance takes the lock around the indexEntry read + merge + write, so two concurrent PATCHes from different replicas can't both read the same original and clobber each other's merge.

Lock granularity stays per-realm — concurrent writes to different realms remain fully parallel. SQLite (host) is a passthrough, so this is a no-op there.

Regression test

packages/realm-server/tests/data-plane-write-lock-test.ts:

Two concurrent PATCHes against the same card with non-overlapping attribute changes — both must persist. Without the lock the second writer computes its merge from pre-first state and drops the first writer's field.
Two concurrent /_atomic add ops on the same href — must yield 201 + 409, not 201 + 201. Without the lock both pass the precheck and both write.

Compatibility

Single-replica deploys: the lock is a no-op without contention — pure overhead is one pg_advisory_xact_lock round-trip per data-plane mutation.
Lands before CS-10899 (atomicity follow-up) and the cache-invalidation tickets that assume writes are already serialized.

Test plan

realm-advisory-locks-test (7/7) + realm-cleanup-transaction-test (3/3) pass after the rebase on Refactor: move per-realm advisory lock into DBAdapter.withWriteLock #4839.
pnpm exec tsc -p packages/runtime-common --noEmit and the realm-server / postgres / host projects — zero new type errors vs. main.
pnpm test:wait-for-servers against a running stack to drive the new data-plane-write-lock-test end-to-end. (Requires Synapse + base realm-server up — not run locally, deferred to CI.)
Eyeballed patchCardInstance / handleAtomicOperations re-indentation diffs (heavy because everything is wrapped in a closure) — please review the structural changes carefully.

🤖 Generated with Claude Code

Copilot

Pull request overview

This PR extends the per-realm advisory write lock (added to DBAdapter.withWriteLock in #4839) to the data-plane mutation paths in runtime-common so that concurrent writers on different realm-server replicas can’t race on the same realm URL (preventing lost updates and TOCTOU issues around PATCH and /_atomic).

Changes:

Wrap Realm.write, writeMany, delete, and deleteAll in dbAdapter.withWriteLock(this.url, …) and introduce unlocked inner helpers (e.g. _batchWriteUnlocked) for callers that already hold the lock.
Take the lock at the handler boundary for /_atomic and PATCH so reads + validation + writes are serialized in one critical section.
Add a new realm-server regression test module and register it in the test index.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
packages/runtime-common/realm.ts	Threads `withWriteLock` through public mutation entry points and wraps `/_atomic` + PATCH critical sections; renames inner write/delete helpers to avoid lock re-entrancy.
packages/realm-server/tests/index.ts	Registers the new data-plane lock regression test module.
packages/realm-server/tests/data-plane-write-lock-test.ts	Adds regression tests for concurrent `PATCH` lost-update and `/_atomic add` TOCTOU serialization.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The advisory lock from CS-10891 wrapped only the realm-lifecycle handlers (create / publish / unpublish / delete realm). The data-plane mutation routes that normal client traffic actually hits (POST/PATCH/DELETE <realm>/<path>, POST <realm>/_atomic) didn't take it — fine on a single replica because the Node event loop is the implicit serializer, but a real lost-update race the moment a second replica appears behind an LB. This PR threads the per-realm `withWriteLock` (newly exposed on `DBAdapter` by the parent refactor) through the data-plane mutations in `packages/runtime-common/realm.ts`: - `write` / `writeMany` / `delete` / `deleteAll` wrap their inner work in the lock. Inner methods are renamed `_batchWriteUnlocked` / `_deleteUnlocked` / `_deleteAllUnlocked` so callers that need the lock around a wider critical section can re-use the inner without re-entering the lock (which would block on a different pinned pool connection). - `handleAtomicOperations` takes the lock at the handler boundary so the `add`/`update` precheck is in the same critical section as the writeMany — fixes the `/_atomic` TOCTOU spelled out in the ticket. - `patchCardInstance` takes the lock around the indexEntry read + merge + write, so two concurrent PATCHes from different replicas can't both read the same `original` and clobber each other's merge. Lock granularity stays per-realm — concurrent writes to different realms remain fully parallel. SQLite (host) is a passthrough (no cross-connection concurrency to coordinate), so this is a no-op there. ## Regression test `packages/realm-server/tests/data-plane-write-lock-test.ts`: 1. Two concurrent PATCHes against the same card with non-overlapping attribute changes — both must persist. Without the lock the second writer would compute its merge from pre-first state and drop the first writer's field. 2. Two concurrent `/_atomic` `add` operations on the same href — must yield 201 + 409, not 201 + 201. Without the lock both pass the precheck and both write. Compatible with single-replica deploy: the lock is a no-op without contention. Lands before CS-10899 (atomicity follow-up) and the cache-invalidation tickets that assume writes are already serialized. Stacks on #4839 (DBAdapter.withWriteLock refactor). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-14T20:24:24Z

Host Test Results

1 files ± 0 1 suites ±0 3h 47m 48s ⏱️ + 1h 51m 1s
2 658 tests ± 0 2 642 ✅ - 1 15 💤 ± 0 0 ❌ ±0 1 🔥 +1
5 354 runs +2 677 5 322 ✅ +2 660 30 💤 +15 1 ❌ +1 1 🔥 +1

Results for commit 65b0596. ± Comparison against earlier commit e8f6a07.

For more details on these errors, see this check.

Realm Server Test Results

1 files ±0 1 suites ±0 9m 36s ⏱️ +25s
1 376 tests ±0 1 376 ✅ ±0 0 💤 ±0 0 ❌ ±0
1 457 runs ±0 1 457 ✅ ±0 0 💤 ±0 0 ❌ ±0

Results for commit 65b0596. ± Comparison against earlier commit e8f6a07.

68680de unintentionally removed Realm.clearLocalCaches() — the CS-11043 publish-realm cache invalidation primitive. The advisory-lock work on this branch had no reason to touch it; it's a rebase casualty. Two callers were still on this API: * packages/realm-server/handlers/handle-publish-realm.ts:575 (production — the synchronous local cache drop that keeps the reindex's prerender from seeing pre-swap bytes). * packages/realm-server/tests/card-source-endpoints-test.ts:263 (the regression test pinning that production behavior). Restoring exactly the CS-11043 implementation. The added note points at [CS-11156](https://linear.app/cardstack/issue/CS-11156), the follow-up that replaces the publish handler's local-only call with a cross-replica NOTIFY broadcast — out of scope here. Addresses Copilot review feedback on PR #4840. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* runtime-common/realm.ts: collapse a buildCardJsonEtag call back onto one line — prettier nit caught by Lint CI, unrelated to the advisory-lock work but blocking merge. * tests/data-plane-write-lock-test.ts: add `?waitForIndex=true` to the two concurrent /_atomic POSTs. /_atomic returns on durability, not on index completion, so the immediate GET that asserts on the winning card's attributes could race the indexer and 404 / read stale. The PATCH test elsewhere in this file uses `writeMany` which defaults to `waitForIndex: true`, so only /_atomic needed the explicit opt-in. Addresses the Copilot comment at PR #4840 review r3243846106. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

lukemelia · 2026-05-14T22:27:05Z

Addressed both Copilot review comments and the Lint failure:

Lint (prettier on realm.ts:4111) — collapsed the buildCardJsonEtag(...) call back onto a single line. Unrelated to advisory locks, blocking merge.
r3243846068 — Realm.clearLocalCaches() deletion was an unintentional rebase casualty. Restored at the original CS-11043 site in commit e8f6a07.
r3243846106 — added ?waitForIndex=true to the two /_atomic POSTs in data-plane-write-lock-test.ts. /_atomic returns on durability, not on index completion, so the immediate GET could race the indexer. The PATCH test in the same file uses writeMany which defaults to waitForIndex: true, so only /_atomic needed the explicit opt-in.

Follow-up CS-11156 (cross-replica cache invalidation via NOTIFY broadcast) is open as draft PR #4842, stacked on this branch.

lukemelia requested a review from Copilot May 14, 2026 19:34

Copilot started reviewing on behalf of lukemelia May 14, 2026 19:35 View session

Copilot AI reviewed May 14, 2026

View reviewed changes

Comment thread packages/runtime-common/realm.ts

Comment thread packages/realm-server/tests/data-plane-write-lock-test.ts

lukemelia force-pushed the withwritelock-refactor branch from 450e521 to 87eb4d8 Compare May 14, 2026 19:48

lukemelia force-pushed the cs-11125-per-realm-advisory-lock-for-data-plane-write-paths branch from 9de891b to 68680de Compare May 14, 2026 19:48

lukemelia mentioned this pull request May 14, 2026

CS-11156: cross-replica clearLocalCaches broadcast via NOTIFY #4842

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CS-11125: per-realm advisory lock on data-plane write paths#4840

CS-11125: per-realm advisory lock on data-plane write paths#4840
lukemelia wants to merge 3 commits into
withwritelock-refactorfrom
cs-11125-per-realm-advisory-lock-for-data-plane-write-paths

lukemelia commented May 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 14, 2026 •

edited

Loading

Uh oh!

lukemelia commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lukemelia commented May 14, 2026

Summary

What's in

Regression test

Compatibility

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Host Test Results

Realm Server Test Results

Uh oh!

lukemelia commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented May 14, 2026 •

edited

Loading