CS-11125: per-realm advisory lock on data-plane write paths#4840
CS-11125: per-realm advisory lock on data-plane write paths#4840lukemelia wants to merge 3 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR extends the per-realm advisory write lock (added to DBAdapter.withWriteLock in #4839) to the data-plane mutation paths in runtime-common so that concurrent writers on different realm-server replicas can’t race on the same realm URL (preventing lost updates and TOCTOU issues around PATCH and /_atomic).
Changes:
- Wrap
Realm.write,writeMany,delete, anddeleteAllindbAdapter.withWriteLock(this.url, …)and introduce unlocked inner helpers (e.g._batchWriteUnlocked) for callers that already hold the lock. - Take the lock at the handler boundary for
/_atomicandPATCHso reads + validation + writes are serialized in one critical section. - Add a new realm-server regression test module and register it in the test index.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| packages/runtime-common/realm.ts | Threads withWriteLock through public mutation entry points and wraps /_atomic + PATCH critical sections; renames inner write/delete helpers to avoid lock re-entrancy. |
| packages/realm-server/tests/index.ts | Registers the new data-plane lock regression test module. |
| packages/realm-server/tests/data-plane-write-lock-test.ts | Adds regression tests for concurrent PATCH lost-update and /_atomic add TOCTOU serialization. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
450e521 to
87eb4d8
Compare
The advisory lock from CS-10891 wrapped only the realm-lifecycle handlers (create / publish / unpublish / delete realm). The data-plane mutation routes that normal client traffic actually hits (POST/PATCH/DELETE <realm>/<path>, POST <realm>/_atomic) didn't take it — fine on a single replica because the Node event loop is the implicit serializer, but a real lost-update race the moment a second replica appears behind an LB. This PR threads the per-realm `withWriteLock` (newly exposed on `DBAdapter` by the parent refactor) through the data-plane mutations in `packages/runtime-common/realm.ts`: - `write` / `writeMany` / `delete` / `deleteAll` wrap their inner work in the lock. Inner methods are renamed `_batchWriteUnlocked` / `_deleteUnlocked` / `_deleteAllUnlocked` so callers that need the lock around a wider critical section can re-use the inner without re-entering the lock (which would block on a different pinned pool connection). - `handleAtomicOperations` takes the lock at the handler boundary so the `add`/`update` precheck is in the same critical section as the writeMany — fixes the `/_atomic` TOCTOU spelled out in the ticket. - `patchCardInstance` takes the lock around the indexEntry read + merge + write, so two concurrent PATCHes from different replicas can't both read the same `original` and clobber each other's merge. Lock granularity stays per-realm — concurrent writes to different realms remain fully parallel. SQLite (host) is a passthrough (no cross-connection concurrency to coordinate), so this is a no-op there. ## Regression test `packages/realm-server/tests/data-plane-write-lock-test.ts`: 1. Two concurrent PATCHes against the same card with non-overlapping attribute changes — both must persist. Without the lock the second writer would compute its merge from pre-first state and drop the first writer's field. 2. Two concurrent `/_atomic` `add` operations on the same href — must yield 201 + 409, not 201 + 201. Without the lock both pass the precheck and both write. Compatible with single-replica deploy: the lock is a no-op without contention. Lands before CS-10899 (atomicity follow-up) and the cache-invalidation tickets that assume writes are already serialized. Stacks on #4839 (DBAdapter.withWriteLock refactor). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9de891b to
68680de
Compare
Host Test Results 1 files ± 0 1 suites ±0 3h 47m 48s ⏱️ + 1h 51m 1s Results for commit 65b0596. ± Comparison against earlier commit e8f6a07. For more details on these errors, see this check. Realm Server Test Results 1 files ±0 1 suites ±0 9m 36s ⏱️ +25s Results for commit 65b0596. ± Comparison against earlier commit e8f6a07. |
68680de unintentionally removed Realm.clearLocalCaches() — the CS-11043 publish-realm cache invalidation primitive. The advisory-lock work on this branch had no reason to touch it; it's a rebase casualty. Two callers were still on this API: * packages/realm-server/handlers/handle-publish-realm.ts:575 (production — the synchronous local cache drop that keeps the reindex's prerender from seeing pre-swap bytes). * packages/realm-server/tests/card-source-endpoints-test.ts:263 (the regression test pinning that production behavior). Restoring exactly the CS-11043 implementation. The added note points at [CS-11156](https://linear.app/cardstack/issue/CS-11156), the follow-up that replaces the publish handler's local-only call with a cross-replica NOTIFY broadcast — out of scope here. Addresses Copilot review feedback on PR #4840. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* runtime-common/realm.ts: collapse a buildCardJsonEtag call back onto one line — prettier nit caught by Lint CI, unrelated to the advisory-lock work but blocking merge. * tests/data-plane-write-lock-test.ts: add `?waitForIndex=true` to the two concurrent /_atomic POSTs. /_atomic returns on durability, not on index completion, so the immediate GET that asserts on the winning card's attributes could race the indexer and 404 / read stale. The PATCH test elsewhere in this file uses `writeMany` which defaults to `waitForIndex: true`, so only /_atomic needed the explicit opt-in. Addresses the Copilot comment at PR #4840 review r3243846106. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Addressed both Copilot review comments and the Lint failure:
Follow-up CS-11156 (cross-replica cache invalidation via NOTIFY broadcast) is open as draft PR #4842, stacked on this branch. |
Summary
withWriteLock(now onDBAdapter, via Refactor: move per-realm advisory lock into DBAdapter.withWriteLock #4839) through the data-plane mutations inpackages/runtime-common/realm.ts, so two replicas behind a load balancer can no longer race on the same realm URL.POST/PATCH/DELETE <realm>/<path>andPOST <realm>/_atomic— the routes real client traffic actually hits — did not take it. Single-replica deploys were fine because the Node event loop was the implicit serializer; the moment a second replica appears, last-writer-wins on EFS and interleavedboxel_indexrows.Stacks on #4839.
Linear: CS-11125.
What's in
write/writeMany/delete/deleteAllwrap their inner work indbAdapter.withWriteLock(this.url, ...). Inner methods are renamed_batchWriteUnlocked/_deleteUnlocked/_deleteAllUnlockedso handlers that need the lock around a wider critical section reuse the inner without re-entering the lock (a secondpg_advisory_xact_lockon the same key would block on a different pinned pool connection).handleAtomicOperationstakes the lock at the handler boundary so theadd/updateprecheck is in the same critical section as thewriteMany— fixes the/_atomicTOCTOU spelled out in the ticket.patchCardInstancetakes the lock around theindexEntryread + merge + write, so two concurrentPATCHes from different replicas can't both read the sameoriginaland clobber each other's merge.Lock granularity stays per-realm — concurrent writes to different realms remain fully parallel. SQLite (host) is a passthrough, so this is a no-op there.
Regression test
packages/realm-server/tests/data-plane-write-lock-test.ts:/_atomicaddops on the same href — must yield 201 + 409, not 201 + 201. Without the lock both pass the precheck and both write.Compatibility
pg_advisory_xact_lockround-trip per data-plane mutation.Test plan
realm-advisory-locks-test(7/7) +realm-cleanup-transaction-test(3/3) pass after the rebase on Refactor: move per-realm advisory lock into DBAdapter.withWriteLock #4839.pnpm exec tsc -p packages/runtime-common --noEmitand the realm-server / postgres / host projects — zero new type errors vs. main.pnpm test:wait-for-serversagainst a running stack to drive the newdata-plane-write-lock-testend-to-end. (Requires Synapse + base realm-server up — not run locally, deferred to CI.)patchCardInstance/handleAtomicOperationsre-indentation diffs (heavy because everything is wrapped in a closure) — please review the structural changes carefully.🤖 Generated with Claude Code