docs(design): propose Composed-1 cross-group commit-time ownership guard#870
Conversation
Forward-looking design proposal. Today the implementation is
vacuously safe wrt Composed-1 (the TLA+ cross-module invariant
proved in tla/composed) because SplitRange is same-group only.
The doc designs the apply-time ownership re-validation that any
future cross-group route mutation (cross-group SplitRange,
MoveRange, group decommission) would need *before* it can land —
review-and-ready instead of patched in under time pressure.
What the doc covers:
* The Composed-1 invariant from Composed.tla and the canonical
4-step gap counterexample.
* Why the current implementation is vacuously safe.
* The hazard if cross-group route mutation is introduced —
structurally a G1c-style lost-write across a group boundary,
invisible to current Jepsen workloads because the RPC surface
doesn't expose the trigger.
* The proposed plumbing: ObservedRouteVersion field on
OperationGroup → propagated through dispatchTxn into
pb.Request → FSM verifyComposed1 check at apply time →
coordinator-side retry on ErrComposed1Violation.
* Catalog version retention (small versioned-snapshot ring in
distribution/Engine, default depth 32, soft-failure on miss).
* TLA+-to-implementation correspondence — the apply check
refines exactly the Commit precondition the spec asserts.
* Five milestones M1..M5 (plumbing → ring → check → retry →
Jepsen workload), each independently testable.
* Five-lens self-review per CLAUDE.md.
* Three open questions (deprecation of unpinned txns,
routeHistoryDepth default, hard-vs-soft missing-version
behaviour).
This is the "design-doc-first" artefact required by CLAUDE.md
for changes touching routing / OCC / FSM apply semantics. No
implementation lands with this PR.
|
@claude review |
|
Warning Review limit reached
More reviews will be available in 15 minutes and 19 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Claude finished @bootjp's task in 2m 46s —— View job PR Review: Composed-1 Cross-Group Commit-Time Ownership GuardTasks
Overall AssessmentThis is a well-reasoned forward-looking proposal. The TLA+ alignment is correct, the vacuous-safety argument is sound, and the milestone ordering is clean. Several implementation gaps in the pseudocode would need to be resolved before M3 starts; raising them here so the implementation PR doesn't surface them as surprises. TLA+ Alignment ✅The spec's The proposed The Issues1.
|
There was a problem hiding this comment.
Code Review
This pull request proposes a design for a cross-group commit-time ownership guard (Composed-1) to prevent lost-write anomalies (G1c) when cross-group route mutations are introduced. The review feedback points out that the proposed verifyComposed1 method relies on fields (routes and shardGroupID) that do not currently exist on the kvFSM struct, requiring structural and constructor updates. Additionally, it is suggested to treat a missing version snapshot as a hard, retryable error rather than a soft failure to strictly guarantee safety under aggressive catalog garbage collection.
| func (f *kvFSM) verifyComposed1(r *pb.Request, observedVer uint64) error { | ||
| if observedVer == 0 { | ||
| return nil // unpinned — caller opted out (read-only, legacy) | ||
| } | ||
| snap := f.routes.SnapshotAt(observedVer) | ||
| if snap == nil { | ||
| // We don't retain this version anymore — the catalog GC | ||
| // ran. This is a soft failure: legacy and fits the | ||
| // "treat as unpinned" fallback. Logged for observability. | ||
| return nil | ||
| } | ||
| for _, mut := range r.Mutations { | ||
| if isTxnInternalKey(mut.Key) { | ||
| continue | ||
| } | ||
| owner := snap.OwnerOf(mut.Key) | ||
| if owner != f.shardGroupID { | ||
| return errors.Wrapf(ErrComposed1Violation, | ||
| "key %q owned by group %d at version %d but committing on group %d", | ||
| mut.Key, owner, observedVer, f.shardGroupID) | ||
| } | ||
| } | ||
| return nil | ||
| } |
There was a problem hiding this comment.
The proposed verifyComposed1 method references f.routes and f.shardGroupID, but neither of these fields currently exists on the kvFSM struct in kv/fsm.go.
To implement this:
- Extend
kvFSM: Addroutes(e.g., a history-retaining route engine or snapshot provider) andshardGroupID(the ID of the shard group this FSM belongs to) to thekvFSMstruct. - Update Constructor: Update
NewKvFSMWithHLC(or introduce a new option/constructor) to initialize these fields.
Without these fields, the proposed FSM-level check cannot be compiled or executed.
// Note: This requires adding routes (history provider) and shardGroupID fields to kvFSM.
func (f *kvFSM) verifyComposed1(r *pb.Request, observedVer uint64) error {
if observedVer == 0 {
return nil // unpinned — caller opted out (read-only, legacy)
}
snap := f.routes.SnapshotAt(observedVer)
if snap == nil {
// We don't retain this version anymore — the catalog GC
// ran. This is a soft failure: legacy and fits the
// "treat as unpinned" fallback. Logged for observability.
return nil
}
for _, mut := range r.Mutations {
if isTxnInternalKey(mut.Key) {
continue
}
owner := snap.OwnerOf(mut.Key)
if owner != f.shardGroupID {
return errors.Wrapf(ErrComposed1Violation,
"key %q owned by group %d at version %d but committing on group %d",
mut.Key, owner, observedVer, f.shardGroupID)
}
}
return nil
}| - **Q3.** Should the soft-failure on `SnapshotAt(v) == nil` be a | ||
| hard error instead? Today it is "permit and log" because the | ||
| catalog GC is operator-controlled and unlikely to drop a recent | ||
| version; revisit if this becomes a routine path. |
There was a problem hiding this comment.
To strictly guarantee the Composed-1 safety invariant and prevent G1c anomalies under all conditions (including long-running transactions or aggressive catalog GC), a missing version snapshot should be treated as a hard error (triggering a transaction abort and retry) rather than a soft failure.
Since the coordinator already handles retries on ErrComposed1Violation by re-reading the catalog and re-routing, treating a missing version as a retryable error is safe, robust, and avoids silent safety degradation.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c7f9f9bf1f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| owner := snap.OwnerOf(mut.Key) | ||
| if owner != f.shardGroupID { | ||
| return errors.Wrapf(ErrComposed1Violation, |
There was a problem hiding this comment.
Fence late commits to the old owner
In the §3 sequence where a transaction observes k1 on g1, MoveRange changes the current owner to g2, and the transaction still commits on g1, this check passes because SnapshotAt(observedVer).OwnerOf(k1) returns g1. The write is then accepted on the old group, while new readers route to g2 and can miss it, which is the G1c case the proposal says this guard prevents. The design needs an additional fence against route changes after the observed version, or it must specify a move protocol that drains/forwards in-flight old-owner commits.
Useful? React with 👍 / 👎.
| Versions older than the ring fall through to the soft-failure | ||
| path (§4.2): logged but not rejected. This matches the design | ||
| decision in the partial hotspot-split doc — the catalog is | ||
| authoritative at the apply boundary, and stale-version commits | ||
| are an acceptable retry pattern. |
There was a problem hiding this comment.
Fail closed when the observed route version is gone
Allowing SnapshotAt(observedVer) == nil to fall through means the guard is bypassed exactly for long-running transactions or high catalog churn that evicts the observed version from the proposed ring. With cross-group movement enabled, a transaction that observed an evicted version can commit to a non-owning group without any ErrComposed1Violation, so the stated Composed-1 protection depends on a best-effort retention depth rather than the invariant. This path should be a hard retry/error or the catalog must pin versions while transactions are in flight.
Useful? React with 👍 / 👎.
…Cd versions (codex P1 / P2 + gemini mediums on PR #870) Round-2 revision of the Composed-1 cross-group commit-time ownership guard, addressing four substantive comments on the first revision of the doc: 1. **codex P1 — late-commit fence to old owner is missing.** The first revision proposed only an observed-version check `routes[txnObservedVer[t]][k] = txnCommitGroup[t]`. This refines the spec's literal `Commit` precondition but does not prevent the §3 G1c trace: a txn that observes `routes[v_obs]` commits on `g1` (`routes[v_obs][k1] = g1` — observed check passes!) while `MoveRange` has shifted `k1` to `g2` at `v_obs+1`, leaving readers at `v_obs+1` routed to `g2` and blind to the write. Added a new §4.4 specifying a **current-version cross-version- read fence** that the apply-time FSM also runs alongside the observed-version check. The two together cover both the spec-level Composed-1 invariant AND the implementation-level cross-version-read consistency. Documented the `MoveRange` drain protocol as the rejected alternative. Updated §3 to call out explicitly that an observed-version-only check is insufficient and §5 to clarify that §4.4 is a strictly stronger contract than the literal spec predicate (with a note that extending `Composed.tla` to model `currentVersion` is a tracked follow-up). 2. **gemini medium / codex P2 — fail closed on GCd version.** The first revision proposed soft-passing `SnapshotAt(observedVer) == nil` (catalog GC dropped the version). That bypasses the guard exactly in the cases (long- running txns, high catalog churn) where the cross-version-read hazard is most likely. §4.2 and §4.3 now return a new `ErrComposed1VersionGCd` sentinel on the retention miss; the coordinator's retry path handles it identically to `ErrComposed1Violation` — re-read catalog, re-route, retry. The retention depth becomes a *liveness* knob (too small → more retries), not a *safety* knob. 3. **gemini medium — kvFSM fields don't exist.** The first revision's `verifyComposed1` snippet referenced `f.routes` and `f.shardGroupID`, neither of which exist on today's `kvFSM` struct. Added an explicit prerequisite comment block in §4.2 calling out the struct extension, and merged the engine ring and `kvFSM` extension into a single M2 milestone (they are tightly coupled through the constructor wiring) — was previously M2 (engine ring) and an implicit M3 dependency. 4. **Self-review and open-question maintenance.** §8 lens 1 updated to describe the three-way guard and credit the failing-closed fix. §9 Q2 clarified as a liveness-only knob. §9 Q3 replaced — the soft-failure question is resolved (hard error), and the new open question is "apply-time fence vs. MoveRange drain protocol" pending real operational data from the first cross-group RPC. No code changes. Implementation deliberately not landing with this PR — the doc is the artefact under review.
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
All four review comments addressed in 124a36e: codex P1 — late commit to old owner. Single biggest finding — the first revision's observed-version-only check refines the spec's literal gemini medium #2 + codex P2 — fail closed on GCd version. §4.2 / §4.3 now return a new gemini medium #1 — kvFSM fields don't exist. Added explicit prerequisite comment block in §4.2 covering the struct extension (routes kv.RouteHistory + shardGroupID uint64). Merged the engine ring and kvFSM extension into a single M2 milestone since they're tightly coupled through constructor wiring. Self-review / open questions. §8 lens 1 updated to describe the three-way guard. §9 Q2 clarified as liveness-only. §9 Q3 replaced — soft-failure question is resolved (hard error); new Q3 is apply-time fence vs. MoveRange drain protocol, pending real operational data. @claude review |
|
Claude finished @bootjp's task in 3m 39s —— View job PR Review: Composed-1 Cross-Group Guard (r2)Tasks
Overall AssessmentAll four findings called out in the r2 commit message are correctly addressed. The fail-closed approach for GC'd versions is the right call, and the new §4.4 correctly identifies the gap in the observed-version-only check. Two concerns remain: one is new and introduced by §4.4's reliance on the watcher; the other carries over from the r1 review (two-phase transactions). r2 Findings Verified ✅codex P1 — late commit to old owner. §3 now explicitly flags the insufficiency of the observed-version check for the gemini medium #2 + codex P2 — fail closed on GC'd version. gemini medium #1 — kvFSM fields. The prerequisite block in §4.2 explicitly names the two missing fields and calls out Self-review / Q2 / Q3. Lens 1 in §8 correctly describes the three-way guard. Q2 as liveness-only is accurate. Q3 (apply-time fence vs. drain protocol) is now properly framed as an operational decision deferred to when real data is available. Issues1. §4.4 has a watcher-propagation lag window (new — introduced in r2)The The G1c trace §4.4 was designed to prevent can still occur within that 100 ms window: The doc's §4.4 rationale ("the FSM is the single serialisation point that observes the authoritative catalog state via the route-engine watcher") is correct in principle but the via the watcher clause means the FSM sees catalog state with a bounded lag, not instantaneously. The fence closes the gap asymptotically but not completely under the current architecture. The doc already documents the drain protocol as an alternative (§4.4 last paragraph), and that alternative would close this window (MoveRange quiesces g1's in-flight commits before activating g2 as owner). I'm not asking for the design to change — but the doc should acknowledge the lag window explicitly and explain why it is acceptable or how it is bounded. As written, §4.4 reads as if the fence is complete, when it is actually best-effort within the watcher interval. Suggested addition (one sentence in §4.4): "The FSM's view of the current catalog version is bounded by the watcher poll interval ( 2. Two-phase transaction path still unaddressed (carried from r1)The M3 unit test scenarios in §6 all describe single-shard (one-phase) transactions. The doc doesn't specify whether
The TLA+ spec's Minor Points
Checklist per PR test plan
The doc is approvable for a proposed state. The watcher-lag acknowledgment (issue #1) is the one addition that would prevent the §4.4 guarantee from being over-stated when the eventual implementation PR opens. |
…w-up) (#878) ## Summary #870 follow-up: extends the M5 Composed module so the design doc §4.4 apply-time cross-version-read fence is spec-checked by TLC. The first revision of `tla/composed` proved the spec-level Composed-1 predicate (commit goes to the owning group at the txn's observed catalog version) but — as codex P1 surfaced on PR #870 — that predicate alone is **not** sufficient to prevent the G1c trace where `ProposeRouteChange` moves a key after `WriteIntent`: observed-version check passes because `routes[v_obs][k]` still records the old owner, yet readers at `v_obs+1` route to the new owner and miss the write. This PR adds the strictly stronger **Composed-1a** invariant: ``` Composed1a_CommitToCurrentOwner == \A t \in CommittedTxns : \A k \in txnWriteSet[t] : OwnerAt(currentVersionAtCommit[t], k) = txnCommitGroup[t] ``` New state variable `currentVersionAtCommit[t]` records the catalog version the FSM observed at apply time (set on `Commit`). A new CONSTANT `EnableCurrentFence` gates the matching Commit-action guard so the implementation-level fence can be toggled independently of the spec-level `EnableSafety` guard. ## Three configurations now exercise the module | Config | Toggles | Outcome | |---|---|---| | `MCComposed.cfg` | safety: ON, currentfence: ON | PASS | | `MCComposed_gap.cfg` | safety: OFF, currentfence: ON | FAIL on `Composed1_CommitToOwningGroup` (canonical 4-step counterexample) | | `MCComposed_currentfence_gap.cfg` (**new**) | safety: ON, currentfence: OFF | FAIL on `Composed1a_CommitToCurrentOwner` — **the codex P1 trace at depth 5** | The codex P1 trace TLC found: 1. `BeginTxn(t1)` → `txnObservedVer[t1] = 0` 2. `WriteIntent(t1, k1)` 3. `ProposeRouteChange(k1, g2)` → `catalogVersion = 1`, `routes[1][k1] = g2` 4. `Commit(t1, g1)` — observed-version check passes (`routes[0][k1] = g1`); current-fence gate is OFF, so Commit fires 5. Invariant fails: `currentVersionAtCommit[t1] = 1`, `OwnerAt(1, k1) = g2 ≠ g1` ## Other changes - `scripts/tla-check.sh` refactored from a single-gap-per-module to a list-of-gaps-per-module. Per-module entry point `gap_configs_for` returns `<cfg-stem>|<invariant-string>` lines, one per gap; modules with a single guard (HLC, OCC, MVCC, Routes) are unchanged. - `tla/README.md` documents Composed-1a alongside Composed-1 in the module description and the invariant table. ## Verification ``` $ make tla-check HLC safe pass + HLC-4 gap fail OCC safe pass + OCC-1 gap fail MVCC safe pass + MVCC-4 gap fail Routes safe pass + Routes-4 gap fail Composed safe pass + Composed-1 gap fail + Composed-1a currentfence_gap fail (depth 5) tla-check: all model-check outcomes match the design contract. ``` 11 outcomes total (5 safe PASS + 6 gap FAIL), including the new codex P1 trace. ## Self-review (5 lenses) 1. **Data loss** — out of scope (spec-only). 2. **Concurrency** — Composed-1a captures the concurrency seam between OCC commit and Routes route mutation that Composed-1 alone missed; the gap counterexample IS an interleaving. 3. **Performance** — TLC state space grew from 1,684 → 5,170 distinct states on the safe config (still ~2s wall-clock), well under §8.1's <10 min target. 4. **Data consistency** — Composed-1a is strictly stronger than Composed-1. Spec → impl correspondence: §4.4 of the design doc. 5. **Test coverage** — `make tla-check` runs the new gap config automatically; no new Go tests (spec-only PR). ## Test plan - [x] `make tla-check` PASS for all 11 outcomes - [x] codex P1 counterexample at depth 5 reproduced on `MCComposed_currentfence_gap.cfg` - [ ] Reviewer cross-checks `Composed1a` against §4.4 of the design doc - [ ] Reviewer confirms `EnableCurrentFence = TRUE` on the safe config preserves the existing `MCComposed.cfg` outcome (yes — TLC PASS verified locally) ## Resolves The "extending `Composed.tla` to model `currentVersion`" follow-up flagged in §5 of `docs/design/2026_05_29_proposed_composed1_cross_group_commit_guard.md`. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added a stronger apply-time ownership fence safety property that verifies committed writes' ownership against the current catalog version at commit time. * **Testing** * Enhanced the verification framework to support multiple gap-check configurations per module, enabling more comprehensive safety verification. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/bootjp/elastickv/pull/878?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
Forward-looking design proposal for the Composed-1 cross-group commit-time ownership guard, the implementation that would be required before introducing any cross-group route mutation (cross-group SplitRange, MoveRange, group decommission).
Today the implementation is vacuously safe with respect to Composed-1 (proved in
tla/composed/Composed.tla/MCComposed.cfg) becauseSplitRangeis same-group only and no other route-mutating RPC exists. This doc designs the apply-time check that any cross-group RPC would need to land alongside — review-and-ready instead of patched in under time pressure.What the doc covers
Composed.tlaand the canonical 4-step gap counterexample at depth 4.ObservedRouteVersionfield onOperationGrouppropagated throughdispatchTxnintopb.Request.verifyComposed1at apply time + newErrComposed1Violationsentinel.distribution/Engine(default depth 32; soft-failure on miss).Commitprecondition.Why now, why this PR
CLAUDE.md mandates "design-doc-first" for changes touching routing, OCC, or FSM apply semantics. The implementation does not land with this PR — only the proposal. Landing the doc first lets the eventual implementation PR cite a reviewed contract instead of inventing one inline.
Test plan
Commitpredicate intla/composed/Composed.tla.