Skip to content

tla: M2 — OCC.tla (OCC-1..OCC-5 + 2 action properties) + module-aware tla-check#858

Merged
bootjp merged 4 commits into
mainfrom
feat/tla-m2-occ-spec
May 29, 2026
Merged

tla: M2 — OCC.tla (OCC-1..OCC-5 + 2 action properties) + module-aware tla-check#858
bootjp merged 4 commits into
mainfrom
feat/tla-m2-occ-spec

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented May 28, 2026

Summary

M2 of the TLA+ safety spec — the OCC layer encoded as a Percolator-style 2PC model.

Follows PR #856 (M1 HLC spec).

What lands

  • tla/occ/OCC.tla — transaction lifecycle Idle → Active → Prepared → Committed/Aborted, lock map (key, lock_ts) → start_ts, LockResolve action that drains abandoned locks.
  • tla/occ/MCOCC.tla — TLC model module with txn + key symmetry.
  • tla/occ/MCOCC.cfg — safe config (EnableSafety = TRUE; expected PASS).
  • tla/occ/MCOCC_gap.cfg — Commit's OCC-1 guard removed (expected FAIL on OCC1_CommitTsAboveStart).
  • scripts/tla-check.sh — module-aware orchestrator. Adding M3/M4/M5 needs only a new entry in TLA_MODULES + a one-line case for the gap-invariant string.
  • Makefile tla-check delegates to the script.
  • .gitignore TLA artefact glob widened from MCHLC_TTrace_* to MC*_TTrace_* (M2 + M3..M5 covered).
  • tla/README.md: M2 status Not startedLanded; OCC module description block + invariant table.

Invariants encoded

# Statement Form
OCC-1 commit_ts(T) > start_ts(T) for every committed txn INVARIANT
OCC-2 Two committed txns sharing a write key have distinct commit_ts and one started after the other committed INVARIANT
OCC-3 Snapshot-read stability under the lock encoding (reads blocked by conflicting uncommitted locks; LockResolve drains them) INVARIANT
OCC-4 No stranded lock at quiescence (bounded-safety form) INVARIANT
OCC-5 start_ts(T) == read_ts(T) for every operation INVARIANT
OCC5_Action start_ts[t] is assigned once at BeginTxn, never updated PROPERTY
CommitTsAssignedOnce commit_ts[t] is assigned once at Commit, never updated PROPERTY

OCC-L1 (liveness) is deferred to M6 per §8.1.

Verification (local)

$ make tla-check
... TLC: tla/hlc/MCHLC.cfg  (correct design, expected PASS)
... 3594 distinct states found, 0 left on queue
... TLC: tla/hlc/MCHLC_gap.cfg  (...expected FAIL on HLC4...)
OK: MCHLC_gap.cfg failed as designed
... TLC: tla/occ/MCOCC.cfg  (correct design, expected PASS)
... 4217 distinct states found, 0 left on queue
... TLC: tla/occ/MCOCC_gap.cfg  (...expected FAIL on OCC1...)
OK: MCOCC_gap.cfg failed as designed (Invariant OCC1_CommitTsAboveStart is violated)
tla-check: all model-check outcomes match the design contract.

Modelling notes

  • HLC abstraction. OCC.tla uses a single global monotonic tsCounter, not the full 48/16 HLC. M5 (composed) will INSTANCE HLC.tla for the real layout. This keeps the M2 state space manageable.
  • Commit borrows ts. commit_ts = tsCounter (does NOT advance). Only Tick advances ts, so the bounded model stays inside 0..MaxTs. OCC-1's commit_ts > start_ts then requires a Tick between BeginTxn and Commit.
  • OCC-2 non-strict serialization. A txn with start_ts = T sees every version with commit_ts ≤ T, so start_ts = commit_ts(earlier) is a valid "later runs in the world after earlier committed" boundary (<=, not <).
  • OCC-3 lock encoding is enforced as an ENABLED guard on ReadKey — TLC simply does not explore reads in a state where a conflicting lock exists, mirroring the spec's "OCC-3 is asserted only in states where no uncommitted lock at lock_ts ≤ read_ts exists".

Self-review (5-lens per CLAUDE.md)

  1. Data loss — n/a, spec-only PR, no persistence paths.
  2. Concurrency / distributed failures — the spec is exactly modelling these. Safe config (4,217 states) explored without invariant violation; gap config produces the documented counterexample.
  3. Performancemake tla-check end-to-end runs in well under 5 seconds on a developer laptop at default bounds.
  4. Data consistency — OCC-1..OCC-5 + the two PROPERTY transitions are the M2 contract from the design doc. The non-strict OCC-2 form is justified in the spec comment.
  5. Test coverage — no Go tests added (spec-only). make tla-check is the new coverage layer.

Test plan

  • tla-check CI workflow runs green (it watches tla/**)
  • tla-spec-ai-review does NOT fire — this PR touches no Go anchor (correct: spec-only changes don't need an AI impl-divergence review)
  • Reviewer cross-checks OCC-1..OCC-5 invariants against §5.2 of the design doc
  • Reviewer runs make tla-check and confirms 4 outcomes (HLC safe pass, HLC gap fail, OCC safe pass, OCC gap fail)

Out of scope

  • OCC-L1 (liveness) — deferred to M6.
  • M3 (MVCC), M4 (Routes), M5 (Composed) — separate future PRs.
  • Integrating OCC with the real HLC.tla — happens in M5.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added OCC (Optimistic Concurrency Control) specification module with two-phase commit lifecycle, five core safety invariants, and automated model-checking verification suite
  • Documentation

    • Updated project README with detailed OCC module documentation and marked the OCC milestone as completed
  • Chores

    • Refactored verification build infrastructure with improved scripts and updated trace-file ignore patterns

Review Change Stack

Per §8.1 of docs/design/2026_05_28_partial_tla_safety_spec.md (M2
deliverable): the OCC layer encoded as a Percolator-style 2PC model.

New module:
- tla/occ/OCC.tla — transaction lifecycle Idle → Active → Prepared →
  Committed/Aborted, lock map (key, lock_ts) → start_ts, LockResolve
  action that drains abandoned locks (motivates OCC-4 / OCC-L1).
- tla/occ/MCOCC.tla — TLC model module with txn + key symmetry hints.
- tla/occ/MCOCC.cfg — safe config (EnableSafety = TRUE; expected PASS).
- tla/occ/MCOCC_gap.cfg — Commit's OCC-1 guard removed (expected FAIL
  on OCC1_CommitTsAboveStart with a counterexample motivating the
  spec, per the same gap-evidence pattern used for HLC in M1).

Invariants encoded:
- OCC-1  commit_ts > start_ts for every committed txn.
- OCC-2  no write-write conflict on intersecting write sets (distinct
  commit_ts and one txn started after the other committed).
- OCC-3  snapshot-read stability under the lock encoding (reads are
  blocked by uncommitted locks at lock_ts ≤ start_ts; LockResolve
  drains them).
- OCC-4  no stranded lock at quiescence (bounded-safety form).
- OCC-5  start_ts == read_ts for every operation (structural).
- OCC5_Action / CommitTsAssignedOnce (PROPERTIES) — start_ts and
  commit_ts are each assigned once and never updated.

Tooling refactor:
- scripts/tla-check.sh — module-aware orchestrator.  Adding M3..M5
  requires only an entry in TLA_MODULES + a `case` line for the gap-
  invariant string; Makefile target shrinks to one `bash` invocation.
- Makefile tla-check now delegates to that script.
- .gitignore: TLA artefact glob widened from MCHLC_TTrace_* to
  MC*_TTrace_* so M2/M3/M4/M5 are covered without further edits.
- tla/README.md: M2 status moved from "Not started" to "Landed";
  added the OCC module description block + invariant table mirroring
  the HLC entry.

Verified locally with `make tla-check`:
- HLC safe: 3,594 distinct states, no errors.
- HLC gap: counterexample on HLC4_NoRegressionAcrossTerms at depth 5.
- OCC safe: 4,217 distinct states, no errors.
- OCC gap: counterexample on OCC1_CommitTsAboveStart at depth ≈ 5.
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Warning

Review limit reached

@bootjp, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 28 minutes and 2 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cf59d7e8-ea5e-4ce4-937f-f238602f7309

📥 Commits

Reviewing files that changed from the base of the PR and between 4b8d1b9 and b602383.

📒 Files selected for processing (5)
  • scripts/tla-check.sh
  • tla/README.md
  • tla/occ/MCOCC.cfg
  • tla/occ/MCOCC_gap.cfg
  • tla/occ/OCC.tla
📝 Walkthrough

Walkthrough

This PR adds an OCC (Optimistic Concurrency Control) specification in TLA+ with five safety invariants modeling 2PC commit flow, generalizes the TLC test infrastructure from hardcoded Makefile rules to a modular Bash script supporting multiple modules, and updates documentation to mark the OCC milestone as complete.

Changes

OCC specification and generalized model-check infrastructure

Layer / File(s) Summary
Generalized TLA+ test infrastructure
.gitignore, Makefile, scripts/tla-check.sh
TLC orchestration delegates to a new Bash harness (scripts/tla-check.sh) that runs safe and gap configs for all modules in TLA_MODULES array, enforces exit-code expectations, and validates gap-mode output for module-specific invariant strings. .gitignore patterns generalize from MCHLC_TTrace_* to MC*_TTrace_* to support all modules.
OCC specification with state, actions, and invariants
tla/occ/OCC.tla
Defines 2PC optimistic concurrency control layer with transaction lifecycle (Idle→Active→Prepared/Committed/Aborted), per-key versioning, lock maps, and snapshot-read conflict detection. Five OCC safety invariants: commit_ts > start_ts (gated by EnableSafety), write-write conflict avoidance, snapshot-read consistency, no stranded locks, and start_ts/read consistency. Actions progress time, begin txns, gate reads on conflicts, buffer writes, prepare/commit with optional strict-greater guard, abort, and resolve orphan locks.
OCC model-check instance and configurations
tla/occ/MCOCC.tla, tla/occ/MCOCC.cfg, tla/occ/MCOCC_gap.cfg
MCOCC.tla extends OCC with TLC symmetry permutations for transaction IDs and keys. MCOCC.cfg (safe mode) verifies all OCC invariants and OCC5_Action property. MCOCC_gap.cfg disables EnableSafety to intentionally violate OCC-1 and validate the harness detects the correct counterexample. Both disable deadlock checking and share constant bounds.
Status and specification documentation
tla/README.md
Updates project status table marking OCC (M2) as "Landed". Adds "What each module proves" section documenting OCC layer model (2PC lifecycle, five invariants, EnableSafety gating), MCOCC model-check setup, and expected gap-mode OCC-1 counterexample.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

A rabbit hops through time and key,
With stamps and locks and guarantees,
Five invariants, strong and true—
OCC proves what systems do! 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly summarizes the main changes: OCC.tla with five invariants and action properties, plus module-aware tla-check tooling. It matches the core deliverables of M2.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/tla-m2-occ-spec

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the TLA+ specification for the Optimistic Concurrency Control (OCC) layer (Milestone M2), including the core model OCC.tla, model-checking configurations, and a new orchestrator script scripts/tla-check.sh to run TLC checks. Feedback on these changes highlights several critical issues: the LockResolve action in OCC.tla is currently unreachable (dead code) because locks are cleared atomically during commit and abort; a bash scripting bug in tla-check.sh could cause gap checks to silently succeed if a command substitution fails; the OCC3_ReadSnapshotStability invariant is defined too weakly and does not assert the intended equality consistency; and the TypeOK invariant is missing type checks for versions and lockMap.lockTs.

Comment thread tla/occ/OCC.tla Outdated
Comment on lines +199 to +209
LockResolve(k) ==
/\ HasLock(k)
/\ LET t == lockMap[k].owner IN
/\ txnState[t] \in {"Committed", "Aborted"}
/\ IF txnState[t] = "Committed"
THEN versions' = [versions EXCEPT ![k] = @ \cup
{[commitTs |-> commitTs[t], owner |-> t]}]
ELSE versions' = versions
/\ lockMap' = [lockMap EXCEPT ![k] = NoLock]
/\ UNCHANGED <<tsCounter, txnState, startTs, commitTs, writeSet,
readObs, opCount>>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In the current TLA+ specification, the LockResolve action is completely dead (it can never be enabled in any reachable state).

Why this happens:

  1. Commit(t) (lines 178-179) atomically clears all locks owned by t by setting them to NoLock at the exact same step it transitions txnState[t] to "Committed".
  2. Abort(t) (lines 188-189) also atomically clears all locks owned by t when transitioning txnState[t] to "Aborted".
  3. Consequently, there is no reachable state where txnState[t] \in {"Committed", "Aborted"} and lockMap[k].owner = t for any key k.
  4. Therefore, the guard txnState[t] \in {"Committed", "Aborted"} in LockResolve(k) (line 202) can never be satisfied, making the entire action unreachable.

Impact:

This means the model does not actually exercise or verify the asynchronous lock resolution protocol (where locks are left behind and resolved by concurrent readers or a background resolver). Instead, locks are always cleaned up atomically during the commit/abort step. This also makes the OCC4_NoStrandedLockAtQuiescence invariant trivially satisfied by the atomic cleanup rather than by the lock resolution logic.

Recommendation:

To properly model Percolator-style 2PC where locks can be orphaned and resolved asynchronously:

  • Define a "primary key" for each transaction (or designate one key in the write set as the primary).
  • In Commit(t), only transition txnState[t] to "Committed" and write the version/clear the lock for the primary key.
  • Leave the locks on the secondary keys intact.
  • Allow secondary locks to be resolved asynchronously either by a separate commit-secondary action or by the LockResolve action.
  • Alternatively, if this simplified atomic model is intentional for M2 to keep the state space small, this limitation should be explicitly documented in the comments, and LockResolve should either be removed or marked as a placeholder for M5/M6.

Comment thread scripts/tla-check.sh Outdated
mc="$(mc_basename "$module")"
safe_cfg="${mc}.cfg"
gap_cfg="${mc}_gap.cfg"
gap_inv="$(gap_invariant_for "$module")"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Because set -e is not set in this script, a failure inside the command substitution gap_inv="$(gap_invariant_for "$module")" will not cause the script to terminate.

If gap_invariant_for hits the wildcard * case and calls exit 64, only the subshell exits. The main script will continue executing with gap_inv set to an empty string. Later, grep -qF "$gap_inv" (line 99) will run with an empty string, which always matches, causing the gap check to silently succeed.

Recommendation:

Check the exit status of the command substitution to ensure errors are caught immediately.

Suggested change
gap_inv="$(gap_invariant_for "$module")"
if ! gap_inv="$(gap_invariant_for "$module")"; then
overall_rc=1
continue
fi

Comment thread tla/occ/OCC.tla Outdated
Comment on lines +276 to +279
OCC3_ReadSnapshotStability ==
\A t \in TxnIds, k \in Keys :
(txnState[t] \in {"Active", "Prepared", "Committed"} /\ readObs[t][k] # 0) =>
readObs[t][k] <= startTs[t]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The invariant OCC3_ReadSnapshotStability is defined as readObs[t][k] <= startTs[t], which is virtually identical to (but weaker than) OCC5_StartTsConsistency (lines 294-296).

According to the comment on lines 266-275, this invariant is intended to assert that the recorded observation is consistent with the current visible version:

"We assert the simpler equality form: the recorded observation is consistent with the current visible version..."

However, the actual formula written does not assert equality or consistency with LatestVisible. It only asserts that the observed version's timestamp is bounded by startTs[t].

Recommendation:

To actually assert snapshot stability (i.e., that the read value remains consistent with the latest visible version at the transaction's snapshot timestamp), the invariant should be defined using equality.

OCC3_ReadSnapshotStability ==
    \A t \in TxnIds, k \in Keys :
        (txnState[t] \in {"Active", "Prepared", "Committed"} /\ readObs[t][k] # 0) =>
            readObs[t][k] = LatestVisible(k, startTs[t])

Comment thread tla/occ/OCC.tla Outdated
Comment on lines +230 to +238
TypeOK ==
/\ tsCounter \in 0..MaxTs
/\ txnState \in [TxnIds -> States]
/\ startTs \in [TxnIds -> 0..MaxTs]
/\ commitTs \in [TxnIds -> 0..MaxTs]
/\ writeSet \in [TxnIds -> SUBSET Keys]
/\ readObs \in [TxnIds -> [Keys -> 0..MaxTs]]
/\ opCount \in 0..MaxOps
/\ \A k \in Keys : lockMap[k].owner \in TxnIds \cup {"none"}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The TypeOK invariant is missing type checks for the versions variable and the lockTs field of lockMap.

To ensure complete type safety and prevent any unexpected state-space issues during model checking, TypeOK should be updated to cover all variables.

TypeOK ==
    /\ tsCounter \in 0..MaxTs
    /\ txnState  \in [TxnIds -> States]
    /\ startTs   \in [TxnIds -> 0..MaxTs]
    /\ commitTs  \in [TxnIds -> 0..MaxTs]
    /\ writeSet  \in [TxnIds -> SUBSET Keys]
    /\ readObs   \in [TxnIds -> [Keys -> 0..MaxTs]]
    /\ opCount   \in 0..MaxOps
    /\ \A k \in Keys :
        /\ lockMap[k].owner \in TxnIds \cup {"none"}
        /\ lockMap[k].lockTs \in 0..MaxTs
    /\ versions \in [Keys -> SUBSET [commitTs: 0..MaxTs, owner: TxnIds]]

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tla/README.md (1)

49-60: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update run-behavior wording to match module-aware tla-check.

This section still describes a single module pair (“both configurations” for HLC), but make tla-check now runs safe+gap configs per registered module (currently HLC and OCC). Please update wording so operators know to expect multiple module runs.

Suggested doc patch
-This runs TLC against both configurations and prints whether the outcome
-matches the design contract:
+This runs TLC across each registered module, executing both the safe and
+gap configurations per module, and prints whether outcomes match the
+design contract:
 
-1. `tla/hlc/MCHLC.cfg` — the **correct design**, with HLC-4
+1. `tla/hlc/MCHLC.cfg` — the **correct design**, with HLC-4
    preconditions encoded as ASSUMEs / action guards. TLC must finish
    with no invariant violation.
-2. `tla/hlc/MCHLC_gap.cfg` — the **gap configuration**, with the
+2. `tla/hlc/MCHLC_gap.cfg` — the **gap configuration**, with the
    preconditions disabled (`EnableSafety = FALSE`). TLC must produce a
    counterexample on `HLC4_NoRegressionAcrossTerms` — this is the
    motivating evidence that strategy (c) handoff + the ceiling fence
    are necessary. The `tla-check` target inverts the exit code for
    this run so a TLC FAILURE here counts as PASS for CI.
+3. `tla/occ/MCOCC.cfg` / `tla/occ/MCOCC_gap.cfg` follow the same
+   pattern for OCC (`OCC1_CommitTsAboveStart` in gap mode).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tla/README.md` around lines 49 - 60, Update the README text describing the
tla-check behavior to indicate that make tla-check runs the safe and gap
configurations for each registered module (not just the HLC pair), e.g. it will
run MCHLC.cfg and MCHLC_gap.cfg for HLC and the corresponding safe+gap configs
for OCC (and any other registered modules), and that for each module the gap run
is expected to produce a counterexample on HLC4_NoRegressionAcrossTerms (or the
module's equivalent) with tla-check inverting the exit code so a TLC failure on
the gap run counts as a CI pass; replace phrases like “both configurations” and
“the HLC” with wording that operators should expect multiple module runs
(safe+gap per registered module).
🧹 Nitpick comments (1)
tla/occ/OCC.tla (1)

43-43: ⚡ Quick win

Fix contradictory tsCounter semantics in the variable comment.

Line 43 says tsCounter advances on Tick + Commit, but Commit explicitly leaves tsCounter unchanged (and Lines 157-159 document that correctly). Please align the Line 43 comment to avoid misleading spec reasoning.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tla/occ/OCC.tla` at line 43, The file comment for tsCounter is misleading:
update the variable comment for tsCounter (referenced as tsCounter, and the
actions Tick and Commit) so it matches the spec semantics documented later—i.e.,
state that tsCounter is an abstract HLC counter that advances on Tick and is
left unchanged by Commit (or simply "advances on Tick only"), ensuring
consistency with the Commit definition and lines 157–159.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@tla/README.md`:
- Around line 49-60: Update the README text describing the tla-check behavior to
indicate that make tla-check runs the safe and gap configurations for each
registered module (not just the HLC pair), e.g. it will run MCHLC.cfg and
MCHLC_gap.cfg for HLC and the corresponding safe+gap configs for OCC (and any
other registered modules), and that for each module the gap run is expected to
produce a counterexample on HLC4_NoRegressionAcrossTerms (or the module's
equivalent) with tla-check inverting the exit code so a TLC failure on the gap
run counts as a CI pass; replace phrases like “both configurations” and “the
HLC” with wording that operators should expect multiple module runs (safe+gap
per registered module).

---

Nitpick comments:
In `@tla/occ/OCC.tla`:
- Line 43: The file comment for tsCounter is misleading: update the variable
comment for tsCounter (referenced as tsCounter, and the actions Tick and Commit)
so it matches the spec semantics documented later—i.e., state that tsCounter is
an abstract HLC counter that advances on Tick and is left unchanged by Commit
(or simply "advances on Tick only"), ensuring consistency with the Commit
definition and lines 157–159.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c93b99d4-2903-4597-bd9e-cf234952be18

📥 Commits

Reviewing files that changed from the base of the PR and between 0a6cafe and 4b8d1b9.

📒 Files selected for processing (8)
  • .gitignore
  • Makefile
  • scripts/tla-check.sh
  • tla/README.md
  • tla/occ/MCOCC.cfg
  • tla/occ/MCOCC.tla
  • tla/occ/MCOCC_gap.cfg
  • tla/occ/OCC.tla

…on removal

Four findings from gemini on PR #858 commit 4b8d1b9.

- HIGH (OCC.tla LockResolve dead-action): Commit and Abort atomically
  drain every lock they own, so LockResolve(k) — gated on
  HasLock(k) /\ txnState[owner] in {Committed, Aborted} — is
  unreachable in this model.  Removed LockResolve entirely.  Added a
  block-comment in its place documenting the M2 abstraction: the
  atomic-drain subsumes the resolver's effect on the durable state
  without modelling the asynchronous interleaving.  M5 (composed) is
  the right place to model the real async resolver — cross-module
  interactions with OCC-3 (lock encoding for reads) and Routes
  (where the resolver issues secondary RPCs) belong together.

- HIGH (scripts/tla-check.sh gap_invariant_for swallowed exit):
  Without `set -e`, a non-zero exit from gap_invariant_for inside
  the command substitution only terminated the subshell — the parent
  loop continued with gap_inv="" and the later `grep -qF "$gap_inv"`
  matched the empty pattern, silently passing the gap check.
  Added an explicit `if !` check so a future undeclared module fails
  the harness instead of fake-passing.

- medium (OCC.tla OCC-3 too weak): the previous
  `readObs[t][k] <= startTs[t]` overlapped with OCC-5 and didn't
  actually assert snapshot stability.  Strengthened to the canonical
  form `readObs[t][k] = LatestVisible(k, startTs[t])`.
  Soundness in the safe config now rests on monotonic ts allocation:
  BeginTxn and Commit both advance tsCounter, so every fresh
  startTs > all prior startTs, and every commit_ts > all prior
  startTs.  This means no future Commit can produce a commit_ts <=
  startTs[t] for any t past BeginTxn, so LatestVisible(k, startTs[t])
  cannot change after the read — strict equality is preserved.

- medium (OCC.tla TypeOK incomplete): extended TypeOK to cover
  `lockMap[k].lockTs \in 0..MaxTs` and `versions \in [Keys -> SUBSET
  [commitTs: 0..MaxTs, owner: TxnIds]]`.  Without these, a buggy
  action could smuggle in a malformed lock-ts or version record
  undetected.

Structural changes to support strict OCC-3:

- Removed the `Tick` action; tsCounter now advances only as part of
  BeginTxn and Commit, which makes the monotonic-ts allocation
  faithful to the real HLC behaviour.  This also collapses the safe
  config's state space from 4,217 to 150 distinct states — the
  removed nondeterminism was Tick interleaving, not safety-relevant
  exploration.
- BeginTxn now sets `tsCounter' = tsCounter + 1` and `startTs[t] =
  tsCounter + 1`.
- Commit under EnableSafety advances tsCounter and uses the new
  value as commit_ts (yielding commit_ts > start_ts as a consequence
  of the allocation rather than an explicit guard); under the gap
  toggle, Commit borrows the *current* tsCounter without advancing,
  which is exactly == startTs[t] (no Tick to bump it), violating
  OCC-1 as the spec doc predicts.
- MCOCC.cfg / MCOCC_gap.cfg: MaxTs raised from 4 to 6 to keep the
  bounded model from saturating tsCounter under auto-advance.

Verified `make tla-check`:
- HLC safe (3,594 states, unchanged) + HLC gap (HLC-4 ce at depth 5).
- OCC safe (150 distinct states, no invariant violation).
- OCC gap: OCC1_CommitTsAboveStart counterexample, matches the
  pinned string scripts/tla-check.sh greps for.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 29, 2026

Round 1 fixes in a26f650f. All four gemini findings on 4b8d1b9f:

  1. HIGH (OCC.tla LockResolve dead-action) — Commit and Abort atomically drain locks, so LockResolve(k) (gated on HasLock(k) /\ txnState[owner] ∈ {Committed, Aborted}) was unreachable. Removed LockResolve and added a block comment explaining the M2 abstraction: atomic-drain subsumes the resolver's effect on durable state. M5 (composed) is the right milestone for the real async resolver — cross-module interactions with OCC-3 (lock encoding) and Routes (resolver's secondary RPCs) belong together.

  2. HIGH (scripts/tla-check.sh swallowed exit) — without set -e, a non-zero exit from gap_invariant_for inside the command substitution only terminated the subshell; the parent loop continued with gap_inv="" and the later grep -qF "" always matched, silently passing the gap check. Added an explicit if ! gap_inv="$(...)" check.

  3. medium (OCC.tla OCC-3 too weak) — strengthened to canonical strict equality readObs[t][k] = LatestVisible(k, startTs[t]). To make this hold, restructured BeginTxn and Commit to advance tsCounter (monotonic ts allocation), so no future Commit can produce commit_ts ≤ startTs[t] for any t past BeginTxn. The Tick action was removed (tsCounter now advances only as part of BeginTxn / Commit). Safe config state space collapses from 4,217 to 150 distinct states because the removed nondeterminism was Tick interleaving, not safety-relevant exploration.

  4. medium (OCC.tla TypeOK incomplete) — extended TypeOK to cover lockMap[k].lockTs ∈ 0..MaxTs and versions ∈ [Keys → SUBSET [commitTs: 0..MaxTs, owner: TxnIds]].

Bumped MaxTs from 4 to 6 in both MCOCC.cfg and MCOCC_gap.cfg to keep the bounded model from saturating tsCounter under auto-advance.

Verified make tla-check:

  • HLC safe (3,594 states, unchanged) + HLC gap (HLC-4 ce at depth 5)
  • OCC safe (150 distinct states, no violation)
  • OCC gap: OCC1_CommitTsAboveStart counterexample — matches the pinned string that scripts/tla-check.sh greps for

CI Analyze (go) FAILURE on the prior commit was an infrastructure flake (codeql-action@v4 download failed mid-checkout) — unrelated to spec content.

Spec-only PR, no Go code touched, no caller audit required.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 29, 2026

Claude finished @bootjp's task in 6m 6s —— View job


Code Review — Round 2 (after a26f650f)

  • Gather context and understand request
  • Read tla/occ/OCC.tla
  • Read scripts/tla-check.sh
  • Read tla/occ/MCOCC.tla, MCOCC.cfg, MCOCC_gap.cfg
  • Read Makefile, .gitignore, tla/README.md
  • Analyze and write review

The round 1 fixes (commit a26f650f) correctly addressed all four Gemini findings. The spec's core correctness is sound. The following observations remain for this review pass.


Stale documentation (fixes recommended before merge)

1. OCC.tla:43 — tsCounter comment still mentions removed Tick action

tsCounter,        \* Nat. Abstract HLC counter; advances on Tick + Commit.

Tick was removed in round 1; the counter now advances in BeginTxn and Commit. Should read:

tsCounter,        \* Nat. Abstract HLC counter; advances on BeginTxn + Commit.

Fix this →

2. OCC.tla:6-9 — module header still references the removed LockResolve action

(* Models the transaction lifecycle Idle -> Active -> Prepared ->          *)
(* Committed/Aborted, the lock-map (key, lock_ts) -> start_ts, and the     *)
(* LockResolve action that turns abandoned locks into versions or         *)
(* clears them.  ...

LockResolve was removed in round 1. The phrase "and the LockResolve action that turns abandoned locks into versions or clears them" should be dropped from the header. The per-action block comment at line 199 already explains the removal correctly.

3. OCC.tla:309 — OCC-4 comment body is stale

\* OCC-4 — no stranded lock at quiescence ... LockResolve drains them.

Last sentence is stale (LockResolve removed). The actual mechanism is now: Commit/Abort atomically drain locks, so at quiescence all locks are already gone.

4. tla/README.md:127-128 — OCC module description references LockResolve

Lines 127-128:

the LockResolve action that turns abandoned locks into versions or clears them

Same stale reference. The block comment in OCC.tla at line 199-221 has the correct explanation; the README should be updated to match.

5. tla/README.md:140 — OCC-3 invariant table describes the old weak form

Current table entry:

| `OCC3_ReadSnapshotStability` | Every read observation's `commit_ts <= start_ts` of the reader (lock-encoded reads only) |

After round 1's fix, the invariant is strict equality: readObs[t][k] = LatestVisible(k, startTs[t]). The description should say "equals LatestVisible(k, startTs[t])" rather than "commit_ts ≤ start_ts".

6. tla/README.md:49-60 — "Run" section still describes only HLC configs

This runs TLC against both configurations…

  1. tla/hlc/MCHLC.cfg — …
  2. tla/hlc/MCHLC_gap.cfg — …

Since tla-check.sh now runs safe+gap for all modules, the section should enumerate HLC and OCC configs (and note that M3/M4/M5 will add more). Also the manual-run snippet below (lines 64-75) only shows the HLC module and should note that the pattern extends to OCC/other modules.


Spec correctness (no issues found)

  • OCC-1 under EnableSafety: BeginTxn allocates startTs[t] = tsCounter + 1 (post-increment). Commit under EnableSafety allocates commitTs[t] = tsCounter + 1 (another post-increment). So commitTs[t] > startTs[t] holds by construction — any interleaving between BeginTxn and Commit only raises tsCounter further. ✓

  • OCC-1 gap counterexample: With EnableSafety = FALSE, Commit borrows tsCounter without advancing. If no other BeginTxn fires between BeginTxn(t) and Commit(t), then commitTs[t] = tsCounter = startTs[t], violating OCC-1. TLC finds this short trace. ✓

  • OCC-3 stability soundness: The comment on lines 292-301 correctly argues that LatestVisible(k, startTs[t]) is stable after allocation because all future Commit invocations produce commitTs = tsCounter + 1 > tsCounter >= startTs[t]. The precondition readObs[t][k] # 0 conflating "not yet read" with "read but saw nothing" is acceptable for the same reason (no future version can appear with commitTs ≤ startTs[t]). ✓

  • OCC-2 enforcement: Prepare requires ~HasLock(k) for every key in the write set. Two transactions cannot both hold a lock on the same key simultaneously, so concurrent overlapping writes are serialized at Prepare time. OCC-2 invariant correctly reflects this. ✓

  • OCC-4 trivially satisfied: With atomic lock drain at Commit/Abort, quiescence (all txns terminal) always means no locks remain. The invariant is not vacuous in the spec sense — it is correctly checked — but is guaranteed by construction rather than by a resolver action. The block comment at line 199 explains this well. ✓

  • TypeOK completeness: Now covers lockMap[k].lockTs ∈ 0..MaxTs and versions ∈ [Keys → SUBSET [commitTs: 0..MaxTs, owner: TxnIds]]. Complete. ✓

  • tla-check.sh exit handling: The if ! gap_inv="$(gap_invariant_for "$module")" guard (line 73) correctly catches subshell exits. The safe-config path uses if ! run_tlc ...; the gap-config path captures both output and exit code separately. Both patterns are correct. ✓


Minor observation — symmetry group completeness

MCOCC.tla:

TxnSymmetry == Permutations(TxnIds)
KeySymmetry == Permutations(Keys)
Symmetry    == TxnSymmetry \cup KeySymmetry

Permutations({t1, t2}) \cup Permutations({k1, k2}) (after TLC's universe-extension) gives {id, txn-swap, key-swap}missing the combined txn-swap ∘ key-swap element. The full product group of independent txn- and key-symmetries has 4 elements; the union only generates 3. TLC applies each permutation independently without closing under composition, so the combined equivalence class is never folded. This is safe (no unsound reduction) but means TLC explores a small number of redundant state pairs that differ only by simultaneous txn+key swap. For a 150-state model this is negligible; worth noting for M3/M4 if the state space grows.

The standard fix for TLC is to declare model values as a named SYMMETRY type directly in the .cfg file or to explicitly form the product set in TLA+. Neither is strictly required for M2.


Summary

Round 1 correctly addressed all Gemini HIGH and MEDIUM findings. The spec logic, gap-config counterexample evidence, and harness exit-code handling are all correct. The remaining items above (1-6) are documentation drift from the LockResolve removal and OCC-3 strengthening — I'd recommend fixing the stale comments and README entries before merge since they'll be the primary reference for M3/M4 contributors.

Six stale-documentation findings from claude bot on commit a26f650.
All are doc-only drift from round 1's structural changes (Tick removal,
LockResolve removal, OCC-3 strengthened to strict equality).

OCC.tla:
- Module header (lines 6-9) no longer describes the LockResolver
  action — replaced with a forward reference to the M2 atomic-drain
  abstraction and the M5 deferral note already living next to where
  LockResolve used to be.
- tsCounter variable comment now says "advances on BeginTxn + Commit"
  instead of "Tick + Commit".
- OCC4_NoStrandedLockAtQuiescence comment no longer says "LockResolve
  drains them"; instead points at the atomic-drain abstraction and
  the M5 deferral.

tla/README.md:
- "Run" section generalised: enumerates safe + gap per module rather
  than naming MCHLC alone; manual-run snippet now notes the same
  pattern applies to OCC and future M3/M4/M5 modules.
- OCC module description (under "What each module proves") no longer
  mentions LockResolve; explicit forward reference to the OCC.tla
  block comment + M5 deferral.
- OCC3_ReadSnapshotStability table entry updated to the strict
  equality form readObs[t][k] = LatestVisible(k, start_ts[t]);
  the previous commit_ts <= start_ts description matched the
  pre-round-1 weak form that round 1 strengthened.

Verified make tla-check end-to-end still passes:
- HLC safe (3,594 states) + HLC gap (HLC-4 ce).
- OCC safe (150 states) + OCC gap (OCC-1 ce).
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 29, 2026

Round 2 doc polish in 372d9ca8. claude bot's review on a26f650f confirmed the spec correctness ("Round 1 correctly addressed all Gemini findings… The spec logic, gap-config counterexample evidence, and harness exit-code handling are all correct") and flagged 6 stale documentation drift from round 1's structural changes (Tick removal, LockResolve removal, OCC-3 strict equality):

  1. OCC.tla:6-9 module header — removed the "LockResolver action that turns abandoned locks into versions" sentence; replaced with a forward reference to the M2 atomic-drain abstraction.
  2. OCC.tla:43 tsCounter variable comment — "advances on Tick + Commit" → "advances on BeginTxn + Commit".
  3. OCC.tla:309 OCC4_NoStrandedLockAtQuiescence comment — removed "LockResolve drains them"; points at the atomic-drain abstraction + M5 deferral.
  4. tla/README.md:49-60 "Run" section — generalised to enumerate safe + gap per module rather than naming MCHLC alone; manual-run snippet now notes the same pattern applies to OCC and future modules.
  5. tla/README.md:127-128 OCC description — removed the LockResolve reference, added forward pointer to the OCC.tla block comment + M5 deferral.
  6. tla/README.md:140 OCC-3 table entry — updated to strict equality readObs[t][k] = LatestVisible(k, start_ts(t)); the previous commit_ts <= start_ts matched the round-0 weak form.

Skipped (per claude bot's "not strictly required for M2"): the minor SYMMETRY observation that Permutations(TxnIds) \cup Permutations(Keys) doesn't fold the combined txn-swap ∘ key-swap element. The reduction is sound (no missed counterexamples); the cost is at most a few redundant state pairs in a 150-state model. Will revisit if M3/M4 state-space growth exposes the inefficiency.

Verified make tla-check: same 4 outcomes (HLC safe 3,594, HLC gap HLC-4 ce, OCC safe 150, OCC gap OCC-1 ce).

Spec-only doc fixes, no semantics changed. Pre-commit make lint (which scans the whole tree) flagged 3 issues in unrelated cmd/dynamodb-legacy-migrate/main.go left over from another local branch; those files are NOT part of this PR.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 29, 2026

Claude finished @bootjp's task in 1m 38s —— View job


Code Review — Round 3 (after 372d9ca8)

  • Gather context and understand request
  • Read tla/occ/OCC.tla — verify 3 doc fixes (module header, tsCounter comment, OCC-4 comment)
  • Read tla/README.md — verify 3 doc fixes (Run section, OCC description, OCC-3 table)
  • Cross-check all 6 fixes from round 2 report
  • Check for any remaining issues
  • Write final review

All six round-2 doc fixes landed correctly. Confirming each:

OCC.tla fixes

  1. Module header (OCC.tla:6-11)LockResolve description removed; replaced with forward reference to M2 atomic-drain abstraction and M5 deferral. ✓ Correct.

  2. tsCounter variable comment (OCC.tla:45) — Now reads "advances on BeginTxn + Commit". ✓ Matches BeginTxn (tsCounter' = tsCounter + 1 at line 107) and Commit under EnableSafety (line 178).

  3. OCC4_NoStrandedLockAtQuiescence comment (OCC.tla:309-315) — "LockResolve drains them" removed; now correctly points at the atomic-drain abstraction and M5 deferral. ✓

tla/README.md fixes

  1. "Run" section (README.md:49-65) — Generalized to "safe + gap configuration pair for every module declared in scripts/tla-check.sh". Both safe and gap bullet entries now enumerate HLC and OCC examples. Manual-run snippet notes the OCC pattern at line 82-83. ✓

  2. OCC module description (README.md:133-142)LockResolve mention removed; replaced with forward reference to the block comment in OCC.tla and the M5 deferral. ✓

  3. OCC-3 table entry (README.md:151) — Now reads "Every read observation equals LatestVisible(k, start_ts(t))". ✓ Matches the strict-equality invariant at OCC.tla:304-307.


One remaining minor stale reference

OCC.tla:116ReadKey action comment still says "LockResolve fires first":

(* outstanding, the read is blocked (modelled here as the action being    *)
(* disabled; LockResolve fires first).                                    *)

With LockResolve removed, the lock holder's Commit or Abort is what eventually unblocks the read (by draining the lock in that same atomic step). A one-word fix:

(* outstanding, the read is blocked (modelled here as the action being    *)
(* disabled until the lock holder's Commit or Abort drains the lock).     *)

This is a nit — the spec behavior is unambiguous from the actual action guard ~ConflictingLockExists(k, startTs[t]) — so it's fine to fix as a follow-on or fold into a later commit.


Spec correctness (unchanged from round 2 assessment)

All prior correctness findings carry over: OCC-1 through OCC-5 invariant logic is sound, the gap-config counterexample evidence is correct, tla-check.sh exit-code handling is robust, and TypeOK is complete. No new concerns introduced by the round 2 doc-only changes.


Overall: The PR is ready to merge. The one remaining stale phrase at OCC.tla:116 is the only loose end — low enough risk to address in a follow-on or squash before merge per your preference.

claude bot round 3 confirmed all 6 round-2 doc fixes landed and the
PR is "ready to merge" with one remaining loose end at OCC.tla:116.

The ReadKey action comment described the block path as "LockResolve
fires first" — but LockResolve was removed in round 1 alongside Tick.
With the atomic-drain abstraction, the lock holder's Commit or Abort
is what eventually unblocks the read by clearing the lock in the same
atomic step.  Updated the comment to say so.

Verified make tla-check end-to-end unchanged:
- HLC safe (3,594 states) + HLC gap (HLC-4 ce).
- OCC safe (150 states) + OCC gap (OCC-1 ce).
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 29, 2026

Round 3 fix in b6023839. claude bot Round 3 confirmed all 6 round-2 fixes landed and rated the PR "ready to merge" with one remaining loose end at OCC.tla:116 — the ReadKey action comment said "LockResolve fires first" but LockResolve was removed in round 1. Updated to:

the read is blocked (modelled here as the action being disabled until the lock holder's Commit or Abort drains the lock in that same atomic step).

make tla-check end-to-end unchanged (HLC 3,594 / HLC-4 ce / OCC 150 / OCC-1 ce).

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 29, 2026

Claude finished @bootjp's task in 1m 19s —— View job


Code Review — Round 4 (after b6023839)

  • Read tla/occ/OCC.tla — verify Round 3 fix at lines 112-117 (ReadKey comment)
  • Final pass over all changed files (scripts/tla-check.sh, tla/README.md, MCOCC.tla)
  • Write review

Round 3 fix confirmed. OCC.tla:112-117 now reads:

(* ReadKey: snapshot read on key k at startTs[t].  Gated by OCC-3 lock     *)
(* encoding — if any uncommitted lock with lockTs <= startTs[t] is         *)
(* outstanding, the read is blocked (modelled here as the action being    *)
(* disabled until the lock holder's Commit or Abort drains the lock in    *)
(* that same atomic step).                                                 *)

The stale "LockResolve fires first" phrase is gone; the replacement correctly describes the M2 atomic-drain semantics. ✓


Full-history reconciliation (all four rounds)

Round Finding Status
R1 — Gemini HIGH LockResolve dead action — removed, block comment added
R1 — Gemini HIGH tla-check.sh swallowed subshell exit — if ! gap_inv="$(…)" guard added
R1 — Gemini MEDIUM OCC3_ReadSnapshotStability too weak — strengthened to strict equality LatestVisible(k, startTs[t])
R1 — Gemini MEDIUM TypeOK missing lockMap[k].lockTs + versions — both added
R2 — doc drift OCC.tla module header still named LockResolve — removed, M5 forward ref added
R2 — doc drift OCC.tla:45 tsCounter comment "advances on Tick + Commit" — updated to "BeginTxn + Commit"
R2 — doc drift OCC.tla OCC-4 comment "LockResolve drains them" — removed, points to atomic-drain abstraction
R2 — doc drift README.md Run section described only HLC configs — generalized to safe+gap per module
R2 — doc drift README.md OCC description named LockResolve — replaced with block-comment forward ref
R2 — doc drift README.md OCC-3 table entry described old weak <= form — updated to strict equality
R3 — stale OCC.tla:116 ReadKey comment "LockResolve fires first" — updated to atomic-drain
R2 — minor (deferred) MCOCC.tla symmetry group missing combined txn+key swap element — noted, deferred to M3/M4 deferred (acceptable)

Final correctness re-check (no changes needed)

  • OCC-1: BeginTxntsCounter + 1, Commit (safe) → tsCounter + 1. Strict-greater by construction, any interleaving only raises the counter further. Gap config borrows tsCounter without advancing, reproducing commit_ts == start_ts. ✓
  • OCC-2: Prepare requires ~HasLock(k) for every write key; two concurrent writers cannot both acquire the same key. ✓
  • OCC-3 stability: Monotonic ts allocation in BeginTxn / Commit guarantees no future Commit can produce commit_ts ≤ startTs[t], so LatestVisible(k, startTs[t]) is immutable after assignment. ✓
  • OCC-4: Trivially satisfied by construction (atomic drain in Commit/Abort); the block comment at lines 201-224 correctly explains this and defers the async resolver to M5. ✓
  • OCC-5: readObs[t][k] ≤ startTs[t] holds because ReadKey assigns LatestVisible(k, startTs[t]) and that value is by definition ≤ startTs[t]. ✓
  • tla-check.sh exit handling: if ! gap_inv="$(gap_invariant_for "$module")" (line 73) catches subshell exits; gap path captures both output and RC separately; specific invariant-string match prevents parse errors or JVM crashes from silently passing. ✓
  • TypeOK: Covers tsCounter, txnState, startTs, commitTs, writeSet, readObs, opCount, lockMap[k].owner, lockMap[k].lockTs, versions. Complete. ✓

This PR is ready to merge. All findings from Rounds 1-3 are resolved. No new issues found in Round 4.

@bootjp bootjp merged commit ebce174 into main May 29, 2026
10 checks passed
@bootjp bootjp deleted the feat/tla-m2-occ-spec branch May 29, 2026 08:09
bootjp added a commit that referenced this pull request May 29, 2026
gemini MEDIUM on PR #862 line 61: without set -e, an exit 64 from mc_basename (added for the routes module) only terminates the subshell. The parent loop would continue with mc=\"\" and the downstream files .tla / .cfg would be syntactically empty file paths that TLC mishandles silently. Same subshell-exit propagation gap that the gap_invariant_for if-check (gemini HIGH on PR #858) closed for the gap-invariant case.

Added explicit if-checks at both call sites:
- main loop where mc / safe_cfg / gap_cfg are derived
- run_tlc inner call (used for both safe and gap runs)

Verified make tla-check end-to-end: HLC 3594 / OCC 150 / MVCC 79 / Routes 29 distinct states on safe; all 4 gap configs surface their pinned invariant.
bootjp added a commit that referenced this pull request May 29, 2026
## Summary

M3 of the [TLA+ safety
spec](https://github.com/bootjp/elastickv/blob/main/docs/design/2026_05_28_partial_tla_safety_spec.md)
— the MVCC layer encoded as a per-key version chain with a `Compact`
action and a ghost variable that preserves the original `Write` history
so MVCC-4 is checkable.

Follows PR #858 (M2 OCC spec).

## What lands

- `tla/mvcc/MVCC.tla` — `Write(k, v)` allocates a fresh `tsCounter` and
appends to both `versions[k]` and the ghost `originalVersions[k]`;
`Compact(newMin)` advances the retention watermark and (under
`EnableSafety`) retains the largest `commit_ts` below `newMin` per key
so reads at `read_ts ≥ minRetained` still see the original value.
- `tla/mvcc/MCMVCC.tla` — TLC model module with key + value symmetry.
- `tla/mvcc/MCMVCC.cfg` / `MCMVCC_gap.cfg` — safe (PASS) + gap (FAIL on
MVCC-4).
- `scripts/tla-check.sh` — `TLA_MODULES` gains `"mvcc"`;
`gap_invariant_for` gains the pinned MVCC-4 string.
- `tla/README.md` — M3 status `Not started` → `Landed`; MVCC module
description + invariant table.

## Invariants

| # | Statement | Form |
|---|---|---|
| MVCC-1 | No two distinct version records on the same key share a
`commit_ts` | INVARIANT |
| MVCC-4 | For every key `k` and every `read_ts ∈ [minRetained, MaxTs]`,
`LatestBefore(versions[k], ts) = LatestBefore(originalVersions[k], ts)`
| INVARIANT |
| `MVCC_TsMonotonic` | `tsCounter` weakly increases across every step |
PROPERTY |
| `MVCC_GhostMonotonic` | `originalVersions[k]` only grows; `Compact`
leaves it `UNCHANGED` | PROPERTY |

## Deferred to M5

- **MVCC-2** (no version below the HLC physical ceiling) is properly an
`HLC.tla` property — the ceiling discipline lives there. M5 will check
the cross-spec form once `HLC.tla` is `INSTANCE`-d into the composed
model.
- **MVCC-3** (cross-node read consistency) requires a multi-node model
that the single-node M3 spec cannot express.

## Verification (local)

```text
$ make tla-check
... TLC: tla/hlc/MCHLC.cfg  (correct design, expected PASS)
... 3594 distinct states found, 0 left on queue
... TLC: tla/hlc/MCHLC_gap.cfg  (...expected FAIL on HLC4...)
OK: MCHLC_gap.cfg failed as designed (Invariant HLC4_NoRegressionAcrossTerms is violated)

... TLC: tla/occ/MCOCC.cfg  (correct design, expected PASS)
... 150 distinct states found, 0 left on queue
... TLC: tla/occ/MCOCC_gap.cfg  (...expected FAIL on OCC1...)
OK: MCOCC_gap.cfg failed as designed (Invariant OCC1_CommitTsAboveStart is violated)

... TLC: tla/mvcc/MCMVCC.cfg  (correct design, expected PASS)
... 79 distinct states found, 0 left on queue
... TLC: tla/mvcc/MCMVCC_gap.cfg  (...expected FAIL on MVCC4...)
OK: MCMVCC_gap.cfg failed as designed (Invariant MVCC4_NoLostCommitOnSnapshotInstall is violated)

tla-check: all model-check outcomes match the design contract.
```

The MVCC-4 counterexample is short (depth 3): one `Write` at `commit_ts
= 1`, one `Compact(newMin = 2)`, and the read at `read_ts ∈ [2, MaxTs]`
now sees `NotFound` where the ghost still shows the `Write`'s value.

## Modelling notes

- **`Compact` retention semantics.** Under `EnableSafety` the per-key
prune keeps every version with `commit_ts ≥ newMin` AND the single
largest version with `commit_ts < newMin` (so any later read in
`[newMin, MaxTs]` still sees the same value the original log would have
shown). Under the gap toggle, the prune drops every version with
`commit_ts < newMin`, including the latest-before-newMin retention.
- **`originalVersions` as a ghost.** Carrying every `Write` in a
parallel set that `Compact` never touches lets MVCC-4 be a direct
equality between the actual store and the canonical "what should be
visible" set. It is the simplest faithful encoding of "no commit is
lost" for a single-node spec.
- **HLC abstraction.** `tsCounter` is a single monotonic counter; M5
will replace it with an `INSTANCE` of `HLC.tla`'s real 48/16 layout.
This keeps M3's state space tight (79 distinct states under the default
bounds).
- **Symmetry.** Keys and values are both symmetric in MVCC.tla because
they participate only as indices in the invariants — neither has an
ordering. `MCMVCC.tla` declares both via `Permutations`.

## Self-review (5-lens per CLAUDE.md)

1. **Data loss** — exactly the property MVCC-4 captures. Verified by
gap-config counterexample.
2. **Concurrency** — single-node M3 model; multi-node interactions
deferred to M5.
3. **Performance** — `make tla-check` end-to-end runs in under 5 seconds
at default bounds.
4. **Data consistency** — MVCC-1 (uniqueness) + MVCC-4 (preserved
visibility) + two PROPERTY transitions. MVCC-2 / MVCC-3 deferred to M5
with rationale.
5. **Test coverage** — no Go tests added (spec-only). `make tla-check`
is the new coverage layer.

## Test plan

- [ ] `tla-check` CI workflow runs green (watches `tla/**`)
- [ ] `tla-spec-ai-review` does NOT fire — spec-only PR
- [ ] Reviewer cross-checks MVCC-1 / MVCC-4 invariants against §5.3 of
the design doc
- [ ] Reviewer runs `make tla-check` and confirms 6 outcomes (HLC / OCC
/ MVCC safe pass + gap fail)

## Out of scope

- MVCC-2 (HLC ceiling discipline) — checked in M5.
- MVCC-3 (multi-node read consistency) — checked in M5.
- M4 (Routes) and M5 (Composed) — separate future PRs.


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added a formal MVCC model capturing per-key version chains, compaction
behavior, and snapshot safety checks.

* **Documentation**
* Marked milestone M3 (MVCC) as landed and expanded docs describing
model scope, invariants, modes, and expected gap-counterexample
behavior.

* **Tests**
* Extended verification harness to run MVCC in both safety and gap
modes, including an expected counterexample check for the MVCC-4
invariant.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/bootjp/elastickv/pull/861?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

<!-- review_stack_entry_end -->
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
bootjp added a commit that referenced this pull request May 29, 2026
…#865)

## Summary

M5 of the [TLA+ safety
spec](https://github.com/bootjp/elastickv/blob/main/docs/design/2026_05_28_partial_tla_safety_spec.md)
— the cross-module safety properties, capping the spec-only roadmap
(M1..M5).

M1..M4 each modelled a single subsystem in isolation. M5 captures the
**seams** between them — the two safety properties that no single-module
spec can express.

## What lands

- `tla/composed/Composed.tla` — recreates the minimal cross-module state
needed for Composed-1 (catalog history `routes[v]`, txn observed catalog
version `txnObservedVer[t]`, committing group `txnCommitGroup[t]`) and
Composed-3 (monotonic `tsCounter` advancing on every Commit). **Does NOT
`INSTANCE` M1..M4** — that would explode the state space well past M5's
`<10 min at default bounds` target from §8.1.
- `tla/composed/MCComposed.tla` — TLC model with key + group + txn
symmetry.
- `tla/composed/MCComposed.cfg` / `MCComposed_gap.cfg` — safe (PASS) +
gap (FAIL on Composed-1).
- `scripts/tla-check.sh` — `TLA_MODULES` gains `"composed"`;
`gap_invariant_for` + `mc_basename` gain the corresponding cases.
- `tla/README.md` — M5 status `Not started` → `Landed`; Composed module
description + invariant table.

## Invariants

| # | Statement | Form |
|---|---|---|
| Composed-1 | Every committed write key was owned by the committing
group at the txn's observed catalog version | INVARIANT |
| Composed-2 | Vacuously TRUE in this abstraction — `SplitRange` is
same-group only per CLAUDE.md | INVARIANT |
| Composed-3 | Distinct committed txns have distinct `commit_ts` |
INVARIANT |
| `Composed_CatalogMonotonic` | `catalogVersion` weakly increases on
every step | PROPERTY |
| `Composed_TsMonotonic` | `tsCounter` weakly increases on every step |
PROPERTY |
| `Composed3_TsAction` | Every Commit strictly raises `tsCounter` |
PROPERTY |

## Why no `INSTANCE`?

Each M1..M4 module tightly bounds its own state for tractable TLC.
Bringing all four `INSTANCE`s into one product spec would multiply the
state spaces — even a 100x growth lands well past M5's `<10 min at
default bounds` target. M5 instead recreates the minimal cross-module
state needed to express the two seam invariants. The integration claim
is that this projection preserves the invariants the full product would
assert; reviewers can sanity-check each Composed action against the
corresponding M1..M4 action (e.g., `ProposeRouteChange` mirrors
`Routes.tla`, `BeginTxn` + `Commit` mirror `OCC.tla` with a cross-module
observed-catalog-version pin).

## Verification (local)

```text
$ make tla-check
HLC      safe 3,594 distinct states + HLC-4 gap ce
OCC      safe   150 distinct states + OCC-1 gap ce
MVCC     safe    79 distinct states + MVCC-4 gap ce
Routes   safe    29 distinct states + Routes-4 gap ce
Composed safe 1,684 distinct states + Composed-1 gap ce
tla-check: all model-check outcomes match the design contract.
```

The Composed-1 counterexample is the canonical 4-step regression:

1. `BeginTxn(t)` — `txn` observes `catalogVersion = 0`, where
`routes[0][k1] = g1`.
2. `WriteIntent(t, k1)`.
3. `ProposeRouteChange(k1, g2)` — `catalogVersion → 1`, `routes[1][k1] =
g2`.
4. `Commit(t, g2)` (gap mode) — committing group is `g2`, but
`routes[0][k1] = g1 ≠ g2`. **Composed-1 fails.**

## Self-review (5-lens per CLAUDE.md)

1. **Data loss** — out of scope (cross-module invariants check group
ownership, not durability).
2. **Concurrency** — Composed-1 is exactly the concurrency seam between
OCC and Routes; the gap counterexample is one such interleaving.
3. **Performance** — `make tla-check` end-to-end runs in ~10 s on a dev
laptop, well under the 10-min target.
4. **Data consistency** — Composed-1 (cross-module write/route
consistency) + Composed-3 (cross-txn ts uniqueness) + 3 PROPERTY
transitions.
5. **Test coverage** — no Go tests added (spec-only). `make tla-check`
is the coverage layer.

## Roadmap status (post-merge)

| Milestone | Status |
|---|---|
| M1 HLC | Landed (PR #856) |
| M2 OCC | Landed (PR #858) |
| M3 MVCC | Landed (PR #861) |
| M4 Routes | Landed (PR #861 via stacked #862) |
| **M5 Composed** | **This PR** |
| M6 liveness (OPTIONAL) | Not started — can follow as a separate PR |
| M1 Go follow-up (ceiling fence) | Not started — separate Go PR with
caller audit |

## Test plan

- [ ] `tla-check` CI runs green (watches `tla/**`)
- [ ] `tla-spec-ai-review` does NOT fire — spec-only PR
- [ ] Reviewer cross-checks Composed-1 / Composed-3 against §5.5 of the
design doc
- [ ] Reviewer runs `make tla-check` and confirms 10 outcomes (5 safe
pass + 5 gap fail)
- [ ] Reviewer sanity-checks the `INSTANCE`-free projection claim — each
Composed action mirrors the corresponding M1..M4 action

## Out of scope

- M6 liveness (OCC-L1, Routes-L1) — separate PR if needed
- HLC-4 (iii) ceiling fence Go implementation — separate code PR
- Cross-group `SplitRange` in the real implementation (currently
same-group only)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant