tla: M1 — HLC.tla + lib + make tla-check (HLC-1..4 invariants + gap counterexample) by bootjp · Pull Request #856 · bootjp/elastickv

bootjp · 2026-05-28T17:03:25Z

Summary

M1 of the TLA+ safety spec proposal lands:

tla/lib/Env.tla + tla/lib/Raft.tla (shared scaffolding).
tla/hlc/HLC.tla with invariants HLC-1..HLC-4 + TypeOK.
- Precondition (i) bounded clock skew encoded as a TLA+ ASSUME in Env.tla.
- Precondition (ii) logical-counter handoff — strategy (c) — encoded as a state update inside BecomeLeader_HLC (hlc.Observe(maxAppliedHLC) analogue).
- Precondition (iii) ceiling fence encoded as an enabling action guard on IssueTimestamp (wallNow[n] < physicalCeiling[n]).
tla/hlc/MCHLC.tla + MCHLC.cfg (correct design — TLC passes) and MCHLC_gap.cfg (preconditions disabled — TLC produces the HLC-4 counterexample the design doc anticipated).
make tla-check Makefile target: downloads checksum-pinned tla2tools.jar v1.8.0 (SHA-256 237332bd…) to .cache/tla/ and runs both configurations, inverting the exit code on the gap config so the expected-FAIL counts as PASS for CI.
tla/README.md covering install / run / what each module proves / reading list / how to map a TLC counterexample back to the implementation, per §7.1 of the design doc.
Design doc renamed proposed_ → partial_ (per CLAUDE.md lifecycle), header Status: updated, in a dedicated commit.

Verification (local)

$ make tla-check
================================================================
  TLC: tla/hlc/MCHLC.cfg  (correct design, expected PASS)
================================================================
... Model checking completed. No error has been found.
... 61 states generated, 28 distinct states found, 0 states left on queue.

================================================================
  TLC: tla/hlc/MCHLC_gap.cfg  (no preconditions, expected FAIL)
================================================================
... Error: Invariant HLC4_NoRegressionAcrossTerms is violated.
... 44 states generated, 28 distinct states found, 14 states left on queue.
... depth 5

OK: MCHLC_gap.cfg failed as designed (HLC-4 counterexample).

tla-check: all model-check outcomes match the design contract.

The gap-config counterexample (depth 5):

BecomeLeader(n1) — n1 elected term 1
IssueTimestamp(n1) — commits (wall=0, logical=0) in term 1
IssueTimestamp(n1) — commits (0, 1) in term 1 (logical increments)
BecomeLeader_HLC(n2) — n2 elected term 2; EnableSafety = FALSE so strategy (c) does NOT raise n2's hlcLast to maxAppliedHLC = (0, 1)
IssueTimestamp(n2) — commits (0, 1) in term 2 (same as term 1's last commit; HLC-4 requires strict >)

This is exactly the bug the design doc §5.1 HLC-4 (ii) said the spec would surface.

Self-review (per CLAUDE.md five-lens checklist)

Data loss — n/a: spec-only PR, no persistence paths touched.
Concurrency / distributed failures — the spec itself models the concurrency bugs (HLC-4 cross-leader regression); both make tla-check configurations explore the full bounded state space.
Performance — make tla-check runs in <1s on a developer laptop at default bounds.
Data consistency — MCHLC.cfg proves HLC-1..HLC-4 hold under preconditions (i)/(ii)/(iii); MCHLC_gap.cfg proves the preconditions are necessary by surfacing the documented counterexample.
Test coverage — Go code untouched, no Go tests added. make tla-check is the new coverage layer; running it is the test.

Out of scope

This PR is spec-only. The Go code changes that strategy (c) and the ceiling fence require (per §5.1 HLC-4 (ii) and (iii)) are not in this PR — they need:

Add MaxAppliedHLC() to kv/fsm.go (running max across applied write entries).
Call hlc.Observe(fsm.MaxAppliedHLC()) on new-term detection in kv/sharded_coordinator.go RunHLCLeaseRenewal, before any persistence Next() is served.
Add wallNow < physicalCeiling fence in kv/hlc.go Next() with a fail-closed return path and operator-visible metrics — caller audits of every site that calls Next() will be required (per the loop-feedback discipline) since Next() changes from infallible to error-returning.

Tracked as a follow-up issue; this PR lands the design contract that the follow-up must satisfy.

Test plan

CI runs make tla-check (CI integration is deferred to a future PR per §7.3 of the design doc — local invocation is the contract for now)
Reviewer runs make tla-check locally and confirms the safe pass + gap-config counterexample
Reviewer cross-checks each invariant in HLC.tla against the corresponding entry in §5.1 of the design doc
Reviewer cross-checks the implementation anchors in tla/README.md "How to interpret a TLC failure" against the current Go source

Per docs/design/2026_05_28_proposed_tla_safety_spec.md §8.1 (M1 deliverables): adds the TLA+ specification suite directory layout and the HLC module, runnable via `make tla-check`. Files: - tla/lib/Env.tla — shared constants and the HLC-4 precondition (i) ASSUME (MaxClockSkewMs < HlcPhysicalWindowMs). - tla/lib/Raft.tla — abstract Raft interface (currentTerm, leaderOf, BecomeLeader, IsLeader). Log matching and leader completeness are treated as assumed properties per NG4. - tla/hlc/HLC.tla — HLC layer with all five invariants: - HLC-1 (per-node monotonic), HLC-2 (ceiling monotonic), HLC-3 (leader-only issuance), HLC-4 (no regression across leader change), TypeOK. Strategy (c) handoff lives in BecomeLeader_HLC; the ceiling fence is an enabling guard on IssueTimestamp. Both are gated by CONSTANT EnableSafety so the same spec drives the correct-design and gap configurations. - tla/hlc/MCHLC.tla — TLC model-check instance. - tla/hlc/MCHLC.cfg — EnableSafety = TRUE (expected to PASS all invariants at default bounds: 3 nodes / 2 terms / 3 ops). - tla/hlc/MCHLC_gap.cfg — EnableSafety = FALSE (expected to FAIL on HLC-4 with a counterexample, demonstrating that strategy (c) + ceiling fence are necessary). - tla/README.md — install, run, what each module proves, reading list, how to map a TLC counterexample back to the implementation. Tooling: - Makefile: tla-check target downloads tla2tools.jar v1.8.0 to .cache/tla/ with checksum pinning (SHA-256 237332bdcc79a35c7d26efa7b82c77c85c2744591c5598673a8a45085ff2a4fb), runs both configurations, and inverts the exit code on the gap configuration so the expected-FAIL counts as PASS for CI. - .gitignore: TLC counterexample-trace artefacts (MCHLC_TTrace_*.{tla,bin}) and the per-run states/ directory. Verified locally: - make tla-check passes end-to-end. - MCHLC.cfg: 28 distinct states, 0 invariant violations. - MCHLC_gap.cfg: HLC-4 counterexample at depth 5 (n1 commits (0,1) in term 1; n2 elected term 2 without strategy (c), commits (0,1) again, tying the prior commit — HLC-4 requires strictly greater). Follow-up (separate PRs): - Rename the design doc proposed -> partial (next commit in this PR). - Implement strategy (c) and the ceiling fence in Go: - Add MaxAppliedHLC() to kv/fsm.go. - Call hlc.Observe(fsm.MaxAppliedHLC()) on new-term detection in kv/sharded_coordinator.go RunHLCLeaseRenewal. - Add wallNow < physicalCeiling fence to HLC.Next() with a fail-closed return path, plus operator-visible metrics.

M1 (HLC.tla + lib + make tla-check) lands in the prior commit on this branch, satisfying the first row of §8.1. Per CLAUDE.md docs/design/ lifecycle, rename proposed -> partial and update the header / Status note. Filename propose date and slug are preserved; only the status token changed. M2..M5 remain open; M6 (liveness) remains optional.

coderabbitai · 2026-05-28T17:03:33Z

Warning

Review limit reached

@bootjp, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 8 minutes and 27 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c2f771a7-8716-475d-8e51-673ef52a21b4

📥 Commits

Reviewing files that changed from the base of the PR and between fc8e4bd and 2a2238a.

📒 Files selected for processing (11)

.github/workflows/tla-check.yml
.gitignore
Makefile
docs/design/2026_05_28_partial_tla_safety_spec.md
tla/README.md
tla/hlc/HLC.tla
tla/hlc/MCHLC.cfg
tla/hlc/MCHLC.tla
tla/hlc/MCHLC_gap.cfg
tla/lib/Env.tla
tla/lib/Raft.tla

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/tla-m1-hlc-spec

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces the TLA+ safety specification for the Hybrid Logical Clock (HLC) as part of Milestone 1, including the core specification files, model-checking configurations, and a Makefile target to automate TLC verification. The review feedback highlights two key improvements: rewriting subtraction-based inequalities in the TLA+ specification to addition to prevent evaluation errors when clock skew is enabled, and updating the Makefile to support both sha256sum and shasum for better portability across different environments.

gemini-code-assist · 2026-05-28T17:05:09Z

+    /\ \A m \in Nodes : (wallNow[n] + 1) - wallNow[m] <= MaxClockSkewMs
+                     /\ wallNow[m] - (wallNow[n] + 1) <= MaxClockSkewMs


In TLA+, subtraction on natural numbers (Naturals) is only defined when the left operand is greater than or equal to the right operand. If MaxClockSkewMs is configured to be greater than 0 (which is the core scenario for verifying clock skew safety), a state can be reached where wallNow[m] > wallNow[n] + 1. In this state, evaluating (wallNow[n] + 1) - wallNow[m] will result in a negative number, causing TLC to halt with an evaluation error.

To prevent this, we can rewrite the inequalities using addition instead of subtraction, which is always safe and closed under Naturals.

/\ \\A m \\in Nodes : wallNow[n] + 1 <= wallNow[m] + MaxClockSkewMs /\ wallNow[m] <= wallNow[n] + 1 + MaxClockSkewMs

gemini-code-assist · 2026-05-28T17:05:09Z

+	@actual=$$(shasum -a 256 $(TLA_JAR).tmp | awk '{print $$1}'); \
+	if [ "$$actual" != "$(TLA_SHA256)" ]; then \
+		echo "ERROR: tla2tools.jar SHA-256 mismatch."; \
+		echo "  expected: $(TLA_SHA256)"; \
+		echo "  actual:   $$actual"; \
+		rm -f $(TLA_JAR).tmp; \
+		exit 1; \
+	fi


The shasum command is not installed by default on many Linux distributions (such as Ubuntu/Debian), whereas sha256sum is part of GNU coreutils and is universally available. To ensure the Makefile is portable and robust across both macOS and Linux environments (especially in minimal CI runners), we should check for the availability of both commands and use whichever is present.

@if command -v sha256sum >/dev/null 2>&1; then \ actual=$$(sha256sum $(TLA_JAR).tmp | awk '{print $$1}'); \ elif command -v shasum >/dev/null 2>&1; then \ actual=$$(shasum -a 256 $(TLA_JAR).tmp | awk '{print $$1}'); \ else \ echo "ERROR: neither sha256sum nor shasum found."; \ rm -f $(TLA_JAR).tmp; \ exit 1; \ fi; \ if [ "$$actual" != "$(TLA_SHA256)" ]; then \ echo "ERROR: tla2tools.jar SHA-256 mismatch."; \ echo " expected: $(TLA_SHA256)"; \ echo " actual: $$actual"; \ rm -f $(TLA_JAR).tmp; \ exit 1; \ fi

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 66b515392e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-28T17:05:44Z

+HLC1_PerNodeMonotonic ==
+    \A r1, r2 \in committedTS :
+        r1.node = r2.node => HLCLE(r1.ts, r2.ts) \/ HLCLE(r2.ts, r1.ts)


Track issuance order for HLC-1

In this spec, committedTS is a set and HLCLE is a total lexicographic order, so this invariant is true for any two timestamps from the same node regardless of which one was issued first. For example, if a later change allowed a node to commit (5,0) and then regress to (4,0), the reversed disjunct HLCLE(r2.ts, r1.ts) would still make HLC-1 pass. Since the design doc says HLC-1 proves monotonicity across Next/Observe, the model needs to record issuance order/history or assert the monotonicity in the actions rather than just comparability.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-28T17:05:44Z

+\* with a max(), and there is no action that lowers physicalCeiling.  The
+\* invariant is therefore a stuttering-step consequence, asserted here as
+\* a sanity check rather than a behavioural property.
+HLC2_CeilingMonotonic == \A n \in Nodes : physicalCeiling[n] >= 0


Encode actual ceiling monotonicity

This invariant only checks that each ceiling is a natural number, which is already implied by TypeOK, so TLC would still pass if a future action lowered physicalCeiling[n] to any smaller non-negative value or made nodes diverge after applying the same ceiling entry. Because HLC-2 is meant to prove per-node monotonicity and same-entry equality, this needs history/action-level checking (or previous values in the state) rather than >= 0.

Useful? React with 👍 / 👎.

…rtability Four findings from PR #856 review on commit 66b5153: - HLC.tla TickWall skew check: rewrite subtraction as addition. Nat subtraction in TLA+ underflows when LHS < RHS, which would halt TLC the moment MaxClockSkewMs > 0 enabled an out-of-order TickWall. Inequalities equivalently expressed with `+` stay inside Nat. (gemini HIGH on PR #856). - HLC1_PerNodeMonotonic now uses the per-commit `seq` ordinal (the global opCount value at issuance time) so the invariant compares same-node commits in their actual issuance order: `r1.node = r2.node /\ r1.seq < r2.seq => HLCLT(r1.ts, r2.ts)`. The earlier `HLCLE(r1, r2) \/ HLCLE(r2, r1)` form was vacuously true on a totally ordered set and would not have caught a future regression (codex P2 on PR #856). - HLC1_Action + HLC2_Action: new action-level safety properties registered under PROPERTIES (not INVARIANTS) in both .cfg files — TLC checks `[][P]_vars` as a transition property without requiring fairness. These pin the contract that hlcLast and physicalCeiling weakly increase across every step. HLC2_CeilingMonotonic's old `>= 0` form was vacuously implied by TypeOK (codex P2 on PR #856) and is replaced by HLC2_NonNeg (the residual TypeOK content) plus HLC2_Action (the real monotonicity claim). - Makefile sha256 portability: prefer GNU sha256sum (Linux universal), fall back to shasum -a 256 (macOS default), error out if neither is available. Without this the target failed on minimal Linux CI images (gemini MEDIUM on PR #856). - committedTS records now carry `seq` (= opCount at issuance), and TypeOK is extended to require it. Verified: `make tla-check` still PASSES on MCHLC.cfg (now also exercising the two action properties) and FAILS on MCHLC_gap.cfg as designed, with the HLC-4 counterexample at depth 5 unchanged.

bootjp · 2026-05-28T17:30:00Z

Addressed PR #856 round 1 review in 791c65b3. Four findings:

gemini HIGH (HLC.tla line 97) — Nat subtraction underflow. Rewrote the TickWall skew check using addition instead of subtraction. The old form (wallNow[n] + 1) - wallNow[m] <= MaxClockSkewMs would underflow in TLA+ Naturals the moment MaxClockSkewMs > 0 admitted an out-of-order tick, halting TLC. The equivalent additive form stays inside Nat.
gemini MEDIUM (Makefile) — sha256sum portability. Prefer GNU sha256sum (universal on Linux), fall back to macOS shasum -a 256, error out if neither is available. Without this make tla-check failed on minimal Linux CI images.
codex P2 (HLC.tla line 180) — HLC-1 was vacuously true. Old form HLCLE(r1, r2) \/ HLCLE(r2, r1) is just totality on a totally ordered set, so a regression (5, 0) → (4, 0) would have passed. Added a per-commit seq ordinal (= global opCount at issuance time) and changed the invariant to r1.node = r2.node /\ r1.seq < r2.seq => HLCLT(r1.ts, r2.ts). Also added the transition-level HLC1_Action property to pin "hlcLast weakly increases on every step" so future actions cannot silently regress.
codex P2 (HLC.tla line 186) — HLC-2 was vacuously implied by TypeOK. Old form physicalCeiling[n] >= 0 was just type non-negativity. Replaced with HLC2_NonNeg (the residual TypeOK content; harmless) plus the real HLC2_Action == [][\A n: physicalCeiling[n] <= physicalCeiling'[n]]_vars transition property registered under PROPERTIES.

The TLA+ "same set of ceiling entries → equal across nodes" half of HLC-2 is structurally satisfied in M1 because ApplyCeiling propagates atomically (per the action comment); per-node async propagation is an M5 concern when route-catalog/HLC interactions cross multiple groups.

Verification. make tla-check still PASSES on MCHLC.cfg (now also exercising the two action properties) and FAILS on MCHLC_gap.cfg as designed, with the HLC-4 counterexample at depth 5 unchanged. No Go code semantics changed — spec-only PR, no caller audit required.

@claude review

claude · 2026-05-28T17:30:14Z

Claude finished @bootjp's task in 5m 31s —— View job

Review of Round 2 (`791c65b`)

Todo list:

Load tools and start review
Read HLC.tla — verify Nat subtraction fix, HLC-1 fix (seq ordinal), HLC-2 fix (action property)
Read Env.tla — verify skew check rewrite
Read Raft.tla — general review
Read MCHLC.tla / .cfg / gap.cfg — verify action properties are registered
Read Makefile — verify sha256sum portability fix
Read tla/README.md and design doc — consistency check
Synthesize and post findings

Round-1 fixes: all correct ✓

Fix 1 — Nat subtraction underflow (TickWall HLC.tla:100-101)
The rewrite from (wallNow[n]+1) - wallNow[m] <= MaxClockSkewMs to:

/\ \A m \in Nodes : /\ wallNow[n] + 1 <= wallNow[m] + MaxClockSkewMs
                    /\ wallNow[m] <= wallNow[n] + 1 + MaxClockSkewMs

is semantically equivalent and stays inside Nat. Confirmed correct.

Fix 2 — sha256sum portability (Makefile:39-54)
The if sha256sum...elif shasum...else error fallback chain is correct Make syntax (single @if block with \ continuations, proper $$ escaping). Works on both Linux (sha256sum) and macOS (shasum -a 256).

Fix 3 — HLC-1 was vacuous (HLC.tla:193-195)
Adding seq = opCount at issuance and changing the invariant to r1.node = r2.node /\ r1.seq < r2.seq => HLCLT(r1.ts, r2.ts) is correct. Since only the active leader can call IssueTimestamp, the global opCount faithfully captures per-node temporal ordering. Combined with HLC1_Action (PROPERTIES), this pins both the per-commit set invariant and the per-step transition. Confirmed non-vacuous.

Fix 4 — HLC-2 was vacuous (HLC.tla:203-221)
HLC2_NonNeg is a residual TypeOK subsumption (harmless, noted in the comment). The load-bearing HLC2_Action == [][\A n \in Nodes : physicalCeiling[n] <= physicalCeiling'[n]]_vars is the real monotonicity claim, registered under PROPERTIES in both .cfg files. Confirmed the gap config correctly keeps both action properties (they don't fail in the gap scenario — only HLC4 does).

New findings

1. Stale `proposed_` path in TLA module headers and Makefile — medium

Four files still reference docs/design/2026_05_28_proposed_tla_safety_spec.md after the rename:

File	Line
`tla/hlc/HLC.tla`	4
`tla/lib/Env.tla`	4
`tla/lib/Raft.tla`	4
`Makefile`	17

tla/README.md already has the correct partial_ path. The referenced file no longer exists under the old name — the links are broken.

Fix this →

2. README invariant table names HLC-2 incorrectly — low

tla/README.md:114 lists HLC2_CeilingMonotonic in the "What each module proves" table, but that operator does not exist in HLC.tla. The actual operators are HLC2_NonNeg (INVARIANTS) and HLC2_Action (PROPERTIES).

Fix this →

3. `TickWall` is unreachable with `MaxClockSkewMs = 0` — informational

The skew guard \A m \in Nodes : wallNow[n] + 1 <= wallNow[m] + MaxClockSkewMs includes m = n. For m = n, condition 1 reduces to wallNow[n] + 1 <= wallNow[n] + 0 = 1 <= 0, which is always false. So with MaxClockSkewMs = 0 (both current configs), TickWall is never enabled, all wallNow[n] stay at 0 throughout the model check, and MaxWallTime = 3 is never exercised.

Physical-time advancement is modeled exclusively via ApplyCeiling → physicalCeiling. The invariants still hold under this restriction, so the M1 model check is not unsound — but the wall-clock ticking path is untested. A future config with MaxClockSkewMs >= 1 (and the corresponding tighter ASSUME MaxClockSkewMs < HlcPhysicalWindowMs) would exercise TickWall.

No fix needed for M1; worth tracking as a follow-up config for deeper clock-skew exploration.

Summary

The round-1 fixes are all correctly implemented. The spec is sound for M1. Two doc-consistency issues (stale proposed_ paths, wrong invariant name in README) should be fixed before merging. The TickWall/MaxClockSkewMs=0 observation is informational and out of scope for this PR.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 791c65b31b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-28T17:31:37Z

+    /\ \A m \in Nodes : /\ wallNow[n] + 1 <= wallNow[m] + MaxClockSkewMs
+                        /\ wallNow[m] <= wallNow[n] + 1 + MaxClockSkewMs


Exclude self from skew check

With the checked configs setting MaxClockSkewMs = 0, this universal quantifier also evaluates the m = n case, which requires wallNow[n] + 1 <= wallNow[n] + 0 before any tick can occur. That is impossible, so TickWall is never enabled and the model never explores wall-clock advancement, ceiling expiry, or skew schedules despite the MaxWallTime bounds. Exclude the ticking node from this comparison or compare it against its updated self value.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-28T17:31:37Z

+	else \
+		echo; \
+		echo "OK: MCHLC_gap.cfg failed as designed (HLC-4 counterexample)."; \


Validate the gap failure reason

In this tla-check target, every non-zero exit from the gap TLC run is accepted as the expected HLC-4 counterexample. If MCHLC_gap.cfg starts failing because of a parse/type error, a different invariant, a deadlock, or a JVM/tooling crash, CI will still print OK and pass, masking regressions in the newly added model-check harness. Check the TLC output for the specific HLC4_NoRegressionAcrossTerms invariant violation before treating the failure as success.

Useful? React with 👍 / 👎.

…reason Two P2 findings from codex on commit 791c65b. - HLC.tla TickWall: exclude self from the inter-node skew check. The `\A m \in Nodes:` quantifier included `m = n`, so under MaxClockSkewMs = 0 the constraint forced `wallNow[n] + 1 <= wallNow[n] + 0`, which is unsatisfiable. TickWall was therefore never enabled and the model never explored wall-clock advancement, ceiling expiry, or skew schedules — the safe config was passing on a degenerate 28-state subspace. Restrict the universal to `Nodes \ {n}` so self-comparison is excluded. (codex PR #856 round 2.) - Raise MaxClockSkewMs from 0 to 1 (and HlcPhysicalWindowMs from 1 to 2 so the ASSUME `MaxClockSkewMs < HlcPhysicalWindowMs` still holds) in both MCHLC.cfg and MCHLC_gap.cfg so TickWall is actually enabled and the skew dimension is part of the explored space. After this change the safe config explores 3,594 distinct states (vs 28 previously) and the gap config still surfaces the HLC-4 counterexample at depth 5. - Makefile tla-check: validate the gap-run failure reason instead of blindly accepting any non-zero TLC exit. Capture TLC stdout/stderr, print it, then grep for the exact string `Invariant HLC4_NoRegressionAcrossTerms is violated` before declaring the gap evidence intact. Without this check a parse error, deadlock, JVM crash, or different invariant violation would have silently counted as "expected gap counterexample" and masked regressions in the harness itself. (codex PR #856 round 2.)

bootjp · 2026-05-28T17:37:45Z

PR #856 round 2 review (codex P2 ×2 on 791c65b3) addressed in 1123a0e1:

codex P2 (HLC.tla line 101) — TickWall self-skew block. The \A m \in Nodes: quantifier included m = n, so under MaxClockSkewMs = 0 the constraint became wallNow[n] + 1 <= wallNow[n] + 0 (unsatisfiable). TickWall was never enabled and the model never explored wall-clock advancement, ceiling expiry, or skew schedules — the safe config was passing on a degenerate 28-state subspace. Excluded self from the universal: \A m \in Nodes \ {n}: ….

Also raised MaxClockSkewMs from 0 to 1 and HlcPhysicalWindowMs from 1 to 2 (so the ASSUME MaxClockSkewMs < HlcPhysicalWindowMs still holds) in both .cfg files so the skew dimension is actually part of the explored space.

codex P2 (Makefile line 81) — Validate gap failure reason. tla-check previously accepted any non-zero exit from the gap TLC run as the expected HLC-4 counterexample. Capture stdout/stderr, print it, then grep for Invariant HLC4_NoRegressionAcrossTerms is violated before declaring the gap evidence intact. A parse error, deadlock, JVM crash, or different invariant violation now correctly fails the target instead of masking a harness regression.

Verification. Safe config explores 3,594 distinct states (vs 28 previously) and completes without invariant violation. Gap config still surfaces the HLC-4 counterexample at depth 5; the Makefile correctly identifies it by string match.

Spec/Makefile-only change, no Go code touched. No caller audit required.

@claude review

claude · 2026-05-28T17:38:02Z

Claude finished @bootjp's task in 2m 9s —— View job

Review of Round 3 (`1123a0e`)

Todo list:

Load tools
Read HLC.tla — verify TickWall self-exclusion fix, MaxClockSkewMs=1 bump
Read Makefile — verify gap failure reason validation
Read MCHLC.cfg + MCHLC_gap.cfg — verify constant updates
Check round-2 Claude findings (stale proposed_ paths, wrong README invariant name)
Read tla/README.md and Env.tla, Raft.tla — consistency check
Synthesize and post findings

Round-3 fixes: both correct ✓

Fix 1 — TickWall self-exclusion (HLC.tla:107)

/\ \A m \in Nodes \ {n} :
    /\ wallNow[n] + 1 <= wallNow[m] + MaxClockSkewMs
    /\ wallNow[m] <= wallNow[n] + 1 + MaxClockSkewMs

Nodes \ {n} is the correct fix. Under MaxClockSkewMs = 0, \A m \in {} : ... is vacuously true, so TickWall is now enabled up to MaxWallTime. With the new MaxClockSkewMs = 1, condition 1 reduces to wallNow[n] <= wallNow[m] for each peer — n can only tick forward if it's not already ahead of any other node. This is a symmetric skew model and correct for the design intent. The in-comment explanation at lines 100–106 is clear.

Both .cfg files now carry MaxClockSkewMs = 1 and HlcPhysicalWindowMs = 2, keeping the Env.tla ASSUME MaxClockSkewMs < HlcPhysicalWindowMs (1 < 2) intact. State space goes from 28 to 3,594 distinct states — the wall-clock and skew dimensions are now genuinely exercised. Confirmed correct.

Fix 2 — Gap failure reason validation (Makefile:69-102)

The implementation is correct and well-structured:

Captures both stdout+stderr via 2>&1 into out=.
Separates the "unexpected clean pass" check (exit code 0 → error) from the "wrong failure reason" check (grep miss → error).
Uses grep -qF (fixed-string, not regex) — the right tool for matching a literal TLC error message.
Prints the full TLC output before the checks so the failure reason is always visible in CI logs regardless of which path is taken.

A JVM crash, parse error, deadlock, or wrong invariant violation will now correctly fail make tla-check instead of silently masking itself as expected gap evidence. Confirmed correct.

Outstanding from round-2 Claude review (still not addressed)

These two were raised in round 2 and remain unfixed after round 3.

1. Stale `proposed_` path in four files — medium

All four files still reference the pre-rename path docs/design/2026_05_28_proposed_tla_safety_spec.md, which was renamed to 2026_05_28_partial_tla_safety_spec.md in the M1 commit. The file no longer exists under the old name — the references are broken dead links.

File	Line	Current (broken)	Should be
`tla/hlc/HLC.tla`	4	`…proposed_tla_safety_spec.md §5.1`	`…partial_tla_safety_spec.md §5.1`
`tla/lib/Env.tla`	4	`…proposed_tla_safety_spec.md §6.1`	`…partial_tla_safety_spec.md §6.1`
`tla/lib/Raft.tla`	4	`…proposed_tla_safety_spec.md §3 / §6.1`	`…partial_tla_safety_spec.md §3 / §6.1`
`Makefile`	17	`…proposed_tla_safety_spec.md §7.2`	`…partial_tla_safety_spec.md §7.2`

tla/README.md already uses the correct partial_ path.

Fix this →

2. README invariant table names `HLC2_CeilingMonotonic` — low

tla/README.md:114 lists HLC2_CeilingMonotonic in the "What each module proves" table, but that operator name does not exist in HLC.tla. The correct names are HLC2_NonNeg (in INVARIANTS) and HLC2_Action (in PROPERTIES). A reviewer cross-checking the README against the spec file — as the PR test plan requests — will not find the stated name.

Fix this →

Summary

Both round-3 codex P2 fixes are correctly implemented. The spec is now sound and the make tla-check harness is robust. The two doc-consistency issues from round 2 (stale proposed_ paths in four module headers, wrong invariant name in the README table) remain open and should be fixed before merge — particularly finding 1, since the linked design doc is the primary reference a reviewer uses to cross-check the invariants.

Two doc-consistency findings from claude bot's round 2 review of commit 791c65b (after the proposed -> partial rename in 66b5153 those references became broken links): - Update 4 stale references to the old proposed_ doc path: - tla/hlc/HLC.tla line 4 - tla/lib/Env.tla line 4 - tla/lib/Raft.tla line 4 - Makefile line 17 All now point at docs/design/2026_05_28_partial_tla_safety_spec.md. - tla/README.md "What each module proves" table referenced HLC2_CeilingMonotonic, which does not exist in HLC.tla after the round-1 rework. Replace with the actual operators: HLC1_PerNodeMonotonic (state, with strict seq ordering), HLC1_Action (PROPERTY, transition form), HLC2_NonNeg (state, residual TypeOK content), HLC2_Action (PROPERTY, transition form). Also clarify HLC-1's state form is the strict-seq version, not the earlier vacuous total-order form. Docs-only changes. make tla-check still passes on MCHLC.cfg (3,594 distinct states) and fails on MCHLC_gap.cfg as designed.

chatgpt-codex-connector · 2026-05-28T17:41:11Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

bootjp · 2026-05-28T17:41:19Z

Addressed claude bot's round 2 review (on 791c65b3) in 9901e6ca.

MEDIUM (stale proposed_ path): Updated 4 broken references to the design doc path that became invalid after the proposed_ → partial_ rename in 66b51539:
- tla/hlc/HLC.tla line 4
- tla/lib/Env.tla line 4
- tla/lib/Raft.tla line 4
- Makefile line 17
LOW (README invariant table): tla/README.md listed HLC2_CeilingMonotonic, which no longer exists after round 1's rework. Replaced with the actual operators: HLC1_PerNodeMonotonic, HLC1_Action (PROPERTY), HLC2_NonNeg, HLC2_Action (PROPERTY); also clarified HLC-1's state form uses the strict seq ordering, not the earlier vacuous total-order form.
INFO (TickWall MaxClockSkewMs = 0): claude bot was reviewing round-1 commit 791c65b3; this finding is already addressed in round-2 commit 1123a0e1 (TickWall now excludes self, MaxClockSkewMs raised to 1, safe config explores 3,594 distinct states).

Docs-only changes. make tla-check still passes on MCHLC.cfg (3,594 distinct states) and fails on MCHLC_gap.cfg as designed.

@claude review

claude · 2026-05-28T17:42:12Z

Claude finished @bootjp's task in 3m 55s —— View job

Review of Round 3 (`9901e6c`)

Todo list:

Load tools
Read HLC.tla — verify proposed_ → partial_ path fix
Read Env.tla — verify proposed_ → partial_ path fix
Read Raft.tla — verify proposed_ → partial_ path fix
Read Makefile — verify proposed_ → partial_ path fix
Read tla/README.md — verify invariant table fix (HLC2_CeilingMonotonic → correct names)
Read MCHLC.cfg + MCHLC_gap.cfg — verify constant updates still correct
Read MCHLC.tla — general review
Synthesize and post findings

Round-2 fixes: both correct ✓

Fix 1 — Stale proposed_ path in four files

All four broken references are now updated:

File	Line	Confirmed
`tla/hlc/HLC.tla`	4	`…partial_tla_safety_spec.md §5.1` ✓
`tla/lib/Env.tla`	4	`…partial_tla_safety_spec.md §6.1` ✓
`tla/lib/Raft.tla`	3	`…partial_tla_safety_spec.md §3 / §6.1` ✓
`Makefile`	17	`…partial_tla_safety_spec.md §7.2` ✓

docs/design/2026_05_28_partial_tla_safety_spec.md exists on disk; the old proposed_ path is gone. All links are live.

Fix 2 — README invariant table (tla/README.md:108-118)

The table now correctly lists all operators that exist in HLC.tla:

README entry	Exists in HLC.tla	Correct
`HLC1_PerNodeMonotonic`	line 201 ✓	✓
`HLC1_Action` (PROPERTY)	line 223 ✓	✓
`HLC2_NonNeg`	line 211 ✓	✓
`HLC2_Action` (PROPERTY)	line 228 ✓	✓
`HLC3_LeaderOnly`	line 234 ✓	✓
`HLC4_NoRegressionAcrossTerms`	line 242 ✓	✓

No more phantom HLC2_CeilingMonotonic. The note about "strict-seq ordering, not the earlier vacuous total-order form" is accurate.

Spec correctness pass — no issues found

TickWall skew model (HLC.tla:107-109)

With MaxClockSkewMs = 1:

Condition 1: wallNow[n] + 1 <= wallNow[m] + 1 = wallNow[n] <= wallNow[m] — n can tick only if not ahead of any peer.
Condition 2: wallNow[m] <= wallNow[n] + 2 — no peer is more than MaxClockSkewMs+1 ahead (post-tick skew ≤ MaxClockSkewMs).

Together: |wallNow[n]+1 − wallNow[m]| ≤ 1 for all peers — correct symmetric bounded-skew model. State space is non-degenerate (verified: 3,594 distinct states).

HLC1_Action holds on all actions

TickWall: UNCHANGED hlcLast → HLCLE(hlcLast[n], hlcLast[n]) trivially TRUE.
ApplyCeiling: UNCHANGED hlcLast → trivially TRUE.
BecomeLeader_HLC: either raises hlcLast[n] to maxAppliedHLC (HLCLT precondition guarantees > → HLCLE TRUE) or leaves it unchanged.
IssueTimestamp(n): ComputeNextHLC(prev, FlooredNow(n)) is always >= prev across all three branches (wall advance, logical increment, logical overflow). hlcLast[m] for m ≠ n is UNCHANGED.

HLC2_Action holds on all actions

Only ApplyCeiling modifies physicalCeiling, and it uses IF physicalCeiling[m] < newC THEN newC ELSE physicalCeiling[m] — strict max(), never decreases. All other actions UNCHANGED physicalCeiling.

HLC4_NoRegressionAcrossTerms with EnableSafety = TRUE

New leader n via BecomeLeader_HLC:

hlcLast[n] raised to maxAppliedHLC (≥ every previous commit across all terms).
IssueTimestamp gated on wallNow[n] < physicalCeiling[n], so FlooredNow(n) = physicalCeiling[n].
ComputeNextHLC(hlcLast[n], physicalCeiling[n]) produces a value strictly above maxAppliedHLC (either the ceiling is larger than hlcLast[n].wall, or the logical counter increments above maxAppliedHLC.logical). Every new-term commit is therefore strictly above every prior-term commit.

Both .cfg files register HLC1_Action + HLC2_Action under PROPERTIES — correctly separating temporal properties from state invariants. Both configs share MaxClockSkewMs = 1, HlcPhysicalWindowMs = 2, maintaining ASSUME 1 < 2 from Env.tla.

HLC3_LeaderOnly: IssueTimestamp(n) is gated on IsLeader(n), which guarantees activeTerm > 0 /\ leaderOf[activeTerm] = n. The committed record stores term = activeTerm and node = n, so the invariant trivially holds.

TypeOK: committedTS records carry seq \in 0..MaxOps via opCount at issuance time. The constraint opCount < MaxOps before IssueTimestamp and the increment opCount' = opCount + 1 after ensure seq stays in range.

MCHLC.tla: Clean module — just EXTENDS HLC, TLC with a symmetry hint Permutations(Nodes). Correct to share one module file between both configs.

Summary

All four rounds of review findings have been fully addressed:

Round 1 (gemini HIGH + MEDIUM, codex P2 ×2): Nat subtraction, sha256sum portability, vacuous HLC-1 and HLC-2.
Round 2 (codex P2 ×2): TickWall self-exclusion, gap failure reason validation.
Round 3 (claude MEDIUM + LOW): stale proposed_ paths, wrong invariant name in README.

The spec is sound and internally consistent. No new findings. Ready to merge.

Per docs/design/2026_05_28_partial_tla_safety_spec.md §7.3 — the M1 PR is the right place to land the CI workflow alongside the concrete spec. Adds .github/workflows/tla-check.yml: - Triggers on PR (opened/synchronize/reopened) and on push to main, with a path filter matching exactly the §7.3 list: tla/**, Makefile, kv/hlc*.go, kv/coordinator.go, kv/sharded_coordinator.go, kv/transaction.go, kv/fsm.go, kv/lock_resolver.go, store/mvcc_store.go, distribution/**, and the workflow file itself. - Uses Temurin JDK 21 (consistent with jepsen-test.yml). - Caches .cache/tla/tla2tools.jar keyed on the Makefile hash so any TLA_VERSION / TLA_SHA256 change automatically invalidates the cache. - Pre-fetches the jar via `make tla-tools` so the download is in a separate log group from the model check. - Runs `make tla-check`, which inherits the round-2 gap-config validation: a non-zero TLC exit on MCHLC_gap.cfg only counts as success if the output contains the exact HLC-4 violation string. Updates §7.3 of the design doc: removes the "(deferred)" qualifier and names the workflow file as the now-active CI binding. actionlint clean. The full M1 run completes in well under a minute locally (3,594 distinct states on the safe config, HLC-4 counter- example at depth 5 on the gap config).

bootjp · 2026-05-28T17:46:52Z

Wired make tla-check into GitHub Actions in ca0ecfc0.

New file: .github/workflows/tla-check.yml

Triggers on PR (opened / synchronize / reopened) and push to main, with a path filter matching §7.3 of the design doc (tla/**, Makefile, the HLC/coordinator/transaction/fsm/lock_resolver Go files, store/mvcc_store.go, distribution/**, and the workflow file).
Temurin JDK 21 (consistent with jepsen-test.yml).
Caches .cache/tla/tla2tools.jar keyed on Makefile hash so any TLA_VERSION / TLA_SHA256 change invalidates the cache automatically.
Pre-fetches the jar via make tla-tools so the download is logged separately from the model check.
Runs make tla-check, which keeps the round-2 gap-config validation: a non-zero TLC exit on MCHLC_gap.cfg only counts as success if the output contains the exact Invariant HLC4_NoRegressionAcrossTerms is violated string.

Updated §7.3 of the design doc: removed the (deferred) qualifier and named the workflow file as the now-active CI binding.

actionlint clean. The workflow is already running on this PR (ca0ecfc0): https://github.com/bootjp/elastickv/actions/runs/26591495196

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ca0ecfc0e7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-28T17:49:44Z

+          restore-keys: |
+            ${{ runner.os }}-tla-tools-


Verify restored TLA jar before use

When TLA_VERSION or TLA_SHA256 changes, the primary cache key misses, and GitHub's cache docs say restore-keys are checked for partial matches; this can restore an older .cache/tla/tla2tools.jar. Because make tla-tools only depends on the existing file target and the checksum recipe in the Makefile is skipped when that file already exists, CI can run an old/unverified TLC jar despite the checksum pin. Drop the prefix restore or make the target always verify/remove stale jars after cache restore.

Useful? React with 👍 / 👎.

codex P2 on PR #856 ca0ecfc: the prefix-match restore-keys could bring back an older tla2tools.jar after a TLA_VERSION/TLA_SHA256 bump in the Makefile. Because $(TLA_JAR) is a file-existence target, the restored stale jar would skip the Makefile's checksum verification entirely — defeating the pin. A full cache miss is the correct behaviour on a version-pin change: the recipe runs, downloads from GitHub, and verifies the SHA-256 before serving the jar to TLC. Documented inline so the next reviewer doesn't re-add restore-keys.

bootjp · 2026-05-28T17:52:36Z

Addressed codex P2 on ca0ecfc0 in 2a2238af.

Finding: .github/workflows/tla-check.yml cache restore-keys could bring back an older tla2tools.jar after a TLA_VERSION / TLA_SHA256 bump in the Makefile. Because the Makefile's $(TLA_JAR) target is gated on file existence, the restored stale jar would skip the checksum verification entirely — defeating the pin.

Fix: Dropped the restore-keys block. A full cache miss is the correct behaviour on a version-pin change: the recipe runs, downloads from GitHub, and verifies the SHA-256 before serving the jar to TLC. Documented inline so the next reviewer doesn't re-add it.

This is a workflow-only change. The Makefile's checksum verification path is unchanged — and is now load-bearing for every version-pin bump.

@claude review

claude · 2026-05-28T17:52:56Z

Claude finished @bootjp's task in 1m 24s —— View job

Todo list:

Load tools
Read .github/workflows/tla-check.yml — verify restore-keys removal
Read Makefile — confirm tla-tools / TLA_JAR target is still the verification gatekeeper
Check overall workflow correctness after the change
Post findings

Review of Round 4 (`2a2238a`)

The diff is a 7-line addition (explanatory comment) + 2-line deletion (restore-keys block). The fix is correct.

The root cause and fix

The $(TLA_JAR) Make target is a standard file-existence guard:

$(TLA_JAR):
    @curl -fsSL -o $(TLA_JAR).tmp $(TLA_URL)
    @# ... SHA-256 verification ...
    @mv $(TLA_JAR).tmp $(TLA_JAR)

This recipe only runs when .cache/tla/tla2tools.jar does not exist. With the old restore-keys: ${{ runner.os }}-tla-tools-, a prefix match after a TLA_VERSION/TLA_SHA256 bump would restore an older jar (file now exists), the Make target would skip entirely, and TLC would silently run against the wrong version — with no checksum verification. Removing restore-keys means the only non-miss path is an exact hashFiles('Makefile') match, which guarantees the jar on disk is exactly the one the current Makefile specifies.

Correctness verified

Scenario	Cache behaviour	`make tla-tools`	Result
Same Makefile hash as last run	Exact hit → jar restored	no-op (file exists)	correct jar, no download
Makefile changed (e.g. TLA_VERSION bump)	Miss (no restore-keys)	downloads + SHA-256 verified	correct jar, download logged separately
First run ever	Miss	downloads + SHA-256 verified	correct jar

Comment quality

The inline comment (lines 70-75 of the final file) names the codex P2 finding, explains the $(TLA_JAR) file-existence mechanism, and states the correct semantics. This is sufficient to stop a future reviewer from re-adding restore-keys thinking it's a free performance win. Good.

Everything else unchanged and correct

key: ${{ runner.os }}-tla-tools-${{ hashFiles('Makefile') }} — both TLA_VERSION and TLA_SHA256 are in the Makefile, so bumping either one rotates the key.
cancel-in-progress: false — correct for a correctness check (don't abandon a running TLC invocation on push).
make tla-tools pre-fetch step still present — separates download logs from model-check logs, making failure attribution straightforward.
make tla-check still greps for the specific Invariant HLC4_NoRegressionAcrossTerms is violated string from round 2, so the gap evidence is validated by content, not just exit code.

No new findings. All five rounds of review findings are now fully addressed. Ready to merge.

@claude

…857) ## Summary Auto-trigger AI review (Claude + Codex) whenever a PR touches a file the TLA+ safety spec has an anchor on, so a spec-divergence review **always** fires — never relies on a human remembering to ping. Follows PR #856 (M1 spec + `tla-check` workflow). ## How it works New workflow: `.github/workflows/tla-spec-ai-review.yml` - Fires on `pull_request` (opened / synchronize / reopened / ready_for_review). - Path filter is an exact copy of `tla-check.yml`'s — the two workflows stay paired so every anchored-file change triggers **both** the TLC run AND the AI-review request. - Computes which anchored files actually changed (via `git diff` against the PR base) and skips the comment if none. - Posts a fresh PR comment per push containing both `@claude review` and `@codex review` mentions, plus a per-subsystem checklist of which spec invariants to verify (HLC-4 preconditions, OCC lock encoding, MVCC visibility, Routes catalog versioning). - **Always** `createComment` — never `updateComment`. The upstream `claude.yml` listens on `issue_comment: created` only; editing would not re-trigger Claude. Same for Codex. ## Chain ``` anchored file change in PR -> tla-check.yml runs TLC (catches *spec* bugs) -> tla-spec-ai-review.yml posts comment (with @claude + @codex pings) -> claude.yml fires on @claude → Claude reviews diff -> chatgpt-codex-connector bot fires → Codex reviews diff (both focused on TLA+ divergence) ``` The two workflows are complementary: `tla-check` verifies the spec passes; the AI reviewers verify the implementation matches the spec. ## Docs §7.3 of `docs/design/2026_05_28_partial_tla_safety_spec.md` is updated with the chain diagram, the two-workflow split, and a reminder that the path filter must stay in sync across the two YAML files and the doc § list. ## Verification - `actionlint` clean. - `make tla-check` still passes locally on `main` (3,594 distinct states on safe config, HLC-4 counterexample at depth 5 on gap config). - The workflow itself doesn't run TLC — it only posts a comment — so the trigger chain can be tested by opening a follow-up PR that touches e.g. `kv/hlc.go`. ## Test plan - [ ] CI green (`tla-check` runs again because the workflow path filter touches `tla/**`? No — this PR only adds `.github/workflows/`, which is NOT in the path filter, so `tla-check` will NOT fire on this PR. That's correct.) - [ ] Reviewer opens a tiny follow-up PR touching e.g. `kv/hlc.go` to verify the comment+bot chain fires - [ ] Reviewer confirms the comment template lists the right spec sections per subsystem ## Out of scope - Auto-collapsing older marker comments after a fresh push — embedded as a marker (``) so a future PR can add a "minimize older" step if the comment noise becomes a problem. - Triggering AI review when the spec itself changes (no impl divergence to check then — `tla-check` is sufficient).  ## Summary by CodeRabbit * **Documentation** * Updated design documentation describing the new automated specification review workflow, including trigger conditions, review coordination processes, and how verification ensures implementations maintain alignment with formal specifications * **Chores** * Added automated workflow that monitors pull requests for specification file changes and automatically coordinates review requests to verify specification compliance  [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/bootjp/elastickv/pull/857?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

…#859) ## Summary Implements the **M1 spec follow-up** per [`docs/design/2026_05_28_partial_tla_safety_spec.md`](https://github.com/bootjp/elastickv/blob/main/docs/design/2026_05_28_partial_tla_safety_spec.md) §5.1 HLC-4 precondition **(ii)** — strategy (c) "scan the FSM for max committed HLC and `Observe` before serving any persistence `Next()`". Follows PR #856 (M1 TLA+ spec). ## What lands One line in `kv/fsm.go` `applyRequestErr` after a successful `handleRequest`: ```go if f.hlc != nil && commitTS > 0 { f.hlc.Observe(commitTS) } ``` `HLC.Observe` is the existing CAS-to-max method (`kv/hlc.go:129-140`), so this is a **pure addition** with no signature change and no caller audit required. ## Why "observe on every apply" vs the spec doc's "observe on election" The spec doc describes strategy (c) as "on election, scan the FSM and call `Observe`". The implementation refines this to "Observe on every apply". The two are semantically equivalent for HLC-4 because etcd/raft applies all uncommitted prior-term entries before a new leader serves a write — so by the time `HLC.Next()` runs in the new term, `hlc.last >= every prior commit`. Two advantages over the on-election form: 1. **No protocol change** to `ShardedCoordinator` or `RunHLCLeaseRenewal`. 2. **No race window** between `BecomeLeader` and the first `Observe` — a concurrent client request landing in that window cannot see a stale `hlc.last`. ## Tests `kv/fsm_hlc_observe_test.go`: - `TestApplyObservesCommitTSIntoHLC` — verifies Apply advances `hlc.last` via `Observe(commitTS)`, and preserves monotonicity on a stale-ts apply. - `TestApplyHLCObserveAfterRestart` — simulates the spec's new-leader-after-restart scenario: a fresh HLC (`last = 0`) re-plays a prior leader's write through the FSM and the resulting `Next()` is strictly greater than the prior `commit_ts`. Both tests pass with `-race`. ## Deferred to a follow-up PR The **ceiling-fence half** of the M1 spec contract (HLC-4 precondition **(iii)** — making `HLC.Next()` fail-closed when `wall_now >= physicalCeiling`) is intentionally NOT in this PR. That change alters `HLC.Next()`'s return signature from `uint64` to `(uint64, error)` and requires auditing **18 production call sites** across `kv/`, `adapter/`, and `store/`. Splitting it from the strategy-(c) PR keeps each diff individually reviewable per the loop-discipline reminder for semantic changes. ## Self-review (5-lens per CLAUDE.md) 1. **Data loss** — n/a; `Observe` only advances an in-memory clock counter, no persistence path touched. 2. **Concurrency** — `Observe` is the existing CAS-to-max loop; safe under concurrent `Apply` (which is serial per Raft FSM contract anyway) and concurrent `Next` (the CAS handles it). 3. **Performance** — adds one atomic load + CAS to the FSM apply hot path. The CAS only fires when `ts > current` so steady-state is one load. 4. **Data consistency** — `Observe` is monotone (only advances). No semantic change to any other code path. 5. **Test coverage** — two new tests in `kv/fsm_hlc_observe_test.go` covering the basic apply→observe path and the post-restart catchup case. Existing kv + store tests pass. ## Test plan - [ ] CI green (the `tla-check` workflow does NOT fire — this PR is Go-only) - [ ] The new `tla-spec-ai-review` workflow (PR #857) auto-pings Claude + Codex for spec-divergence review - [ ] Reviewer cross-checks the Observe call against HLC-4 (ii) strategy (c) in the design doc and `BecomeLeader_HLC` in `tla/hlc/HLC.tla` ## Out of scope - HLC-4 precondition (iii) ceiling fence (`HLC.Next()` fail-closed) — separate follow-up PR. - Operator-visible metrics for the fence behaviour — lands with (iii).

… tla-check (#858) ## Summary M2 of the [TLA+ safety spec](https://github.com/bootjp/elastickv/blob/main/docs/design/2026_05_28_partial_tla_safety_spec.md) — the OCC layer encoded as a Percolator-style 2PC model. Follows PR #856 (M1 HLC spec). ## What lands - `tla/occ/OCC.tla` — transaction lifecycle `Idle → Active → Prepared → Committed/Aborted`, lock map `(key, lock_ts) → start_ts`, `LockResolve` action that drains abandoned locks. - `tla/occ/MCOCC.tla` — TLC model module with txn + key symmetry. - `tla/occ/MCOCC.cfg` — safe config (`EnableSafety = TRUE`; expected PASS). - `tla/occ/MCOCC_gap.cfg` — Commit's OCC-1 guard removed (expected FAIL on `OCC1_CommitTsAboveStart`). - `scripts/tla-check.sh` — module-aware orchestrator. Adding M3/M4/M5 needs only a new entry in `TLA_MODULES` + a one-line `case` for the gap-invariant string. - `Makefile tla-check` delegates to the script. - `.gitignore` TLA artefact glob widened from `MCHLC_TTrace_*` to `MC*_TTrace_*` (M2 + M3..M5 covered). - `tla/README.md`: M2 status `Not started` → `Landed`; OCC module description block + invariant table. ## Invariants encoded | # | Statement | Form | |---|---|---| | OCC-1 | `commit_ts(T) > start_ts(T)` for every committed txn | INVARIANT | | OCC-2 | Two committed txns sharing a write key have distinct `commit_ts` and one started after the other committed | INVARIANT | | OCC-3 | Snapshot-read stability under the lock encoding (reads blocked by conflicting uncommitted locks; LockResolve drains them) | INVARIANT | | OCC-4 | No stranded lock at quiescence (bounded-safety form) | INVARIANT | | OCC-5 | `start_ts(T) == read_ts(T)` for every operation | INVARIANT | | OCC5_Action | `start_ts[t]` is assigned once at BeginTxn, never updated | PROPERTY | | CommitTsAssignedOnce | `commit_ts[t]` is assigned once at Commit, never updated | PROPERTY | OCC-L1 (liveness) is deferred to M6 per §8.1. ## Verification (local) ```text $ make tla-check ... TLC: tla/hlc/MCHLC.cfg (correct design, expected PASS) ... 3594 distinct states found, 0 left on queue ... TLC: tla/hlc/MCHLC_gap.cfg (...expected FAIL on HLC4...) OK: MCHLC_gap.cfg failed as designed ... TLC: tla/occ/MCOCC.cfg (correct design, expected PASS) ... 4217 distinct states found, 0 left on queue ... TLC: tla/occ/MCOCC_gap.cfg (...expected FAIL on OCC1...) OK: MCOCC_gap.cfg failed as designed (Invariant OCC1_CommitTsAboveStart is violated) tla-check: all model-check outcomes match the design contract. ``` ## Modelling notes - **HLC abstraction.** OCC.tla uses a single global monotonic `tsCounter`, not the full 48/16 HLC. M5 (composed) will INSTANCE `HLC.tla` for the real layout. This keeps the M2 state space manageable. - **`Commit` borrows ts.** `commit_ts = tsCounter` (does NOT advance). Only `Tick` advances ts, so the bounded model stays inside `0..MaxTs`. OCC-1's `commit_ts > start_ts` then requires a Tick between BeginTxn and Commit. - **OCC-2 non-strict serialization.** A txn with `start_ts = T` sees every version with `commit_ts ≤ T`, so `start_ts = commit_ts(earlier)` is a valid "later runs in the world after earlier committed" boundary (`<=`, not `<`). - **OCC-3 lock encoding** is enforced as an `ENABLED` guard on `ReadKey` — TLC simply does not explore reads in a state where a conflicting lock exists, mirroring the spec's "OCC-3 is asserted only in states where no uncommitted lock at lock_ts ≤ read_ts exists". ## Self-review (5-lens per CLAUDE.md) 1. **Data loss** — n/a, spec-only PR, no persistence paths. 2. **Concurrency / distributed failures** — the spec is exactly modelling these. Safe config (4,217 states) explored without invariant violation; gap config produces the documented counterexample. 3. **Performance** — `make tla-check` end-to-end runs in well under 5 seconds on a developer laptop at default bounds. 4. **Data consistency** — OCC-1..OCC-5 + the two PROPERTY transitions are the M2 contract from the design doc. The non-strict OCC-2 form is justified in the spec comment. 5. **Test coverage** — no Go tests added (spec-only). `make tla-check` is the new coverage layer. ## Test plan - [ ] `tla-check` CI workflow runs green (it watches `tla/**`) - [ ] `tla-spec-ai-review` does NOT fire — this PR touches no Go anchor (correct: spec-only changes don't need an AI impl-divergence review) - [ ] Reviewer cross-checks OCC-1..OCC-5 invariants against §5.2 of the design doc - [ ] Reviewer runs `make tla-check` and confirms 4 outcomes (HLC safe pass, HLC gap fail, OCC safe pass, OCC gap fail) ## Out of scope - OCC-L1 (liveness) — deferred to M6. - M3 (MVCC), M4 (Routes), M5 (Composed) — separate future PRs. - Integrating OCC with the real HLC.tla — happens in M5.  ## Summary by CodeRabbit ## Release Notes * **New Features** * Added OCC (Optimistic Concurrency Control) specification module with two-phase commit lifecycle, five core safety invariants, and automated model-checking verification suite * **Documentation** * Updated project README with detailed OCC module documentation and marked the OCC milestone as completed * **Chores** * Refactored verification build infrastructure with improved scripts and updated trace-file ignore patterns  [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/bootjp/elastickv/pull/858?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

…#865) ## Summary M5 of the [TLA+ safety spec](https://github.com/bootjp/elastickv/blob/main/docs/design/2026_05_28_partial_tla_safety_spec.md) — the cross-module safety properties, capping the spec-only roadmap (M1..M5). M1..M4 each modelled a single subsystem in isolation. M5 captures the **seams** between them — the two safety properties that no single-module spec can express. ## What lands - `tla/composed/Composed.tla` — recreates the minimal cross-module state needed for Composed-1 (catalog history `routes[v]`, txn observed catalog version `txnObservedVer[t]`, committing group `txnCommitGroup[t]`) and Composed-3 (monotonic `tsCounter` advancing on every Commit). **Does NOT `INSTANCE` M1..M4** — that would explode the state space well past M5's `<10 min at default bounds` target from §8.1. - `tla/composed/MCComposed.tla` — TLC model with key + group + txn symmetry. - `tla/composed/MCComposed.cfg` / `MCComposed_gap.cfg` — safe (PASS) + gap (FAIL on Composed-1). - `scripts/tla-check.sh` — `TLA_MODULES` gains `"composed"`; `gap_invariant_for` + `mc_basename` gain the corresponding cases. - `tla/README.md` — M5 status `Not started` → `Landed`; Composed module description + invariant table. ## Invariants | # | Statement | Form | |---|---|---| | Composed-1 | Every committed write key was owned by the committing group at the txn's observed catalog version | INVARIANT | | Composed-2 | Vacuously TRUE in this abstraction — `SplitRange` is same-group only per CLAUDE.md | INVARIANT | | Composed-3 | Distinct committed txns have distinct `commit_ts` | INVARIANT | | `Composed_CatalogMonotonic` | `catalogVersion` weakly increases on every step | PROPERTY | | `Composed_TsMonotonic` | `tsCounter` weakly increases on every step | PROPERTY | | `Composed3_TsAction` | Every Commit strictly raises `tsCounter` | PROPERTY | ## Why no `INSTANCE`? Each M1..M4 module tightly bounds its own state for tractable TLC. Bringing all four `INSTANCE`s into one product spec would multiply the state spaces — even a 100x growth lands well past M5's `<10 min at default bounds` target. M5 instead recreates the minimal cross-module state needed to express the two seam invariants. The integration claim is that this projection preserves the invariants the full product would assert; reviewers can sanity-check each Composed action against the corresponding M1..M4 action (e.g., `ProposeRouteChange` mirrors `Routes.tla`, `BeginTxn` + `Commit` mirror `OCC.tla` with a cross-module observed-catalog-version pin). ## Verification (local) ```text $ make tla-check HLC safe 3,594 distinct states + HLC-4 gap ce OCC safe 150 distinct states + OCC-1 gap ce MVCC safe 79 distinct states + MVCC-4 gap ce Routes safe 29 distinct states + Routes-4 gap ce Composed safe 1,684 distinct states + Composed-1 gap ce tla-check: all model-check outcomes match the design contract. ``` The Composed-1 counterexample is the canonical 4-step regression: 1. `BeginTxn(t)` — `txn` observes `catalogVersion = 0`, where `routes[0][k1] = g1`. 2. `WriteIntent(t, k1)`. 3. `ProposeRouteChange(k1, g2)` — `catalogVersion → 1`, `routes[1][k1] = g2`. 4. `Commit(t, g2)` (gap mode) — committing group is `g2`, but `routes[0][k1] = g1 ≠ g2`. **Composed-1 fails.** ## Self-review (5-lens per CLAUDE.md) 1. **Data loss** — out of scope (cross-module invariants check group ownership, not durability). 2. **Concurrency** — Composed-1 is exactly the concurrency seam between OCC and Routes; the gap counterexample is one such interleaving. 3. **Performance** — `make tla-check` end-to-end runs in ~10 s on a dev laptop, well under the 10-min target. 4. **Data consistency** — Composed-1 (cross-module write/route consistency) + Composed-3 (cross-txn ts uniqueness) + 3 PROPERTY transitions. 5. **Test coverage** — no Go tests added (spec-only). `make tla-check` is the coverage layer. ## Roadmap status (post-merge) | Milestone | Status | |---|---| | M1 HLC | Landed (PR #856) | | M2 OCC | Landed (PR #858) | | M3 MVCC | Landed (PR #861) | | M4 Routes | Landed (PR #861 via stacked #862) | | **M5 Composed** | **This PR** | | M6 liveness (OPTIONAL) | Not started — can follow as a separate PR | | M1 Go follow-up (ceiling fence) | Not started — separate Go PR with caller audit | ## Test plan - [ ] `tla-check` CI runs green (watches `tla/**`) - [ ] `tla-spec-ai-review` does NOT fire — spec-only PR - [ ] Reviewer cross-checks Composed-1 / Composed-3 against §5.5 of the design doc - [ ] Reviewer runs `make tla-check` and confirms 10 outcomes (5 safe pass + 5 gap fail) - [ ] Reviewer sanity-checks the `INSTANCE`-free projection claim — each Composed action mirrors the corresponding M1..M4 action ## Out of scope - M6 liveness (OCC-L1, Routes-L1) — separate PR if needed - HLC-4 (iii) ceiling fence Go implementation — separate code PR - Cross-group `SplitRange` in the real implementation (currently same-group only)

…ini medium on PR #878) Closes a silent-false-positive class in the gap-config loop introduced by the multi-gap refactor in this PR's parent commit. The loop does: while IFS='|' read -r gap_stem gap_inv; do ... if printf '%s\n' "$gap_out" | grep -qF "$gap_inv"; then OK fi done If a gap_configs_for() entry is malformed (missing the '|' delimiter, e.g. a typo dropped the invariant string), `read` puts the entire line in gap_stem and leaves gap_inv empty. Then `grep -qF ""` matches every line of TLC output unconditionally, so the gap config 'passes' even when TLC failed for an unrelated reason — exactly the silent-false-positive class codex P2 on PR #856 round 2 introduced the gap-invariant-string match to prevent. Fix: validate both fields are non-empty before the body runs. A trailing blank line (the existing skip case) is still skipped; a malformed entry now produces a clear ERROR with the stem / inv it read so the maintainer can fix gap_configs_for() directly. Verified: `make tla-check` still produces the 11 expected outcomes (5 safe PASS + 5 gap FAIL + 1 currentfence gap FAIL on Composed-1a) after the guard.

bootjp added 2 commits May 29, 2026 02:02

gemini-code-assist Bot reviewed May 28, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed May 28, 2026

View reviewed changes

bootjp merged commit 32ef69d into main May 28, 2026
10 checks passed

bootjp deleted the feat/tla-m1-hlc-spec branch May 28, 2026 18:08

bootjp mentioned this pull request May 28, 2026

ci(tla): auto-request Claude + Codex review on anchored-file changes #857

Merged

3 tasks

This was referenced May 28, 2026

tla: M2 — OCC.tla (OCC-1..OCC-5 + 2 action properties) + module-aware tla-check #858

Merged

kv(hlc): observe applied commit_ts into HLC.last — HLC-4 strategy (c) #859

Merged

bootjp mentioned this pull request May 29, 2026

tla: M5 — Composed.tla (Composed-1 + Composed-3) wired into tla-check #865

Merged

5 tasks

		/\ \A m \in Nodes : (wallNow[n] + 1) - wallNow[m] <= MaxClockSkewMs
		/\ wallNow[m] - (wallNow[n] + 1) <= MaxClockSkewMs

		/\ \A m \in Nodes : /\ wallNow[n] + 1 <= wallNow[m] + MaxClockSkewMs
		/\ wallNow[m] <= wallNow[n] + 1 + MaxClockSkewMs

Conversation

bootjp commented May 28, 2026

Summary

Verification (local)

Self-review (per CLAUDE.md five-lens checklist)

Out of scope

Test plan

Uh oh!

coderabbitai Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp commented May 28, 2026

Uh oh!

claude Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review of Round 2 (791c65b)

Round-1 fixes: all correct ✓

New findings

1. Stale proposed_ path in TLA module headers and Makefile — medium

2. README invariant table names HLC-2 incorrectly — low

3. TickWall is unreachable with MaxClockSkewMs = 0 — informational

Summary

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp commented May 28, 2026

Uh oh!

claude Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review of Round 3 (1123a0e)

Round-3 fixes: both correct ✓

Outstanding from round-2 Claude review (still not addressed)

1. Stale proposed_ path in four files — medium

2. README invariant table names HLC2_CeilingMonotonic — low

Summary

Uh oh!

chatgpt-codex-connector Bot commented May 28, 2026

Uh oh!

bootjp commented May 28, 2026

Uh oh!

claude Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review of Round 3 (9901e6c)

Round-2 fixes: both correct ✓

Spec correctness pass — no issues found

Summary

Uh oh!

bootjp commented May 28, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

coderabbitai Bot commented May 28, 2026 •

edited

Loading

claude Bot commented May 28, 2026 •

edited

Loading

Review of Round 2 (`791c65b`)

1. Stale `proposed_` path in TLA module headers and Makefile — medium

3. `TickWall` is unreachable with `MaxClockSkewMs = 0` — informational

claude Bot commented May 28, 2026 •

edited

Loading

Review of Round 3 (`1123a0e`)

1. Stale `proposed_` path in four files — medium

2. README invariant table names `HLC2_CeilingMonotonic` — low

claude Bot commented May 28, 2026 •

edited

Loading

Review of Round 3 (`9901e6c`)

claude Bot commented May 28, 2026 •

edited

Loading

Review of Round 4 (`2a2238a`)