Skip to content

tla: M1 — HLC.tla + lib + make tla-check (HLC-1..4 invariants + gap counterexample)#856

Merged
bootjp merged 7 commits into
mainfrom
feat/tla-m1-hlc-spec
May 28, 2026
Merged

tla: M1 — HLC.tla + lib + make tla-check (HLC-1..4 invariants + gap counterexample)#856
bootjp merged 7 commits into
mainfrom
feat/tla-m1-hlc-spec

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented May 28, 2026

Summary

M1 of the TLA+ safety spec proposal lands:

  • tla/lib/Env.tla + tla/lib/Raft.tla (shared scaffolding).
  • tla/hlc/HLC.tla with invariants HLC-1..HLC-4 + TypeOK.
    • Precondition (i) bounded clock skew encoded as a TLA+ ASSUME in Env.tla.
    • Precondition (ii) logical-counter handoff — strategy (c) — encoded as a state update inside BecomeLeader_HLC (hlc.Observe(maxAppliedHLC) analogue).
    • Precondition (iii) ceiling fence encoded as an enabling action guard on IssueTimestamp (wallNow[n] < physicalCeiling[n]).
  • tla/hlc/MCHLC.tla + MCHLC.cfg (correct design — TLC passes) and MCHLC_gap.cfg (preconditions disabled — TLC produces the HLC-4 counterexample the design doc anticipated).
  • make tla-check Makefile target: downloads checksum-pinned tla2tools.jar v1.8.0 (SHA-256 237332bd…) to .cache/tla/ and runs both configurations, inverting the exit code on the gap config so the expected-FAIL counts as PASS for CI.
  • tla/README.md covering install / run / what each module proves / reading list / how to map a TLC counterexample back to the implementation, per §7.1 of the design doc.
  • Design doc renamed proposed_partial_ (per CLAUDE.md lifecycle), header Status: updated, in a dedicated commit.

Verification (local)

$ make tla-check
================================================================
  TLC: tla/hlc/MCHLC.cfg  (correct design, expected PASS)
================================================================
... Model checking completed. No error has been found.
... 61 states generated, 28 distinct states found, 0 states left on queue.

================================================================
  TLC: tla/hlc/MCHLC_gap.cfg  (no preconditions, expected FAIL)
================================================================
... Error: Invariant HLC4_NoRegressionAcrossTerms is violated.
... 44 states generated, 28 distinct states found, 14 states left on queue.
... depth 5

OK: MCHLC_gap.cfg failed as designed (HLC-4 counterexample).

tla-check: all model-check outcomes match the design contract.

The gap-config counterexample (depth 5):

  1. BecomeLeader(n1) — n1 elected term 1
  2. IssueTimestamp(n1) — commits (wall=0, logical=0) in term 1
  3. IssueTimestamp(n1) — commits (0, 1) in term 1 (logical increments)
  4. BecomeLeader_HLC(n2) — n2 elected term 2; EnableSafety = FALSE so strategy (c) does NOT raise n2's hlcLast to maxAppliedHLC = (0, 1)
  5. IssueTimestamp(n2) — commits (0, 1) in term 2 (same as term 1's last commit; HLC-4 requires strict >)

This is exactly the bug the design doc §5.1 HLC-4 (ii) said the spec would surface.

Self-review (per CLAUDE.md five-lens checklist)

  1. Data loss — n/a: spec-only PR, no persistence paths touched.
  2. Concurrency / distributed failures — the spec itself models the concurrency bugs (HLC-4 cross-leader regression); both make tla-check configurations explore the full bounded state space.
  3. Performancemake tla-check runs in <1s on a developer laptop at default bounds.
  4. Data consistencyMCHLC.cfg proves HLC-1..HLC-4 hold under preconditions (i)/(ii)/(iii); MCHLC_gap.cfg proves the preconditions are necessary by surfacing the documented counterexample.
  5. Test coverage — Go code untouched, no Go tests added. make tla-check is the new coverage layer; running it is the test.

Out of scope

This PR is spec-only. The Go code changes that strategy (c) and the ceiling fence require (per §5.1 HLC-4 (ii) and (iii)) are not in this PR — they need:

  • Add MaxAppliedHLC() to kv/fsm.go (running max across applied write entries).
  • Call hlc.Observe(fsm.MaxAppliedHLC()) on new-term detection in kv/sharded_coordinator.go RunHLCLeaseRenewal, before any persistence Next() is served.
  • Add wallNow < physicalCeiling fence in kv/hlc.go Next() with a fail-closed return path and operator-visible metrics — caller audits of every site that calls Next() will be required (per the loop-feedback discipline) since Next() changes from infallible to error-returning.

Tracked as a follow-up issue; this PR lands the design contract that the follow-up must satisfy.

Test plan

  • CI runs make tla-check (CI integration is deferred to a future PR per §7.3 of the design doc — local invocation is the contract for now)
  • Reviewer runs make tla-check locally and confirms the safe pass + gap-config counterexample
  • Reviewer cross-checks each invariant in HLC.tla against the corresponding entry in §5.1 of the design doc
  • Reviewer cross-checks the implementation anchors in tla/README.md "How to interpret a TLC failure" against the current Go source

bootjp added 2 commits May 29, 2026 02:02
Per docs/design/2026_05_28_proposed_tla_safety_spec.md §8.1 (M1
deliverables): adds the TLA+ specification suite directory layout and
the HLC module, runnable via `make tla-check`.

Files:
- tla/lib/Env.tla — shared constants and the HLC-4 precondition (i)
  ASSUME (MaxClockSkewMs < HlcPhysicalWindowMs).
- tla/lib/Raft.tla — abstract Raft interface (currentTerm, leaderOf,
  BecomeLeader, IsLeader). Log matching and leader completeness are
  treated as assumed properties per NG4.
- tla/hlc/HLC.tla — HLC layer with all five invariants:
  - HLC-1 (per-node monotonic), HLC-2 (ceiling monotonic), HLC-3
    (leader-only issuance), HLC-4 (no regression across leader change),
    TypeOK. Strategy (c) handoff lives in BecomeLeader_HLC; the ceiling
    fence is an enabling guard on IssueTimestamp. Both are gated by
    CONSTANT EnableSafety so the same spec drives the correct-design
    and gap configurations.
- tla/hlc/MCHLC.tla — TLC model-check instance.
- tla/hlc/MCHLC.cfg — EnableSafety = TRUE (expected to PASS all
  invariants at default bounds: 3 nodes / 2 terms / 3 ops).
- tla/hlc/MCHLC_gap.cfg — EnableSafety = FALSE (expected to FAIL on
  HLC-4 with a counterexample, demonstrating that strategy (c) +
  ceiling fence are necessary).
- tla/README.md — install, run, what each module proves, reading list,
  how to map a TLC counterexample back to the implementation.

Tooling:
- Makefile: tla-check target downloads tla2tools.jar v1.8.0 to
  .cache/tla/ with checksum pinning (SHA-256
  237332bdcc79a35c7d26efa7b82c77c85c2744591c5598673a8a45085ff2a4fb),
  runs both configurations, and inverts the exit code on the gap
  configuration so the expected-FAIL counts as PASS for CI.
- .gitignore: TLC counterexample-trace artefacts
  (MCHLC_TTrace_*.{tla,bin}) and the per-run states/ directory.

Verified locally:
- make tla-check passes end-to-end.
- MCHLC.cfg: 28 distinct states, 0 invariant violations.
- MCHLC_gap.cfg: HLC-4 counterexample at depth 5 (n1 commits (0,1) in
  term 1; n2 elected term 2 without strategy (c), commits (0,1) again,
  tying the prior commit — HLC-4 requires strictly greater).

Follow-up (separate PRs):
- Rename the design doc proposed -> partial (next commit in this PR).
- Implement strategy (c) and the ceiling fence in Go:
  - Add MaxAppliedHLC() to kv/fsm.go.
  - Call hlc.Observe(fsm.MaxAppliedHLC()) on new-term detection in
    kv/sharded_coordinator.go RunHLCLeaseRenewal.
  - Add wallNow < physicalCeiling fence to HLC.Next() with a
    fail-closed return path, plus operator-visible metrics.
M1 (HLC.tla + lib + make tla-check) lands in the prior commit on this
branch, satisfying the first row of §8.1.  Per CLAUDE.md docs/design/
lifecycle, rename proposed -> partial and update the header / Status
note.  Filename propose date and slug are preserved; only the status
token changed.

M2..M5 remain open; M6 (liveness) remains optional.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Warning

Review limit reached

@bootjp, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 8 minutes and 27 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c2f771a7-8716-475d-8e51-673ef52a21b4

📥 Commits

Reviewing files that changed from the base of the PR and between fc8e4bd and 2a2238a.

📒 Files selected for processing (11)
  • .github/workflows/tla-check.yml
  • .gitignore
  • Makefile
  • docs/design/2026_05_28_partial_tla_safety_spec.md
  • tla/README.md
  • tla/hlc/HLC.tla
  • tla/hlc/MCHLC.cfg
  • tla/hlc/MCHLC.tla
  • tla/hlc/MCHLC_gap.cfg
  • tla/lib/Env.tla
  • tla/lib/Raft.tla
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/tla-m1-hlc-spec

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the TLA+ safety specification for the Hybrid Logical Clock (HLC) as part of Milestone 1, including the core specification files, model-checking configurations, and a Makefile target to automate TLC verification. The review feedback highlights two key improvements: rewriting subtraction-based inequalities in the TLA+ specification to addition to prevent evaluation errors when clock skew is enabled, and updating the Makefile to support both sha256sum and shasum for better portability across different environments.

Comment thread tla/hlc/HLC.tla Outdated
Comment on lines +96 to +97
/\ \A m \in Nodes : (wallNow[n] + 1) - wallNow[m] <= MaxClockSkewMs
/\ wallNow[m] - (wallNow[n] + 1) <= MaxClockSkewMs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In TLA+, subtraction on natural numbers (Naturals) is only defined when the left operand is greater than or equal to the right operand. If MaxClockSkewMs is configured to be greater than 0 (which is the core scenario for verifying clock skew safety), a state can be reached where wallNow[m] > wallNow[n] + 1. In this state, evaluating (wallNow[n] + 1) - wallNow[m] will result in a negative number, causing TLC to halt with an evaluation error.

To prevent this, we can rewrite the inequalities using addition instead of subtraction, which is always safe and closed under Naturals.

    /\ \\A m \\in Nodes : wallNow[n] + 1 <= wallNow[m] + MaxClockSkewMs
                     /\ wallNow[m] <= wallNow[n] + 1 + MaxClockSkewMs

Comment thread Makefile Outdated
Comment on lines +36 to +43
@actual=$$(shasum -a 256 $(TLA_JAR).tmp | awk '{print $$1}'); \
if [ "$$actual" != "$(TLA_SHA256)" ]; then \
echo "ERROR: tla2tools.jar SHA-256 mismatch."; \
echo " expected: $(TLA_SHA256)"; \
echo " actual: $$actual"; \
rm -f $(TLA_JAR).tmp; \
exit 1; \
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The shasum command is not installed by default on many Linux distributions (such as Ubuntu/Debian), whereas sha256sum is part of GNU coreutils and is universally available. To ensure the Makefile is portable and robust across both macOS and Linux environments (especially in minimal CI runners), we should check for the availability of both commands and use whichever is present.

	@if command -v sha256sum >/dev/null 2>&1; then \
		actual=$$(sha256sum $(TLA_JAR).tmp | awk '{print $$1}'); \
	elif command -v shasum >/dev/null 2>&1; then \
		actual=$$(shasum -a 256 $(TLA_JAR).tmp | awk '{print $$1}'); \
	else \
		echo "ERROR: neither sha256sum nor shasum found."; \
		rm -f $(TLA_JAR).tmp; \
		exit 1; \
	fi; \
	if [ "$$actual" != "$(TLA_SHA256)" ]; then \
		echo "ERROR: tla2tools.jar SHA-256 mismatch."; \
		echo "  expected: $(TLA_SHA256)"; \
		echo "  actual:   $$actual"; \
		rm -f $(TLA_JAR).tmp; \
		exit 1; \
	fi

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 66b515392e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tla/hlc/HLC.tla Outdated
Comment on lines +178 to +180
HLC1_PerNodeMonotonic ==
\A r1, r2 \in committedTS :
r1.node = r2.node => HLCLE(r1.ts, r2.ts) \/ HLCLE(r2.ts, r1.ts)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Track issuance order for HLC-1

In this spec, committedTS is a set and HLCLE is a total lexicographic order, so this invariant is true for any two timestamps from the same node regardless of which one was issued first. For example, if a later change allowed a node to commit (5,0) and then regress to (4,0), the reversed disjunct HLCLE(r2.ts, r1.ts) would still make HLC-1 pass. Since the design doc says HLC-1 proves monotonicity across Next/Observe, the model needs to record issuance order/history or assert the monotonicity in the actions rather than just comparability.

Useful? React with 👍 / 👎.

Comment thread tla/hlc/HLC.tla Outdated
\* with a max(), and there is no action that lowers physicalCeiling. The
\* invariant is therefore a stuttering-step consequence, asserted here as
\* a sanity check rather than a behavioural property.
HLC2_CeilingMonotonic == \A n \in Nodes : physicalCeiling[n] >= 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Encode actual ceiling monotonicity

This invariant only checks that each ceiling is a natural number, which is already implied by TypeOK, so TLC would still pass if a future action lowered physicalCeiling[n] to any smaller non-negative value or made nodes diverge after applying the same ceiling entry. Because HLC-2 is meant to prove per-node monotonicity and same-entry equality, this needs history/action-level checking (or previous values in the state) rather than >= 0.

Useful? React with 👍 / 👎.

…rtability

Four findings from PR #856 review on commit 66b5153:

- HLC.tla TickWall skew check: rewrite subtraction as addition.  Nat
  subtraction in TLA+ underflows when LHS < RHS, which would halt TLC
  the moment MaxClockSkewMs > 0 enabled an out-of-order TickWall.
  Inequalities equivalently expressed with `+` stay inside Nat.
  (gemini HIGH on PR #856).

- HLC1_PerNodeMonotonic now uses the per-commit `seq` ordinal (the
  global opCount value at issuance time) so the invariant compares
  same-node commits in their actual issuance order:
  `r1.node = r2.node /\ r1.seq < r2.seq => HLCLT(r1.ts, r2.ts)`.
  The earlier `HLCLE(r1, r2) \/ HLCLE(r2, r1)` form was vacuously
  true on a totally ordered set and would not have caught a future
  regression (codex P2 on PR #856).

- HLC1_Action + HLC2_Action: new action-level safety properties
  registered under PROPERTIES (not INVARIANTS) in both .cfg files —
  TLC checks `[][P]_vars` as a transition property without requiring
  fairness.  These pin the contract that hlcLast and physicalCeiling
  weakly increase across every step.  HLC2_CeilingMonotonic's old
  `>= 0` form was vacuously implied by TypeOK (codex P2 on PR #856)
  and is replaced by HLC2_NonNeg (the residual TypeOK content) plus
  HLC2_Action (the real monotonicity claim).

- Makefile sha256 portability: prefer GNU sha256sum (Linux universal),
  fall back to shasum -a 256 (macOS default), error out if neither is
  available.  Without this the target failed on minimal Linux CI
  images (gemini MEDIUM on PR #856).

- committedTS records now carry `seq` (= opCount at issuance), and
  TypeOK is extended to require it.

Verified: `make tla-check` still PASSES on MCHLC.cfg (now also
exercising the two action properties) and FAILS on MCHLC_gap.cfg as
designed, with the HLC-4 counterexample at depth 5 unchanged.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 28, 2026

Addressed PR #856 round 1 review in 791c65b3. Four findings:

  1. gemini HIGH (HLC.tla line 97) — Nat subtraction underflow. Rewrote the TickWall skew check using addition instead of subtraction. The old form (wallNow[n] + 1) - wallNow[m] <= MaxClockSkewMs would underflow in TLA+ Naturals the moment MaxClockSkewMs > 0 admitted an out-of-order tick, halting TLC. The equivalent additive form stays inside Nat.

  2. gemini MEDIUM (Makefile) — sha256sum portability. Prefer GNU sha256sum (universal on Linux), fall back to macOS shasum -a 256, error out if neither is available. Without this make tla-check failed on minimal Linux CI images.

  3. codex P2 (HLC.tla line 180) — HLC-1 was vacuously true. Old form HLCLE(r1, r2) \/ HLCLE(r2, r1) is just totality on a totally ordered set, so a regression (5, 0)(4, 0) would have passed. Added a per-commit seq ordinal (= global opCount at issuance time) and changed the invariant to r1.node = r2.node /\ r1.seq < r2.seq => HLCLT(r1.ts, r2.ts). Also added the transition-level HLC1_Action property to pin "hlcLast weakly increases on every step" so future actions cannot silently regress.

  4. codex P2 (HLC.tla line 186) — HLC-2 was vacuously implied by TypeOK. Old form physicalCeiling[n] >= 0 was just type non-negativity. Replaced with HLC2_NonNeg (the residual TypeOK content; harmless) plus the real HLC2_Action == [][\A n: physicalCeiling[n] <= physicalCeiling'[n]]_vars transition property registered under PROPERTIES.

The TLA+ "same set of ceiling entries → equal across nodes" half of HLC-2 is structurally satisfied in M1 because ApplyCeiling propagates atomically (per the action comment); per-node async propagation is an M5 concern when route-catalog/HLC interactions cross multiple groups.

Verification. make tla-check still PASSES on MCHLC.cfg (now also exercising the two action properties) and FAILS on MCHLC_gap.cfg as designed, with the HLC-4 counterexample at depth 5 unchanged. No Go code semantics changed — spec-only PR, no caller audit required.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 28, 2026

Claude finished @bootjp's task in 5m 31s —— View job


Review of Round 2 (791c65b)

Todo list:

  • Load tools and start review
  • Read HLC.tla — verify Nat subtraction fix, HLC-1 fix (seq ordinal), HLC-2 fix (action property)
  • Read Env.tla — verify skew check rewrite
  • Read Raft.tla — general review
  • Read MCHLC.tla / .cfg / gap.cfg — verify action properties are registered
  • Read Makefile — verify sha256sum portability fix
  • Read tla/README.md and design doc — consistency check
  • Synthesize and post findings

Round-1 fixes: all correct ✓

Fix 1 — Nat subtraction underflow (TickWall HLC.tla:100-101)
The rewrite from (wallNow[n]+1) - wallNow[m] <= MaxClockSkewMs to:

/\ \A m \in Nodes : /\ wallNow[n] + 1 <= wallNow[m] + MaxClockSkewMs
                    /\ wallNow[m] <= wallNow[n] + 1 + MaxClockSkewMs

is semantically equivalent and stays inside Nat. Confirmed correct.

Fix 2 — sha256sum portability (Makefile:39-54)
The if sha256sum...elif shasum...else error fallback chain is correct Make syntax (single @if block with \ continuations, proper $$ escaping). Works on both Linux (sha256sum) and macOS (shasum -a 256).

Fix 3 — HLC-1 was vacuous (HLC.tla:193-195)
Adding seq = opCount at issuance and changing the invariant to r1.node = r2.node /\ r1.seq < r2.seq => HLCLT(r1.ts, r2.ts) is correct. Since only the active leader can call IssueTimestamp, the global opCount faithfully captures per-node temporal ordering. Combined with HLC1_Action (PROPERTIES), this pins both the per-commit set invariant and the per-step transition. Confirmed non-vacuous.

Fix 4 — HLC-2 was vacuous (HLC.tla:203-221)
HLC2_NonNeg is a residual TypeOK subsumption (harmless, noted in the comment). The load-bearing HLC2_Action == [][\A n \in Nodes : physicalCeiling[n] <= physicalCeiling'[n]]_vars is the real monotonicity claim, registered under PROPERTIES in both .cfg files. Confirmed the gap config correctly keeps both action properties (they don't fail in the gap scenario — only HLC4 does).


New findings

1. Stale proposed_ path in TLA module headers and Makefile — medium

Four files still reference docs/design/2026_05_28_proposed_tla_safety_spec.md after the rename:

File Line
tla/hlc/HLC.tla 4
tla/lib/Env.tla 4
tla/lib/Raft.tla 4
Makefile 17

tla/README.md already has the correct partial_ path. The referenced file no longer exists under the old name — the links are broken.

Fix this →

2. README invariant table names HLC-2 incorrectly — low

tla/README.md:114 lists HLC2_CeilingMonotonic in the "What each module proves" table, but that operator does not exist in HLC.tla. The actual operators are HLC2_NonNeg (INVARIANTS) and HLC2_Action (PROPERTIES).

Fix this →

3. TickWall is unreachable with MaxClockSkewMs = 0 — informational

The skew guard \A m \in Nodes : wallNow[n] + 1 <= wallNow[m] + MaxClockSkewMs includes m = n. For m = n, condition 1 reduces to wallNow[n] + 1 <= wallNow[n] + 0 = 1 <= 0, which is always false. So with MaxClockSkewMs = 0 (both current configs), TickWall is never enabled, all wallNow[n] stay at 0 throughout the model check, and MaxWallTime = 3 is never exercised.

Physical-time advancement is modeled exclusively via ApplyCeilingphysicalCeiling. The invariants still hold under this restriction, so the M1 model check is not unsound — but the wall-clock ticking path is untested. A future config with MaxClockSkewMs >= 1 (and the corresponding tighter ASSUME MaxClockSkewMs < HlcPhysicalWindowMs) would exercise TickWall.

No fix needed for M1; worth tracking as a follow-up config for deeper clock-skew exploration.


Summary

The round-1 fixes are all correctly implemented. The spec is sound for M1. Two doc-consistency issues (stale proposed_ paths, wrong invariant name in README) should be fixed before merging. The TickWall/MaxClockSkewMs=0 observation is informational and out of scope for this PR.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 791c65b31b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tla/hlc/HLC.tla Outdated
Comment on lines +100 to +101
/\ \A m \in Nodes : /\ wallNow[n] + 1 <= wallNow[m] + MaxClockSkewMs
/\ wallNow[m] <= wallNow[n] + 1 + MaxClockSkewMs
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Exclude self from skew check

With the checked configs setting MaxClockSkewMs = 0, this universal quantifier also evaluates the m = n case, which requires wallNow[n] + 1 <= wallNow[n] + 0 before any tick can occur. That is impossible, so TickWall is never enabled and the model never explores wall-clock advancement, ceiling expiry, or skew schedules despite the MaxWallTime bounds. Exclude the ticking node from this comparison or compare it against its updated self value.

Useful? React with 👍 / 👎.

Comment thread Makefile Outdated
Comment on lines +79 to +81
else \
echo; \
echo "OK: MCHLC_gap.cfg failed as designed (HLC-4 counterexample)."; \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate the gap failure reason

In this tla-check target, every non-zero exit from the gap TLC run is accepted as the expected HLC-4 counterexample. If MCHLC_gap.cfg starts failing because of a parse/type error, a different invariant, a deadlock, or a JVM/tooling crash, CI will still print OK and pass, masking regressions in the newly added model-check harness. Check the TLC output for the specific HLC4_NoRegressionAcrossTerms invariant violation before treating the failure as success.

Useful? React with 👍 / 👎.

…reason

Two P2 findings from codex on commit 791c65b.

- HLC.tla TickWall: exclude self from the inter-node skew check.  The
  `\A m \in Nodes:` quantifier included `m = n`, so under
  MaxClockSkewMs = 0 the constraint forced
  `wallNow[n] + 1 <= wallNow[n] + 0`, which is unsatisfiable.  TickWall
  was therefore never enabled and the model never explored wall-clock
  advancement, ceiling expiry, or skew schedules — the safe config was
  passing on a degenerate 28-state subspace.  Restrict the universal to
  `Nodes \ {n}` so self-comparison is excluded.  (codex PR #856 round 2.)

- Raise MaxClockSkewMs from 0 to 1 (and HlcPhysicalWindowMs from 1 to 2
  so the ASSUME `MaxClockSkewMs < HlcPhysicalWindowMs` still holds) in
  both MCHLC.cfg and MCHLC_gap.cfg so TickWall is actually enabled and
  the skew dimension is part of the explored space.  After this change
  the safe config explores 3,594 distinct states (vs 28 previously) and
  the gap config still surfaces the HLC-4 counterexample at depth 5.

- Makefile tla-check: validate the gap-run failure reason instead of
  blindly accepting any non-zero TLC exit.  Capture TLC stdout/stderr,
  print it, then grep for the exact string
  `Invariant HLC4_NoRegressionAcrossTerms is violated` before declaring
  the gap evidence intact.  Without this check a parse error, deadlock,
  JVM crash, or different invariant violation would have silently
  counted as "expected gap counterexample" and masked regressions in
  the harness itself.  (codex PR #856 round 2.)
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 28, 2026

PR #856 round 2 review (codex P2 ×2 on 791c65b3) addressed in 1123a0e1:

  1. codex P2 (HLC.tla line 101) — TickWall self-skew block. The \A m \in Nodes: quantifier included m = n, so under MaxClockSkewMs = 0 the constraint became wallNow[n] + 1 <= wallNow[n] + 0 (unsatisfiable). TickWall was never enabled and the model never explored wall-clock advancement, ceiling expiry, or skew schedules — the safe config was passing on a degenerate 28-state subspace. Excluded self from the universal: \A m \in Nodes \ {n}: ….

Also raised MaxClockSkewMs from 0 to 1 and HlcPhysicalWindowMs from 1 to 2 (so the ASSUME MaxClockSkewMs < HlcPhysicalWindowMs still holds) in both .cfg files so the skew dimension is actually part of the explored space.

  1. codex P2 (Makefile line 81) — Validate gap failure reason. tla-check previously accepted any non-zero exit from the gap TLC run as the expected HLC-4 counterexample. Capture stdout/stderr, print it, then grep for Invariant HLC4_NoRegressionAcrossTerms is violated before declaring the gap evidence intact. A parse error, deadlock, JVM crash, or different invariant violation now correctly fails the target instead of masking a harness regression.

Verification. Safe config explores 3,594 distinct states (vs 28 previously) and completes without invariant violation. Gap config still surfaces the HLC-4 counterexample at depth 5; the Makefile correctly identifies it by string match.

Spec/Makefile-only change, no Go code touched. No caller audit required.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 28, 2026

Claude finished @bootjp's task in 2m 9s —— View job


Review of Round 3 (1123a0e)

Todo list:

  • Load tools
  • Read HLC.tla — verify TickWall self-exclusion fix, MaxClockSkewMs=1 bump
  • Read Makefile — verify gap failure reason validation
  • Read MCHLC.cfg + MCHLC_gap.cfg — verify constant updates
  • Check round-2 Claude findings (stale proposed_ paths, wrong README invariant name)
  • Read tla/README.md and Env.tla, Raft.tla — consistency check
  • Synthesize and post findings

Round-3 fixes: both correct ✓

Fix 1 — TickWall self-exclusion (HLC.tla:107)

/\ \A m \in Nodes \ {n} :
    /\ wallNow[n] + 1 <= wallNow[m] + MaxClockSkewMs
    /\ wallNow[m] <= wallNow[n] + 1 + MaxClockSkewMs

Nodes \ {n} is the correct fix. Under MaxClockSkewMs = 0, \A m \in {} : ... is vacuously true, so TickWall is now enabled up to MaxWallTime. With the new MaxClockSkewMs = 1, condition 1 reduces to wallNow[n] <= wallNow[m] for each peer — n can only tick forward if it's not already ahead of any other node. This is a symmetric skew model and correct for the design intent. The in-comment explanation at lines 100–106 is clear.

Both .cfg files now carry MaxClockSkewMs = 1 and HlcPhysicalWindowMs = 2, keeping the Env.tla ASSUME MaxClockSkewMs < HlcPhysicalWindowMs (1 < 2) intact. State space goes from 28 to 3,594 distinct states — the wall-clock and skew dimensions are now genuinely exercised. Confirmed correct.

Fix 2 — Gap failure reason validation (Makefile:69-102)

The implementation is correct and well-structured:

  1. Captures both stdout+stderr via 2>&1 into out=.
  2. Separates the "unexpected clean pass" check (exit code 0 → error) from the "wrong failure reason" check (grep miss → error).
  3. Uses grep -qF (fixed-string, not regex) — the right tool for matching a literal TLC error message.
  4. Prints the full TLC output before the checks so the failure reason is always visible in CI logs regardless of which path is taken.

A JVM crash, parse error, deadlock, or wrong invariant violation will now correctly fail make tla-check instead of silently masking itself as expected gap evidence. Confirmed correct.


Outstanding from round-2 Claude review (still not addressed)

These two were raised in round 2 and remain unfixed after round 3.

1. Stale proposed_ path in four files — medium

All four files still reference the pre-rename path docs/design/2026_05_28_proposed_tla_safety_spec.md, which was renamed to 2026_05_28_partial_tla_safety_spec.md in the M1 commit. The file no longer exists under the old name — the references are broken dead links.

File Line Current (broken) Should be
tla/hlc/HLC.tla 4 …proposed_tla_safety_spec.md §5.1 …partial_tla_safety_spec.md §5.1
tla/lib/Env.tla 4 …proposed_tla_safety_spec.md §6.1 …partial_tla_safety_spec.md §6.1
tla/lib/Raft.tla 4 …proposed_tla_safety_spec.md §3 / §6.1 …partial_tla_safety_spec.md §3 / §6.1
Makefile 17 …proposed_tla_safety_spec.md §7.2 …partial_tla_safety_spec.md §7.2

tla/README.md already uses the correct partial_ path.

Fix this →

2. README invariant table names HLC2_CeilingMonotonic — low

tla/README.md:114 lists HLC2_CeilingMonotonic in the "What each module proves" table, but that operator name does not exist in HLC.tla. The correct names are HLC2_NonNeg (in INVARIANTS) and HLC2_Action (in PROPERTIES). A reviewer cross-checking the README against the spec file — as the PR test plan requests — will not find the stated name.

Fix this →


Summary

Both round-3 codex P2 fixes are correctly implemented. The spec is now sound and the make tla-check harness is robust. The two doc-consistency issues from round 2 (stale proposed_ paths in four module headers, wrong invariant name in the README table) remain open and should be fixed before merge — particularly finding 1, since the linked design doc is the primary reference a reviewer uses to cross-check the invariants.

Two doc-consistency findings from claude bot's round 2 review of
commit 791c65b (after the proposed -> partial rename in 66b5153
those references became broken links):

- Update 4 stale references to the old proposed_ doc path:
  - tla/hlc/HLC.tla line 4
  - tla/lib/Env.tla line 4
  - tla/lib/Raft.tla line 4
  - Makefile line 17
  All now point at docs/design/2026_05_28_partial_tla_safety_spec.md.

- tla/README.md "What each module proves" table referenced
  HLC2_CeilingMonotonic, which does not exist in HLC.tla after the
  round-1 rework.  Replace with the actual operators:
  HLC1_PerNodeMonotonic (state, with strict seq ordering),
  HLC1_Action (PROPERTY, transition form),
  HLC2_NonNeg (state, residual TypeOK content),
  HLC2_Action (PROPERTY, transition form).
  Also clarify HLC-1's state form is the strict-seq version, not the
  earlier vacuous total-order form.

Docs-only changes.  make tla-check still passes on MCHLC.cfg
(3,594 distinct states) and fails on MCHLC_gap.cfg as designed.
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 28, 2026

Addressed claude bot's round 2 review (on 791c65b3) in 9901e6ca.

  1. MEDIUM (stale proposed_ path): Updated 4 broken references to the design doc path that became invalid after the proposed_partial_ rename in 66b51539:

    • tla/hlc/HLC.tla line 4
    • tla/lib/Env.tla line 4
    • tla/lib/Raft.tla line 4
    • Makefile line 17
  2. LOW (README invariant table): tla/README.md listed HLC2_CeilingMonotonic, which no longer exists after round 1's rework. Replaced with the actual operators: HLC1_PerNodeMonotonic, HLC1_Action (PROPERTY), HLC2_NonNeg, HLC2_Action (PROPERTY); also clarified HLC-1's state form uses the strict seq ordering, not the earlier vacuous total-order form.

  3. INFO (TickWall MaxClockSkewMs = 0): claude bot was reviewing round-1 commit 791c65b3; this finding is already addressed in round-2 commit 1123a0e1 (TickWall now excludes self, MaxClockSkewMs raised to 1, safe config explores 3,594 distinct states).

Docs-only changes. make tla-check still passes on MCHLC.cfg (3,594 distinct states) and fails on MCHLC_gap.cfg as designed.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 28, 2026

Claude finished @bootjp's task in 3m 55s —— View job


Review of Round 3 (9901e6c)

Todo list:

  • Load tools
  • Read HLC.tla — verify proposed_ → partial_ path fix
  • Read Env.tla — verify proposed_ → partial_ path fix
  • Read Raft.tla — verify proposed_ → partial_ path fix
  • Read Makefile — verify proposed_ → partial_ path fix
  • Read tla/README.md — verify invariant table fix (HLC2_CeilingMonotonic → correct names)
  • Read MCHLC.cfg + MCHLC_gap.cfg — verify constant updates still correct
  • Read MCHLC.tla — general review
  • Synthesize and post findings

Round-2 fixes: both correct ✓

Fix 1 — Stale proposed_ path in four files

All four broken references are now updated:

File Line Confirmed
tla/hlc/HLC.tla 4 …partial_tla_safety_spec.md §5.1
tla/lib/Env.tla 4 …partial_tla_safety_spec.md §6.1
tla/lib/Raft.tla 3 …partial_tla_safety_spec.md §3 / §6.1
Makefile 17 …partial_tla_safety_spec.md §7.2

docs/design/2026_05_28_partial_tla_safety_spec.md exists on disk; the old proposed_ path is gone. All links are live.

Fix 2 — README invariant table (tla/README.md:108-118)

The table now correctly lists all operators that exist in HLC.tla:

README entry Exists in HLC.tla Correct
HLC1_PerNodeMonotonic line 201 ✓
HLC1_Action (PROPERTY) line 223 ✓
HLC2_NonNeg line 211 ✓
HLC2_Action (PROPERTY) line 228 ✓
HLC3_LeaderOnly line 234 ✓
HLC4_NoRegressionAcrossTerms line 242 ✓

No more phantom HLC2_CeilingMonotonic. The note about "strict-seq ordering, not the earlier vacuous total-order form" is accurate.


Spec correctness pass — no issues found

TickWall skew model (HLC.tla:107-109)

With MaxClockSkewMs = 1:

  • Condition 1: wallNow[n] + 1 <= wallNow[m] + 1 = wallNow[n] <= wallNow[m] — n can tick only if not ahead of any peer.
  • Condition 2: wallNow[m] <= wallNow[n] + 2 — no peer is more than MaxClockSkewMs+1 ahead (post-tick skew ≤ MaxClockSkewMs).

Together: |wallNow[n]+1 − wallNow[m]| ≤ 1 for all peers — correct symmetric bounded-skew model. State space is non-degenerate (verified: 3,594 distinct states).

HLC1_Action holds on all actions

  • TickWall: UNCHANGED hlcLastHLCLE(hlcLast[n], hlcLast[n]) trivially TRUE.
  • ApplyCeiling: UNCHANGED hlcLast → trivially TRUE.
  • BecomeLeader_HLC: either raises hlcLast[n] to maxAppliedHLC (HLCLT precondition guarantees > → HLCLE TRUE) or leaves it unchanged.
  • IssueTimestamp(n): ComputeNextHLC(prev, FlooredNow(n)) is always >= prev across all three branches (wall advance, logical increment, logical overflow). hlcLast[m] for m ≠ n is UNCHANGED.

HLC2_Action holds on all actions

  • Only ApplyCeiling modifies physicalCeiling, and it uses IF physicalCeiling[m] < newC THEN newC ELSE physicalCeiling[m] — strict max(), never decreases. All other actions UNCHANGED physicalCeiling.

HLC4_NoRegressionAcrossTerms with EnableSafety = TRUE

New leader n via BecomeLeader_HLC:

  1. hlcLast[n] raised to maxAppliedHLC (≥ every previous commit across all terms).
  2. IssueTimestamp gated on wallNow[n] < physicalCeiling[n], so FlooredNow(n) = physicalCeiling[n].
  3. ComputeNextHLC(hlcLast[n], physicalCeiling[n]) produces a value strictly above maxAppliedHLC (either the ceiling is larger than hlcLast[n].wall, or the logical counter increments above maxAppliedHLC.logical). Every new-term commit is therefore strictly above every prior-term commit.

Both .cfg files register HLC1_Action + HLC2_Action under PROPERTIES — correctly separating temporal properties from state invariants. Both configs share MaxClockSkewMs = 1, HlcPhysicalWindowMs = 2, maintaining ASSUME 1 < 2 from Env.tla.

HLC3_LeaderOnly: IssueTimestamp(n) is gated on IsLeader(n), which guarantees activeTerm > 0 /\ leaderOf[activeTerm] = n. The committed record stores term = activeTerm and node = n, so the invariant trivially holds.

TypeOK: committedTS records carry seq \in 0..MaxOps via opCount at issuance time. The constraint opCount < MaxOps before IssueTimestamp and the increment opCount' = opCount + 1 after ensure seq stays in range.

MCHLC.tla: Clean module — just EXTENDS HLC, TLC with a symmetry hint Permutations(Nodes). Correct to share one module file between both configs.


Summary

All four rounds of review findings have been fully addressed:

  • Round 1 (gemini HIGH + MEDIUM, codex P2 ×2): Nat subtraction, sha256sum portability, vacuous HLC-1 and HLC-2.
  • Round 2 (codex P2 ×2): TickWall self-exclusion, gap failure reason validation.
  • Round 3 (claude MEDIUM + LOW): stale proposed_ paths, wrong invariant name in README.

The spec is sound and internally consistent. No new findings. Ready to merge.

Per docs/design/2026_05_28_partial_tla_safety_spec.md §7.3 — the M1 PR
is the right place to land the CI workflow alongside the concrete
spec.  Adds .github/workflows/tla-check.yml:

- Triggers on PR (opened/synchronize/reopened) and on push to main,
  with a path filter matching exactly the §7.3 list: tla/**, Makefile,
  kv/hlc*.go, kv/coordinator.go, kv/sharded_coordinator.go,
  kv/transaction.go, kv/fsm.go, kv/lock_resolver.go,
  store/mvcc_store.go, distribution/**, and the workflow file itself.
- Uses Temurin JDK 21 (consistent with jepsen-test.yml).
- Caches .cache/tla/tla2tools.jar keyed on the Makefile hash so any
  TLA_VERSION / TLA_SHA256 change automatically invalidates the cache.
- Pre-fetches the jar via `make tla-tools` so the download is in a
  separate log group from the model check.
- Runs `make tla-check`, which inherits the round-2 gap-config
  validation: a non-zero TLC exit on MCHLC_gap.cfg only counts as
  success if the output contains the exact HLC-4 violation string.

Updates §7.3 of the design doc: removes the "(deferred)" qualifier and
names the workflow file as the now-active CI binding.

actionlint clean.  The full M1 run completes in well under a minute
locally (3,594 distinct states on the safe config, HLC-4 counter-
example at depth 5 on the gap config).
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 28, 2026

Wired make tla-check into GitHub Actions in ca0ecfc0.

New file: .github/workflows/tla-check.yml

  • Triggers on PR (opened / synchronize / reopened) and push to main, with a path filter matching §7.3 of the design doc (tla/**, Makefile, the HLC/coordinator/transaction/fsm/lock_resolver Go files, store/mvcc_store.go, distribution/**, and the workflow file).
  • Temurin JDK 21 (consistent with jepsen-test.yml).
  • Caches .cache/tla/tla2tools.jar keyed on Makefile hash so any TLA_VERSION / TLA_SHA256 change invalidates the cache automatically.
  • Pre-fetches the jar via make tla-tools so the download is logged separately from the model check.
  • Runs make tla-check, which keeps the round-2 gap-config validation: a non-zero TLC exit on MCHLC_gap.cfg only counts as success if the output contains the exact Invariant HLC4_NoRegressionAcrossTerms is violated string.

Updated §7.3 of the design doc: removed the (deferred) qualifier and named the workflow file as the now-active CI binding.

actionlint clean. The workflow is already running on this PR (ca0ecfc0): https://github.com/bootjp/elastickv/actions/runs/26591495196

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ca0ecfc0e7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread .github/workflows/tla-check.yml Outdated
Comment on lines +74 to +75
restore-keys: |
${{ runner.os }}-tla-tools-
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Verify restored TLA jar before use

When TLA_VERSION or TLA_SHA256 changes, the primary cache key misses, and GitHub's cache docs say restore-keys are checked for partial matches; this can restore an older .cache/tla/tla2tools.jar. Because make tla-tools only depends on the existing file target and the checksum recipe in the Makefile is skipped when that file already exists, CI can run an old/unverified TLC jar despite the checksum pin. Drop the prefix restore or make the target always verify/remove stale jars after cache restore.

Useful? React with 👍 / 👎.

codex P2 on PR #856 ca0ecfc: the prefix-match restore-keys could
bring back an older tla2tools.jar after a TLA_VERSION/TLA_SHA256 bump
in the Makefile.  Because $(TLA_JAR) is a file-existence target, the
restored stale jar would skip the Makefile's checksum verification
entirely — defeating the pin.

A full cache miss is the correct behaviour on a version-pin change:
the recipe runs, downloads from GitHub, and verifies the SHA-256
before serving the jar to TLC.  Documented inline so the next
reviewer doesn't re-add restore-keys.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 28, 2026

Addressed codex P2 on ca0ecfc0 in 2a2238af.

Finding: .github/workflows/tla-check.yml cache restore-keys could bring back an older tla2tools.jar after a TLA_VERSION / TLA_SHA256 bump in the Makefile. Because the Makefile's $(TLA_JAR) target is gated on file existence, the restored stale jar would skip the checksum verification entirely — defeating the pin.

Fix: Dropped the restore-keys block. A full cache miss is the correct behaviour on a version-pin change: the recipe runs, downloads from GitHub, and verifies the SHA-256 before serving the jar to TLC. Documented inline so the next reviewer doesn't re-add it.

This is a workflow-only change. The Makefile's checksum verification path is unchanged — and is now load-bearing for every version-pin bump.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 28, 2026

Claude finished @bootjp's task in 1m 24s —— View job


Todo list:

  • Load tools
  • Read .github/workflows/tla-check.yml — verify restore-keys removal
  • Read Makefile — confirm tla-tools / TLA_JAR target is still the verification gatekeeper
  • Check overall workflow correctness after the change
  • Post findings

Review of Round 4 (2a2238a)

The diff is a 7-line addition (explanatory comment) + 2-line deletion (restore-keys block). The fix is correct.

The root cause and fix

The $(TLA_JAR) Make target is a standard file-existence guard:

$(TLA_JAR):
    @curl -fsSL -o $(TLA_JAR).tmp $(TLA_URL)
    @# ... SHA-256 verification ...
    @mv $(TLA_JAR).tmp $(TLA_JAR)

This recipe only runs when .cache/tla/tla2tools.jar does not exist. With the old restore-keys: ${{ runner.os }}-tla-tools-, a prefix match after a TLA_VERSION/TLA_SHA256 bump would restore an older jar (file now exists), the Make target would skip entirely, and TLC would silently run against the wrong version — with no checksum verification. Removing restore-keys means the only non-miss path is an exact hashFiles('Makefile') match, which guarantees the jar on disk is exactly the one the current Makefile specifies.

Correctness verified

Scenario Cache behaviour make tla-tools Result
Same Makefile hash as last run Exact hit → jar restored no-op (file exists) correct jar, no download
Makefile changed (e.g. TLA_VERSION bump) Miss (no restore-keys) downloads + SHA-256 verified correct jar, download logged separately
First run ever Miss downloads + SHA-256 verified correct jar

Comment quality

The inline comment (lines 70-75 of the final file) names the codex P2 finding, explains the $(TLA_JAR) file-existence mechanism, and states the correct semantics. This is sufficient to stop a future reviewer from re-adding restore-keys thinking it's a free performance win. Good.

Everything else unchanged and correct

  • key: ${{ runner.os }}-tla-tools-${{ hashFiles('Makefile') }} — both TLA_VERSION and TLA_SHA256 are in the Makefile, so bumping either one rotates the key.
  • cancel-in-progress: false — correct for a correctness check (don't abandon a running TLC invocation on push).
  • make tla-tools pre-fetch step still present — separates download logs from model-check logs, making failure attribution straightforward.
  • make tla-check still greps for the specific Invariant HLC4_NoRegressionAcrossTerms is violated string from round 2, so the gap evidence is validated by content, not just exit code.

No new findings. All five rounds of review findings are now fully addressed. Ready to merge.

@bootjp bootjp merged commit 32ef69d into main May 28, 2026
10 checks passed
@bootjp bootjp deleted the feat/tla-m1-hlc-spec branch May 28, 2026 18:08
bootjp added a commit that referenced this pull request May 28, 2026
…857)

## Summary

Auto-trigger AI review (Claude + Codex) whenever a PR touches a file the
TLA+ safety spec has an anchor on, so a spec-divergence review
**always** fires — never relies on a human remembering to ping.

Follows PR #856 (M1 spec + `tla-check` workflow).

## How it works

New workflow: `.github/workflows/tla-spec-ai-review.yml`

- Fires on `pull_request` (opened / synchronize / reopened /
ready_for_review).
- Path filter is an exact copy of `tla-check.yml`'s — the two workflows
stay paired so every anchored-file change triggers **both** the TLC run
AND the AI-review request.
- Computes which anchored files actually changed (via `git diff` against
the PR base) and skips the comment if none.
- Posts a fresh PR comment per push containing both `@claude review` and
`@codex review` mentions, plus a per-subsystem checklist of which spec
invariants to verify (HLC-4 preconditions, OCC lock encoding, MVCC
visibility, Routes catalog versioning).
- **Always** `createComment` — never `updateComment`. The upstream
`claude.yml` listens on `issue_comment: created` only; editing would not
re-trigger Claude. Same for Codex.

## Chain

```
anchored file change in PR
  -> tla-check.yml runs TLC                    (catches *spec* bugs)
  -> tla-spec-ai-review.yml posts comment      (with @claude + @codex pings)
       -> claude.yml fires on @claude          → Claude reviews diff
       -> chatgpt-codex-connector bot fires    → Codex reviews diff
            (both focused on TLA+ divergence)
```

The two workflows are complementary: `tla-check` verifies the spec
passes; the AI reviewers verify the implementation matches the spec.

## Docs

§7.3 of `docs/design/2026_05_28_partial_tla_safety_spec.md` is updated
with the chain diagram, the two-workflow split, and a reminder that the
path filter must stay in sync across the two YAML files and the doc §
list.

## Verification

- `actionlint` clean.
- `make tla-check` still passes locally on `main` (3,594 distinct states
on safe config, HLC-4 counterexample at depth 5 on gap config).
- The workflow itself doesn't run TLC — it only posts a comment — so the
trigger chain can be tested by opening a follow-up PR that touches e.g.
`kv/hlc.go`.

## Test plan

- [ ] CI green (`tla-check` runs again because the workflow path filter
touches `tla/**`? No — this PR only adds `.github/workflows/`, which is
NOT in the path filter, so `tla-check` will NOT fire on this PR. That's
correct.)
- [ ] Reviewer opens a tiny follow-up PR touching e.g. `kv/hlc.go` to
verify the comment+bot chain fires
- [ ] Reviewer confirms the comment template lists the right spec
sections per subsystem

## Out of scope

- Auto-collapsing older marker comments after a fresh push — embedded as
a marker (`<!-- tla-spec-ai-review-marker -->`) so a future PR can add a
"minimize older" step if the comment noise becomes a problem.
- Triggering AI review when the spec itself changes (no impl divergence
to check then — `tla-check` is sufficient).


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Documentation**
* Updated design documentation describing the new automated
specification review workflow, including trigger conditions, review
coordination processes, and how verification ensures implementations
maintain alignment with formal specifications

* **Chores**
* Added automated workflow that monitors pull requests for specification
file changes and automatically coordinates review requests to verify
specification compliance

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/bootjp/elastickv/pull/857?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

<!-- review_stack_entry_end -->

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
bootjp added a commit that referenced this pull request May 28, 2026
…#859)

## Summary

Implements the **M1 spec follow-up** per
[`docs/design/2026_05_28_partial_tla_safety_spec.md`](https://github.com/bootjp/elastickv/blob/main/docs/design/2026_05_28_partial_tla_safety_spec.md)
§5.1 HLC-4 precondition **(ii)** — strategy (c) "scan the FSM for max
committed HLC and `Observe` before serving any persistence `Next()`".

Follows PR #856 (M1 TLA+ spec).

## What lands

One line in `kv/fsm.go` `applyRequestErr` after a successful
`handleRequest`:

```go
if f.hlc != nil && commitTS > 0 {
    f.hlc.Observe(commitTS)
}
```

`HLC.Observe` is the existing CAS-to-max method (`kv/hlc.go:129-140`),
so this is a **pure addition** with no signature change and no caller
audit required.

## Why "observe on every apply" vs the spec doc's "observe on election"

The spec doc describes strategy (c) as "on election, scan the FSM and
call `Observe`". The implementation refines this to "Observe on every
apply". The two are semantically equivalent for HLC-4 because etcd/raft
applies all uncommitted prior-term entries before a new leader serves a
write — so by the time `HLC.Next()` runs in the new term, `hlc.last >=
every prior commit`.

Two advantages over the on-election form:

1. **No protocol change** to `ShardedCoordinator` or
`RunHLCLeaseRenewal`.
2. **No race window** between `BecomeLeader` and the first `Observe` — a
concurrent client request landing in that window cannot see a stale
`hlc.last`.

## Tests

`kv/fsm_hlc_observe_test.go`:

- `TestApplyObservesCommitTSIntoHLC` — verifies Apply advances
`hlc.last` via `Observe(commitTS)`, and preserves monotonicity on a
stale-ts apply.
- `TestApplyHLCObserveAfterRestart` — simulates the spec's
new-leader-after-restart scenario: a fresh HLC (`last = 0`) re-plays a
prior leader's write through the FSM and the resulting `Next()` is
strictly greater than the prior `commit_ts`.

Both tests pass with `-race`.

## Deferred to a follow-up PR

The **ceiling-fence half** of the M1 spec contract (HLC-4 precondition
**(iii)** — making `HLC.Next()` fail-closed when `wall_now >=
physicalCeiling`) is intentionally NOT in this PR. That change alters
`HLC.Next()`'s return signature from `uint64` to `(uint64, error)` and
requires auditing **18 production call sites** across `kv/`, `adapter/`,
and `store/`. Splitting it from the strategy-(c) PR keeps each diff
individually reviewable per the loop-discipline reminder for semantic
changes.

## Self-review (5-lens per CLAUDE.md)

1. **Data loss** — n/a; `Observe` only advances an in-memory clock
counter, no persistence path touched.
2. **Concurrency** — `Observe` is the existing CAS-to-max loop; safe
under concurrent `Apply` (which is serial per Raft FSM contract anyway)
and concurrent `Next` (the CAS handles it).
3. **Performance** — adds one atomic load + CAS to the FSM apply hot
path. The CAS only fires when `ts > current` so steady-state is one
load.
4. **Data consistency** — `Observe` is monotone (only advances). No
semantic change to any other code path.
5. **Test coverage** — two new tests in `kv/fsm_hlc_observe_test.go`
covering the basic apply→observe path and the post-restart catchup case.
Existing kv + store tests pass.

## Test plan

- [ ] CI green (the `tla-check` workflow does NOT fire — this PR is
Go-only)
- [ ] The new `tla-spec-ai-review` workflow (PR #857) auto-pings Claude
+ Codex for spec-divergence review
- [ ] Reviewer cross-checks the Observe call against HLC-4 (ii) strategy
(c) in the design doc and `BecomeLeader_HLC` in `tla/hlc/HLC.tla`

## Out of scope

- HLC-4 precondition (iii) ceiling fence (`HLC.Next()` fail-closed) —
separate follow-up PR.
- Operator-visible metrics for the fence behaviour — lands with (iii).
bootjp added a commit that referenced this pull request May 29, 2026
… tla-check (#858)

## Summary

M2 of the [TLA+ safety
spec](https://github.com/bootjp/elastickv/blob/main/docs/design/2026_05_28_partial_tla_safety_spec.md)
— the OCC layer encoded as a Percolator-style 2PC model.

Follows PR #856 (M1 HLC spec).

## What lands

- `tla/occ/OCC.tla` — transaction lifecycle `Idle → Active → Prepared →
Committed/Aborted`, lock map `(key, lock_ts) → start_ts`, `LockResolve`
action that drains abandoned locks.
- `tla/occ/MCOCC.tla` — TLC model module with txn + key symmetry.
- `tla/occ/MCOCC.cfg` — safe config (`EnableSafety = TRUE`; expected
PASS).
- `tla/occ/MCOCC_gap.cfg` — Commit's OCC-1 guard removed (expected FAIL
on `OCC1_CommitTsAboveStart`).
- `scripts/tla-check.sh` — module-aware orchestrator. Adding M3/M4/M5
needs only a new entry in `TLA_MODULES` + a one-line `case` for the
gap-invariant string.
- `Makefile tla-check` delegates to the script.
- `.gitignore` TLA artefact glob widened from `MCHLC_TTrace_*` to
`MC*_TTrace_*` (M2 + M3..M5 covered).
- `tla/README.md`: M2 status `Not started` → `Landed`; OCC module
description block + invariant table.

## Invariants encoded

| # | Statement | Form |
|---|---|---|
| OCC-1 | `commit_ts(T) > start_ts(T)` for every committed txn |
INVARIANT |
| OCC-2 | Two committed txns sharing a write key have distinct
`commit_ts` and one started after the other committed | INVARIANT |
| OCC-3 | Snapshot-read stability under the lock encoding (reads blocked
by conflicting uncommitted locks; LockResolve drains them) | INVARIANT |
| OCC-4 | No stranded lock at quiescence (bounded-safety form) |
INVARIANT |
| OCC-5 | `start_ts(T) == read_ts(T)` for every operation | INVARIANT |
| OCC5_Action | `start_ts[t]` is assigned once at BeginTxn, never
updated | PROPERTY |
| CommitTsAssignedOnce | `commit_ts[t]` is assigned once at Commit,
never updated | PROPERTY |

OCC-L1 (liveness) is deferred to M6 per §8.1.

## Verification (local)

```text
$ make tla-check
... TLC: tla/hlc/MCHLC.cfg  (correct design, expected PASS)
... 3594 distinct states found, 0 left on queue
... TLC: tla/hlc/MCHLC_gap.cfg  (...expected FAIL on HLC4...)
OK: MCHLC_gap.cfg failed as designed
... TLC: tla/occ/MCOCC.cfg  (correct design, expected PASS)
... 4217 distinct states found, 0 left on queue
... TLC: tla/occ/MCOCC_gap.cfg  (...expected FAIL on OCC1...)
OK: MCOCC_gap.cfg failed as designed (Invariant OCC1_CommitTsAboveStart is violated)
tla-check: all model-check outcomes match the design contract.
```

## Modelling notes

- **HLC abstraction.** OCC.tla uses a single global monotonic
`tsCounter`, not the full 48/16 HLC. M5 (composed) will INSTANCE
`HLC.tla` for the real layout. This keeps the M2 state space manageable.
- **`Commit` borrows ts.** `commit_ts = tsCounter` (does NOT advance).
Only `Tick` advances ts, so the bounded model stays inside `0..MaxTs`.
OCC-1's `commit_ts > start_ts` then requires a Tick between BeginTxn and
Commit.
- **OCC-2 non-strict serialization.** A txn with `start_ts = T` sees
every version with `commit_ts ≤ T`, so `start_ts = commit_ts(earlier)`
is a valid "later runs in the world after earlier committed" boundary
(`<=`, not `<`).
- **OCC-3 lock encoding** is enforced as an `ENABLED` guard on `ReadKey`
— TLC simply does not explore reads in a state where a conflicting lock
exists, mirroring the spec's "OCC-3 is asserted only in states where no
uncommitted lock at lock_ts ≤ read_ts exists".

## Self-review (5-lens per CLAUDE.md)

1. **Data loss** — n/a, spec-only PR, no persistence paths.
2. **Concurrency / distributed failures** — the spec is exactly
modelling these. Safe config (4,217 states) explored without invariant
violation; gap config produces the documented counterexample.
3. **Performance** — `make tla-check` end-to-end runs in well under 5
seconds on a developer laptop at default bounds.
4. **Data consistency** — OCC-1..OCC-5 + the two PROPERTY transitions
are the M2 contract from the design doc. The non-strict OCC-2 form is
justified in the spec comment.
5. **Test coverage** — no Go tests added (spec-only). `make tla-check`
is the new coverage layer.

## Test plan

- [ ] `tla-check` CI workflow runs green (it watches `tla/**`)
- [ ] `tla-spec-ai-review` does NOT fire — this PR touches no Go anchor
(correct: spec-only changes don't need an AI impl-divergence review)
- [ ] Reviewer cross-checks OCC-1..OCC-5 invariants against §5.2 of the
design doc
- [ ] Reviewer runs `make tla-check` and confirms 4 outcomes (HLC safe
pass, HLC gap fail, OCC safe pass, OCC gap fail)

## Out of scope

- OCC-L1 (liveness) — deferred to M6.
- M3 (MVCC), M4 (Routes), M5 (Composed) — separate future PRs.
- Integrating OCC with the real HLC.tla — happens in M5.


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **New Features**
* Added OCC (Optimistic Concurrency Control) specification module with
two-phase commit lifecycle, five core safety invariants, and automated
model-checking verification suite

* **Documentation**
* Updated project README with detailed OCC module documentation and
marked the OCC milestone as completed

* **Chores**
* Refactored verification build infrastructure with improved scripts and
updated trace-file ignore patterns

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/bootjp/elastickv/pull/858?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

<!-- review_stack_entry_end -->

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
bootjp added a commit that referenced this pull request May 29, 2026
…#865)

## Summary

M5 of the [TLA+ safety
spec](https://github.com/bootjp/elastickv/blob/main/docs/design/2026_05_28_partial_tla_safety_spec.md)
— the cross-module safety properties, capping the spec-only roadmap
(M1..M5).

M1..M4 each modelled a single subsystem in isolation. M5 captures the
**seams** between them — the two safety properties that no single-module
spec can express.

## What lands

- `tla/composed/Composed.tla` — recreates the minimal cross-module state
needed for Composed-1 (catalog history `routes[v]`, txn observed catalog
version `txnObservedVer[t]`, committing group `txnCommitGroup[t]`) and
Composed-3 (monotonic `tsCounter` advancing on every Commit). **Does NOT
`INSTANCE` M1..M4** — that would explode the state space well past M5's
`<10 min at default bounds` target from §8.1.
- `tla/composed/MCComposed.tla` — TLC model with key + group + txn
symmetry.
- `tla/composed/MCComposed.cfg` / `MCComposed_gap.cfg` — safe (PASS) +
gap (FAIL on Composed-1).
- `scripts/tla-check.sh` — `TLA_MODULES` gains `"composed"`;
`gap_invariant_for` + `mc_basename` gain the corresponding cases.
- `tla/README.md` — M5 status `Not started` → `Landed`; Composed module
description + invariant table.

## Invariants

| # | Statement | Form |
|---|---|---|
| Composed-1 | Every committed write key was owned by the committing
group at the txn's observed catalog version | INVARIANT |
| Composed-2 | Vacuously TRUE in this abstraction — `SplitRange` is
same-group only per CLAUDE.md | INVARIANT |
| Composed-3 | Distinct committed txns have distinct `commit_ts` |
INVARIANT |
| `Composed_CatalogMonotonic` | `catalogVersion` weakly increases on
every step | PROPERTY |
| `Composed_TsMonotonic` | `tsCounter` weakly increases on every step |
PROPERTY |
| `Composed3_TsAction` | Every Commit strictly raises `tsCounter` |
PROPERTY |

## Why no `INSTANCE`?

Each M1..M4 module tightly bounds its own state for tractable TLC.
Bringing all four `INSTANCE`s into one product spec would multiply the
state spaces — even a 100x growth lands well past M5's `<10 min at
default bounds` target. M5 instead recreates the minimal cross-module
state needed to express the two seam invariants. The integration claim
is that this projection preserves the invariants the full product would
assert; reviewers can sanity-check each Composed action against the
corresponding M1..M4 action (e.g., `ProposeRouteChange` mirrors
`Routes.tla`, `BeginTxn` + `Commit` mirror `OCC.tla` with a cross-module
observed-catalog-version pin).

## Verification (local)

```text
$ make tla-check
HLC      safe 3,594 distinct states + HLC-4 gap ce
OCC      safe   150 distinct states + OCC-1 gap ce
MVCC     safe    79 distinct states + MVCC-4 gap ce
Routes   safe    29 distinct states + Routes-4 gap ce
Composed safe 1,684 distinct states + Composed-1 gap ce
tla-check: all model-check outcomes match the design contract.
```

The Composed-1 counterexample is the canonical 4-step regression:

1. `BeginTxn(t)` — `txn` observes `catalogVersion = 0`, where
`routes[0][k1] = g1`.
2. `WriteIntent(t, k1)`.
3. `ProposeRouteChange(k1, g2)` — `catalogVersion → 1`, `routes[1][k1] =
g2`.
4. `Commit(t, g2)` (gap mode) — committing group is `g2`, but
`routes[0][k1] = g1 ≠ g2`. **Composed-1 fails.**

## Self-review (5-lens per CLAUDE.md)

1. **Data loss** — out of scope (cross-module invariants check group
ownership, not durability).
2. **Concurrency** — Composed-1 is exactly the concurrency seam between
OCC and Routes; the gap counterexample is one such interleaving.
3. **Performance** — `make tla-check` end-to-end runs in ~10 s on a dev
laptop, well under the 10-min target.
4. **Data consistency** — Composed-1 (cross-module write/route
consistency) + Composed-3 (cross-txn ts uniqueness) + 3 PROPERTY
transitions.
5. **Test coverage** — no Go tests added (spec-only). `make tla-check`
is the coverage layer.

## Roadmap status (post-merge)

| Milestone | Status |
|---|---|
| M1 HLC | Landed (PR #856) |
| M2 OCC | Landed (PR #858) |
| M3 MVCC | Landed (PR #861) |
| M4 Routes | Landed (PR #861 via stacked #862) |
| **M5 Composed** | **This PR** |
| M6 liveness (OPTIONAL) | Not started — can follow as a separate PR |
| M1 Go follow-up (ceiling fence) | Not started — separate Go PR with
caller audit |

## Test plan

- [ ] `tla-check` CI runs green (watches `tla/**`)
- [ ] `tla-spec-ai-review` does NOT fire — spec-only PR
- [ ] Reviewer cross-checks Composed-1 / Composed-3 against §5.5 of the
design doc
- [ ] Reviewer runs `make tla-check` and confirms 10 outcomes (5 safe
pass + 5 gap fail)
- [ ] Reviewer sanity-checks the `INSTANCE`-free projection claim — each
Composed action mirrors the corresponding M1..M4 action

## Out of scope

- M6 liveness (OCC-L1, Routes-L1) — separate PR if needed
- HLC-4 (iii) ceiling fence Go implementation — separate code PR
- Cross-group `SplitRange` in the real implementation (currently
same-group only)
bootjp added a commit that referenced this pull request May 30, 2026
…ini medium on PR #878)

Closes a silent-false-positive class in the gap-config loop introduced
by the multi-gap refactor in this PR's parent commit.

The loop does:
  while IFS='|' read -r gap_stem gap_inv; do
    ...
    if printf '%s\n' "$gap_out" | grep -qF "$gap_inv"; then
      OK
    fi
  done

If a gap_configs_for() entry is malformed (missing the '|' delimiter,
e.g. a typo dropped the invariant string), `read` puts the entire
line in gap_stem and leaves gap_inv empty.  Then `grep -qF ""` matches
every line of TLC output unconditionally, so the gap config 'passes'
even when TLC failed for an unrelated reason — exactly the
silent-false-positive class codex P2 on PR #856 round 2 introduced
the gap-invariant-string match to prevent.

Fix: validate both fields are non-empty before the body runs.  A
trailing blank line (the existing skip case) is still skipped; a
malformed entry now produces a clear ERROR with the stem / inv it
read so the maintainer can fix gap_configs_for() directly.

Verified: `make tla-check` still produces the 11 expected outcomes
(5 safe PASS + 5 gap FAIL + 1 currentfence gap FAIL on Composed-1a)
after the guard.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant