tla: M1 — HLC.tla + lib + make tla-check (HLC-1..4 invariants + gap counterexample)#856
Conversation
Per docs/design/2026_05_28_proposed_tla_safety_spec.md §8.1 (M1
deliverables): adds the TLA+ specification suite directory layout and
the HLC module, runnable via `make tla-check`.
Files:
- tla/lib/Env.tla — shared constants and the HLC-4 precondition (i)
ASSUME (MaxClockSkewMs < HlcPhysicalWindowMs).
- tla/lib/Raft.tla — abstract Raft interface (currentTerm, leaderOf,
BecomeLeader, IsLeader). Log matching and leader completeness are
treated as assumed properties per NG4.
- tla/hlc/HLC.tla — HLC layer with all five invariants:
- HLC-1 (per-node monotonic), HLC-2 (ceiling monotonic), HLC-3
(leader-only issuance), HLC-4 (no regression across leader change),
TypeOK. Strategy (c) handoff lives in BecomeLeader_HLC; the ceiling
fence is an enabling guard on IssueTimestamp. Both are gated by
CONSTANT EnableSafety so the same spec drives the correct-design
and gap configurations.
- tla/hlc/MCHLC.tla — TLC model-check instance.
- tla/hlc/MCHLC.cfg — EnableSafety = TRUE (expected to PASS all
invariants at default bounds: 3 nodes / 2 terms / 3 ops).
- tla/hlc/MCHLC_gap.cfg — EnableSafety = FALSE (expected to FAIL on
HLC-4 with a counterexample, demonstrating that strategy (c) +
ceiling fence are necessary).
- tla/README.md — install, run, what each module proves, reading list,
how to map a TLC counterexample back to the implementation.
Tooling:
- Makefile: tla-check target downloads tla2tools.jar v1.8.0 to
.cache/tla/ with checksum pinning (SHA-256
237332bdcc79a35c7d26efa7b82c77c85c2744591c5598673a8a45085ff2a4fb),
runs both configurations, and inverts the exit code on the gap
configuration so the expected-FAIL counts as PASS for CI.
- .gitignore: TLC counterexample-trace artefacts
(MCHLC_TTrace_*.{tla,bin}) and the per-run states/ directory.
Verified locally:
- make tla-check passes end-to-end.
- MCHLC.cfg: 28 distinct states, 0 invariant violations.
- MCHLC_gap.cfg: HLC-4 counterexample at depth 5 (n1 commits (0,1) in
term 1; n2 elected term 2 without strategy (c), commits (0,1) again,
tying the prior commit — HLC-4 requires strictly greater).
Follow-up (separate PRs):
- Rename the design doc proposed -> partial (next commit in this PR).
- Implement strategy (c) and the ceiling fence in Go:
- Add MaxAppliedHLC() to kv/fsm.go.
- Call hlc.Observe(fsm.MaxAppliedHLC()) on new-term detection in
kv/sharded_coordinator.go RunHLCLeaseRenewal.
- Add wallNow < physicalCeiling fence to HLC.Next() with a
fail-closed return path, plus operator-visible metrics.
M1 (HLC.tla + lib + make tla-check) lands in the prior commit on this branch, satisfying the first row of §8.1. Per CLAUDE.md docs/design/ lifecycle, rename proposed -> partial and update the header / Status note. Filename propose date and slug are preserved; only the status token changed. M2..M5 remain open; M6 (liveness) remains optional.
|
Warning Review limit reached
More reviews will be available in 8 minutes and 27 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (11)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces the TLA+ safety specification for the Hybrid Logical Clock (HLC) as part of Milestone 1, including the core specification files, model-checking configurations, and a Makefile target to automate TLC verification. The review feedback highlights two key improvements: rewriting subtraction-based inequalities in the TLA+ specification to addition to prevent evaluation errors when clock skew is enabled, and updating the Makefile to support both sha256sum and shasum for better portability across different environments.
| /\ \A m \in Nodes : (wallNow[n] + 1) - wallNow[m] <= MaxClockSkewMs | ||
| /\ wallNow[m] - (wallNow[n] + 1) <= MaxClockSkewMs |
There was a problem hiding this comment.
In TLA+, subtraction on natural numbers (Naturals) is only defined when the left operand is greater than or equal to the right operand. If MaxClockSkewMs is configured to be greater than 0 (which is the core scenario for verifying clock skew safety), a state can be reached where wallNow[m] > wallNow[n] + 1. In this state, evaluating (wallNow[n] + 1) - wallNow[m] will result in a negative number, causing TLC to halt with an evaluation error.
To prevent this, we can rewrite the inequalities using addition instead of subtraction, which is always safe and closed under Naturals.
/\ \\A m \\in Nodes : wallNow[n] + 1 <= wallNow[m] + MaxClockSkewMs
/\ wallNow[m] <= wallNow[n] + 1 + MaxClockSkewMs
| @actual=$$(shasum -a 256 $(TLA_JAR).tmp | awk '{print $$1}'); \ | ||
| if [ "$$actual" != "$(TLA_SHA256)" ]; then \ | ||
| echo "ERROR: tla2tools.jar SHA-256 mismatch."; \ | ||
| echo " expected: $(TLA_SHA256)"; \ | ||
| echo " actual: $$actual"; \ | ||
| rm -f $(TLA_JAR).tmp; \ | ||
| exit 1; \ | ||
| fi |
There was a problem hiding this comment.
The shasum command is not installed by default on many Linux distributions (such as Ubuntu/Debian), whereas sha256sum is part of GNU coreutils and is universally available. To ensure the Makefile is portable and robust across both macOS and Linux environments (especially in minimal CI runners), we should check for the availability of both commands and use whichever is present.
@if command -v sha256sum >/dev/null 2>&1; then \
actual=$$(sha256sum $(TLA_JAR).tmp | awk '{print $$1}'); \
elif command -v shasum >/dev/null 2>&1; then \
actual=$$(shasum -a 256 $(TLA_JAR).tmp | awk '{print $$1}'); \
else \
echo "ERROR: neither sha256sum nor shasum found."; \
rm -f $(TLA_JAR).tmp; \
exit 1; \
fi; \
if [ "$$actual" != "$(TLA_SHA256)" ]; then \
echo "ERROR: tla2tools.jar SHA-256 mismatch."; \
echo " expected: $(TLA_SHA256)"; \
echo " actual: $$actual"; \
rm -f $(TLA_JAR).tmp; \
exit 1; \
fi
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 66b515392e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| HLC1_PerNodeMonotonic == | ||
| \A r1, r2 \in committedTS : | ||
| r1.node = r2.node => HLCLE(r1.ts, r2.ts) \/ HLCLE(r2.ts, r1.ts) |
There was a problem hiding this comment.
Track issuance order for HLC-1
In this spec, committedTS is a set and HLCLE is a total lexicographic order, so this invariant is true for any two timestamps from the same node regardless of which one was issued first. For example, if a later change allowed a node to commit (5,0) and then regress to (4,0), the reversed disjunct HLCLE(r2.ts, r1.ts) would still make HLC-1 pass. Since the design doc says HLC-1 proves monotonicity across Next/Observe, the model needs to record issuance order/history or assert the monotonicity in the actions rather than just comparability.
Useful? React with 👍 / 👎.
| \* with a max(), and there is no action that lowers physicalCeiling. The | ||
| \* invariant is therefore a stuttering-step consequence, asserted here as | ||
| \* a sanity check rather than a behavioural property. | ||
| HLC2_CeilingMonotonic == \A n \in Nodes : physicalCeiling[n] >= 0 |
There was a problem hiding this comment.
Encode actual ceiling monotonicity
This invariant only checks that each ceiling is a natural number, which is already implied by TypeOK, so TLC would still pass if a future action lowered physicalCeiling[n] to any smaller non-negative value or made nodes diverge after applying the same ceiling entry. Because HLC-2 is meant to prove per-node monotonicity and same-entry equality, this needs history/action-level checking (or previous values in the state) rather than >= 0.
Useful? React with 👍 / 👎.
…rtability Four findings from PR #856 review on commit 66b5153: - HLC.tla TickWall skew check: rewrite subtraction as addition. Nat subtraction in TLA+ underflows when LHS < RHS, which would halt TLC the moment MaxClockSkewMs > 0 enabled an out-of-order TickWall. Inequalities equivalently expressed with `+` stay inside Nat. (gemini HIGH on PR #856). - HLC1_PerNodeMonotonic now uses the per-commit `seq` ordinal (the global opCount value at issuance time) so the invariant compares same-node commits in their actual issuance order: `r1.node = r2.node /\ r1.seq < r2.seq => HLCLT(r1.ts, r2.ts)`. The earlier `HLCLE(r1, r2) \/ HLCLE(r2, r1)` form was vacuously true on a totally ordered set and would not have caught a future regression (codex P2 on PR #856). - HLC1_Action + HLC2_Action: new action-level safety properties registered under PROPERTIES (not INVARIANTS) in both .cfg files — TLC checks `[][P]_vars` as a transition property without requiring fairness. These pin the contract that hlcLast and physicalCeiling weakly increase across every step. HLC2_CeilingMonotonic's old `>= 0` form was vacuously implied by TypeOK (codex P2 on PR #856) and is replaced by HLC2_NonNeg (the residual TypeOK content) plus HLC2_Action (the real monotonicity claim). - Makefile sha256 portability: prefer GNU sha256sum (Linux universal), fall back to shasum -a 256 (macOS default), error out if neither is available. Without this the target failed on minimal Linux CI images (gemini MEDIUM on PR #856). - committedTS records now carry `seq` (= opCount at issuance), and TypeOK is extended to require it. Verified: `make tla-check` still PASSES on MCHLC.cfg (now also exercising the two action properties) and FAILS on MCHLC_gap.cfg as designed, with the HLC-4 counterexample at depth 5 unchanged.
|
Addressed PR #856 round 1 review in
The TLA+ "same set of ceiling entries → equal across nodes" half of HLC-2 is structurally satisfied in M1 because Verification. @claude review |
|
Claude finished @bootjp's task in 5m 31s —— View job Review of Round 2 (791c65b)Todo list:
Round-1 fixes: all correct ✓Fix 1 — Nat subtraction underflow ( /\ \A m \in Nodes : /\ wallNow[n] + 1 <= wallNow[m] + MaxClockSkewMs
/\ wallNow[m] <= wallNow[n] + 1 + MaxClockSkewMsis semantically equivalent and stays inside Fix 2 — Fix 3 — HLC-1 was vacuous (HLC.tla:193-195) Fix 4 — HLC-2 was vacuous (HLC.tla:203-221) New findings1. Stale
|
| File | Line |
|---|---|
tla/hlc/HLC.tla |
4 |
tla/lib/Env.tla |
4 |
tla/lib/Raft.tla |
4 |
Makefile |
17 |
tla/README.md already has the correct partial_ path. The referenced file no longer exists under the old name — the links are broken.
2. README invariant table names HLC-2 incorrectly — low
tla/README.md:114 lists HLC2_CeilingMonotonic in the "What each module proves" table, but that operator does not exist in HLC.tla. The actual operators are HLC2_NonNeg (INVARIANTS) and HLC2_Action (PROPERTIES).
3. TickWall is unreachable with MaxClockSkewMs = 0 — informational
The skew guard \A m \in Nodes : wallNow[n] + 1 <= wallNow[m] + MaxClockSkewMs includes m = n. For m = n, condition 1 reduces to wallNow[n] + 1 <= wallNow[n] + 0 = 1 <= 0, which is always false. So with MaxClockSkewMs = 0 (both current configs), TickWall is never enabled, all wallNow[n] stay at 0 throughout the model check, and MaxWallTime = 3 is never exercised.
Physical-time advancement is modeled exclusively via ApplyCeiling → physicalCeiling. The invariants still hold under this restriction, so the M1 model check is not unsound — but the wall-clock ticking path is untested. A future config with MaxClockSkewMs >= 1 (and the corresponding tighter ASSUME MaxClockSkewMs < HlcPhysicalWindowMs) would exercise TickWall.
No fix needed for M1; worth tracking as a follow-up config for deeper clock-skew exploration.
Summary
The round-1 fixes are all correctly implemented. The spec is sound for M1. Two doc-consistency issues (stale proposed_ paths, wrong invariant name in README) should be fixed before merging. The TickWall/MaxClockSkewMs=0 observation is informational and out of scope for this PR.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 791c65b31b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| /\ \A m \in Nodes : /\ wallNow[n] + 1 <= wallNow[m] + MaxClockSkewMs | ||
| /\ wallNow[m] <= wallNow[n] + 1 + MaxClockSkewMs |
There was a problem hiding this comment.
With the checked configs setting MaxClockSkewMs = 0, this universal quantifier also evaluates the m = n case, which requires wallNow[n] + 1 <= wallNow[n] + 0 before any tick can occur. That is impossible, so TickWall is never enabled and the model never explores wall-clock advancement, ceiling expiry, or skew schedules despite the MaxWallTime bounds. Exclude the ticking node from this comparison or compare it against its updated self value.
Useful? React with 👍 / 👎.
| else \ | ||
| echo; \ | ||
| echo "OK: MCHLC_gap.cfg failed as designed (HLC-4 counterexample)."; \ |
There was a problem hiding this comment.
Validate the gap failure reason
In this tla-check target, every non-zero exit from the gap TLC run is accepted as the expected HLC-4 counterexample. If MCHLC_gap.cfg starts failing because of a parse/type error, a different invariant, a deadlock, or a JVM/tooling crash, CI will still print OK and pass, masking regressions in the newly added model-check harness. Check the TLC output for the specific HLC4_NoRegressionAcrossTerms invariant violation before treating the failure as success.
Useful? React with 👍 / 👎.
…reason Two P2 findings from codex on commit 791c65b. - HLC.tla TickWall: exclude self from the inter-node skew check. The `\A m \in Nodes:` quantifier included `m = n`, so under MaxClockSkewMs = 0 the constraint forced `wallNow[n] + 1 <= wallNow[n] + 0`, which is unsatisfiable. TickWall was therefore never enabled and the model never explored wall-clock advancement, ceiling expiry, or skew schedules — the safe config was passing on a degenerate 28-state subspace. Restrict the universal to `Nodes \ {n}` so self-comparison is excluded. (codex PR #856 round 2.) - Raise MaxClockSkewMs from 0 to 1 (and HlcPhysicalWindowMs from 1 to 2 so the ASSUME `MaxClockSkewMs < HlcPhysicalWindowMs` still holds) in both MCHLC.cfg and MCHLC_gap.cfg so TickWall is actually enabled and the skew dimension is part of the explored space. After this change the safe config explores 3,594 distinct states (vs 28 previously) and the gap config still surfaces the HLC-4 counterexample at depth 5. - Makefile tla-check: validate the gap-run failure reason instead of blindly accepting any non-zero TLC exit. Capture TLC stdout/stderr, print it, then grep for the exact string `Invariant HLC4_NoRegressionAcrossTerms is violated` before declaring the gap evidence intact. Without this check a parse error, deadlock, JVM crash, or different invariant violation would have silently counted as "expected gap counterexample" and masked regressions in the harness itself. (codex PR #856 round 2.)
|
PR #856 round 2 review (codex P2 ×2 on
Also raised
Verification. Safe config explores 3,594 distinct states (vs 28 previously) and completes without invariant violation. Gap config still surfaces the HLC-4 counterexample at depth 5; the Makefile correctly identifies it by string match. Spec/Makefile-only change, no Go code touched. No caller audit required. @claude review |
|
Claude finished @bootjp's task in 2m 9s —— View job Review of Round 3 (1123a0e)Todo list:
Round-3 fixes: both correct ✓Fix 1 — TickWall self-exclusion ( /\ \A m \in Nodes \ {n} :
/\ wallNow[n] + 1 <= wallNow[m] + MaxClockSkewMs
/\ wallNow[m] <= wallNow[n] + 1 + MaxClockSkewMs
Both Fix 2 — Gap failure reason validation ( The implementation is correct and well-structured:
A JVM crash, parse error, deadlock, or wrong invariant violation will now correctly fail Outstanding from round-2 Claude review (still not addressed)These two were raised in round 2 and remain unfixed after round 3. 1. Stale
|
| File | Line | Current (broken) | Should be |
|---|---|---|---|
tla/hlc/HLC.tla |
4 | …proposed_tla_safety_spec.md §5.1 |
…partial_tla_safety_spec.md §5.1 |
tla/lib/Env.tla |
4 | …proposed_tla_safety_spec.md §6.1 |
…partial_tla_safety_spec.md §6.1 |
tla/lib/Raft.tla |
4 | …proposed_tla_safety_spec.md §3 / §6.1 |
…partial_tla_safety_spec.md §3 / §6.1 |
Makefile |
17 | …proposed_tla_safety_spec.md §7.2 |
…partial_tla_safety_spec.md §7.2 |
tla/README.md already uses the correct partial_ path.
2. README invariant table names HLC2_CeilingMonotonic — low
tla/README.md:114 lists HLC2_CeilingMonotonic in the "What each module proves" table, but that operator name does not exist in HLC.tla. The correct names are HLC2_NonNeg (in INVARIANTS) and HLC2_Action (in PROPERTIES). A reviewer cross-checking the README against the spec file — as the PR test plan requests — will not find the stated name.
Summary
Both round-3 codex P2 fixes are correctly implemented. The spec is now sound and the make tla-check harness is robust. The two doc-consistency issues from round 2 (stale proposed_ paths in four module headers, wrong invariant name in the README table) remain open and should be fixed before merge — particularly finding 1, since the linked design doc is the primary reference a reviewer uses to cross-check the invariants.
Two doc-consistency findings from claude bot's round 2 review of commit 791c65b (after the proposed -> partial rename in 66b5153 those references became broken links): - Update 4 stale references to the old proposed_ doc path: - tla/hlc/HLC.tla line 4 - tla/lib/Env.tla line 4 - tla/lib/Raft.tla line 4 - Makefile line 17 All now point at docs/design/2026_05_28_partial_tla_safety_spec.md. - tla/README.md "What each module proves" table referenced HLC2_CeilingMonotonic, which does not exist in HLC.tla after the round-1 rework. Replace with the actual operators: HLC1_PerNodeMonotonic (state, with strict seq ordering), HLC1_Action (PROPERTY, transition form), HLC2_NonNeg (state, residual TypeOK content), HLC2_Action (PROPERTY, transition form). Also clarify HLC-1's state form is the strict-seq version, not the earlier vacuous total-order form. Docs-only changes. make tla-check still passes on MCHLC.cfg (3,594 distinct states) and fails on MCHLC_gap.cfg as designed.
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
Addressed claude bot's round 2 review (on
Docs-only changes. @claude review |
|
Claude finished @bootjp's task in 3m 55s —— View job Review of Round 3 (9901e6c)Todo list:
Round-2 fixes: both correct ✓Fix 1 — Stale All four broken references are now updated:
Fix 2 — README invariant table ( The table now correctly lists all operators that exist in
No more phantom Spec correctness pass — no issues found
With
Together:
New leader
Both
SummaryAll four rounds of review findings have been fully addressed:
The spec is sound and internally consistent. No new findings. Ready to merge. |
Per docs/design/2026_05_28_partial_tla_safety_spec.md §7.3 — the M1 PR is the right place to land the CI workflow alongside the concrete spec. Adds .github/workflows/tla-check.yml: - Triggers on PR (opened/synchronize/reopened) and on push to main, with a path filter matching exactly the §7.3 list: tla/**, Makefile, kv/hlc*.go, kv/coordinator.go, kv/sharded_coordinator.go, kv/transaction.go, kv/fsm.go, kv/lock_resolver.go, store/mvcc_store.go, distribution/**, and the workflow file itself. - Uses Temurin JDK 21 (consistent with jepsen-test.yml). - Caches .cache/tla/tla2tools.jar keyed on the Makefile hash so any TLA_VERSION / TLA_SHA256 change automatically invalidates the cache. - Pre-fetches the jar via `make tla-tools` so the download is in a separate log group from the model check. - Runs `make tla-check`, which inherits the round-2 gap-config validation: a non-zero TLC exit on MCHLC_gap.cfg only counts as success if the output contains the exact HLC-4 violation string. Updates §7.3 of the design doc: removes the "(deferred)" qualifier and names the workflow file as the now-active CI binding. actionlint clean. The full M1 run completes in well under a minute locally (3,594 distinct states on the safe config, HLC-4 counter- example at depth 5 on the gap config).
|
Wired New file:
Updated §7.3 of the design doc: removed the
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ca0ecfc0e7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| restore-keys: | | ||
| ${{ runner.os }}-tla-tools- |
There was a problem hiding this comment.
Verify restored TLA jar before use
When TLA_VERSION or TLA_SHA256 changes, the primary cache key misses, and GitHub's cache docs say restore-keys are checked for partial matches; this can restore an older .cache/tla/tla2tools.jar. Because make tla-tools only depends on the existing file target and the checksum recipe in the Makefile is skipped when that file already exists, CI can run an old/unverified TLC jar despite the checksum pin. Drop the prefix restore or make the target always verify/remove stale jars after cache restore.
Useful? React with 👍 / 👎.
codex P2 on PR #856 ca0ecfc: the prefix-match restore-keys could bring back an older tla2tools.jar after a TLA_VERSION/TLA_SHA256 bump in the Makefile. Because $(TLA_JAR) is a file-existence target, the restored stale jar would skip the Makefile's checksum verification entirely — defeating the pin. A full cache miss is the correct behaviour on a version-pin change: the recipe runs, downloads from GitHub, and verifies the SHA-256 before serving the jar to TLC. Documented inline so the next reviewer doesn't re-add restore-keys.
|
Addressed codex P2 on Finding: Fix: Dropped the This is a workflow-only change. The Makefile's checksum verification path is unchanged — and is now load-bearing for every version-pin bump. @claude review |
|
Claude finished @bootjp's task in 1m 24s —— View job Todo list:
Review of Round 4 (2a2238a)The diff is a 7-line addition (explanatory comment) + 2-line deletion ( The root cause and fix The $(TLA_JAR):
@curl -fsSL -o $(TLA_JAR).tmp $(TLA_URL)
@# ... SHA-256 verification ...
@mv $(TLA_JAR).tmp $(TLA_JAR)This recipe only runs when Correctness verified
Comment quality The inline comment (lines 70-75 of the final file) names the codex P2 finding, explains the Everything else unchanged and correct
No new findings. All five rounds of review findings are now fully addressed. Ready to merge. |
…857) ## Summary Auto-trigger AI review (Claude + Codex) whenever a PR touches a file the TLA+ safety spec has an anchor on, so a spec-divergence review **always** fires — never relies on a human remembering to ping. Follows PR #856 (M1 spec + `tla-check` workflow). ## How it works New workflow: `.github/workflows/tla-spec-ai-review.yml` - Fires on `pull_request` (opened / synchronize / reopened / ready_for_review). - Path filter is an exact copy of `tla-check.yml`'s — the two workflows stay paired so every anchored-file change triggers **both** the TLC run AND the AI-review request. - Computes which anchored files actually changed (via `git diff` against the PR base) and skips the comment if none. - Posts a fresh PR comment per push containing both `@claude review` and `@codex review` mentions, plus a per-subsystem checklist of which spec invariants to verify (HLC-4 preconditions, OCC lock encoding, MVCC visibility, Routes catalog versioning). - **Always** `createComment` — never `updateComment`. The upstream `claude.yml` listens on `issue_comment: created` only; editing would not re-trigger Claude. Same for Codex. ## Chain ``` anchored file change in PR -> tla-check.yml runs TLC (catches *spec* bugs) -> tla-spec-ai-review.yml posts comment (with @claude + @codex pings) -> claude.yml fires on @claude → Claude reviews diff -> chatgpt-codex-connector bot fires → Codex reviews diff (both focused on TLA+ divergence) ``` The two workflows are complementary: `tla-check` verifies the spec passes; the AI reviewers verify the implementation matches the spec. ## Docs §7.3 of `docs/design/2026_05_28_partial_tla_safety_spec.md` is updated with the chain diagram, the two-workflow split, and a reminder that the path filter must stay in sync across the two YAML files and the doc § list. ## Verification - `actionlint` clean. - `make tla-check` still passes locally on `main` (3,594 distinct states on safe config, HLC-4 counterexample at depth 5 on gap config). - The workflow itself doesn't run TLC — it only posts a comment — so the trigger chain can be tested by opening a follow-up PR that touches e.g. `kv/hlc.go`. ## Test plan - [ ] CI green (`tla-check` runs again because the workflow path filter touches `tla/**`? No — this PR only adds `.github/workflows/`, which is NOT in the path filter, so `tla-check` will NOT fire on this PR. That's correct.) - [ ] Reviewer opens a tiny follow-up PR touching e.g. `kv/hlc.go` to verify the comment+bot chain fires - [ ] Reviewer confirms the comment template lists the right spec sections per subsystem ## Out of scope - Auto-collapsing older marker comments after a fresh push — embedded as a marker (`<!-- tla-spec-ai-review-marker -->`) so a future PR can add a "minimize older" step if the comment noise becomes a problem. - Triggering AI review when the spec itself changes (no impl divergence to check then — `tla-check` is sufficient). <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Updated design documentation describing the new automated specification review workflow, including trigger conditions, review coordination processes, and how verification ensures implementations maintain alignment with formal specifications * **Chores** * Added automated workflow that monitors pull requests for specification file changes and automatically coordinates review requests to verify specification compliance <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/bootjp/elastickv/pull/857?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…#859) ## Summary Implements the **M1 spec follow-up** per [`docs/design/2026_05_28_partial_tla_safety_spec.md`](https://github.com/bootjp/elastickv/blob/main/docs/design/2026_05_28_partial_tla_safety_spec.md) §5.1 HLC-4 precondition **(ii)** — strategy (c) "scan the FSM for max committed HLC and `Observe` before serving any persistence `Next()`". Follows PR #856 (M1 TLA+ spec). ## What lands One line in `kv/fsm.go` `applyRequestErr` after a successful `handleRequest`: ```go if f.hlc != nil && commitTS > 0 { f.hlc.Observe(commitTS) } ``` `HLC.Observe` is the existing CAS-to-max method (`kv/hlc.go:129-140`), so this is a **pure addition** with no signature change and no caller audit required. ## Why "observe on every apply" vs the spec doc's "observe on election" The spec doc describes strategy (c) as "on election, scan the FSM and call `Observe`". The implementation refines this to "Observe on every apply". The two are semantically equivalent for HLC-4 because etcd/raft applies all uncommitted prior-term entries before a new leader serves a write — so by the time `HLC.Next()` runs in the new term, `hlc.last >= every prior commit`. Two advantages over the on-election form: 1. **No protocol change** to `ShardedCoordinator` or `RunHLCLeaseRenewal`. 2. **No race window** between `BecomeLeader` and the first `Observe` — a concurrent client request landing in that window cannot see a stale `hlc.last`. ## Tests `kv/fsm_hlc_observe_test.go`: - `TestApplyObservesCommitTSIntoHLC` — verifies Apply advances `hlc.last` via `Observe(commitTS)`, and preserves monotonicity on a stale-ts apply. - `TestApplyHLCObserveAfterRestart` — simulates the spec's new-leader-after-restart scenario: a fresh HLC (`last = 0`) re-plays a prior leader's write through the FSM and the resulting `Next()` is strictly greater than the prior `commit_ts`. Both tests pass with `-race`. ## Deferred to a follow-up PR The **ceiling-fence half** of the M1 spec contract (HLC-4 precondition **(iii)** — making `HLC.Next()` fail-closed when `wall_now >= physicalCeiling`) is intentionally NOT in this PR. That change alters `HLC.Next()`'s return signature from `uint64` to `(uint64, error)` and requires auditing **18 production call sites** across `kv/`, `adapter/`, and `store/`. Splitting it from the strategy-(c) PR keeps each diff individually reviewable per the loop-discipline reminder for semantic changes. ## Self-review (5-lens per CLAUDE.md) 1. **Data loss** — n/a; `Observe` only advances an in-memory clock counter, no persistence path touched. 2. **Concurrency** — `Observe` is the existing CAS-to-max loop; safe under concurrent `Apply` (which is serial per Raft FSM contract anyway) and concurrent `Next` (the CAS handles it). 3. **Performance** — adds one atomic load + CAS to the FSM apply hot path. The CAS only fires when `ts > current` so steady-state is one load. 4. **Data consistency** — `Observe` is monotone (only advances). No semantic change to any other code path. 5. **Test coverage** — two new tests in `kv/fsm_hlc_observe_test.go` covering the basic apply→observe path and the post-restart catchup case. Existing kv + store tests pass. ## Test plan - [ ] CI green (the `tla-check` workflow does NOT fire — this PR is Go-only) - [ ] The new `tla-spec-ai-review` workflow (PR #857) auto-pings Claude + Codex for spec-divergence review - [ ] Reviewer cross-checks the Observe call against HLC-4 (ii) strategy (c) in the design doc and `BecomeLeader_HLC` in `tla/hlc/HLC.tla` ## Out of scope - HLC-4 precondition (iii) ceiling fence (`HLC.Next()` fail-closed) — separate follow-up PR. - Operator-visible metrics for the fence behaviour — lands with (iii).
… tla-check (#858) ## Summary M2 of the [TLA+ safety spec](https://github.com/bootjp/elastickv/blob/main/docs/design/2026_05_28_partial_tla_safety_spec.md) — the OCC layer encoded as a Percolator-style 2PC model. Follows PR #856 (M1 HLC spec). ## What lands - `tla/occ/OCC.tla` — transaction lifecycle `Idle → Active → Prepared → Committed/Aborted`, lock map `(key, lock_ts) → start_ts`, `LockResolve` action that drains abandoned locks. - `tla/occ/MCOCC.tla` — TLC model module with txn + key symmetry. - `tla/occ/MCOCC.cfg` — safe config (`EnableSafety = TRUE`; expected PASS). - `tla/occ/MCOCC_gap.cfg` — Commit's OCC-1 guard removed (expected FAIL on `OCC1_CommitTsAboveStart`). - `scripts/tla-check.sh` — module-aware orchestrator. Adding M3/M4/M5 needs only a new entry in `TLA_MODULES` + a one-line `case` for the gap-invariant string. - `Makefile tla-check` delegates to the script. - `.gitignore` TLA artefact glob widened from `MCHLC_TTrace_*` to `MC*_TTrace_*` (M2 + M3..M5 covered). - `tla/README.md`: M2 status `Not started` → `Landed`; OCC module description block + invariant table. ## Invariants encoded | # | Statement | Form | |---|---|---| | OCC-1 | `commit_ts(T) > start_ts(T)` for every committed txn | INVARIANT | | OCC-2 | Two committed txns sharing a write key have distinct `commit_ts` and one started after the other committed | INVARIANT | | OCC-3 | Snapshot-read stability under the lock encoding (reads blocked by conflicting uncommitted locks; LockResolve drains them) | INVARIANT | | OCC-4 | No stranded lock at quiescence (bounded-safety form) | INVARIANT | | OCC-5 | `start_ts(T) == read_ts(T)` for every operation | INVARIANT | | OCC5_Action | `start_ts[t]` is assigned once at BeginTxn, never updated | PROPERTY | | CommitTsAssignedOnce | `commit_ts[t]` is assigned once at Commit, never updated | PROPERTY | OCC-L1 (liveness) is deferred to M6 per §8.1. ## Verification (local) ```text $ make tla-check ... TLC: tla/hlc/MCHLC.cfg (correct design, expected PASS) ... 3594 distinct states found, 0 left on queue ... TLC: tla/hlc/MCHLC_gap.cfg (...expected FAIL on HLC4...) OK: MCHLC_gap.cfg failed as designed ... TLC: tla/occ/MCOCC.cfg (correct design, expected PASS) ... 4217 distinct states found, 0 left on queue ... TLC: tla/occ/MCOCC_gap.cfg (...expected FAIL on OCC1...) OK: MCOCC_gap.cfg failed as designed (Invariant OCC1_CommitTsAboveStart is violated) tla-check: all model-check outcomes match the design contract. ``` ## Modelling notes - **HLC abstraction.** OCC.tla uses a single global monotonic `tsCounter`, not the full 48/16 HLC. M5 (composed) will INSTANCE `HLC.tla` for the real layout. This keeps the M2 state space manageable. - **`Commit` borrows ts.** `commit_ts = tsCounter` (does NOT advance). Only `Tick` advances ts, so the bounded model stays inside `0..MaxTs`. OCC-1's `commit_ts > start_ts` then requires a Tick between BeginTxn and Commit. - **OCC-2 non-strict serialization.** A txn with `start_ts = T` sees every version with `commit_ts ≤ T`, so `start_ts = commit_ts(earlier)` is a valid "later runs in the world after earlier committed" boundary (`<=`, not `<`). - **OCC-3 lock encoding** is enforced as an `ENABLED` guard on `ReadKey` — TLC simply does not explore reads in a state where a conflicting lock exists, mirroring the spec's "OCC-3 is asserted only in states where no uncommitted lock at lock_ts ≤ read_ts exists". ## Self-review (5-lens per CLAUDE.md) 1. **Data loss** — n/a, spec-only PR, no persistence paths. 2. **Concurrency / distributed failures** — the spec is exactly modelling these. Safe config (4,217 states) explored without invariant violation; gap config produces the documented counterexample. 3. **Performance** — `make tla-check` end-to-end runs in well under 5 seconds on a developer laptop at default bounds. 4. **Data consistency** — OCC-1..OCC-5 + the two PROPERTY transitions are the M2 contract from the design doc. The non-strict OCC-2 form is justified in the spec comment. 5. **Test coverage** — no Go tests added (spec-only). `make tla-check` is the new coverage layer. ## Test plan - [ ] `tla-check` CI workflow runs green (it watches `tla/**`) - [ ] `tla-spec-ai-review` does NOT fire — this PR touches no Go anchor (correct: spec-only changes don't need an AI impl-divergence review) - [ ] Reviewer cross-checks OCC-1..OCC-5 invariants against §5.2 of the design doc - [ ] Reviewer runs `make tla-check` and confirms 4 outcomes (HLC safe pass, HLC gap fail, OCC safe pass, OCC gap fail) ## Out of scope - OCC-L1 (liveness) — deferred to M6. - M3 (MVCC), M4 (Routes), M5 (Composed) — separate future PRs. - Integrating OCC with the real HLC.tla — happens in M5. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **New Features** * Added OCC (Optimistic Concurrency Control) specification module with two-phase commit lifecycle, five core safety invariants, and automated model-checking verification suite * **Documentation** * Updated project README with detailed OCC module documentation and marked the OCC milestone as completed * **Chores** * Refactored verification build infrastructure with improved scripts and updated trace-file ignore patterns <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/bootjp/elastickv/pull/858?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…#865) ## Summary M5 of the [TLA+ safety spec](https://github.com/bootjp/elastickv/blob/main/docs/design/2026_05_28_partial_tla_safety_spec.md) — the cross-module safety properties, capping the spec-only roadmap (M1..M5). M1..M4 each modelled a single subsystem in isolation. M5 captures the **seams** between them — the two safety properties that no single-module spec can express. ## What lands - `tla/composed/Composed.tla` — recreates the minimal cross-module state needed for Composed-1 (catalog history `routes[v]`, txn observed catalog version `txnObservedVer[t]`, committing group `txnCommitGroup[t]`) and Composed-3 (monotonic `tsCounter` advancing on every Commit). **Does NOT `INSTANCE` M1..M4** — that would explode the state space well past M5's `<10 min at default bounds` target from §8.1. - `tla/composed/MCComposed.tla` — TLC model with key + group + txn symmetry. - `tla/composed/MCComposed.cfg` / `MCComposed_gap.cfg` — safe (PASS) + gap (FAIL on Composed-1). - `scripts/tla-check.sh` — `TLA_MODULES` gains `"composed"`; `gap_invariant_for` + `mc_basename` gain the corresponding cases. - `tla/README.md` — M5 status `Not started` → `Landed`; Composed module description + invariant table. ## Invariants | # | Statement | Form | |---|---|---| | Composed-1 | Every committed write key was owned by the committing group at the txn's observed catalog version | INVARIANT | | Composed-2 | Vacuously TRUE in this abstraction — `SplitRange` is same-group only per CLAUDE.md | INVARIANT | | Composed-3 | Distinct committed txns have distinct `commit_ts` | INVARIANT | | `Composed_CatalogMonotonic` | `catalogVersion` weakly increases on every step | PROPERTY | | `Composed_TsMonotonic` | `tsCounter` weakly increases on every step | PROPERTY | | `Composed3_TsAction` | Every Commit strictly raises `tsCounter` | PROPERTY | ## Why no `INSTANCE`? Each M1..M4 module tightly bounds its own state for tractable TLC. Bringing all four `INSTANCE`s into one product spec would multiply the state spaces — even a 100x growth lands well past M5's `<10 min at default bounds` target. M5 instead recreates the minimal cross-module state needed to express the two seam invariants. The integration claim is that this projection preserves the invariants the full product would assert; reviewers can sanity-check each Composed action against the corresponding M1..M4 action (e.g., `ProposeRouteChange` mirrors `Routes.tla`, `BeginTxn` + `Commit` mirror `OCC.tla` with a cross-module observed-catalog-version pin). ## Verification (local) ```text $ make tla-check HLC safe 3,594 distinct states + HLC-4 gap ce OCC safe 150 distinct states + OCC-1 gap ce MVCC safe 79 distinct states + MVCC-4 gap ce Routes safe 29 distinct states + Routes-4 gap ce Composed safe 1,684 distinct states + Composed-1 gap ce tla-check: all model-check outcomes match the design contract. ``` The Composed-1 counterexample is the canonical 4-step regression: 1. `BeginTxn(t)` — `txn` observes `catalogVersion = 0`, where `routes[0][k1] = g1`. 2. `WriteIntent(t, k1)`. 3. `ProposeRouteChange(k1, g2)` — `catalogVersion → 1`, `routes[1][k1] = g2`. 4. `Commit(t, g2)` (gap mode) — committing group is `g2`, but `routes[0][k1] = g1 ≠ g2`. **Composed-1 fails.** ## Self-review (5-lens per CLAUDE.md) 1. **Data loss** — out of scope (cross-module invariants check group ownership, not durability). 2. **Concurrency** — Composed-1 is exactly the concurrency seam between OCC and Routes; the gap counterexample is one such interleaving. 3. **Performance** — `make tla-check` end-to-end runs in ~10 s on a dev laptop, well under the 10-min target. 4. **Data consistency** — Composed-1 (cross-module write/route consistency) + Composed-3 (cross-txn ts uniqueness) + 3 PROPERTY transitions. 5. **Test coverage** — no Go tests added (spec-only). `make tla-check` is the coverage layer. ## Roadmap status (post-merge) | Milestone | Status | |---|---| | M1 HLC | Landed (PR #856) | | M2 OCC | Landed (PR #858) | | M3 MVCC | Landed (PR #861) | | M4 Routes | Landed (PR #861 via stacked #862) | | **M5 Composed** | **This PR** | | M6 liveness (OPTIONAL) | Not started — can follow as a separate PR | | M1 Go follow-up (ceiling fence) | Not started — separate Go PR with caller audit | ## Test plan - [ ] `tla-check` CI runs green (watches `tla/**`) - [ ] `tla-spec-ai-review` does NOT fire — spec-only PR - [ ] Reviewer cross-checks Composed-1 / Composed-3 against §5.5 of the design doc - [ ] Reviewer runs `make tla-check` and confirms 10 outcomes (5 safe pass + 5 gap fail) - [ ] Reviewer sanity-checks the `INSTANCE`-free projection claim — each Composed action mirrors the corresponding M1..M4 action ## Out of scope - M6 liveness (OCC-L1, Routes-L1) — separate PR if needed - HLC-4 (iii) ceiling fence Go implementation — separate code PR - Cross-group `SplitRange` in the real implementation (currently same-group only)
…ini medium on PR #878) Closes a silent-false-positive class in the gap-config loop introduced by the multi-gap refactor in this PR's parent commit. The loop does: while IFS='|' read -r gap_stem gap_inv; do ... if printf '%s\n' "$gap_out" | grep -qF "$gap_inv"; then OK fi done If a gap_configs_for() entry is malformed (missing the '|' delimiter, e.g. a typo dropped the invariant string), `read` puts the entire line in gap_stem and leaves gap_inv empty. Then `grep -qF ""` matches every line of TLC output unconditionally, so the gap config 'passes' even when TLC failed for an unrelated reason — exactly the silent-false-positive class codex P2 on PR #856 round 2 introduced the gap-invariant-string match to prevent. Fix: validate both fields are non-empty before the body runs. A trailing blank line (the existing skip case) is still skipped; a malformed entry now produces a clear ERROR with the stem / inv it read so the maintainer can fix gap_configs_for() directly. Verified: `make tla-check` still produces the 11 expected outcomes (5 safe PASS + 5 gap FAIL + 1 currentfence gap FAIL on Composed-1a) after the guard.
Summary
M1 of the TLA+ safety spec proposal lands:
tla/lib/Env.tla+tla/lib/Raft.tla(shared scaffolding).tla/hlc/HLC.tlawith invariants HLC-1..HLC-4 +TypeOK.ASSUMEinEnv.tla.BecomeLeader_HLC(hlc.Observe(maxAppliedHLC)analogue).IssueTimestamp(wallNow[n] < physicalCeiling[n]).tla/hlc/MCHLC.tla+MCHLC.cfg(correct design — TLC passes) andMCHLC_gap.cfg(preconditions disabled — TLC produces the HLC-4 counterexample the design doc anticipated).make tla-checkMakefile target: downloads checksum-pinnedtla2tools.jarv1.8.0 (SHA-256 237332bd…) to.cache/tla/and runs both configurations, inverting the exit code on the gap config so the expected-FAIL counts as PASS for CI.tla/README.mdcovering install / run / what each module proves / reading list / how to map a TLC counterexample back to the implementation, per §7.1 of the design doc.proposed_→partial_(per CLAUDE.md lifecycle), headerStatus:updated, in a dedicated commit.Verification (local)
The gap-config counterexample (depth 5):
BecomeLeader(n1)— n1 elected term 1IssueTimestamp(n1)— commits(wall=0, logical=0)in term 1IssueTimestamp(n1)— commits(0, 1)in term 1 (logical increments)BecomeLeader_HLC(n2)— n2 elected term 2; EnableSafety = FALSE so strategy (c) does NOT raise n2'shlcLasttomaxAppliedHLC = (0, 1)IssueTimestamp(n2)— commits(0, 1)in term 2 (same as term 1's last commit; HLC-4 requires strict>)This is exactly the bug the design doc §5.1 HLC-4 (ii) said the spec would surface.
Self-review (per CLAUDE.md five-lens checklist)
make tla-checkconfigurations explore the full bounded state space.make tla-checkruns in <1s on a developer laptop at default bounds.MCHLC.cfgproves HLC-1..HLC-4 hold under preconditions (i)/(ii)/(iii);MCHLC_gap.cfgproves the preconditions are necessary by surfacing the documented counterexample.make tla-checkis the new coverage layer; running it is the test.Out of scope
This PR is spec-only. The Go code changes that strategy (c) and the ceiling fence require (per §5.1 HLC-4 (ii) and (iii)) are not in this PR — they need:
MaxAppliedHLC()tokv/fsm.go(running max across applied write entries).hlc.Observe(fsm.MaxAppliedHLC())on new-term detection inkv/sharded_coordinator.goRunHLCLeaseRenewal, before any persistenceNext()is served.wallNow < physicalCeilingfence inkv/hlc.goNext()with a fail-closed return path and operator-visible metrics — caller audits of every site that callsNext()will be required (per the loop-feedback discipline) sinceNext()changes from infallible to error-returning.Tracked as a follow-up issue; this PR lands the design contract that the follow-up must satisfy.
Test plan
make tla-check(CI integration is deferred to a future PR per §7.3 of the design doc — local invocation is the contract for now)make tla-checklocally and confirms the safe pass + gap-config counterexampleHLC.tlaagainst the corresponding entry in §5.1 of the design doctla/README.md"How to interpret a TLC failure" against the current Go source