kv(hlc): observe applied commit_ts into HLC.last — HLC-4 strategy (c) by bootjp · Pull Request #859 · bootjp/elastickv

bootjp · 2026-05-28T19:26:46Z

Summary

Implements the M1 spec follow-up per docs/design/2026_05_28_partial_tla_safety_spec.md §5.1 HLC-4 precondition (ii) — strategy (c) "scan the FSM for max committed HLC and Observe before serving any persistence Next()".

Follows PR #856 (M1 TLA+ spec).

What lands

One line in kv/fsm.go applyRequestErr after a successful handleRequest:

if f.hlc != nil && commitTS > 0 {
    f.hlc.Observe(commitTS)
}

HLC.Observe is the existing CAS-to-max method (kv/hlc.go:129-140), so this is a pure addition with no signature change and no caller audit required.

Why "observe on every apply" vs the spec doc's "observe on election"

The spec doc describes strategy (c) as "on election, scan the FSM and call Observe". The implementation refines this to "Observe on every apply". The two are semantically equivalent for HLC-4 because etcd/raft applies all uncommitted prior-term entries before a new leader serves a write — so by the time HLC.Next() runs in the new term, hlc.last >= every prior commit.

Two advantages over the on-election form:

No protocol change to ShardedCoordinator or RunHLCLeaseRenewal.
No race window between BecomeLeader and the first Observe — a concurrent client request landing in that window cannot see a stale hlc.last.

Tests

kv/fsm_hlc_observe_test.go:

TestApplyObservesCommitTSIntoHLC — verifies Apply advances hlc.last via Observe(commitTS), and preserves monotonicity on a stale-ts apply.
TestApplyHLCObserveAfterRestart — simulates the spec's new-leader-after-restart scenario: a fresh HLC (last = 0) re-plays a prior leader's write through the FSM and the resulting Next() is strictly greater than the prior commit_ts.

Both tests pass with -race.

Deferred to a follow-up PR

The ceiling-fence half of the M1 spec contract (HLC-4 precondition (iii) — making HLC.Next() fail-closed when wall_now >= physicalCeiling) is intentionally NOT in this PR. That change alters HLC.Next()'s return signature from uint64 to (uint64, error) and requires auditing 18 production call sites across kv/, adapter/, and store/. Splitting it from the strategy-(c) PR keeps each diff individually reviewable per the loop-discipline reminder for semantic changes.

Self-review (5-lens per CLAUDE.md)

Data loss — n/a; Observe only advances an in-memory clock counter, no persistence path touched.
Concurrency — Observe is the existing CAS-to-max loop; safe under concurrent Apply (which is serial per Raft FSM contract anyway) and concurrent Next (the CAS handles it).
Performance — adds one atomic load + CAS to the FSM apply hot path. The CAS only fires when ts > current so steady-state is one load.
Data consistency — Observe is monotone (only advances). No semantic change to any other code path.
Test coverage — two new tests in kv/fsm_hlc_observe_test.go covering the basic apply→observe path and the post-restart catchup case. Existing kv + store tests pass.

Test plan

CI green (the tla-check workflow does NOT fire — this PR is Go-only)
The new tla-spec-ai-review workflow (PR ci(tla): auto-request Claude + Codex review on anchored-file changes #857) auto-pings Claude + Codex for spec-divergence review
Reviewer cross-checks the Observe call against HLC-4 (ii) strategy (c) in the design doc and BecomeLeader_HLC in tla/hlc/HLC.tla

Out of scope

HLC-4 precondition (iii) ceiling fence (HLC.Next() fail-closed) — separate follow-up PR.
Operator-visible metrics for the fence behaviour — lands with (iii).

Implements the M1 spec follow-up per docs/design/2026_05_28_partial_tla_safety_spec.md §5.1 HLC-4 precondition (ii), strategy (c) — "scan the FSM for max committed HLC and Observe before serving any persistence Next()". Refinement over the spec doc's strategy (c) description: instead of detecting a new term in ShardedCoordinator and calling Observe once at election time, the FSM observes every commit_ts on apply. This is semantically equivalent for HLC-4 (etcd/raft applies all uncommitted prior-term entries before the new leader serves a write, so by the time HLC.Next() runs in the new term, hlc.last >= every prior commit), and has two advantages over the on-election form: 1. No protocol change to ShardedCoordinator or RunHLCLeaseRenewal. 2. No race window between BecomeLeader and the first Observe — a concurrent client request landing in that window cannot see a stale hlc.last. Concretely, one line in kv/fsm.go applyRequestErr after a successful handleRequest: if f.hlc != nil && commitTS > 0 { f.hlc.Observe(commitTS) } f.hlc.Observe is the existing CAS-to-max method (kv/hlc.go:129-140), so this is a pure addition with no signature change. Apply is the only call site exercised; no Next() caller audit required. Tests (kv/fsm_hlc_observe_test.go): - TestApplyObservesCommitTSIntoHLC — verifies Apply advances hlc.last via Observe(commitTS), preserves monotonicity on a stale-ts apply. - TestApplyHLCObserveAfterRestart — simulates the spec's new-leader-after-restart scenario: a fresh HLC (last = 0) re-plays a prior leader's write through the FSM and the resulting Next() is strictly greater than the prior commit_ts. The ceiling-fence half of the M1 spec contract (HLC-4 precondition (iii) — making HLC.Next() fail-closed when wall_now >= physicalCeiling) is intentionally deferred to a follow-up PR. That change alters HLC.Next()'s return signature from `uint64` to `(uint64, error)` and requires auditing 18 production call sites across kv/, adapter/, and store/ (per the spec doc and the loop- discipline reminder for semantic changes); splitting it from this strategy-(c) PR keeps each diff individually reviewable.

chatgpt-codex-connector · 2026-05-28T19:26:50Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

coderabbitai · 2026-05-28T19:26:54Z

Warning

Review limit reached

@bootjp, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 46 minutes and 19 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 60c130df-2c3a-4210-b163-15b692438797

📥 Commits

Reviewing files that changed from the base of the PR and between 0a6cafe and dafd215.

📒 Files selected for processing (2)

kv/fsm.go
kv/fsm_hlc_observe_test.go

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/m1-hlc-strategy-c-and-fence

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-28T19:27:00Z

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (dafd215):

kv/fsm.go

What to check, by subsystem:

kv/hlc*.go — Next() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
kv/coordinator.go, kv/sharded_coordinator.go — RunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.

@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

chatgpt-codex-connector · 2026-05-28T19:27:06Z

To use Codex here, create a Codex account and connect to github.

gemini-code-assist

Code Review

This pull request implements the HLC-4 strategy (c) in the FSM by observing every applied commit timestamp to ensure the node's Hybrid Logical Clock (HLC) dominates the maximum committed timestamp visible in the FSM. This prevents logical-handoff gaps when a follower is elected as a leader. It also introduces comprehensive unit tests (TestApplyObservesCommitTSIntoHLC and TestApplyHLCObserveAfterRestart) to verify this behavior under normal operation and post-restart recovery scenarios. There are no review comments, so I have no feedback to provide.

gemini-code-assist Bot reviewed May 28, 2026

View reviewed changes

bootjp merged commit d5ad5d3 into main May 28, 2026
12 checks passed

bootjp deleted the feat/m1-hlc-strategy-c-and-fence branch May 28, 2026 20:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv(hlc): observe applied commit_ts into HLC.last — HLC-4 strategy (c)#859

kv(hlc): observe applied commit_ts into HLC.last — HLC-4 strategy (c)#859
bootjp merged 1 commit into
mainfrom
feat/m1-hlc-strategy-c-and-fence

bootjp commented May 28, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 28, 2026

Uh oh!

coderabbitai Bot commented May 28, 2026

Review limit reached

Uh oh!

github-actions Bot commented May 28, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 28, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bootjp commented May 28, 2026

Summary

What lands

Why "observe on every apply" vs the spec doc's "observe on election"

Tests

Deferred to a follow-up PR

Self-review (5-lens per CLAUDE.md)

Test plan

Out of scope

Uh oh!

chatgpt-codex-connector Bot commented May 28, 2026

Uh oh!

coderabbitai Bot commented May 28, 2026

Review limit reached

Uh oh!

github-actions Bot commented May 28, 2026

TLA+ spec divergence review (auto-triggered)

Uh oh!

chatgpt-codex-connector Bot commented May 28, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant