Skip to content

docs(design): idempotent FSM snapshot restore on cold start#910

Merged
bootjp merged 10 commits into
mainfrom
docs/idempotent-snapshot-restore-design
Jun 3, 2026
Merged

docs(design): idempotent FSM snapshot restore on cold start#910
bootjp merged 10 commits into
mainfrom
docs/idempotent-snapshot-restore-design

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Jun 2, 2026

Summary

PR #909 (HEALTH_TIMEOUT_SECONDS 60s → 300s) の 根本原因側 の設計提案。コードは無変更、ドキュメント追加のみ。

何を解決したいか

internal/raftengine/etcd/wal_store.go:117 loadWalState が、WAL の persisted snapshot pointer が空でない限りコールド起動の度に
restoreSnapshotState(fsm, snapshot, fsmSnapDir) を呼びます。EKV/token-format snapshot の経路では最終的に
pebbleStore.RestorerestorePebbleNativeAtomicswapInTempDB (store/lsm_store.go:1816, :2002) が走り、
fsm.db の全内容を sibling tempdir に書き戻してから rename で差し替える = 起動コストが O(snapshot size)

5 ノード 192.168.0.x クラスタで実測 ~46 s (Found 2 WALs 012720/012734 のオープン → Found 0 WALs の tempdir 作成 → Found 4 WALs 001309..001315 の rename 後再オープン)。
これが PR #909 で観測された "gRPC port did not come up" の主因です。

問題は この復元の大半が冗長 だという点:

  • 毎回成功した fsm.Apply が変更内容を Pebble の WriteBatch で durable に書き込み済み
  • FSM が applied = Y > snapshot.Metadata.Index = X まで進んだ後の fsm.db は state ≥ X を保持済み
  • にもかかわらず次のコールド起動で その状態を捨てて古い snapshot から再構築 → raft が同じエントリを再 replay

提案アーキテクチャ

(1) StateMachine.Apply(index, data)

// before
Apply(data []byte) any
// after
Apply(index uint64, data []byte) any

Engine の唯一の呼び出しサイト (engine.go:1769 applyNormalEntry) で entry.Index を渡すだけ。
in-tree 実装は機械的変換、外部実装は無し。

(2) pebbleStore の meta key に applied-index を atomic 同梱

kvFSM.Apply(index, data) が leaf MVCC mutation で取得する pebble.Batchkey=metaAppliedIndex, value=BE64(index) を追加。
Pebble の batch は atomic なので torn-write window なし

(3) restoreSnapshotState 条件分岐

if skip, err := fsmAlreadyAtIndex(fsm, tok.Index); err != nil {
    return err
} else if skip {
    // FSM data on disk is at least as fresh as the snapshot.
    return nil
}
return openAndRestoreFSMSnapshot(...)

AppliedIndexReader interface 未実装の FSM、または meta key 不在 (アップグレード直後の最初の再起動) は 現状の full restore にフォールバック → strictly additive。

期待される効果

  • コールド起動の steady-state コスト: pebble.Open + WAL replay + raft follower 化 + gRPC bind の <5 秒 (FSM サイズに依存しない)
  • raft term 暴発 (timeout-induced restart loop による) の解消
  • 将来 HEALTH_TIMEOUT_SECONDS を引き下げ可能 (Branch 5 で実施)

実装シーケンス (別 PR 4 本に分割)

Branch 内容 動作変化
B2 StateMachine.Apply(index, data) interface 変更 + 全実装更新 (index は受け取って捨てるだけ) なし
B3 kvFSM で index を leaf mutation に流し、pebbleStore で meta key を WriteBatch 同梱 + LastAppliedIndex() 公開 + 単体テスト meta key が書き込まれ始める (まだ skip は無効)
B4 restoreSnapshotStatefsmAlreadyAtIndex ガート + metrics + INFO log ユーザー可視な perf 改善 ここで発火
B5 HEALTH_TIMEOUT_SECONDS を tighter な値に引き下げ docker restart loop 抑止

各 branch は対応する invariant を unit test でガード。

Crash-safety / Compatibility

詳細は docs/design/2026_06_02_idempotent_snapshot_restore.md の §3 §5 参照。
要約:

  • meta key とデータ mutation は同一 pebble.Batch で commit されるため、クラッシュ後の Pebble WAL replay は両者を一緒に restore するか両者を一緒に discard する。中間状態はあり得ない。
  • 8-byte length check 失敗時は読み取り側が conservative に "missing" として扱い、full restore にフォールバック。
  • 旧 fsm.db に meta key が無い場合は最初の再起動は現状動作。データ移行不要。
  • ロールバックは wal_store.go の skip 分岐を revert するだけ。meta key は dead data として残るが無害。

観測対象

fsm_cold_start_restore_total{outcome="executed|skipped|fallback"}
fsm_cold_start_applied_index_gap (executed 時の snapshot.Index - LastAppliedIndex)
  • skip 時の INFO log: restoreSnapshotState skipped (FSM already at index N >= snapshot index X)

レビュー観点

  • §3 の Pebble batch 原子性に基づく crash-safety 証明が説得力あるか
  • §4 の "skip 後の replay 冪等性" 議論で見落としは無いか (kvFSM.Apply が現状 idempotent と読めているか)
  • Branch 分割 (4 本) の粒度が妥当か / もっと束ねるべきか
  • HLC lease entry の index 監視に関する Open Question (§Open Questions 1 番目) の解像度

設計が固まり次第 Branch 2 (interface change) から着手します。

Summary by CodeRabbit

  • Documentation
    • Revised design doc detailing idempotent snapshot restore: safe skip of heavy snapshot bodies when on-disk state is proven current, header CRC verification for the skip path, and a durable applied-index write before snapshot persistence. Clarifies crash-safety, observability, compatibility, and implementation plan.

Review Change Stack

Companion to PR #909 (HEALTH_TIMEOUT_SECONDS 60s -> 300s).

PR #909 is a band-aid: the 60s ceiling kept tripping because
restoreSnapshotState unconditionally calls fsm.Restore on every cold
start when the WAL has a non-empty snapshot pointer, which on the
pebbleStore backend dispatches to restorePebbleNativeAtomic +
swapInTempDB and rewrites the entire FSM into a sibling temp directory
before renaming it into place. For multi-GiB FSMs this is ~46 s on the
192.168.0.x cluster's hardware.

The restore is mostly redundant: each successful fsm.Apply already
persists its mutation durably, so after the FSM applied entry Y >
snapshot.Metadata.Index = X, the on-disk fsm.db contains state >= X.
On the next cold start we tear that state down and rebuild it from the
older snapshot just to have raft replay the same entries we already
had on disk.

Proposed structure:

  1. StateMachine.Apply(index, data) — thread the raft applied-index
     into the FSM so kvFSM can persist it atomically.
  2. pebbleStore bundles a meta key (last_applied_index) in the same
     pebble.Batch as the data mutation. Pebble batches are atomic, so
     the index and the data move together; no torn-write window.
  3. restoreSnapshotState gains an AppliedIndexReader gate: skip when
     stored.LastAppliedIndex >= snapshot.Metadata.Index. Fall back to
     the current full restore when the FSM cannot self-report or the
     meta key is missing (first restart after upgrade).

Expected cold start in the steady-state case: pebble.Open + WAL replay
+ raft follower-ization + grpc bind, under 5 s independent of FSM size.

Doc-only PR; no code change yet. Implementation is sequenced as four
follow-up PRs (interface change -> meta key -> skip wiring -> metrics
+ HEALTH_TIMEOUT_SECONDS tightening), each individually reviewable.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 2, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 41 minutes and 35 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 338ccfc0-605c-40fd-8fca-71de6562c6f4

📥 Commits

Reviewing files that changed from the base of the PR and between d5a1bd2 and 5339da0.

📒 Files selected for processing (1)
  • docs/design/2026_06_02_idempotent_snapshot_restore.md
📝 Walkthrough

Walkthrough

Round‑6 proposal: expands the idempotent FSM snapshot-restore design to thread tok.CRC32C through the skip path, add CRC-verified header application on skip, and persist a durable applied-index meta-bump at snapshot persist sites so the conservative skip gate is provably safe.

Changes

Design Proposal: Idempotent FSM Snapshot Restore (Round 6)

Layer / File(s) Summary
Title & metadata update
docs/design/2026_06_02_idempotent_snapshot_restore.md
Updates document title/status to "Proposal (Round 6 — thread tok.CRC32C through the skip path)" and date/round metadata.
Problem, Goals, Non-Goals
docs/design/2026_06_02_idempotent_snapshot_restore.md
Rewrites Problem/Goal/Non-Goals to reflect later-round constraints and steady-state skip expectations.
Applied-index plumbing & ApplyIndexAware seam
docs/design/2026_06_02_idempotent_snapshot_restore.md
Re-specifies reuse of ApplyIndexAware, threading pendingApplyIdx into data-apply leaf paths (incl. DEL_PREFIX), and clarifies encryption-only entries not advancing persisted applied-index.
Store reader & conditional restore skip gate
docs/design/2026_06_02_idempotent_snapshot_restore.md
Introduces AppliedIndexReader/LastAppliedIndex access with dbMu lock-ordering and a conservative restoreSnapshotState skip gate that skips only when provable.
Skip-path header preservation & CRC verification
docs/design/2026_06_02_idempotent_snapshot_restore.md
Adds SnapshotHeaderApplier contract: pass tok.CRC32C, verify file size, footer CRC vs tokenCRC, and full-body CRC before applying HLC physical ceiling and Stage 8a cutover on skip.
Durability: AppliedIndexWriter & meta-bump at persist sites
docs/design/2026_06_02_idempotent_snapshot_restore.md
Introduces AppliedIndexWriter/SetDurableAppliedIndex and mandates bumping the applied-index meta key before persist.SaveSnap at both config-snapshot creation and steady-state local snapshot paths.
Correctness & idempotency after skip (OCC cases)
docs/design/2026_06_02_idempotent_snapshot_restore.md
Updates OCC/replay idempotency arguments to align with the revised skip behavior and header verification.
Observability & Implementation Plan
docs/design/2026_06_02_idempotent_snapshot_restore.md
Expands metrics/logging (executed/skipped/fallback, fallback_reason), and updates B2/B3 implementation/test outlines for the added CRC/header applier and meta-bump work.
Open questions, retractions, and notes
docs/design/2026_06_02_idempotent_snapshot_restore.md
Replaces earlier errata framing with round-by-round retractions (Rounds 2–5) and updates open questions/out-of-scope items.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • bootjp/elastickv#877: Related Stage 8a snapshot header/ceiling cutover design that the Round‑6 skip path and header format depend upon.

Poem

I’m a rabbit in docs, nibbling threads of CRC,
I hop through indexes, keeping state tidy and free,
I mark the durable spot before snapshots sleep,
So restores can skip and wake without a leap,
Happy hops — small fixes, giant glee. 🐇✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: a design proposal for idempotent FSM snapshot restore during cold start, which is the core focus of the added documentation file.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch docs/idempotent-snapshot-restore-design

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 2, 2026

@claude review

PR は docs/design/2026_06_02_idempotent_snapshot_restore.md の追加のみ (コード変更なし)。PR #909 の band-aid に対する根本原因側の設計提案です。

レビュー観点:

  • §3 "Crash-safety argument" — meta key とデータ mutation を同一 pebble.Batch で commit する根拠が説得力あるか (Pebble の batch 原子性の引用が正確か)
  • §4 "Idempotency of replay after skip" — kvFSM.Apply が現状すでに raft の snapshot recovery 要件である冪等性を満たしていると読めるか
  • §1 Interface change (StateMachine.Apply(data) → Apply(index, data)) の影響範囲が正しく見積もられているか (在庫した呼び出しサイトの抜けが無いか)
  • 4 分割した実装シーケンス (Branch 2/3/4/5) の粒度が妥当か / 束ねるべきか / 切り過ぎているか
  • HLC lease entry 経由の Apply (kvFSM.applyHLCLease) で meta key 更新をどう扱うかの Open Question (§Open Questions 1 番目) に対する筋の良い解

@claude
Copy link
Copy Markdown

claude Bot commented Jun 2, 2026

Claude finished @bootjp's task in 5m 2s —— View job


Design Review: Idempotent FSM Snapshot Restore on Cold Start

Todo List:

  • Load review tools
  • Read design document (docs/design/2026_06_02_idempotent_snapshot_restore.md)
  • Read relevant source files (kv/fsm.go, store/lsm_store.go, internal/raftengine/etcd/wal_store.go, kv/fsm.go)
  • Evaluate §3 crash-safety argument (Pebble batch atomicity)
  • Evaluate §4 idempotency of replay after skip
  • Evaluate §1 interface change scope / call-site inventory
  • Evaluate branch split granularity (B2/B3/B4/B5)
  • Evaluate HLC lease entry open question
  • Post final review

Review

Overall the design is well-motivated and the core mechanism is sound. The Pebble-batch atomicity argument for crash safety holds, and the observability story (metrics + INFO log) is well thought through. The issues below range from a significant architectural redundancy in B2, to a correctness gap in the error-handling path, to a precision issue in the idempotency argument.


§3 — Crash-safety argument (Pebble batch atomicity)

Verdict: Sound, with two caveats the doc should address.

The batch-atomic meta key pattern is already in production in this codebase: applyMutationsWithOpts (lsm_store.go:1324) piggybacks metaLastCommitTSBytes into the exact same pebble.Batch that carries the data mutations, for the same crash-safety reason. The proposed metaAppliedIndex key follows this precedent correctly.

Caveat 1 — ELASTICKV_FSM_SYNC_MODE=nosync: §3 claims "both are durable per the batch's pebble.Sync option" but ELASTICKV_FSM_SYNC_MODE=nosync causes raftApplyWriteOpts() to return pebble.NoSync. In nosync mode neither the data nor the meta key is fsynced to disk; durability comes from raft log replay, not from the batch commit itself. The crash-safety invariant still holds (both meta key and data land in the same Pebble WAL record and either both survive WAL replay or both are discarded), but the durability mechanism is different. The doc should clarify: the batch atomicity argument applies in both modes; in nosync mode the safety boundary shifts to the raft WAL rather than the Pebble fsync. This matters in a codebase that documents ELASTICKV_FSM_SYNC_MODE as a first-class operator lever.

Caveat 2 — DeletePrefixAtRaft path: handleDelPrefixDeletePrefixAtRaft builds its own pebble.Batch inside deletePrefixAtWithOpts (lsm_store.go:1378–1401) and commits it independently of applyAtIndex. If the proposed design threads index only through applyAtIndexApplyMutationsRaft, DEL_PREFIX entries will not bump metaAppliedIndex. After a crash following a DEL_PREFIX apply, LastAppliedIndex could be stale relative to the actual applied index, causing unnecessary (conservative) full restores rather than skips. This should be called out in the design and handled in Branch 3.

Caveat 3 — applyEncryption path: applyEncryption dispatches via WriteSidecar (not a pebble.Batch) and does not go through ApplyMutationsRaft. Like HLC lease entries (already in Open Questions), encryption opcodes won't advance metaAppliedIndex. This is analogous to the HLC lease concern and should be mentioned alongside it.


§4 — Idempotency of replay after skip

Verdict: The invariant holds, but the stated mechanism is imprecise for OCC transactions — the explanation needs correction.

The design claims: "every mutation is keyed by its raft metadata; re-applying the same entry produces the same Pebble write." This is accurate for raw requests (where startTS = commitTS = T, so checkConflicts sees latestCommitTS(T) > T → false → no conflict, and the Pebble write is a deterministic overwrite of the same MVCC cell). It is not accurate for OCC one-phase transactions (startTS < commitTS):

After first apply, latestCommitTS(key) = commitTS. On re-apply, checkConflicts evaluates latestCommitTS(commitTS) > startTStrueErrWriteConflict. The key is NOT re-written.

This is nonetheless safe, for a different reason: ErrWriteConflict is returned as the FSM response value (not as the error return from applyNormalEntry). It does not implement HaltApply. Therefore setApplied advances and the FSM state is already correct from the first apply. The effect is indistinguishable from "produced the same write" from a correctness standpoint — but the mechanism is "conflict detected, no-op, apply advances" not "idempotent Pebble write."

handleCommitRequest has explicit idempotency via applyCommitWithIdempotencyFallback and commitApplyStartTS (commit-record detection). handleAbortRequest is safe via shouldClearAbortKey (lock-missing → no-op) and appendRollbackRecord (marker-present check). handlePrepareRequest is effectively safe because lock/intent keys that were cleaned up by COMMIT would return write conflicts on re-apply (same non-fatal, non-HaltApply path).

Recommended fix: Replace the single sentence in §4 with a two-case argument: (a) raw requests produce an idempotent Pebble overwrite of the same MVCC cell; (b) OCC operations whose effects are already present produce a non-fatal ErrWriteConflict response that does not implement HaltApply, so setApplied advances and FSM state (already correct from the first apply) is unchanged.


§1 — Interface change scope: Apply(index, data) — is Branch 2 necessary?

Verdict: The ApplyIndexAware seam already delivers entry.Index to kvFSM without a breaking interface change. Branch 2 should be justified or eliminated.

Looking at statemachine.go and engine.go:2292–2295:

// engine.go — already in production
if aware, ok := e.fsm.(raftengine.ApplyIndexAware); ok {
    aware.SetApplyIndex(entry.Index)
}
return e.fsm.Apply(payload), nil

And kv/fsm.go:128–130 (also already in production):

func (f *kvFSM) SetApplyIndex(idx uint64) {
    f.pendingApplyIdx = idx
}

SetApplyIndex is called on the same goroutine, immediately before Apply. f.pendingApplyIdx already holds entry.Index when Apply executes. The meta key write in Branch 3 could use f.pendingApplyIdx directly — no interface change is needed.

The ApplyIndexAware seam was added explicitly to avoid changing StateMachine.Apply's signature (see the doc-comment: "The StateMachine.Apply(data) signature does not carry it, so this opt-in interface is the seam that delivers it without forcing every existing implementation to change"). Branch 2 undoes that design decision without explaining why.

Call-site inventory: my grep finds 2 non-test implementations (kvFSM.Apply, TestStateMachine.Apply in raftenginetest/conformance.go) and 4 test fakes (encryption_test.go:22, engine_test.go:33, :140, :171). The doc claims "8 in internal/raftengine/" — I found 5 (1 conformance shim + 4 test fakes), not 8. This discrepancy should be checked before landing B2.

Recommendation: Either (a) drop Branch 2 and use f.pendingApplyIdx throughout Branch 3, or (b) explicitly state in the design why the full interface change is preferred (e.g., to make the API contract self-documenting, to eliminate the implicit pre-call side-effect). Option (a) reduces risk and respects the existing design decision.


§3 — fsmAlreadyAtIndex error handling mismatch

Verdict: The text and pseudocode disagree on error handling — this needs resolution before implementation.

The design text says (§2): "Restore runs when LastAppliedIndex() returns an error." But the pseudocode does:

if skip, err := fsmAlreadyAtIndex(fsm, tok.Index); err != nil {
    return err  // ← propagates error, FAILS startup
} else if skip {
    return nil
}

An error in LastAppliedIndex() (e.g., corrupt meta key) would fail the node's cold start rather than falling back to a full restore. This violates the stated "strictly additive" / conservative fallback guarantee and could turn a corrupt meta key into an unrecoverable node. The conservative behavior should be to fall back to full restore on any read error, matching the !present case. In fsmAlreadyAtIndex:

have, present, err := r.LastAppliedIndex()
if err != nil {
    return false, nil  // conservative: treat as missing, fall back to restore
}

The 8-byte length check in LastAppliedIndex() already catches corruption and returns an error — that error should be swallowed by the conservative fallback, not propagated up to fail startup.


Branch split (B2/B3/B4/B5) granularity

Verdict: B3/B4 split is well-chosen; B2 needs rethinking; B5 placement is correct.

  • B2 (interface change): questionable if ApplyIndexAware seam can substitute (see above). If kept, the PR should be small enough that the risk is low, but it's a breaking API change.
  • B3 (meta key + LastAppliedIndex()): The "data side active, skip not yet wired" split is excellent. It lets you run B3 in production for one release cycle, observe that meta keys are being written correctly (via metrics + pebble inspection), and build confidence before activating B4.
  • B4 (skip gate): Good separation. The 4 test cases listed in the doc (skip/restore/missing/error) fully characterize the gate.
  • B5 (health timeout tightening): Correctly deferred — you need B4 in production with observed skip rates before you can set a tighter bound with confidence.

If B2 is eliminated, consider renaming B3→B2, B4→B3, B5→B4 to keep the numbering clean.


Open Question 1 — HLC lease entry handling

Recommendation: Commit to Option 2 (fold into next data Apply), and document why.

Option 1 (degenerate batch with just the meta key for each HLC lease entry) adds a pebble write per ceiling renewal. hlcRenewalInterval = 1s means ~1 extra pebble batch per second per group, which is measurable under high shard counts.

Option 2 is correct and cheaper: applyHLCLease has no pebble side-effects (it only advances f.hlc). If the node crashes between a HLC lease entry at index K+1 and the next data entry, LastAppliedIndex stays at K. On restart: skip check fires if K >= snapshot.Metadata.Index, raft replays [snapshotIndex+1, K+1]. The HLC lease entry at K+1 is a fresh apply (no pebble conflict, HLC is advanced in-memory, which is how it works on every cold start anyway since f.hlc starts at zero). Correct.

The only scenario where a long run of HLC-lease-only entries could regress the skip effectiveness is if a snapshot was taken at some index S > K (the last data entry). In that case LastAppliedIndex = K < S = snapshot.Metadata.Index and the full restore fires. But this would only happen if the cluster is actively snapshotting during a phase with zero data writes — an unlikely and benign case.

Document the choice in the design with the reasoning: "HLC lease entries are idempotent in-memory operations whose state is reconstructed from raft replay at every cold start regardless of skip; Option 2 avoids extra pebble batch overhead at the cost of conservative fall-back on the rare all-lease-entry case."


Minor

  • metaAppliedIndexKey byte representation: Specify the exact key in the design (e.g., \x00meta\x00applied_index or similar). Confirm it is outside the user key space and doesn't collide with metaLastCommitTSBytes, metaMinRetainedTSBytes, or the reservation table in lsm_store.go.
  • Metrics naming: fsm_cold_start_applied_index_gap is described as "snapshot.Index - lastAppliedIndex when restore is executed" — consider whether this should also be emitted (as a positive value, with outcome="skipped") to confirm the gap closed path. That would make the metric symmetric and useful for regression detection.

Summary

Area Status
§3 Crash-safety (batch atomicity core) ✅ Sound — nosync/DEL_PREFIX/encryption caveats should be addressed
§4 Idempotency of re-apply ⚠️ Mechanically correct but imprecise — OCC write-conflict path needs accurate description
§1 Interface change (B2) ⚠️ ApplyIndexAware seam makes this potentially redundant — justify or drop
Error handling in fsmAlreadyAtIndex ❌ Pseudocode propagates error instead of falling back — must be corrected before implementation
B3/B4/B5 branch granularity ✅ Well-chosen
HLC lease Open Question ✅ Option 2 is the right call — commit to it in the doc

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request proposes a design for idempotent FSM snapshot restores on cold starts to optimize startup times. The review feedback identifies several critical issues and design improvements: leveraging the existing ApplyIndexAware interface to avoid breaking public APIs, addressing a missing opcode dispatch for encryption, adding database read locks to prevent race conditions, and resolving a type assertion mismatch by using a decoupled pure-function injection pattern.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +93 to +100
```go
// internal/raftengine/statemachine.go
type StateMachine interface {
Apply(index uint64, data []byte) any // was: Apply(data []byte) any
Snapshot() (Snapshot, error)
Restore(r io.Reader) error
}
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

既存の ApplyIndexAware インターフェースの活用について

StateMachine インターフェースを変更してすべてのインツリー実装やテストシムを修正する(提案されている Branch 2)代わりに、既存の ApplyIndexAware インターフェースを活用することを検討してください。

kvFSM はすでに SetApplyIndex(idx uint64) を通じて ApplyIndexAware を実装しており、同じゴルーチン上で Apply が呼び出される直前にインデックスを f.pendingApplyIdx に保存しています。Apply 内部で f.pendingApplyIdx を直接利用すれば、パブリックインターフェースを破壊することなく、実装計画を大幅に簡素化(Branch 2 の作業を完全に削減)できます。

Comment on lines +119 to +127
func (f *kvFSM) Apply(index uint64, data []byte) any {
if len(data) > 0 && data[0] == raftEncodeHLCLease {
return f.applyHLCLease(index, data[1:])
}
ctx := context.TODO()
reqs, err := decodeRaftRequests(data)
if err != nil { return errors.WithStack(err) }
return f.applyAtIndex(ctx, index, reqs)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

暗号化オプコード処理の欠落

提案されている Apply の実装では、applyReservedOpcode(data) の呼び出しが完全にバイパスされています。現在のコードベースにおいて、applyReservedOpcoderaftEncodeHLCLease と暗号化オプコード(0x03..0x07)の両方を処理するため非常に重要です。この処理をバイパスすると、Stage 6/7/8 の機能に必要な暗号化オプコードのディスパッチが機能しなくなります。新しい Apply のロジックでも applyReservedOpcode を通すか、暗号化ディスパッチが維持されるように設計を修正してください。

Comment on lines +140 to +152
func (s *pebbleStore) LastAppliedIndex() (uint64, bool, error) {
val, closer, err := s.db.Get(metaAppliedIndexKey)
if errors.Is(err, pebble.ErrNotFound) {
return 0, false, nil
}
if err != nil { return 0, false, errors.WithStack(err) }
defer closer.Close()
if len(val) != 8 {
return 0, false, errors.Newf("corrupt applied-index meta key: %d bytes", len(val))
}
return binary.BigEndian.Uint64(val), true, nil
}
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

データベースの読み取りロックの欠落

提案されている LastAppliedIndex() メソッドは、s.dbMu.RLock() を取得せずに s.db に直接アクセスしています。store/lsm_store.go で定義されているロック順序の規則に従い、s.db にアクセスするすべての操作は、並行して行われるデータベースのスワップ(RestoreClose など)との競合状態を防ぐために、少なくとも s.dbMu.RLock() を保持する必要があります。

Comment on lines +191 to +200
func fsmAlreadyAtIndex(fsm StateMachine, want uint64) (bool, error) {
r, ok := fsm.(AppliedIndexReader)
if !ok {
return false, nil // FSM cannot self-report; restore conservatively.
}
have, present, err := r.LastAppliedIndex()
if err != nil { return false, err }
if !present { return false, nil }
return have >= want, nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

kvFSM における型アサーションの不整合とインターフェース設計の改善

fsmAlreadyAtIndex 内で、fsmAppliedIndexReader に型アサーションしようとしていますが、kvFSM 自体は AppliedIndexReader を実装していないため、この型アサーションは常に失敗します。

この結合を解決するために、kvFSM に新たなメソッドを追加してインターフェースを肥大化させるのではなく、純粋関数(例:func() uint64)を値として注入するアダプターやインターセプターパターンの採用を検討してください。これにより、ステートフルなインターフェースに不要なメソッドを追加することなく、インターフェースを軽量に保ちながら疎結合な設計を実現できます。

References
  1. When decoupling a generic interface from concrete implementations, prefer injecting pure functions as values (e.g., via an adapter or interceptor) instead of adding stateless methods to stateful interfaces, keeping the interfaces lightweight and focused on stateful operations.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cdabb969a7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +317 to +319
to branch 3; the in-memory HLC state is reconstructed at startup from
the lease entries replayed by raft, so this is purely about getting
the index counter monotonic.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve HLC snapshot state when skipping restore

In the cold-start path, loadWalState restores the FSM before opening the WAL at walSnapshotFor(snapshot), so after the proposed skip only entries after the persisted snapshot index are replayed. kvFSM.Restore is currently the place that reads the snapshot header and applies the HLC physical ceiling (kv/fsm.go:513-518), so if the last HLC lease is at or before the snapshot index and a later data entry advances the proposed Pebble meta key, this design skips Restore, does not replay that compacted lease, and restarts with no restored ceiling; a subsequently elected node can issue HLC timestamps below the snapshotted ceiling. The open question’s statement that startup reconstructs HLC from replayed lease entries is therefore unsafe unless the skip path separately reads/persists the HLC snapshot metadata or refuses to skip in this case.

Useful? React with 👍 / 👎.

Address:
- codex P1 (HLC ceiling preservation when skipping body Restore) by
  adding §5 / applyHLCCeilingFromSnapshot / HLCCeilingSetter seam
- gemini medium on dbMu locking by adding s.dbMu.RLock() to
  LastAppliedIndex pseudocode in §3
- claude §3 caveat 1 (ELASTICKV_FSM_SYNC_MODE=nosync) by splitting §6
  crash-safety into Sync/NoSync subsections
- claude §3 caveat 2 (DEL_PREFIX path builds independent batch) by
  threading appliedIndex into both applyMutationsWithOpts AND
  deletePrefixAtWithOpts in §2 + §6
- claude §3 error handling by collapsing fsmAlreadyAtIndex's signature
  to (bool) only; any error / missing / uncertainty -> false (full
  restore) per the strictly-additive guarantee
- claude §4 OCC two-case argument by replacing §7 with raw vs OCC
  reasoning
- claude §1 call-site count by correcting 8 -> 6 in §1's table

Pivot the interface change:
- Round-1 proposed breaking StateMachine.Apply(data) -> Apply(index, data)
- Round-2 introduces a NEW ApplyIndexAware seam (does not exist on
  main today) so StateMachine.Apply's public signature is unchanged
- This adopts the pattern reviewers wrongly believed already existed,
  but with grep-verified honesty about what is actually in main

Add Errata section with grep-verified table of fabricated reviewer
claims (ApplyIndexAware, SetApplyIndex, pendingApplyIdx,
applyReservedOpcode, applyEncryption, WriteSidecar, encryption opcodes
0x03..0x07 — all zero matches at SHA 94579fc) so future readers
understand which round-1 nits are real and which were imagined.

No code change. PR remains docs-only.
bootjp added a commit that referenced this pull request Jun 3, 2026
Address claude[bot] §3 doc nit on PR #909: future maintainers reading
only the shell script have no pointer to the planned structural fix.
Add an inline pointer to PR #910 / docs/design/.../snapshot-restore.md
so the 'why is this 300?' question doesn't get re-opened after #910
ships.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 3, 2026

round-2 push: df9580b6 — major design revision in response to round-1 feedback.

Incorporated (real findings):

  • 🔴 codex P1 line 319 — HLC ceiling preservation when skipping body Restore. §5 adds applyHLCCeilingFromSnapshot + HLCCeilingSetter seam reading just the 16-byte header from the snapshot file.
  • 🟠 gemini medium line 152 — s.dbMu.RLock() on LastAppliedIndex() per lsm_store.go lock-ordering discipline. §3 fixed.
  • 🟠 claude §3 caveat 1 — ELASTICKV_FSM_SYNC_MODE=nosync. §6 split into Sync / NoSync subsections.
  • 🟠 claude §3 caveat 2 — deletePrefixAtWithOpts builds an independent batch and bundles metaLastCommitTSBytes (lsm_store.go:1231). §2 + §6 thread appliedIndex into BOTH applyMutationsWithOpts AND deletePrefixAtWithOpts.
  • 🟠 claude §3 error handling — fsmAlreadyAtIndex propagating errors would fail cold start. Round-2 collapses signature to bool only: any error / missing / uncertainty → false (conservative full restore).
  • 🟢 claude §4 — OCC two-case argument (raw vs OCC ErrWriteConflict non-HaltApply path). §7 rewritten.
  • 🟢 claude §1 call-site count — corrected 8 → 6 (verified by grep).
  • 🟢 Open Question 1 — committed to Option 2 (don't bundle for HLC lease ticks; rare false-positive full-restore is acceptable cost).

Pushed back with grep evidence (round-1 fabricated claims):

  • ApplyIndexAware / SetApplyIndex / pendingApplyIdxzero matches in repo. Both gemini Configure Renovate #1 and claude §1 wrongly claimed this seam already exists.
  • applyReservedOpcode / applyEncryption / WriteSidecar / encryption opcodes 0x03..0x07 — zero matches. Only raftEncodeSingle=0x00, raftEncodeBatch=0x01, raftEncodeHLCLease=0x02 exist (kv/fsm.go:110-116).
  • engine.go:2292-2295 ApplyIndexAware seam (claude §1 cite) — that range is failPending, unrelated.

The reviewers' reasoning about avoiding a breaking StateMachine.Apply signature change was sound regardless of the factual error. Round-2 adopts the pattern they hallucinated: §1 introduces ApplyIndexAware as a new opt-in interface (not breaking) rather than modifying the existing Apply(data) signature. The Errata section at the doc tail documents this with verification commands so future readers can reproduce.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Jun 3, 2026

Claude finished @bootjp's task in 6m 42s —— View job


Claude finished @bootjp's task


Design Review — Round 2

Todo List:

  • Load review tools
  • Read updated design document
  • Read relevant source files (kv/fsm.go, store/lsm_store.go, internal/raftengine/etcd/wal_store.go, internal/raftengine/statemachine.go, kv/snapshot.go)
  • Verify round-2 claims against actual source
  • Evaluate new applyHLCCeilingFromSnapshot + HLCCeilingSetter approach
  • Evaluate deletePrefixAtWithOpts threading
  • Evaluate fsmAlreadyAtIndex bool-only signature
  • Evaluate ApplyIndexAware as new opt-in interface
  • Evaluate OCC two-case argument rewrite (§7)
  • Verify call-site count claim (6)
  • Evaluate HLC Option 2 commitment
  • Post final review

The round-2 revision has resolved every round-1 correctness issue and the overall design is now very sound. Two new findings from reading the current source require attention before implementation — one is a correctness bug in applyHLCCeilingFromSnapshot that would silently fail for all Stage-8a (v2) snapshots, and one is an important implementation-planning correction. The rest are minor.


🔴 Critical: applyHLCCeilingFromSnapshot silently skips HLC ceiling for Stage-8a (v2) snapshots

The §5 helper reads a fixed 16-byte buffer and checks only v1 magic (EKVTHLC1). But Stage 8a (already on main — kv/snapshot.go:27) introduced v2 snapshots with magic EKVTHLC2 and a different layout (26-byte minimum header: magic(8) + len(2,BE) + ceiling(8) + cutover(8)). For any node that has taken a snapshot since Stage 8a shipped, the helper's bytes.Equal(hdr[:8], hlcSnapshotMagic[:]) check silently fails (v2 magic ≠ v1 magic), ceilingMs stays at 0, and the HLC floor is never restored — the exact same bug the skip path was designed to guard against.

applyHLCCeilingFromSnapshot must call ReadSnapshotHeader (kv/snapshot.go:187), which already handles v1, v2, unknown-EKVTHLC (fail-closed), and headerless-legacy correctly. The direct fix:

// applyHLCCeilingFromSnapshot — corrected version
func applyHLCCeilingFromSnapshot(fsm StateMachine, snapPath string) error {
    setter, ok := fsm.(HLCCeilingSetter)
    if !ok { return nil }
    f, err := os.Open(snapPath)
    if err != nil { return errors.WithStack(err) }
    defer f.Close()
    // ReadSnapshotHeader requires a *bufio.Reader and handles v1/v2/legacy/unknown.
    // The underlying file is only read for the header; the store body is not consumed.
    ceiling, _, err := kv.ReadSnapshotHeader(bufio.NewReader(f))
    if err != nil { return errors.WithStack(err) }
    if ceiling > 0 {
        setter.SetHLCPhysicalCeiling(int64(ceiling))
    }
    return nil
}

Package boundary note: ReadSnapshotHeader lives in the kv package; applyHLCCeilingFromSnapshot would live in internal/raftengine/etcd. One clean approach to avoid the cross-package import: move the ceiling-read into the HLCCeilingSetter interface itself, giving kvFSM a method ApplyHLCCeilingFromFile(snapPath string) error that opens the file and calls ReadSnapshotHeader internally (it's in the same kv package). That keeps all snapshot-format knowledge inside kv. The B4 test spec should add a v2-format case explicitly to prevent future regressions.


🔴 Critical (implementation-planning): Branch 2 already exists on main

The Errata section is historically accurate for SHA 94579fc0, but subsequent commits have landed all of Branch 2 as a byproduct of the encryption work. Verified by grep on current HEAD:

Symbol Errata claim Actual state
ApplyIndexAware 0 matches at 94579fc0 Exists at internal/raftengine/statemachine.go:46
SetApplyIndex 0 matches Exists at kv/fsm.go:128
pendingApplyIdx 0 matches Exists at kv/fsm.go:53
applyReservedOpcode 0 matches Exists at kv/fsm.go:293, dispatches HLC-lease (0x02) AND encryption (0x03..0x07)
applyEncryption 0 matches Exists at kv/fsm_encryption.go:78
WriteSidecar 0 matches Exists at internal/encryption/sidecar.go:240
engine call site "engine.go:2292 is failPending" engine.go:2292-2293 IS the ApplyIndexAware seam: if aware, ok := e.fsm.(raftengine.ApplyIndexAware); ok { aware.SetApplyIndex(entry.Index) }

The round-1 reasoning about "avoid a breaking signature change" was adopted correctly, and the Errata correctly attributes the error. But the implementation plan needs updating:

  • Branch 2 is done. The ApplyIndexAware interface, kvFSM.SetApplyIndex, and the engine call site were wired by encryption Stages 6C/6E/7a.
  • The §1 table row "kvFSM.Apply: Add SetApplyIndex storing into pendingApplyIdx" is already complete.
  • Branch 3 can use f.pendingApplyIdx immediately — it is already populated by the engine before every Apply call.
  • The implementation plan branch numbering should shift: B2 → done (drop or mark as shipped), B3 → new B2, B4 → new B3, B5 → new B4.

The docs branch should also be rebased onto current main before B2 implementation starts, since the branch diverged before all the encryption commits and the kv/fsm.go on this branch already contains SetApplyIndex etc. (making the diff clean).


🟠 Significant: restoredCutover (Stage 8a §3.4) not set when skip fires

kvFSM.Restore sets f.restoredCutover = cutover (kv/fsm.go:522) as the "snapshot-to-applier handoff" for Stage 6E. When the skip path fires, this assignment is bypassed. In most restarts this is harmless: the WAL replay window (snapshot.Index, committed] will re-apply the EnableRaftEnvelope entry and the cutover is restored through applyEncryption. But if the EnableRaftEnvelope entry was at index ≤ snapshot.Metadata.Index (compacted before the snapshot), it won't be replayed and the Stage 6E activation hook won't fire on the next apply.

This is an existing coupling that the skip path introduces as a new failure mode. Recommended resolution (consistent with the C1 fix above): extend HLCCeilingSetter to also deliver the cutover, and have applyHLCCeilingFromSnapshot set both in one pass:

type HLCCeilingSetter interface {
    SetHLCPhysicalCeiling(ms int64)
    SetRestoredCutover(cutover uint64)  // Stage 8a §3.4 handoff
}

Or, if keeping the interface minimal, add a single ApplySnapshotHeader(snapPath string) error method to kvFSM that opens the file, calls ReadSnapshotHeader, and sets both f.hlc.SetPhysicalCeiling and f.restoredCutover — exactly mirroring what Restore does for the header section.


🟠 Significant: Encryption-opcode entries are a present gap, not a future one

The "Out of Scope (future)" note says "once Stages 6/7/8 land, they will need the same treatment as DEL_PREFIX." But applyEncryption for opcodes 0x03-0x07 is already live on main (dispatched through applyReservedOpcode). In a cluster that has run encryption rotations, those entries won't advance metaAppliedIndex (since applyEncryption calls WriteSidecar, not f.store.ApplyMutationsRaft). The same Option 2 reasoning as HLC lease entries applies — safe, conservative full restore — but the doc should state this as a current known gap rather than a future one. Consider a single paragraph under the HLC-lease Option 2 commitment:

"The same reasoning applies to encryption-opcode entries (0x03-0x07, already in-tree): they go through applyEncryption / WriteSidecar and do not touch f.store. metaAppliedIndex will not advance for encryption-only apply windows; the skip gate will conservatively fall back to full restore in those cases."


🟢 Minor

isPebbleMetaKey — function name correction: The doc references isReservedMetaKey (§3). The actual function in lsm_store.go:534 is isPebbleMetaKey. Branch 3 should extend isPebbleMetaKey.

kvFSM.Apply dispatch is now three-way: The §2 pseudocode comment says "leave dispatch (HLC-lease vs request) UNCHANGED." The actual applyReservedOpcode is now three-way: HLC-lease (0x02), encryption (0x03-0x07), MVCC requests. Update the comment.

B4 test spec — add v2-format coverage: The spec says "asserts that applyHLCCeilingFromSnapshot sets f.hlc.PhysicalCeiling() for both skip and restore paths." Once C1 is fixed, the test should also explicitly exercise the v2 snapshot format (hlcSnapshotMagicV2) to prevent regression.


Summary

Area Status
§3 Crash-safety (DEL_PREFIX + nosync, round-1 caveats) ✅ Correctly addressed in round-2
§4 OCC two-case idempotency argument ✅ Correct and complete in round-2
§5 HLC ceiling preservation — v2 snapshot correctness 🔴 applyHLCCeilingFromSnapshot only checks v1 magic; silently skips ceiling for v2 snapshots
§5 restoredCutover (Stage 8a §3.4) on skip path 🟠 Not set when skip fires; EnableRaftEnvelope handoff may be lost if entry is compacted
Error handling in fsmAlreadyAtIndex ✅ Bool-only return, conservative fallback, round-1 critical fixed
Branch 2 (ApplyIndexAware seam) 🔴 Already exists on main (encryption work) — implementation plan needs updating
B3/B4/B5 branch granularity ✅ Well-chosen; numbering shifts now that B2 is done
Encryption-opcode gap as "future" 🟠 Opcodes are live; rephrase as a current known gap
isPebbleMetaKey name 🟢 Cosmetic
HLC Option 2 commitment ✅ Correct and consistent with the code

The two critical items (C1 — v2 HLC ceiling bug, C2 — B2 already done) should be resolved before the implementation branches are opened. Everything else is either already sound or a documentation polish.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: df9580b6ae

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +419 to +420
if !bytes.Equal(hdr[:8], hlcSnapshotMagic[:]) {
return nil // old format, no ceiling.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve v2 snapshot header state when skipping restore

Fresh evidence beyond the earlier HLC concern: current snapshots can be emitted with the v2 EKVTHLC2 header (kv/snapshot.go's ReadSnapshotHeader returns both ceiling and cutover), but this skip-path helper only accepts the v1 hlcSnapshotMagic and returns nil for anything else. In clusters that have crossed the v2 cutover, skipping kvFSM.Restore would therefore apply neither the snapshotted HLC ceiling nor the restored cutover, so a restarted node can lose snapshot header state while believing the body restore was safely elided; the skip path should reuse/export the same header parser and setter semantics as kvFSM.Restore, not a 16-byte v1-only probe.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
docs/design/2026_06_02_idempotent_snapshot_restore.md (1)

487-499: ⚡ Quick win

Add a fallback-reason dimension to metrics for lease-only false positives.

You already call out the HLC-lease-only case as safe-but-overrestoring; add a reason label/counter (e.g., reason="missing_meta|read_error|lease_only_gap|stale") so B5 can be gated on real cause distribution, not just skip rate.

Based on learnings: Ensure FSM apply operations are idempotent and that committed entries cannot be lost; check crash-restart paths.

Also applies to: 553-563

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/design/2026_06_02_idempotent_snapshot_restore.md` around lines 487 -
499, Add a "reason" dimension to the skip-gate/restore metrics and increment it
with concrete values ("missing_meta", "read_error", "lease_only_gap", "stale")
at the decision points so B5 can be gated by cause distribution; specifically,
in the code paths around f.applyHLCLease (which is in-memory and doesn't touch
f.store) and the skip-gate fallback logic, emit a metrics counter with
reason="lease_only_gap" when the only committed entries are HLC lease ticks, and
use the other reason labels at the respective error/metadata branches. Also
audit FSM apply code (functions/methods that perform apply operations and rely
on LastAppliedIndex) to ensure apply operations are idempotent and that
committed entries are persisted across crash/restart paths (verify
LastAppliedIndex handling and any f.store writes) so the metric reason
accurately reflects true causes rather than lost commits.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/design/2026_06_02_idempotent_snapshot_restore.md`:
- Around line 579-580: Update the B5 behaviour description to fix the wording
typo: locate the row for **B5** in the document (the line containing "Lower
`HEALTH_TIMEOUT_SECONDS` default once production data shows steady-state skip
rate ≥ 90 % | Tighter ceiling, but tracker the env override still honoured") and
change the phrase "but tracker the env override still honoured" to "but the env
override is still honoured" so the sentence reads clearly.
- Around line 553-569: Add fenced code block languages to the two examples so
markdown lint passes: update the metric fence to use a language (e.g. ```text)
and the log line fence to use a language (e.g. ```text). Locate the metric
example containing fsm_cold_start_restore_total and
fsm_cold_start_applied_index_gap and the log example containing the message
"restoreSnapshotState skipped (FSM at index %d, snapshot at %d, HLC ceiling
applied from header)" and prepend the appropriate language identifier to each
opening ``` fence.

---

Nitpick comments:
In `@docs/design/2026_06_02_idempotent_snapshot_restore.md`:
- Around line 487-499: Add a "reason" dimension to the skip-gate/restore metrics
and increment it with concrete values ("missing_meta", "read_error",
"lease_only_gap", "stale") at the decision points so B5 can be gated by cause
distribution; specifically, in the code paths around f.applyHLCLease (which is
in-memory and doesn't touch f.store) and the skip-gate fallback logic, emit a
metrics counter with reason="lease_only_gap" when the only committed entries are
HLC lease ticks, and use the other reason labels at the respective
error/metadata branches. Also audit FSM apply code (functions/methods that
perform apply operations and rely on LastAppliedIndex) to ensure apply
operations are idempotent and that committed entries are persisted across
crash/restart paths (verify LastAppliedIndex handling and any f.store writes) so
the metric reason accurately reflects true causes rather than lost commits.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 02558587-5367-43c0-a0eb-3bbbea5d536e

📥 Commits

Reviewing files that changed from the base of the PR and between 94579fc and df9580b.

📒 Files selected for processing (1)
  • docs/design/2026_06_02_idempotent_snapshot_restore.md

Comment thread docs/design/2026_06_02_idempotent_snapshot_restore.md Outdated
Comment thread docs/design/2026_06_02_idempotent_snapshot_restore.md Outdated
… (v2)

Round-2 of this doc wrongly pushed back on gemini and claude review
findings as 'fabricated.' My grep was running against
test/event-driven-leader-readiness (27 commits behind origin/main) so
several origin/main-only entities (ApplyIndexAware, SetApplyIndex,
pendingApplyIdx, applyReservedOpcode, applyEncryption, WriteSidecar,
v2 snapshot header) returned zero matches in my local tree. The
reviewers were correct; round 2 was wrong.

Round 3 rebases against the actual origin/main:

  - Drop the round-2 'create ApplyIndexAware seam' framing.
    ApplyIndexAware already exists at
    internal/raftengine/statemachine.go:46; engine.go:2292 already
    calls aware.SetApplyIndex(entry.Index); kvFSM.SetApplyIndex at
    kv/fsm.go:122 already stashes the index in f.pendingApplyIdx
    (kv/fsm.go:53). Branch 2 shrinks to 'thread f.pendingApplyIdx
    through the data-Apply leaves so applyMutationsWithOpts /
    deletePrefixAtWithOpts can bundle metaAppliedIndex in the same
    pebble.Batch as metaLastCommitTSBytes'.

  - Address codex round-2 P1 (v2 snapshot header). The skip path
    must preserve BOTH the HLC ceiling AND the Stage 8a cutover.
    §5 swaps the round-2 v1-only 16-byte probe for the existing
    kv.ReadSnapshotHeader(*bufio.Reader) parser, which handles v1,
    v2, ErrSnapshotHeaderUnknownMagic, and headerless legacy
    uniformly. SnapshotHeaderApplier is a new opt-in interface
    matching kvFSM's existing Restore side-effects (SetPhysicalCeiling
    + restoredCutover).

  - Keep applyReservedOpcode untouched in the data-Apply path.
    Encryption opcodes 0x03..0x05 already consume f.pendingApplyIdx
    via applyEncryption -> WriteSidecar; metaAppliedIndex correctly
    does NOT advance for sidecar-only entries, leaving the skip
    gate to conservatively fall back on Stage 6/7/8 maintenance
    windows where the only entries are encryption ops.

  - Drop the round-2 Errata table; add a Round-2 Retraction section
    at the end with grep-verifiable origin/main file:line references
    so future readers can see why round-2 was wrong and what round 3
    rebased to.

  - Fix coderabbit nits: 'text' language on the two fenced metric /
    log code blocks; 'tracker the env override' typo in the B4 row
    becomes 'the env override remains honoured'.

Doc-only. No code change.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 3, 2026

round-3 push: 97531054 — correcting a self-error in round-2.

What changed

I owe you and the round-1 reviewers a retraction. Round-2's "Errata" pushed back on gemini-code-assist and claude[bot] as if their references to ApplyIndexAware, SetApplyIndex, pendingApplyIdx, applyReservedOpcode, applyEncryption, WriteSidecar, and the v2 snapshot header were fabricated. They were not. My grep was running against my local working branch (test/event-driven-leader-readiness, 27 commits behind origin/main), so all of these entities returned zero matches in my tree — but they all exist on origin/main:

  • ApplyIndexAware at internal/raftengine/statemachine.go:46
  • kvFSM.SetApplyIndex at kv/fsm.go:122 writing f.pendingApplyIdx (kv/fsm.go:53)
  • engine.applyNormalEntry at internal/raftengine/etcd/engine.go:2292-2293 calling aware.SetApplyIndex(entry.Index)
  • applyReservedOpcode in kv/fsm.go dispatching raftEncodeHLCLease (0x02), OpRegistration (0x03), OpBootstrap (0x04), OpRotation (0x05)
  • applyEncryption(f.pendingApplyIdx, ...) consuming the index for the encryption sidecar's RaftAppliedIndex via WriteSidecar
  • EKVTHLC2 v2 snapshot magic + cutover + ReadSnapshotHeader(*bufio.Reader) parser at kv/snapshot.go

The reviewers were right; round 2 was wrong.

Round-3 substantive changes

  • §1 Branch 2 scope shrinks. ApplyIndexAware is already in place. The new work is just "thread f.pendingApplyIdx through the data-Apply leaves so applyMutationsWithOpts / deletePrefixAtWithOpts can bundle a new metaAppliedIndex key in the same pebble.Batch that already carries metaLastCommitTSBytes." No engine.go change, no new SetApplyIndex.
  • §5 codex P1 (v2 header) is the same bug class as round-1 P1. Both the HLC ceiling AND the Stage 8a cutover live in the snapshot header. The skip path must apply both side-effects. §5 swaps the round-2 v1-only 16-byte probe for the existing kv.ReadSnapshotHeader(*bufio.Reader) parser, which already handles v1, v2, ErrSnapshotHeaderUnknownMagic (fail-closed), and headerless legacy. The new SnapshotHeaderApplier seam on kvFSM mirrors Restore's two side-effects (SetPhysicalCeiling + restoredCutover).
  • applyReservedOpcode preserved. The data-Apply dispatch path is untouched; pendingApplyIdx only ever needed to reach the data-store leaves (encryption opcodes already consume it via WriteSidecar).
  • coderabbit nits: text language on the two fenced metric / log code blocks; "tracker the env override" typo fixed in B4 row.

Apology

The lesson for me: always grep against origin/main (not the working tree) when auditing review feedback for fabrication. Round 2 spent thousands of words "correcting" reviewers who were correct, and proposed building from scratch what already exists. Round 3 rebases the design onto the actual seam.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Jun 3, 2026

Claude finished @bootjp's task in 3m 57s —— View job


Round-3 Review

Todo List:

  • Load review tools
  • Read updated design document (docs/design/2026_06_02_idempotent_snapshot_restore.md)
  • Verify round-3 source claims against origin/main
  • Evaluate §1 revised Branch 2 scope
  • Evaluate §5 v2 header fix (kv.ReadSnapshotHeader + SnapshotHeaderApplier)
  • Evaluate applyReservedOpcode preservation claim
  • Check for remaining gaps
  • Post final review

Design Review — Round 3

Verdict on the retraction: Correct and well-documented. Every entity the round-2 Errata dismissed as fabricated has been verified on origin/main:

Entity Location
ApplyIndexAware internal/raftengine/statemachine.go:46
kvFSM.SetApplyIndexf.pendingApplyIdx kv/fsm.go:128 + :53
engine.applyNormalEntry seam call internal/raftengine/etcd/engine.go:2292-2293
applyReservedOpcode dispatching 0x02–0x05 kv/fsm.go:293
applyEncryption(f.pendingApplyIdx, ...) kv/fsm.go:301
v2 magic EKVTHLC2 + ReadSnapshotHeader kv/snapshot.go:27 + :187

The Errata section is well-written; it includes the grep commands future readers can reproduce, and the opcode range correction (0x06/0x07 are reserved-for-future, not live — confirmed at internal/encryption/fsmwire/wire.go) is accurate. Overall this is a clean retraction.


🔴 Critical: Import cycle in §5 applyHeaderStateOnSkip

The §5 pseudocode has applyHeaderStateOnSkip in internal/raftengine/etcd/wal_store.go calling kv.ReadSnapshotHeader(br):

// wal_store.go (proposed) — IMPORT CYCLE
ceiling, cutover, err := kv.ReadSnapshotHeader(br)

wal_store.go currently imports only internal/raftengine (plus etcd/raft and standard library — grep wal_store.go head confirms no kv import). kv/fsm.go imports internal/raftengine at line 13. Adding kv to wal_store.go's imports would create a cycle: internal/raftengine/etcd → kv → internal/raftengine.

The SnapshotHeaderApplier interface is almost the right fix — it just needs one more step: move the parsing into the method itself so wal_store.go never names the kv package:

// internal/raftengine/statemachine.go (new interface alongside ApplyIndexAware)
type SnapshotHeaderApplier interface {
    // ApplySnapshotHeaderFromFile opens snapPath, parses the v1/v2 header via
    // the same parser as Restore, and applies the two header side-effects
    // (HLC ceiling + restoredCutover). Returns error only for genuine I/O or
    // unknown-magic failures; missing header (headerless-legacy) is a no-op.
    ApplySnapshotHeaderFromFile(snapPath string) error
}

// kv/fsm.go (new method)
func (f *kvFSM) ApplySnapshotHeaderFromFile(snapPath string) error {
    file, err := os.Open(snapPath)
    if err != nil { return errors.WithStack(err) }
    defer file.Close()
    ceiling, cutover, err := ReadSnapshotHeader(bufio.NewReader(file))
    if err != nil { return errors.WithStack(err) }
    if f.hlc != nil && ceiling > 0 {
        f.hlc.SetPhysicalCeiling(int64(ceiling))
    }
    f.restoredCutover = cutover
    return nil
}

applyHeaderStateOnSkip in wal_store.go then becomes:

func applyHeaderStateOnSkip(fsm StateMachine, snapPath string) error {
    setter, ok := fsm.(SnapshotHeaderApplier)
    if !ok { return nil }
    return errors.WithStack(setter.ApplySnapshotHeaderFromFile(snapPath))
}

No kv import in wal_store.go. ReadSnapshotHeader stays in kv. The B3 test spec already prescribes a v2-format coverage test; with this shape, that test exercises kvFSM.ApplySnapshotHeaderFromFile directly and verifies both f.hlc.PhysicalCeiling() and f.restoredCutover are set from a v2 snapshot — exactly mirroring the existing Restore test coverage.


🟠 Significant: isReservedMetaKeyisPebbleMetaKey (wrong name, wrong line, Round 2 carry-over)

§3 says: "Branch 2 extends isReservedMetaKey (lsm_store.go:454)". The actual function is isPebbleMetaKey at line 534, not isReservedMetaKey at 454. Line 454 is inside writeValueHeaderBytes (the tombstone-flag encoder). This was a Round-2 minor note that did not make it into Round 3; please fix before Branch 2 opens to avoid confusion when locating the function.


🟠 Minor: Line number annotations for applyMutationsWithOpts / deletePrefixAtWithOpts

§2 references:

  • applyMutationsWithOpts (lsm_store.go:1130) — actual location is :1292
  • deletePrefixAtWithOpts metaLastCommitTSBytes at lsm_store.go:1231 — actual location is :1393 (:1231 is inside deletePrefixAtWithOpts but before metaLastCommitTSBytes)
  • Overload line references lsm_store.go:1162 for metaLastCommitTSBytes in applyMutationsWithOpts — actual is :1324

These are annotation-only errors that don't affect the design; fix them when the doc is finalized to match the actual line numbers so implementors don't hunt.


🟢 Confirmed Sound (no changes needed)

  • §1 Branch 2 scope: Correctly narrowed to "plumb f.pendingApplyIdx through data-Apply leaves" — no new interface or engine.go change required. The ApplyMutationsRaftAt / DeletePrefixAtRaftAt overload approach (rather than changing existing signatures) is the right conservative choice.
  • §2 DEL_PREFIX threading: Both leaves (applyMutationsWithOpts and deletePrefixAtWithOpts) correctly receive the index so LastAppliedIndex never lags behind DEL_PREFIX entries. Confirmed against the actual deletePrefixAtWithOpts which already builds its own pebble.Batch and commits it independently (:1378-1401).
  • §3 dbMu.RLock() on LastAppliedIndex(): Lock ordering is correct per the discipline at lsm_store.go:165. Other readers holding dbMu.RLock() before db.Get (:768, :809, :1161) serve as direct precedent.
  • §4 fsmAlreadyAtIndex bool-only return: Correct — any read error, missing key, or missing interface conservatively returns false and falls back to full restore. The ErrNotFound(0, false, nil) path is correctly distinguished from a read error.
  • §5 SnapshotHeaderApplier shape: Setting both HLC ceiling and restoredCutover is the right two-effect pair, exactly mirroring what kvFSM.Restore does at :517-523. Reusing ReadSnapshotHeader means forward-compat (v3, unknown-magic fail-closed) is inherited automatically.
  • §6 Crash-safety: nosync analysis (durability boundary shifts to raft WAL), DEL_PREFIX batch atomicity, encryption-opcode gap (correctly reclassified as a current known gap, not future) — all sound.
  • §7 OCC two-case idempotency: Raw-request Pebble overwrite path and OCC ErrWriteConflict non-HaltApply path are both accurate. handleCommitRequest idempotency via applyCommitWithIdempotencyFallback and handleAbortRequest via shouldClearAbortKey are correctly cited.
  • §8 Observability fallback_reason label: The reason="behind_snapshot" label for HLC-lease-only and encryption-only windows is a good operational signal — distinguishes "expected false-positive" from "bug" in production dashboards.
  • B2/B3/B4 branch split: Well-chosen. B2 soaks the meta-key write; B3 activates the skip gate; B4 tightens the health timeout only after production data confirms ≥ 90% skip rate. B5 from Round 2 has been correctly collapsed into B4 since there are now only 4 branches.
  • Open Questions: Multi-group (naturally per-shard), hashicorp (satisfiable via log.Index), v3 forward-compat (inherited via ReadSnapshotHeader) — all correctly scoped.

Summary

Area Status
All round-2 errata verified correct
§1 Branch 2 scope correctly narrowed
§5 v1+v2+headerless parity via ReadSnapshotHeader ✅ design shape correct
§5 import cycle (kv.ReadSnapshotHeader from wal_store.go) 🔴 Must use ApplySnapshotHeaderFromFile on the interface
isPebbleMetaKey / lsm_store.go:534 wrong name+line in §3 🟠 Carry-over from Round 2 — fix before B2
Line number annotations in §2 🟠 Minor — correct before doc is finalized
§6 crash-safety + OCC idempotency
§8 observability including fallback_reason label
B2/B3/B4 branch granularity

The one blocking item before implementation can start is resolving the import cycle in §5 by using ApplySnapshotHeaderFromFile(snapPath string) error on SnapshotHeaderApplier instead of calling kv.ReadSnapshotHeader directly from wal_store.go. Everything else is either sound or a doc-annotation fix.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9753105440

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +437 to +438
between two snapshots are lease ticks, the skip gate will fall back
to full restore — safe and rare. Adding a synthetic pebble batch per
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Account for periodic HLC leases before skipping restore

This fallback is not rare in the current system: RunHLCLeaseRenewal proposes an HLC lease every second while the node is leader (kv/coordinator.go:641-663, and the sharded path mirrors it), and applyHLCLease only mutates memory. Once a local snapshot is persisted after any lease-only tail, metaAppliedIndex remains at the last data mutation while snapshot.Metadata.Index includes those leases; on restart fsmAlreadyAtIndex will reject the skip, full restore will reinstall the same stale meta key from the snapshot, and opening the WAL at the snapshot index omits the compacted leases, so idle clusters can keep doing O(snapshot size) restores indefinitely. The design needs a way to advance the durable applied index for these no-store entries (or otherwise prove/header-cover them) rather than accepting a permanent fallback.

Useful? React with 👍 / 👎.

Codex round-3 P2 (line 438) correctly observed that the round-3 'safe
and rare' framing of the HLC-lease-only fallback was wrong:

  - RunHLCLeaseRenewal (kv/coordinator.go:650) ticks at 1s while the
    local node is leader.
  - applyHLCLease is memory-only; metaAppliedIndex does not advance.
  - For any cluster with a leader running >1s, lease entries trail
    every snapshot. Snapshot persists at index X, last data-Apply
    index Y < X, metaAppliedIndex stays at Y on restart.
    fsmAlreadyAtIndex(X) checks Y >= X -> false -> full restore.
    Idle clusters never fire the skip; active clusters have a
    meaningful window where it doesn't fire.

Round 4 closes the gap by bumping metaAppliedIndex to
snapshot.Metadata.Index inside persistCreatedSnapshot
(engine.go:2683), BEFORE e.persist.SaveSnap. After a successful
snapshot persist, LastAppliedIndex >= snapshot.Index holds
unconditionally, so the skip fires on the next restart regardless of
how many lease entries trail the snapshot.

Mechanism:

  - New AppliedIndexWriter interface (sibling to AppliedIndexReader)
    with SetDurableAppliedIndex(idx uint64) error.
  - pebbleStore.SetDurableAppliedIndex runs a single-key pebble.Batch
    write under dbMu.RLock with raftApplyWriteOpts (honours
    ELASTICKV_FSM_SYNC_MODE).
  - kvFSM forwards to f.store; engine.persistCreatedSnapshot calls
    the writer seam before SaveSnap.

Crash ordering: SetDurableAppliedIndex runs first. The only
observable states are (metaAppliedIndex,snapshot pointer) in
{(Y,X'<X), (X,X'), (X,X), (Z>X,X)}; none of these can yield
'snapshot pointer = X but metaAppliedIndex < X'. Over-restore
impossible.

Cost: one extra pebble Batch.Commit per snapshot persist (~one extra
fsync per SnapshotCount entries; default 10000) versus the rejected
Option A (1 fsync/sec/group continuously).

Implementation Plan B2 row gains the SetDurableAppliedIndex /
persistCreatedSnapshot bump. B2 tests gain a snapshot-persist
round-trip + a SaveSnap-failure crash-simulation. B3 tests gain an
idle-cluster integration test that proves the codex P2 scenario is
closed end-to-end.

Header status bumped to Round 4. Round-3 retraction section appended
with a meta-lesson on quantifying 'rare' against production timer
cadences rather than intuition.

Doc-only PR; no code change.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 3, 2026

round-4 push: a5eacc77 — addressing codex round-3 P2 (line 438).

You were right that round 3's framing of the HLC-lease-only fallback as "safe and rare" was wrong. Verified the production cadence:

  • RunHLCLeaseRenewal at kv/coordinator.go:650 ticks at hlcRenewalInterval = 1 * time.Second (:46) while the local node is leader
  • applyHLCLease is memory-only → metaAppliedIndex never advances on lease apply
  • Snapshot persists at index X, last data-Apply at Y < X; on restart fsmAlreadyAtIndex(X) checks Y >= X → false → full restore
  • Idle clusters: skip never fires. Active clusters: meaningful window where it doesn't fire either.

Round-4 fix: snapshot-persist meta key bump

persistCreatedSnapshot (engine.go:2683) gains a call to SetDurableAppliedIndex(snap.Metadata.Index) before e.persist.SaveSnap:

if w, ok := e.fsm.(AppliedIndexWriter); ok {
    if err := w.SetDurableAppliedIndex(snap.Metadata.Index); err != nil {
        return errors.WithStack(err)
    }
}
if err := e.persist.SaveSnap(snap); err != nil { ... }

New AppliedIndexWriter interface sibling to AppliedIndexReader. kvFSM forwards to f.store; pebbleStore.SetDurableAppliedIndex does a single-key pebble.Batch commit under dbMu.RLock with raftApplyWriteOpts (honours ELASTICKV_FSM_SYNC_MODE).

Crash ordering proves over-restore impossible

Bump runs BEFORE SaveSnap. Observable states at any crash boundary:

State metaAppliedIndex snapshot pointer post-restart
Before bump Y (last data) X' < X (previous snap) skip if Y ≥ X' (correct)
Bump done, SaveSnap pending X still X' skip succeeds against X'
Both done X X skip succeeds against X ✓
Later data Apply at Z > X Z X skip succeeds against X ✓

No state can yield "snapshot pointer = X but metaAppliedIndex < X".

Cost

~1 extra pebble Batch.Commit per snapshot persist. etcd raft SnapshotCount default is 10000 entries → ~one extra fsync per 10000 entries. Versus the rejected Option A (1 fsync/sec/group continuously): ~10⁴× cheaper.

Tests

  • B2 gains a persistCreatedSnapshot round-trip asserting LastAppliedIndex == snap.Metadata.Index post-call.
  • B2 gains a SaveSnap-failure crash-simulation asserting post-restart LastAppliedIndex is at least as fresh as the previous-snapshot pointer.
  • B3 gains an idle-cluster integration test (no data writes for 2 × SnapshotCount × hlcRenewalInterval, snapshot, restart, assert skip fires) — proves your scenario is closed end-to-end.

Meta-lesson

The doc body has a Round-3 retraction section calling out the lesson: "rare" should be a quantitative claim against the actual production timer cadence, not an intuition. Your review cited kv/coordinator.go:641-663 — a file:line I could have consulted before making the round-3 claim.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Jun 3, 2026

Claude finished @bootjp's task in 5m 1s —— View job


Round-4 Review

Todo List:

  • Load review tools
  • Read updated design document
  • Read relevant source files (engine.go, wal_store.go, kv/fsm.go, store/lsm_store.go, kv/snapshot.go, statemachine.go)
  • Evaluate AppliedIndexWriter interface and SetDurableAppliedIndex design
  • Verify crash-ordering proof (bump before SaveSnap)
  • Check pebbleStore.SetDurableAppliedIndex lock ordering
  • Verify import cycle concern (carry-forward from round-3)
  • Check idle-cluster test spec for tractability
  • Post review

The round-4 codex P2 fix (snapshot-persist meta key bump) is mechanically sound. The crash-ordering proof is correct. One item carries forward as blocking from round-3 with no change in round-4; the rest are annotation-level issues.


🔴 Critical (carry-forward from round-3, still blocking): Import cycle in §5

This was the single blocking item from round-3 and it is unchanged in round-4.

applyHeaderStateOnSkip is placed in internal/raftengine/etcd/wal_store.go. The §5 pseudocode calls kv.ReadSnapshotHeader(br) from that function:

// wal_store.go (proposed) — still creates an import cycle
ceiling, cutover, err := kv.ReadSnapshotHeader(br)

Verified on origin/main: no non-test file in internal/raftengine/etcd/ currently imports kv (only migrate_test.go does). kv/snapshot.go and kv/fsm.go both import "github.com/bootjp/elastickv/internal/raftengine". Adding kv to wal_store.go would create the cycle internal/raftengine/etcd → kv → internal/raftengine, which the Go toolchain rejects at compile time.

The fix is unchanged from round-3: change SnapshotHeaderApplier so the file-open and kv.ReadSnapshotHeader call live inside kvFSM.ApplySnapshotHeaderFromFile, not in wal_store.go:

// internal/raftengine/statemachine.go (interface)
type SnapshotHeaderApplier interface {
    ApplySnapshotHeaderFromFile(snapPath string) error
}

// wal_store.go — never names the kv package
func applyHeaderStateOnSkip(fsm StateMachine, snapPath string) error {
    setter, ok := fsm.(SnapshotHeaderApplier)
    if !ok { return nil }
    return errors.WithStack(setter.ApplySnapshotHeaderFromFile(snapPath))
}

// kv/fsm.go — kv.ReadSnapshotHeader stays inside the kv package
func (f *kvFSM) ApplySnapshotHeaderFromFile(snapPath string) error {
    file, err := os.Open(snapPath)
    if err != nil { return errors.WithStack(err) }
    defer file.Close()
    ceiling, cutover, err := ReadSnapshotHeader(bufio.NewReader(file))
    if err != nil { return errors.WithStack(err) }
    if f.hlc != nil && ceiling > 0 {
        f.hlc.SetPhysicalCeiling(int64(ceiling))
    }
    f.restoredCutover = cutover
    return nil
}

Fix this →


✅ Round-4 AppliedIndexWriter — Sound

The new AppliedIndexWriter interface is placed in internal/raftengine/statemachine.go (confirmed). Engine.go already imports internal/raftengine, so the type assertion e.fsm.(AppliedIndexWriter) adds no new import. No cycle. ✅

raftApplyWriteOpts() is a real method (confirmed at lsm_store.go:1257). ✅

✅ Crash-ordering proof — Sound

The four states in the table are verified correct:

State metaAppliedIndex Snapshot pointer fsmAlreadyAtIndex check Result
Before bump Y X' < X Y ≥ X' ? Correct (may skip or restore)
Bump done, SaveSnap pending X X' X ≥ X' → true Skip fires safely — no over-restore
Both done X X X ≥ X → true Correct optimisation
Both + data Apply at Z Z X Z ≥ X → true Correct

The bump can never write a value lower than the current metaAppliedIndex: snap.Metadata.Index = e.applied at snapshot creation time, and e.applied ≥ last data Apply index, so SetDurableAppliedIndex always moves the meta key forward or stays equal. ✅

pebbleStore.SetDurableAppliedIndex lock ordering — Sound

dbMu.RLock() (not dbMu.Lock()) is the correct lock for a single-key batch operation that doesn't swap the db pointer. No applyMu is needed: the raft apply loop is serial (engine.go:2292 seam runs on the same goroutine as Apply), so concurrent applyMutationsWithOpts + SetDurableAppliedIndex cannot occur. swapInTempDB holds dbMu.Lock(), which correctly blocks SetDurableAppliedIndex until the swap completes. ✅

🟠 Minor: Idle-cluster test spec needs ELASTICKV_RAFT_SNAPSHOT_COUNT override note

The B3 test spec (line 656) says:

"An idle-cluster test runs a 3-node cluster with no data writes for 2 * SnapshotCount * hlcRenewalInterval seconds"

With defaultSnapshotEvery = 10_000 (confirmed at engine.go:93) and hlcRenewalInterval = 1 s, that's 20 000 seconds — clearly not a runnable integration test. The spec should explicitly state that the test sets ELASTICKV_RAFT_SNAPSHOT_COUNT=10 (or overrides snapshotEvery via a test-only constructor) to make the scenario tractable. Without this note, the implementer will have no signal that a configuration override is expected.

Fix this →

🟢 Minor: Carry-forward annotation errors (unchanged from round-3)

These were round-3 minor nits and remain in round-4:

  • Wrong function name / line: §3 line 227 says "extends isReservedMetaKey (lsm_store.go:454)". Actual: isPebbleMetaKey at line 534.
  • applyMutationsWithOpts line: Doc says :1130; actual is :1292.
  • metaLastCommitTSBytes in applyMutationsWithOpts: Doc says :1162; actual is :1324.
  • deletePrefixAtWithOpts: Doc says :1196 (actual :1358) and :1231 (actual :1393).

These are cosmetic but will cause confusion when implementers look up the code. Fix before B2 opens.


Summary

Area Status
Round-4 codex P2 fix: AppliedIndexWriter + persistCreatedSnapshot bump ✅ Mechanically sound
Round-4 crash-ordering proof (bump-before-SaveSnap table) ✅ Correct
pebbleStore.SetDurableAppliedIndex lock ordering ✅ Correct
§5 import cycle (kv.ReadSnapshotHeader from wal_store.go) 🔴 Still present — blocking from round-3
Idle-cluster test spec (B3) needs ELASTICKV_RAFT_SNAPSHOT_COUNT override note 🟠 Minor
isReservedMetaKey / line number annotations 🟢 Cosmetic carry-forward

The one blocking item before any implementation branch opens remains §5's import cycle, which requires SnapshotHeaderApplier.ApplySnapshotHeaderFromFile to replace the direct kv.ReadSnapshotHeader call in wal_store.go.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a5eacc77c7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +452 to +455
**Mechanism**: bump `metaAppliedIndex` to `snapshot.Metadata.Index`
when persisting a created snapshot, in `persistCreatedSnapshot`
(`internal/raftengine/etcd/engine.go:2683`), **before**
`e.persist.SaveSnap`:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Cover the local snapshot path

This proposed fix only hooks persistCreatedSnapshot, but the HLC-lease-only case being fixed is produced by normal SnapshotCount snapshots, which currently go through maybePersistLocalSnapshotpersistLocalSnapshotpersistLocalSnapshotPayload and call persist.SaveSnap directly in internal/raftengine/etcd/wal_store.go:525, bypassing persistCreatedSnapshot entirely. With that path unchanged, periodic local snapshots can still persist at index X while metaAppliedIndex remains at the last data write Y < X, so the permanent full-restore fallback described above is not closed for the steady-state snapshots this optimization depends on.

Useful? React with 👍 / 👎.

bootjp added 2 commits June 3, 2026 13:22
Addresses claude[bot] round-4 blocking finding + codex round-4 P2.

claude[bot] BLOCKING (carry-forward from round-3 / round-4): §5
pseudocode `ceiling, cutover, err := kv.ReadSnapshotHeader(br)` from
`internal/raftengine/etcd/wal_store.go` would create an import cycle
because `kv/snapshot.go` and `kv/fsm.go` both already import
`github.com/bootjp/elastickv/internal/raftengine`.

Round-5 fix: restructure the SnapshotHeaderApplier seam so the file
open AND the kv.ReadSnapshotHeader call live inside
`kvFSM.ApplySnapshotHeaderFromFile(snapPath string) error`. The
wal_store.go applyHeaderStateOnSkip helper only ever sees the
interface and never names the kv package. This matches the existing
template used by ApplyIndexAware (which delivers entry.Index to kv
without importing kv from the engine).

codex round-4 P2 at :455: the round-4 snapshot-persist hook only
covers `persistCreatedSnapshot` (engine.go:2683), which fires only
for config snapshots (membership changes). The steady-state
SnapshotCount-triggered hot path is:

  maybePersistLocalSnapshot (engine.go:2070)
   -> e.persistLocalSnapshotPayload (engine.go:4032)
   -> free persistLocalSnapshotPayload (wal_store.go:519)
   -> persist.SaveSnap (wal_store.go:525)

This path bypasses persistCreatedSnapshot entirely, so the round-4
fix didn't actually close the round-3 P2 gap for the snapshots the
optimisation depends on.

Round-5 fix: add a second SetDurableAppliedIndex hook in the engine
wrapper `e.persistLocalSnapshotPayload` (engine.go:4032), placed
under `e.snapshotMu.Lock()` and BEFORE the free-function call. The
hook lives in the engine wrapper, not in the free function, so the
free function stays signature-stable and reusable from tests that
bypass the engine. The B2 row of the Implementation Plan now
enumerates both hook sites explicitly.

Minor: idle-cluster test spec annotated with the
ELASTICKV_RAFT_SNAPSHOT_COUNT=10 override (default 10000 +
hlcRenewalInterval=1s would require >= 20 000 seconds, not runnable).

Cosmetic: line-number annotations corrected per claude's verification
against origin/main HEAD:
  - isReservedMetaKey -> isPebbleMetaKey at lsm_store.go:534
  - applyMutationsWithOpts at lsm_store.go:1292 (was :1130)
  - applyMutationsWithOpts meta bundle at :1324 (was :1162)
  - deletePrefixAtWithOpts at lsm_store.go:1358 (was :1196)
  - deletePrefixAtWithOpts meta bundle at :1393 (was :1231)

Header status bumped to Round 5. Round-4 retraction section appended
with grep-verified diagnosis of both round-4 issues + meta-lesson on
verifying import direction before pseudocoding a cross-package call
site.

Doc-only PR; no code change.
…samples

Round-5 corrected the body-prose line numbers
(:1162 -> :1324, :1231 -> :1393) but missed the comments inside the
§2 code samples that colocate the new metaAppliedIndex Set with the
existing metaLastCommitTSBytes Set. Update those comments too so
implementers reading the §2 pseudocode see consistent line numbers.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 3, 2026

round-5 push: fea230b4 (incorporates d17a4f22)

Addresses claude[bot] round-4 BLOCKING + codex round-4 P2 at :455:

🔴 § 5 import cycle (claude carry-forward, BLOCKING)

Verified on origin/main: kv/snapshot.go and kv/fsm.go both import "github.com/bootjp/elastickv/internal/raftengine", so internal/raftengine/etcd/wal_store.go cannot import kv. The round-3 / round-4 pseudocode kv.ReadSnapshotHeader(br) from wal_store.go would fail compilation.

Round-5 fix: push the file open AND the ReadSnapshotHeader call inside kvFSM.ApplySnapshotHeaderFromFile(snapPath). wal_store.go.applyHeaderStateOnSkip only ever sees the interface:

// wal_store.go — never imports kv
func applyHeaderStateOnSkip(fsm StateMachine, snapPath string) error {
    setter, ok := fsm.(SnapshotHeaderApplier)
    if !ok { return nil }
    return errors.WithStack(setter.ApplySnapshotHeaderFromFile(snapPath))
}

// kv/fsm.go — kv.ReadSnapshotHeader stays inside kv
func (f *kvFSM) ApplySnapshotHeaderFromFile(snapPath string) error { ... }

This matches the existing ApplyIndexAware template (delivers entry.Index to kv without engine importing kv).

🟡 codex round-4 P2 at :455 (local snapshot path bypass)

Verified on origin/main: maybePersistLocalSnapshot (engine.go:2070) → e.persistLocalSnapshotPayload (engine.go:4032) → free persistLocalSnapshotPayload (wal_store.go:519) → persist.SaveSnap (:525). persistCreatedSnapshot (the round-4 hook) only fires for config snapshots — the SnapshotCount steady-state path bypasses it entirely.

Round-5 fix: add a second SetDurableAppliedIndex hook in the engine wrapper e.persistLocalSnapshotPayload, under e.snapshotMu.Lock(), before the free-function call:

func (e *Engine) persistLocalSnapshotPayload(index uint64, payload []byte) error {
    e.snapshotMu.Lock()
    defer e.snapshotMu.Unlock()
    // ... existing index <= current.Metadata.Index short-circuit ...
    if w, ok := e.fsm.(AppliedIndexWriter); ok {
        if err := w.SetDurableAppliedIndex(index); err != nil {
            return errors.WithStack(err)
        }
    }
    _, err = persistLocalSnapshotPayload(e.storage, e.persist, index, payload)
    // ... existing error switch + purge ...
}

Hook lives in the engine wrapper, not the free function, so the free function stays signature-stable and test-reusable. B2 row now enumerates both hook sites explicitly.

🟠 B3 idle test tractability

claude flagged: default SnapshotCount=10000 × hlcRenewalInterval=1s = 20 000 s idle period — not runnable. Round-5 spec explicitly sets ELASTICKV_RAFT_SNAPSHOT_COUNT=10 to bring the test budget to 20 s.

🟢 Cosmetic line numbers (verified against origin/main)

Symbol Round-3 / round-4 cited Actual (round-5)
isReservedMetaKey lsm_store.go:454 isPebbleMetaKey at :534
applyMutationsWithOpts :1130 :1292
applyMutationsWithOpts meta bundle :1162 :1324
deletePrefixAtWithOpts :1196 :1358
deletePrefixAtWithOpts meta bundle :1231 :1393

Round-4 retraction section appended documenting both round-4 issues with grep-verified diagnoses and the meta-lesson: when a §X seam touches a package boundary, verify import direction explicitly before pseudocoding the call site (round-5's fix structurally mirrors the existing ApplyIndexAware template that round-3 / round-4 overlooked).

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Jun 3, 2026

Claude finished @bootjp's task in 3m 31s —— View job


Round-5 Review

Todo List:

  • Load review tools
  • Read updated design document
  • Verify §5 import cycle fix
  • Verify local snapshot path fix
  • Verify B3 idle test tractability
  • Verify cosmetic line number corrections
  • Post final review

All round-4 blocking and significant items have been resolved. Verified against origin/main.


✅ Round-4 Critical 1 — Import cycle: Resolved

wal_store.go's applyHeaderStateOnSkip now calls setter.ApplySnapshotHeaderFromFile(snapPath) via the SnapshotHeaderApplier interface only. kv.ReadSnapshotHeader lives exclusively inside kvFSM.ApplySnapshotHeaderFromFile in the kv package. wal_store.go never imports kv. The shape is structurally identical to the existing ApplyIndexAware pattern — a clean precedent. ✅

Import direction confirmed:

  • kv/snapshot.go and kv/fsm.go both import internal/raftengine (confirmed in source)
  • wal_store.go sees only SnapshotHeaderApplier from internal/raftengine/statemachine.go — no kv import needed

✅ Round-4 Critical 2 — Local snapshot path bypass: Resolved

Two persist sites are now correctly identified and both get the SetDurableAppliedIndex hook:

Site Function Engine.go line Confirmed
Config-snapshot persistCreatedSnapshot 2679
Steady-state SnapshotCount e.persistLocalSnapshotPayload 4032

The free function persistLocalSnapshotPayload at wal_store.go:519 (calling persist.SaveSnap at :524) stays signature-stable. The hook correctly lives in the engine wrapper e.persistLocalSnapshotPayload, which already holds e.snapshotMu.Lock() before the free function call. Crash-ordering proof verified: the early-return guard if index <= current.Metadata.Index { return nil } ensures SetDurableAppliedIndex only fires for index values that are strictly greater than the current snapshot pointer, so metaAppliedIndex only moves forward.


✅ B3 Idle Test Tractability: Resolved

The test spec now explicitly sets ELASTICKV_RAFT_SNAPSHOT_COUNT=10 (env var confirmed at engine.go:94 as snapshotEveryEnvVar = "ELASTICKV_RAFT_SNAPSHOT_COUNT", defaultSnapshotEvery = 10_000 at engine.go:93). At ELASTICKV_RAFT_SNAPSHOT_COUNT=10 the idle window is 20 seconds — tractable.


✅ Cosmetic Line Number Corrections: All Verified

Symbol Round-3/4 claimed Round-5 corrected Actual (origin/main)
Function name isReservedMetaKey isPebbleMetaKey lsm_store.go:534
applyMutationsWithOpts :1130 :1292 lsm_store.go:1292
metaLastCommitTSBytes in applyMutationsWithOpts :1162 :1324 lsm_store.go:1324
deletePrefixAtWithOpts :1196 :1358 lsm_store.go:1358
metaLastCommitTSBytes in deletePrefixAtWithOpts :1231 :1393 lsm_store.go:1393

🟢 Concurrent write safety between apply path and snapshot persist path

One thing not explicitly addressed in the doc but worth confirming: applyMutationsWithOpts (apply goroutine) and SetDurableAppliedIndex (snapshot goroutine under snapshotMu) both write metaAppliedIndex and both hold only dbMu.RLock(). They can execute concurrently at the Pebble level. This is safe because:

  • Both writes move metaAppliedIndex to a value ≥ the snapshot index being used for the skip check
  • Snapshot index ≤ applied ≤ any concurrent apply-path index
  • The last-writer-wins outcome always satisfies have >= want in fsmAlreadyAtIndex

No change needed — just confirming soundness.


🟢 Minor: persistCreatedSnapshot line annotation

The B2 implementation plan row and §6 cite persistCreatedSnapshot at engine.go:2683. The function definition is at engine.go:2679; line 2683 is the e.persist.SaveSnap(snap) call inside it. Same class of annotation precision as the fixes already applied. Cosmetic only.

🟢 Minor: wal_store.go:525 in Round-4 retraction section

The Round-4 retraction section says persist.SaveSnap at wal_store.go:525. Actual location is wal_store.go:524. One-line off. Cosmetic.


Summary

Area Status
§5 import cycle — ApplySnapshotHeaderFromFile via SnapshotHeaderApplier ✅ Resolved
§6 local snapshot path — e.persistLocalSnapshotPayload hook ✅ Resolved
B3 idle test tractability — ELASTICKV_RAFT_SNAPSHOT_COUNT=10 ✅ Resolved
Cosmetic line number annotations ✅ All corrected
Concurrent write safety (apply path vs snapshot path on metaAppliedIndex) ✅ Sound
persistCreatedSnapshot line annotation (2683 vs 2679) 🟢 Cosmetic
wal_store.go:525 in retraction section (actual :524) 🟢 Cosmetic

The design is ready for implementation. All blocking and significant issues from rounds 1–4 have been addressed. The two remaining items are single-line annotation imprecisions that can be fixed opportunistically but are not blockers. Branch 2 can proceed.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fea230b40a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +520 to +527
// Round-5: bump metaAppliedIndex BEFORE the free-function
// persistLocalSnapshotPayload (which calls persist.SaveSnap at
// wal_store.go:525). Lives in the engine wrapper so the free
// function stays signature-stable and is reusable from tests
// that bypass the engine. Skipped silently when the FSM does
// not implement AppliedIndexWriter (legacy fakes / test shims).
if w, ok := e.fsm.(AppliedIndexWriter); ok {
if err := w.SetDurableAppliedIndex(index); err != nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Move the applied-index bump into the snapshot payload

In the current local-snapshot workflow, persistLocalSnapshot writes the FSM snapshot file/token before it calls this wrapper, so this hook updates only the live Pebble DB and not the snapshot body that was just serialized. If the next cold start takes any conservative full-restore path (missing/corrupt meta key, read error, or restoring this snapshot elsewhere), that restore reinstalls the stale metaAppliedIndex=Y from the snapshot file while opening the WAL at snapshot index X, where the lease-only entries up to X are already compacted; the node then falls back on every restart again. The bump needs to be part of the state captured by the FSM snapshot (or the restore path must synthesize it from snapshot.Metadata.Index), not just written after the snapshot file has already been produced.

Useful? React with 👍 / 👎.

claude[bot] round-5 verdict flagged two 1-line annotation imprecisions
as 'cosmetic, not blockers':

  - persistCreatedSnapshot at engine.go:2683 -> actual function
    definition at :2679; :2683 is the SaveSnap call inside it
  - persist.SaveSnap at wal_store.go:525 -> actual call site at :524

Both are off-by-one references to internal implementation lines that
the implementer will look up anyway. Fixing them to spare the
implementer the verification step. No structural change.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 18043f7c4c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

if !ok {
return nil // FSM has no header state; skip is harmless.
}
return errors.WithStack(setter.ApplySnapshotHeaderFromFile(snapPath))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Verify snapshot CRC before applying skipped header

When fsmAlreadyAtIndex is true, this skip path reads and applies HLC/cutover header state from snapPath without checking the token CRC. I checked the current openAndRestoreFSMSnapshot path in internal/raftengine/etcd/fsm_snapshot_file.go, and it explicitly compares the file footer to tok.CRC32C before calling fsm.Restore to avoid mutating FSM state from corrupt snapshot bytes. With the proposed ApplySnapshotHeaderFromFile(snapPath) shape, a corrupted/truncated .fsm file or mismatched token can silently install a bogus ceiling/cutover (or treat a corrupt short header as headerless) precisely in the skip case; thread tok.CRC32C through and verify the snapshot file before applying any header side effects.

Useful? React with 👍 / 👎.

Codex round-5 P1: the round-5 ApplySnapshotHeaderFromFile signature
took no tokenCRC and skipped all three of openAndRestoreFSMSnapshot's
fail-closed verifications:

  1. fsmMinFileSize -> ErrFSMSnapshotTooSmall
  2. footer-vs-tokenCRC -> ErrFSMSnapshotTokenCRC
  3. full-body CRC vs footer -> ErrFSMSnapshotFileCRC

A corrupt / wrong-token / truncated snapshot could silently install
bogus HLC ceiling / Stage 8a cutover -- or be parsed by
ReadSnapshotHeader as headerless legacy and silently apply
ceiling=0, cutover=0.

Round-6 fix:

  - applyHeaderStateOnSkip(fsm, snapPath, tokenCRC) -- threads
    tok.CRC32C through from restoreSnapshotState.
  - SnapshotHeaderApplier.ApplySnapshotHeaderFromFile(snapPath,
    tokenCRC) -- new signature.
  - kvFSM.ApplySnapshotHeaderFromFile mirrors
    openAndRestoreFSMSnapshot's three-step verification:
      1. fsmMinFileSize check
      2. readFSMFooter + compare to tokenCRC
      3. crc32 TeeReader over the whole payload; parse the header
         inside the tee'd reader so the header bytes are included
         in the computed CRC; drain remaining payload; compare to
         footer
    Side-effects (SetPhysicalCeiling + restoredCutover assignment)
    only run after all three checks pass.

  - §4 caller signature updated; §5 pseudocode rewritten; B3 test
    list extended with three CRC-corruption tests (one per failure
    mode); Implementation Plan B3 row revised; Round-5 retraction
    section appended documenting the cost note (full-file read on
    skip path still cheaper than restore which also reads + writes
    pebble) and the meta-lesson (mirror existing fail-closed
    checks when relocating side-effects).

  - Open Questions gains an entry for a future follow-up that
    persists HLC ceiling + cutover as durable meta keys, eliding
    the file read entirely. Out of scope here.

Doc-only PR; no code change.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 3, 2026

round-6 push: d5a1bd27 — codex round-5 P1 fix.

You're right: I pre-declared the loop done at round-5 / round-5.1 but missed that the new ApplySnapshotHeaderFromFile skipped all three of openAndRestoreFSMSnapshot's fail-closed verifications (size, footer-vs-tokenCRC, full-body CRC). A corrupt / wrong-token / truncated snapshot could silently install bogus HLC ceiling / Stage 8a cutover. Verified the existing safety contract at internal/raftengine/etcd/fsm_snapshot_file.go — round-5 was structurally fine for the import cycle but lost the implicit verification that fsm.Restore was riding on.

Round-6 fix

Thread tok.CRC32C through:

// wal_store.go
return applyHeaderStateOnSkip(fsm, fsmSnapPath(fsmSnapDir, tok.Index), tok.CRC32C)

// statemachine.go
type SnapshotHeaderApplier interface {
    ApplySnapshotHeaderFromFile(snapPath string, tokenCRC uint32) error
}

// kv/fsm.go — mirrors openAndRestoreFSMSnapshot's three-step verification
func (f *kvFSM) ApplySnapshotHeaderFromFile(snapPath string, tokenCRC uint32) error {
    // 1. size check          -> ErrFSMSnapshotTooSmall
    // 2. readFSMFooter vs tokenCRC -> ErrFSMSnapshotTokenCRC
    // 3. crc32 TeeReader over the WHOLE payload (header parsed inside the
    //    tee'd reader so its bytes are included) vs footer
    //    -> ErrFSMSnapshotFileCRC
    // ... only after all three pass: SetPhysicalCeiling + restoredCutover
}

Cost note

Step 3 reads the whole snapshot file on the skip path (~6 s for a 6 GiB FSM at 1 GiB/s SSD read), but is still strictly cheaper than the restore path it replaces — the same full-file read happens inside restoreAndComputeCRC, plus restore additionally writes a temp Pebble database via restoreBatchLoopInto (which is what the original 46 s measurement was dominated by). Skipping the writes preserves the bulk of the optimisation.

B3 tests added

Three CRC-corruption tests, one per failure mode (truncate / wrong-token / one-byte body flip) asserting the corresponding typed error AND that f.hlc / f.restoredCutover are unchanged.

Open Questions update

Added a future follow-up: persist HLC ceiling + cutover as durable meta keys (analogous to metaAppliedIndex) so the skip path can elide the file read entirely → cold-start collapses to single-digit ms. Deferred because round-6 is already correct + useful; this is a performance optimisation, not a correctness fix.

Round-5 retraction appended

Round-5 was the right structural fix for the import cycle but lost sight of the CRC verification it was implicitly inheriting from openAndRestoreFSMSnapshot. Lesson: when a §X seam takes over a side-effect previously gated by existing fail-closed checks, inventory those checks first and mirror them in the new path. Reading the existing restore path end-to-end before pseudocoding a replacement would have surfaced this.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Jun 3, 2026

Claude finished @bootjp's task in 7m 11s —— View job


Round-6 Review

Todo List:

  • Load review tools
  • Read updated design document (round-6 state)
  • Read internal/raftengine/etcd/fsm_snapshot_file.go — verify existing CRC check contract
  • Read kv/fsm.go — verify Restore and restoredCutover handling
  • Read internal/raftengine/statemachine.go — verify ApplyIndexAware seam
  • Verify import graph (kv → internal/raftengine/etcd cycle safety)
  • Evaluate round-6 CRC threading design
  • Verify cost claim (file read only, no write → still faster than full restore)
  • Evaluate TeeReader approach for CRC verification
  • Check B3 test spec adequacy
  • Evaluate codex P2 (snapshot body bump) concern
  • Post final review

The round-6 CRC verification fix is mechanically correct and closes the final blocking issue from round-5. The retraction is well-documented. One implementation-planning gap in the §5 pseudocode needs resolution before Branch 3 opens; everything else is sound.


✅ Round-6 core — CRC verification: Correct

Verified against internal/raftengine/etcd/fsm_snapshot_file.go on origin/main:

The existing openAndRestoreFSMSnapshot (:264) performs three checks before calling fsm.Restore:

  1. Size: info.Size() < fsmMinFileSizeErrFSMSnapshotTooSmall.
  2. Footer vs tokenCRC: readFSMFooter(f, info.Size()) reads the 4-byte footer; compares to tokenCRCErrFSMSnapshotTokenCRC on mismatch.
  3. Full-body CRC: Seeks to 0, restoreAndComputeCRC wraps f in a TeeReader → CRC accumulator, calls fsm.Restore(br) (which parses the header bytes through the tee), drains remaining body via io.Copy(io.Discard, br), compares computed CRC to footer → ErrFSMSnapshotFileCRC.

The proposed ApplySnapshotHeaderFromFile mirrors all three steps exactly:

  • Step 1: same size check before any file I/O beyond the open.
  • Step 2: same readFSMFooter call, same footer != tokenCRC check.
  • Step 3: file.Seek(0, io.SeekStart)io.TeeReader(io.LimitReader(file, payloadSize), h)ReadSnapshotHeader(br) (consumes header bytes into CRC accumulator) → io.Copy(io.Discard, br) (drains body into CRC accumulator) → h.Sum32() != footer check.

The order is correct. The seek-back after reading the footer mirrors openAndRestoreFSMSnapshot:294. Side-effects (SetPhysicalCeiling, restoredCutover =) are placed after all three checks pass. The "MUST NOT mutate FSM state on any verification failure" invariant is satisfied.

Cost: Step 3 reads the full payload once. The doc's claim that this is strictly cheaper than the restore path (which also reads the payload AND writes a temp Pebble database) is correct — the restore-path's wall-clock is dominated by restoreBatchLoopInto's Pebble writes.


🟠 Implementation planning: kv/fsm.go cannot use unexported etcd symbols in the §5 pseudocode

The proposed ApplySnapshotHeaderFromFile lives in kv/fsm.go but references symbols that are unexported from internal/raftengine/etcd:

Symbol Actual location Exported?
statFSMFileError etcd/fsm_snapshot_file.go:438 ❌ no
fsmMinFileSize etcd/fsm_snapshot_file.go:31 ❌ no (= 4)
readFSMFooter etcd/fsm_snapshot_file.go:311 ❌ no
fsmFooterSize etcd/fsm_snapshot_file.go:27 ❌ no (= 4)
fsmRestoreReadAhead etcd/fsm_snapshot_file.go:37 ❌ no (= 1 MiB)
crc32cTable etcd/fsm_snapshot_file.go:51 ❌ no
ErrFSMSnapshotTooSmall etcd/fsm_snapshot_file.go:68 ✅ yes
ErrFSMSnapshotTokenCRC etcd/fsm_snapshot_file.go:61 ✅ yes
ErrFSMSnapshotFileCRC etcd/fsm_snapshot_file.go:57 ✅ yes

Import cycle check: internal/raftengine/etcd does NOT import kv (confirmed: engine.go imports only internal/raftengine, internal/encryption, internal/monoclock; wal_store.go imports no bootjp packages beyond cockroachdb/errors). Therefore kv can safely import internal/raftengine/etcd — no cycle. The exported error sentinels (ErrFSMSnapshotTooSmall etc.) are then accessible from kv/fsm.go.

For the unexported constants and helpers, Branch 3 has three options:

  1. Export the needed symbols from etcd (FsmFooterSize = 4, etc.) — minor naming churn.
  2. Move them to internal/raftengine (parent package) — cleanest architecture since both kv and etcd already import it.
  3. Reimplement locally in kv/fsm.goconst kFsmFooterSize = 4, var kCRC32CTable = crc32.MakeTable(crc32.Castagnoli), inline binary.Read for footer reading. Trivial duplication of ~10 lines.

The design doc should specify the choice. Option 3 is the least risky (no exported symbol churn, no parent-package change) and is straightforward given the values are all trivial constants or one-liner functions. B3's test assertion on ErrFSMSnapshotFileCRC etc. can use the exported sentinels via the kv → etcd import.

Fix this →


✅ Codex P2 (snapshot body) — SetDurableAppliedIndex is write-to-live-DB: sound under the design's goals

The codex comment at 2026-06-03T04:24:12Z (line 614) raises: SetDurableAppliedIndex writes to the live Pebble DB (not the .fsm file), so a conservative full restore that calls swapInTempDB would overwrite the live DB with the .fsm file content, losing the bumped metaAppliedIndex = X and potentially re-degrading the skip gate.

Verified on current source:

  • SetDurableAppliedIndex(X) runs before persist.SaveSnap(snap) → live DB gets X.
  • Live DB persists across normal restarts (no restore). On normal restarts, LastAppliedIndex() reads X → skip fires correctly.
  • After a conservative restore (swapInTempDB), live DB is replaced by .fsm file content → metaAppliedIndex regresses to Y (last data Apply, pre-bump).
  • The NEXT snapshot taken after restore calls SetDurableAppliedIndex(X+K) → live DB gets X+K → subsequent normal restarts skip correctly.
  • Edge case: crash between restore and next snapshot, with only lease entries in (X, X+K] → live DB has Y < X+K → full restore on next cold start.

This is a one-restart-per-conservative-restore degradation (same as pre-optimization behavior) in a narrow scenario (post-restore crash before next snapshot). Not a correctness issue. Not a regression versus the pre-optimization baseline (which always does full restore). The design is sound for its stated goal; the future optimization (durable meta keys for ceiling/cutover) would close this residual gap entirely. The doc's current treatment of this as "out-of-scope future work" is appropriate.


✅ All other round-5/6 verified items remain sound

Item Status
v2 header parity via ReadSnapshotHeader
file.Seek(0, io.SeekStart) before TeeReader pass ✅ correct, matches openAndRestoreFSMSnapshot:294
Headerless legacy (ceil=0, cutover=0) no-op path ceiling > 0 guard prevents HLC mutation
ErrSnapshotHeaderUnknownMagic fail-closed ✅ propagated before any side-effect
restoredCutover set alongside HLC ceiling ✅ mirrors kvFSM.Restore:523 exactly
Both snapshot persist hooks (Sites 1 + 2) ✅ both enumerated in B2 row
B3 idle-cluster test with ELASTICKV_RAFT_SNAPSHOT_COUNT=10 ✅ tractable
fsmAlreadyAtIndex bool-only return (conservative fallback)
B3 three CRC-corruption test cases ✅ cover all three typed errors
Open Questions: future durable meta key optimization ✅ well-scoped as non-blocking
isPebbleMetaKey name + line number corrections ✅ (carried from R3)

Summary

Area Status
Round-6 CRC verification (three-step) ✅ Correct — mirrors openAndRestoreFSMSnapshot exactly
TeeReader + seek pattern for full-body CRC pass ✅ Sound
Unexported etcd symbols in §5 pseudocode 🟠 Needs implementation note — trivially fixable (local constants or export)
Codex P2 (SetDurableAppliedIndex on live DB vs snapshot body) ✅ Sound for design goals; residual edge case is pre-optimization baseline
Import cycle safety (kv → internal/raftengine/etcd) ✅ Cycle-free — etcd does not import kv
B3 test spec (three CRC failure modes + idle cluster) ✅ Adequate

The one item needed before Branch 3 opens: add a note in §5 identifying the unexported symbols and specifying the resolution (local re-implementation or export). Everything else is ready for implementation.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d5a1bd270e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

if err := setPebbleUint64InBatch(b, metaAppliedIndexBytes, idx); err != nil {
return errors.WithStack(err)
}
return errors.WithStack(b.Commit(s.raftApplyWriteOpts()))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Force the snapshot checkpoint write durable

When ELASTICKV_FSM_SYNC_MODE=nosync, raftApplyWriteOpts() is pebble.NoSync, so this checkpoint batch can return before the Pebble WAL record is fsynced. The immediately following SaveSnap can persist the raft snapshot pointer at X; a crash before Pebble flush leaves the pointer durable while metaAppliedIndex rolls back to Y < X, and because WAL replay starts at X, the compacted HLC leases cannot rebuild the key, so the steady-state skip can fall back forever again. The checkpoint that orders before SaveSnap needs pebble.Sync or must be captured in the snapshot payload, independent of the raft-apply sync mode.

Useful? React with 👍 / 👎.

}

// kv/fsm.go (new method on kvFSM) -- kv.ReadSnapshotHeader stays inside kv
func (f *kvFSM) ApplySnapshotHeaderFromFile(snapPath string, tokenCRC uint32) error {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep CRC verification in an importable seam

This method is proposed in kv/fsm.go to avoid importing kv from wal_store.go, but the verifier symbols it uses below (fsmMinFileSize, readFSMFooter, ErrFSMSnapshotTokenCRC, crc32cTable, fsmRestoreReadAhead, statFSMFileError) are currently unexported in internal/raftengine/etcd/fsm_snapshot_file.go. Since kv already imports internal/raftengine, it cannot import the etcd subpackage without a cycle and cannot access those helpers as written; implementing this design either duplicates the snapshot verifier in kv or loses the round-6 CRC parity. The verification needs to live behind an exported/shared engine helper or a callback-based seam that kv can actually use.

Useful? React with 👍 / 👎.

Address codex round-6 P2 x 2:

P2 line 410 -- seam implementability: round 6 placed CRC verification
in kvFSM.ApplySnapshotHeaderFromFile using fsmMinFileSize,
readFSMFooter, crc32cTable, fsmRestoreReadAhead, statFSMFileError,
ErrFSMSnapshotTokenCRC, ErrFSMSnapshotFileCRC -- all unexported in
internal/raftengine/etcd/fsm_snapshot_file.go. Production code has
neither direction of import between kv and internal/raftengine/etcd
on origin/main (both are test-only). Round-6 pseudocode would not
compile.

Round-7 splits the SnapshotHeaderApplier interface into a parse phase
and an apply phase:

  type SnapshotHeaderApplier interface {
      ParseSnapshotHeader(r io.Reader) (ceiling, cutover uint64, err error)
      ApplySnapshotHeader(ceiling, cutover uint64)
  }

wal_store.go.applyHeaderStateOnSkip orchestrates size + footer-vs-
tokenCRC + crc32 TeeReader using its own (unexported) helpers, hands
the tee'd reader to setter.ParseSnapshotHeader for the v1/v2 header
read + drain (kv.ReadSnapshotHeader still lives entirely inside kv),
verifies the computed CRC against the footer, and only then calls
setter.ApplySnapshotHeader. Side-effects are pure assignment after
the engine confirms all three checks passed.

No package imports change. kv keeps the header parser; etcd keeps
the CRC verifier. The two-phase split costs nothing in the skip path
because we are not trying to do a single-pass restore here.

P2 line 660 -- nosync durability: round 4 / 5 / 6 all wrote the
SetDurableAppliedIndex checkpoint with s.raftApplyWriteOpts(), which
returns pebble.NoSync under ELASTICKV_FSM_SYNC_MODE=nosync.

Crash sequence:
  1. SetDurableAppliedIndex(X) returns (Pebble WAL buffered).
  2. e.persist.SaveSnap(snap) returns durably (etcd snapshotter fsyncs).
  3. Crash before Pebble flush.

Post-restart: metaAppliedIndex=Y<X (last fsynced data Apply), snapshot
pointer=X, WAL compaction starts at X. The lost applies cannot
rebuild the meta key; fsmAlreadyAtIndex(X)=false forever. The codex
round-3 P2 scenario recurs permanently.

Round-7 pins pebbleStore.SetDurableAppliedIndex to pebble.Sync
unconditionally. The checkpoint is the only durable carrier of
metaAppliedIndex at this point; it must be at least as durable as
the snapshot pointer that immediately follows via SaveSnap. Cost: +1
extra fsync per snapshot persist (default SnapshotCount=10000;
negligible).

Also:
  - §6 'ELASTICKV_FSM_SYNC_MODE=nosync mode' subsection split into
    (a) per-entry data Apply (existing reasoning) and (b) snapshot
    checkpoint (new pebble.Sync rationale).
  - B2 Implementation Plan row updated: 'pebble.Sync unconditionally'.
  - B3 Implementation Plan row updated to reflect the two-phase
    SnapshotHeaderApplier shape.
  - Round-6 retraction section appended documenting both bugs +
    meta-lessons.

Status header bumped to Round 7. Doc-only PR; no code change.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 3, 2026

round-7 push: 54632c2f — addressing codex round-6 P2 × 2.

🟡 P2 line 410 — seam implementability (Resolved)

You're right: fsmMinFileSize, readFSMFooter, crc32cTable, fsmRestoreReadAhead, statFSMFileError, ErrFSMSnapshotTokenCRC, ErrFSMSnapshotFileCRC are all unexported in internal/raftengine/etcd/fsm_snapshot_file.go. Independently verified that on origin/main there's no kv → internal/raftengine/etcd or internal/raftengine/etcd → kv edge in production code (both directions are test-only). Round-6 pseudocode would not compile.

Round-7 fix: split SnapshotHeaderApplier into a parse phase + apply phase.

type SnapshotHeaderApplier interface {
    ParseSnapshotHeader(r io.Reader) (ceiling, cutover uint64, err error)
    ApplySnapshotHeader(ceiling, cutover uint64)
}

wal_store.go.applyHeaderStateOnSkip orchestrates size + footer-vs-tokenCRC + crc32 TeeReader (using its own private helpers), hands the tee'd reader to setter.ParseSnapshotHeader (which calls the still-in-kv-package ReadSnapshotHeader and drains), verifies the computed CRC against the footer, and only then calls setter.ApplySnapshotHeader. Side-effects are pure assignment after the engine confirms all three checks passed.

No package imports change. kv keeps the header parser; etcd keeps the CRC verifier. The two-phase split costs nothing in the skip path because we are not trying to do a single-pass restore here.

🟡 P2 line 660 — nosync durability (Resolved)

Round 4 / 5 / 6 all wrote SetDurableAppliedIndex with s.raftApplyWriteOpts(), which is pebble.NoSync under ELASTICKV_FSM_SYNC_MODE=nosync. Crash sequence:

  1. SetDurableAppliedIndex(X) returns (Pebble WAL buffered, not fsynced).
  2. e.persist.SaveSnap(snap) returns durably (etcd snapshotter fsyncs internally).
  3. Crash before Pebble flush.

Post-restart: metaAppliedIndex = Y < X (last fsynced data Apply), snapshot pointer = X, WAL compaction starts at X, lost lease/data applies cannot rebuild the meta key, fsmAlreadyAtIndex(X) returns false forever — the round-3 P2 scenario recurring permanently rather than for a trailing window.

Round-7 fix: pin pebbleStore.SetDurableAppliedIndex to pebble.Sync unconditionally. The checkpoint is the only durable carrier of metaAppliedIndex at this point; it must be at least as durable as the snapshot pointer that immediately follows via SaveSnap. Cost: +1 extra fsync per snapshot persist (default SnapshotCount=10000; negligible).

§6 nosync subsection now splits into (a) per-entry data Apply (existing raft-WAL-as-source-of-truth reasoning) and (b) snapshot checkpoint (new pebble.Sync rationale).

Lessons (Round-6 retraction section appended)

  • Before pseudocoding cross-package helper use, inventory each symbol's export status. Symbols that look obvious from inside a package are often private when crossed.
  • A durability boundary that depends on a sync-mode knob must be examined separately for each write category. The raft-WAL-as-source-of-truth argument applies only to per-entry applies, never to writes that establish a checkpoint other code paths fsync past.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Jun 3, 2026

Claude finished @bootjp's task in 5m 11s —— View job


Round-7 Review

Todo List:

  • Load review tools
  • Read updated design document (round-7 state)
  • Read relevant source files (fsm_snapshot_file.go, lsm_store.go, kv/fsm.go, kv/snapshot.go)
  • Evaluate two-phase SnapshotHeaderApplier split (P2 line 410)
  • Verify pebble.Sync durability argument for SetDurableAppliedIndex (P2 line 660)
  • Check for new issues introduced
  • Post final review

Both round-6 P2 fixes are correct. No new blocking issues. One minor documentation carry-forward needs fixing before B3 opens.


✅ P2 line 410 — two-phase SnapshotHeaderApplier split: Sound

The round-7 split correctly resolves the implementability constraint. Verified against origin/main:

  • wal_store.go (package etcd) calls only setter.ParseSnapshotHeader(tee) and setter.ApplySnapshotHeader(ceiling, cutover) — both on the SnapshotHeaderApplier interface defined in internal/raftengine/statemachine.go. No kv import needed. ✅
  • All private helpers (fsmMinFileSize, fsmFooterSize, crc32cTable, readFSMFooter, statFSMFileError, ErrFSMSnapshotTooSmall, ErrFSMSnapshotTokenCRC, ErrFSMSnapshotFileCRC) stay in internal/raftengine/etcd/fsm_snapshot_file.go where they already live. ✅
  • kvFSM.ParseSnapshotHeader calls kv.ReadSnapshotHeader(br) — stays inside kv, no cross-package reference to etcd. ✅
  • kvFSM.ApplySnapshotHeader is pure field assignment (f.hlc.SetPhysicalCeiling, f.restoredCutover =) — correctly mirrors the two side-effects of kvFSM.Restore:517-523. ✅

TeeReader CRC coverage verified end-to-end:

applyHeaderStateOnSkip:
  tee = io.TeeReader(io.LimitReader(file, payloadSize), h)
  ceiling, cutover, perr := setter.ParseSnapshotHeader(tee)
    // Inside ParseSnapshotHeader:
    br := bufio.NewReaderSize(r/*=tee*/, 1<<20)
    ReadSnapshotHeader(br) → reads header bytes from br → pulls from tee → hashed by h
    io.Copy(io.Discard, br) → drains rest from br → pulls from tee → hashed by h
    // LimitReader exhausted: exactly payloadSize bytes through h
  if h.Sum32() != footer { → ErrFSMSnapshotFileCRC }
  setter.ApplySnapshotHeader(ceiling, cutover) ← called only after all three checks pass

All payload bytes flow through the TeeReader. h.Sum32() covers the full payload (identical boundary to restoreAndComputeCRC). The headerless-legacy path (ReadSnapshotHeader returning (0, 0, nil) with bytes left in br) is also covered: io.Copy(io.Discard, br) drains the peeked bytes through the TeeReader. ✅

Contract alignment with openAndRestoreFSMSnapshot:

Step openAndRestoreFSMSnapshot applyHeaderStateOnSkip (round-7)
Size check info.Size() < fsmMinFileSize Same
Footer vs token readFSMFooter + footer != tokenCRC Same
Full-body CRC restoreAndComputeCRC TeeReader over payload TeeReader via ParseSnapshotHeader + drain
Header side-effects fsm.RestoreReadSnapshotHeader + f.hlc.SetPhysicalCeiling + f.restoredCutover = ParseSnapshotHeader + ApplySnapshotHeader
State mutation on any failure ❌ (error aborts before fsm.Restore writes anything durable) ❌ (ApplySnapshotHeader only after all checks pass)

Shape is structurally identical to the existing ApplyIndexAware seam precedent: engine orchestrates, FSM implements the kv-package half. ✅


✅ P2 line 660 — pebble.Sync unconditionally for SetDurableAppliedIndex: Sound

The durability analysis in §6(b) is correct and complete. Verified:

  • raftApplyWriteOpts() (lsm_store.go:1257) returns s.fsmApplyWriteOpts, which is pebble.NoSync under ELASTICKV_FSM_SYNC_MODE=nosync. ✅ (round-7 correctly avoids this helper)
  • directApplyWriteOpts() (lsm_store.go:1251) returns pebble.Sync always, with the same rationale documented as for SetDurableAppliedIndex: "direct callers do not have raft-log replay as a durability backstop." The snapshot checkpoint is in the same category. ✅
  • persistLocalSnapshotPayload (wal_store.go:519-524) is a free function; the engine wrapper e.persistLocalSnapshotPayload holds e.snapshotMu before calling SetDurableAppliedIndex. SetDurableAppliedIndex uses s.dbMu.RLock() only — no snapshotMu inside, no lock ordering conflict. ✅

Crash-ordering proof confirmed valid under nosync with pebble.Sync on checkpoint:

With SetDurableAppliedIndexpebble.Sync, the bump is fsynced before SaveSnap returns. The previously identified crash scenario (codex round-6 P2 line 660) is closed: there is no longer a state where snapshot pointer = X is durable while metaAppliedIndex < X. The permanent-fallback regression is impossible. ✅

Nice side-effect under nosync mode: A pebble.Sync commit flushes the shared Pebble WAL, making any preceding pebble.NoSync data Apply batches durable as a side-effect. The snapshot checkpoint therefore also promotes any un-fsynced raft-apply entries into the fsync boundary — a conservative bonus, not relied upon by correctness.


🟢 Minor: Stale ApplySnapshotHeaderFromFile reference in B3 test spec

The B3 test row (around line 892) still says:

"assert the specific typed error surfaces from ApplySnapshotHeaderFromFile WITHOUT mutating f.hlc or f.restoredCutover"

ApplySnapshotHeaderFromFile was the round-6 single-method name. Round-7 replaced it with ParseSnapshotHeader + ApplySnapshotHeader. The test scenario is still correct — inject corruption, verify typed error from applyHeaderStateOnSkip, verify FSM state unchanged — but the test surface function name should be updated to reference applyHeaderStateOnSkip (or the full skip path through restoreSnapshotState) rather than the no-longer-existing ApplySnapshotHeaderFromFile. Purely cosmetic; won't block B3 opening.

Fix this →


✅ All prior blocking/significant items confirmed resolved

Area Status
§5 import cycle (kv.ReadSnapshotHeader from wal_store.go) ✅ Resolved — two-phase split keeps both sides in their home package
§5 v2 header parity (v1-only probe was round-5 bug) ReadSnapshotHeader handles v1/v2/headerless/unknown-magic
§5 restoredCutover set alongside HLC ceiling on skip ApplySnapshotHeader(ceiling, cutover) sets both
§5 seam implementability (unexported etcd symbols) ✅ CRC verifier stays in etcd; parser stays in kv; no new exports needed
§6 SetDurableAppliedIndex under nosync (round-6 P2 line 660) pebble.Sync unconditionally
§6 both snapshot persist sites (round-5 fix) ✅ Site 1 persistCreatedSnapshot + Site 2 e.persistLocalSnapshotPayload
fsmAlreadyAtIndex conservative bool-only return ✅ Any error/missing/no-interface → false → full restore
OCC two-case idempotency argument (§7) ✅ Raw overwrite + OCC ErrWriteConflict non-HaltApply
B3 idle-cluster test (ELASTICKV_RAFT_SNAPSHOT_COUNT=10) ✅ Tractable (20 s window)
B3 three CRC-corruption test cases ✅ One per typed error, state unchanged assertion
isPebbleMetaKey / line number annotations ✅ All corrected in round-5

Summary

The design is ready for implementation. All blocking issues from rounds 1–6 have been addressed. The one remaining item (B3 test spec method name) is cosmetic and can be fixed opportunistically. Branch 2 can proceed.

claude[bot] round-7 verdict flagged one cosmetic carry-forward: the B3
test row at line 893 still cites the round-6 single-method
ApplySnapshotHeaderFromFile. Round-7 split that into ParseSnapshotHeader
+ ApplySnapshotHeader; the test exercise surface from the implementer's
viewpoint is applyHeaderStateOnSkip (or restoreSnapshotState with
fsmAlreadyAtIndex returning true). Rename to reflect the round-7 shape;
the test scenarios (inject corruption, assert typed error, assert FSM
state unchanged) are unchanged.
@bootjp bootjp merged commit 6ffdbb2 into main Jun 3, 2026
9 checks passed
@bootjp bootjp deleted the docs/idempotent-snapshot-restore-design branch June 3, 2026 11:52
bootjp added a commit that referenced this pull request Jun 4, 2026
…oth snapshot persist sites (#915)

## Summary

Implements **Branch 2** of the cold-start snapshot-restore skip
optimisation designed in PR #910. After this lands the
`metaAppliedIndex` Pebble meta key is durably written on every
raft-Apply data mutation AND at every snapshot persist — but the skip
gate itself (Branch 3) is NOT yet wired, so behaviour is observationally
identical to `main` except for the new meta key in fsm.db. Branch 2 is
meant to soak in production for at least one release before Branch 3
enables the skip; this PR is intentionally a no-op-from-the-outside
change with comprehensive plumbing.

## Reading order (6 commits, designed to review one-at-a-time)

| # | commit | scope |
|---|---|---|
| 1 | `2339a6f2` | raftengine: opt-in interfaces (`AppliedIndexReader` /
`AppliedIndexWriter`) |
| 2 | `525fc152` | pebbleStore: `metaAppliedIndex` const +
`LastAppliedIndex` + `SetDurableAppliedIndex` (with `pebble.Sync`
UNCONDITIONALLY) |
| 3 | `aa9b8acc` | `MVCCStore` interface extension:
`ApplyMutationsRaftAt` / `DeletePrefixAtRaftAt` overloads, threading
appliedIndex through `applyMutationsWithOpts` + `deletePrefixAtWithOpts`
|
| 4 | `7cd72bda` | kvFSM seam wiring: `AppliedIndexReader()` /
`SetDurableAppliedIndex()` accessors + all 7 data-Apply leaves switched
to `*RaftAt` with `f.pendingApplyIdx` |
| 5 | `f1e8748c` | engine hooks at BOTH snapshot persist sites:
`persistCreatedSnapshot` + `e.persistLocalSnapshotPayload` call
`SetDurableAppliedIndex` BEFORE `persist.SaveSnap` |
| 6 | `2c42f7d6` | tests (10 new tests across store + engine) |

## Design constraints honoured

All from `docs/design/2026_06_02_idempotent_snapshot_restore.md`:

- **§2 "Why both leaves"**: meta key bundle in BOTH
`applyMutationsWithOpts` AND `deletePrefixAtWithOpts` so DEL_PREFIX
entries don't silently leave `LastAppliedIndex` behind. Tested by
`TestDeletePrefixAtRaftAt_BundlesMetaAppliedIndex`.
- **§3 `dbMu.RLock()`**: both `LastAppliedIndex` and
`SetDurableAppliedIndex` acquire the read-lock, matching the
lock-ordering discipline at `lsm_store.go:153 / :553 / :675`.
- **§4 fallback policy**: `AppliedIndexReader()` returns nil when the
store doesn't implement the seam; `LastAppliedIndex` returns `(0, false,
nil)` for missing OR truncated meta key. Branch 3 will then fall back to
full restore conservatively.
- **§6 `ELASTICKV_FSM_SYNC_MODE=nosync` mode**: `SetDurableAppliedIndex`
is **pinned to `pebble.Sync` unconditionally**. Rationale documented at
length in the method's doc-comment — once `persist.SaveSnap` returns,
WAL compaction discards every log entry ≤ `snap.Metadata.Index`, so
there's no source to replay the meta key bump from. +1 fsync per
snapshot persist (rare; default `SnapshotCount=10000`). Tested by
`TestSetDurableAppliedIndex_UsesPebbleSync`.
- **§6 "HLC lease entries — checkpoint at snapshot persist"**: BOTH
`persistCreatedSnapshot` (config snapshots) AND
`e.persistLocalSnapshotPayload` (steady-state `SnapshotCount`-triggered
hot path) call the hook. Both crash-ordering tested by
`TestPersistCreatedSnapshot_*`.
- **§8 compatibility**: `StateMachine.Apply`'s public signature is
unchanged. New interfaces are opt-in. Old call sites
(`ApplyMutationsRaft` without `*At`) still work, just pass
`appliedIndex=0` to opt out of the meta key bump.

## Test results

```
go vet ./...                                                    → 0 issues
go test ./store/ -short                                         → ok 29.4s
go test ./kv/    -short                                         → ok 10.4s
go test ./internal/raftengine/... -short                        → ok 32.8s
go test ./store/ -run 'TestLastAppliedIndex|TestSetDurable...|TestApply...|TestDelete...' → ok 1.6s
go test ./internal/raftengine/etcd/ -run 'TestRecording|TestPersistCreatedSnapshot_' → ok 0.03s
```

10 new tests added (see commit `2c42f7d6` for the full inventory).

## What this does NOT do

- **Does NOT enable the skip gate.** `restoreSnapshotState` still always
restores. Branch 3 wires the `fsmAlreadyAtIndex` check +
`applyHeaderStateOnSkip` + the two-phase `SnapshotHeaderApplier` seam.
- **Does NOT change `HEALTH_TIMEOUT_SECONDS=300`.** Branch 4 lowers it
once Branch 3 has soaked.
- **Does NOT touch the snapshot-install hot path**
(`Engine.applySnapshot`) per Non-Goals in the design.

## Soak plan

Branch 2 should run in production for at least one release before Branch
3 opens. Operators can verify the meta key is being written via:

```bash
# Inspect a pebble fsm.db (read-only)
ldb --db=/var/lib/elastickv/n3/fsm.db get '_meta_applied_index' --hex
# Expected: 8 little-endian bytes equal to the current applied index
```

## Refs

- PR #910 (design) — round 1..7 design history + retraction sections
explaining the design constraints this PR honours
- PR #909 — `HEALTH_TIMEOUT_SECONDS` band-aid that this series
eventually obviates


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Durable tracking of Raft-applied indexes to ensure consistent
snapshot/save ordering.

* **Bug Fixes**
* Improved snapshot persistence reliability by pinning durable applied
index before snapshot writes.
* Stronger durability for writes bundled with Raft entry indices,
reducing restore/recovery surprises.

* **Tests**
* Added comprehensive tests covering applied-index ordering, failure
handling, and persistence behavior.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant