docs(design): idempotent FSM snapshot restore on cold start#910
Conversation
Companion to PR #909 (HEALTH_TIMEOUT_SECONDS 60s -> 300s). PR #909 is a band-aid: the 60s ceiling kept tripping because restoreSnapshotState unconditionally calls fsm.Restore on every cold start when the WAL has a non-empty snapshot pointer, which on the pebbleStore backend dispatches to restorePebbleNativeAtomic + swapInTempDB and rewrites the entire FSM into a sibling temp directory before renaming it into place. For multi-GiB FSMs this is ~46 s on the 192.168.0.x cluster's hardware. The restore is mostly redundant: each successful fsm.Apply already persists its mutation durably, so after the FSM applied entry Y > snapshot.Metadata.Index = X, the on-disk fsm.db contains state >= X. On the next cold start we tear that state down and rebuild it from the older snapshot just to have raft replay the same entries we already had on disk. Proposed structure: 1. StateMachine.Apply(index, data) — thread the raft applied-index into the FSM so kvFSM can persist it atomically. 2. pebbleStore bundles a meta key (last_applied_index) in the same pebble.Batch as the data mutation. Pebble batches are atomic, so the index and the data move together; no torn-write window. 3. restoreSnapshotState gains an AppliedIndexReader gate: skip when stored.LastAppliedIndex >= snapshot.Metadata.Index. Fall back to the current full restore when the FSM cannot self-report or the meta key is missing (first restart after upgrade). Expected cold start in the steady-state case: pebble.Open + WAL replay + raft follower-ization + grpc bind, under 5 s independent of FSM size. Doc-only PR; no code change yet. Implementation is sequenced as four follow-up PRs (interface change -> meta key -> skip wiring -> metrics + HEALTH_TIMEOUT_SECONDS tightening), each individually reviewable.
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughRound‑6 proposal: expands the idempotent FSM snapshot-restore design to thread tok.CRC32C through the skip path, add CRC-verified header application on skip, and persist a durable applied-index meta-bump at snapshot persist sites so the conservative skip gate is provably safe. ChangesDesign Proposal: Idempotent FSM Snapshot Restore (Round 6)
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@claude review PR は docs/design/2026_06_02_idempotent_snapshot_restore.md の追加のみ (コード変更なし)。PR #909 の band-aid に対する根本原因側の設計提案です。 レビュー観点:
|
|
Claude finished @bootjp's task in 5m 2s —— View job Design Review: Idempotent FSM Snapshot Restore on Cold StartTodo List:
ReviewOverall the design is well-motivated and the core mechanism is sound. The Pebble-batch atomicity argument for crash safety holds, and the observability story (metrics + INFO log) is well thought through. The issues below range from a significant architectural redundancy in B2, to a correctness gap in the error-handling path, to a precision issue in the idempotency argument. §3 — Crash-safety argument (Pebble batch atomicity)Verdict: Sound, with two caveats the doc should address. The batch-atomic meta key pattern is already in production in this codebase: Caveat 1 — Caveat 2 — Caveat 3 — §4 — Idempotency of replay after skipVerdict: The invariant holds, but the stated mechanism is imprecise for OCC transactions — the explanation needs correction. The design claims: "every mutation is keyed by its raft metadata; re-applying the same entry produces the same Pebble write." This is accurate for raw requests (where After first apply, This is nonetheless safe, for a different reason:
Recommended fix: Replace the single sentence in §4 with a two-case argument: (a) raw requests produce an idempotent Pebble overwrite of the same MVCC cell; (b) OCC operations whose effects are already present produce a non-fatal §1 — Interface change scope:
|
| Area | Status |
|---|---|
| §3 Crash-safety (batch atomicity core) | ✅ Sound — nosync/DEL_PREFIX/encryption caveats should be addressed |
| §4 Idempotency of re-apply | |
| §1 Interface change (B2) | ApplyIndexAware seam makes this potentially redundant — justify or drop |
Error handling in fsmAlreadyAtIndex |
❌ Pseudocode propagates error instead of falling back — must be corrected before implementation |
| B3/B4/B5 branch granularity | ✅ Well-chosen |
| HLC lease Open Question | ✅ Option 2 is the right call — commit to it in the doc |
There was a problem hiding this comment.
Code Review
This pull request proposes a design for idempotent FSM snapshot restores on cold starts to optimize startup times. The review feedback identifies several critical issues and design improvements: leveraging the existing ApplyIndexAware interface to avoid breaking public APIs, addressing a missing opcode dispatch for encryption, adding database read locks to prevent race conditions, and resolving a type assertion mismatch by using a decoupled pure-function injection pattern.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| ```go | ||
| // internal/raftengine/statemachine.go | ||
| type StateMachine interface { | ||
| Apply(index uint64, data []byte) any // was: Apply(data []byte) any | ||
| Snapshot() (Snapshot, error) | ||
| Restore(r io.Reader) error | ||
| } | ||
| ``` |
There was a problem hiding this comment.
既存の ApplyIndexAware インターフェースの活用について
StateMachine インターフェースを変更してすべてのインツリー実装やテストシムを修正する(提案されている Branch 2)代わりに、既存の ApplyIndexAware インターフェースを活用することを検討してください。
kvFSM はすでに SetApplyIndex(idx uint64) を通じて ApplyIndexAware を実装しており、同じゴルーチン上で Apply が呼び出される直前にインデックスを f.pendingApplyIdx に保存しています。Apply 内部で f.pendingApplyIdx を直接利用すれば、パブリックインターフェースを破壊することなく、実装計画を大幅に簡素化(Branch 2 の作業を完全に削減)できます。
| func (f *kvFSM) Apply(index uint64, data []byte) any { | ||
| if len(data) > 0 && data[0] == raftEncodeHLCLease { | ||
| return f.applyHLCLease(index, data[1:]) | ||
| } | ||
| ctx := context.TODO() | ||
| reqs, err := decodeRaftRequests(data) | ||
| if err != nil { return errors.WithStack(err) } | ||
| return f.applyAtIndex(ctx, index, reqs) | ||
| } |
There was a problem hiding this comment.
暗号化オプコード処理の欠落
提案されている Apply の実装では、applyReservedOpcode(data) の呼び出しが完全にバイパスされています。現在のコードベースにおいて、applyReservedOpcode は raftEncodeHLCLease と暗号化オプコード(0x03..0x07)の両方を処理するため非常に重要です。この処理をバイパスすると、Stage 6/7/8 の機能に必要な暗号化オプコードのディスパッチが機能しなくなります。新しい Apply のロジックでも applyReservedOpcode を通すか、暗号化ディスパッチが維持されるように設計を修正してください。
| func (s *pebbleStore) LastAppliedIndex() (uint64, bool, error) { | ||
| val, closer, err := s.db.Get(metaAppliedIndexKey) | ||
| if errors.Is(err, pebble.ErrNotFound) { | ||
| return 0, false, nil | ||
| } | ||
| if err != nil { return 0, false, errors.WithStack(err) } | ||
| defer closer.Close() | ||
| if len(val) != 8 { | ||
| return 0, false, errors.Newf("corrupt applied-index meta key: %d bytes", len(val)) | ||
| } | ||
| return binary.BigEndian.Uint64(val), true, nil | ||
| } | ||
| ``` |
| func fsmAlreadyAtIndex(fsm StateMachine, want uint64) (bool, error) { | ||
| r, ok := fsm.(AppliedIndexReader) | ||
| if !ok { | ||
| return false, nil // FSM cannot self-report; restore conservatively. | ||
| } | ||
| have, present, err := r.LastAppliedIndex() | ||
| if err != nil { return false, err } | ||
| if !present { return false, nil } | ||
| return have >= want, nil | ||
| } |
There was a problem hiding this comment.
kvFSM における型アサーションの不整合とインターフェース設計の改善
fsmAlreadyAtIndex 内で、fsm を AppliedIndexReader に型アサーションしようとしていますが、kvFSM 自体は AppliedIndexReader を実装していないため、この型アサーションは常に失敗します。
この結合を解決するために、kvFSM に新たなメソッドを追加してインターフェースを肥大化させるのではなく、純粋関数(例:func() uint64)を値として注入するアダプターやインターセプターパターンの採用を検討してください。これにより、ステートフルなインターフェースに不要なメソッドを追加することなく、インターフェースを軽量に保ちながら疎結合な設計を実現できます。
References
- When decoupling a generic interface from concrete implementations, prefer injecting pure functions as values (e.g., via an adapter or interceptor) instead of adding stateless methods to stateful interfaces, keeping the interfaces lightweight and focused on stateful operations.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cdabb969a7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| to branch 3; the in-memory HLC state is reconstructed at startup from | ||
| the lease entries replayed by raft, so this is purely about getting | ||
| the index counter monotonic. |
There was a problem hiding this comment.
Preserve HLC snapshot state when skipping restore
In the cold-start path, loadWalState restores the FSM before opening the WAL at walSnapshotFor(snapshot), so after the proposed skip only entries after the persisted snapshot index are replayed. kvFSM.Restore is currently the place that reads the snapshot header and applies the HLC physical ceiling (kv/fsm.go:513-518), so if the last HLC lease is at or before the snapshot index and a later data entry advances the proposed Pebble meta key, this design skips Restore, does not replay that compacted lease, and restarts with no restored ceiling; a subsequently elected node can issue HLC timestamps below the snapshotted ceiling. The open question’s statement that startup reconstructs HLC from replayed lease entries is therefore unsafe unless the skip path separately reads/persists the HLC snapshot metadata or refuses to skip in this case.
Useful? React with 👍 / 👎.
Address: - codex P1 (HLC ceiling preservation when skipping body Restore) by adding §5 / applyHLCCeilingFromSnapshot / HLCCeilingSetter seam - gemini medium on dbMu locking by adding s.dbMu.RLock() to LastAppliedIndex pseudocode in §3 - claude §3 caveat 1 (ELASTICKV_FSM_SYNC_MODE=nosync) by splitting §6 crash-safety into Sync/NoSync subsections - claude §3 caveat 2 (DEL_PREFIX path builds independent batch) by threading appliedIndex into both applyMutationsWithOpts AND deletePrefixAtWithOpts in §2 + §6 - claude §3 error handling by collapsing fsmAlreadyAtIndex's signature to (bool) only; any error / missing / uncertainty -> false (full restore) per the strictly-additive guarantee - claude §4 OCC two-case argument by replacing §7 with raw vs OCC reasoning - claude §1 call-site count by correcting 8 -> 6 in §1's table Pivot the interface change: - Round-1 proposed breaking StateMachine.Apply(data) -> Apply(index, data) - Round-2 introduces a NEW ApplyIndexAware seam (does not exist on main today) so StateMachine.Apply's public signature is unchanged - This adopts the pattern reviewers wrongly believed already existed, but with grep-verified honesty about what is actually in main Add Errata section with grep-verified table of fabricated reviewer claims (ApplyIndexAware, SetApplyIndex, pendingApplyIdx, applyReservedOpcode, applyEncryption, WriteSidecar, encryption opcodes 0x03..0x07 — all zero matches at SHA 94579fc) so future readers understand which round-1 nits are real and which were imagined. No code change. PR remains docs-only.
|
round-2 push: Incorporated (real findings):
Pushed back with grep evidence (round-1 fabricated claims):
The reviewers' reasoning about avoiding a breaking @claude review |
|
Claude finished @bootjp's task in 6m 42s —— View job Claude finished @bootjp's task — Design Review — Round 2Todo List:
The round-2 revision has resolved every round-1 correctness issue and the overall design is now very sound. Two new findings from reading the current source require attention before implementation — one is a correctness bug in 🔴 Critical:
|
| Symbol | Errata claim | Actual state |
|---|---|---|
ApplyIndexAware |
0 matches at 94579fc0 |
Exists at internal/raftengine/statemachine.go:46 |
SetApplyIndex |
0 matches | Exists at kv/fsm.go:128 |
pendingApplyIdx |
0 matches | Exists at kv/fsm.go:53 |
applyReservedOpcode |
0 matches | Exists at kv/fsm.go:293, dispatches HLC-lease (0x02) AND encryption (0x03..0x07) |
applyEncryption |
0 matches | Exists at kv/fsm_encryption.go:78 |
WriteSidecar |
0 matches | Exists at internal/encryption/sidecar.go:240 |
| engine call site | "engine.go:2292 is failPending" |
engine.go:2292-2293 IS the ApplyIndexAware seam: if aware, ok := e.fsm.(raftengine.ApplyIndexAware); ok { aware.SetApplyIndex(entry.Index) } |
The round-1 reasoning about "avoid a breaking signature change" was adopted correctly, and the Errata correctly attributes the error. But the implementation plan needs updating:
- Branch 2 is done. The
ApplyIndexAwareinterface,kvFSM.SetApplyIndex, and the engine call site were wired by encryption Stages 6C/6E/7a. - The §1 table row "kvFSM.Apply: Add
SetApplyIndexstoring intopendingApplyIdx" is already complete. - Branch 3 can use
f.pendingApplyIdximmediately — it is already populated by the engine before everyApplycall. - The implementation plan branch numbering should shift: B2 → done (drop or mark as shipped), B3 → new B2, B4 → new B3, B5 → new B4.
The docs branch should also be rebased onto current main before B2 implementation starts, since the branch diverged before all the encryption commits and the kv/fsm.go on this branch already contains SetApplyIndex etc. (making the diff clean).
🟠 Significant: restoredCutover (Stage 8a §3.4) not set when skip fires
kvFSM.Restore sets f.restoredCutover = cutover (kv/fsm.go:522) as the "snapshot-to-applier handoff" for Stage 6E. When the skip path fires, this assignment is bypassed. In most restarts this is harmless: the WAL replay window (snapshot.Index, committed] will re-apply the EnableRaftEnvelope entry and the cutover is restored through applyEncryption. But if the EnableRaftEnvelope entry was at index ≤ snapshot.Metadata.Index (compacted before the snapshot), it won't be replayed and the Stage 6E activation hook won't fire on the next apply.
This is an existing coupling that the skip path introduces as a new failure mode. Recommended resolution (consistent with the C1 fix above): extend HLCCeilingSetter to also deliver the cutover, and have applyHLCCeilingFromSnapshot set both in one pass:
type HLCCeilingSetter interface {
SetHLCPhysicalCeiling(ms int64)
SetRestoredCutover(cutover uint64) // Stage 8a §3.4 handoff
}Or, if keeping the interface minimal, add a single ApplySnapshotHeader(snapPath string) error method to kvFSM that opens the file, calls ReadSnapshotHeader, and sets both f.hlc.SetPhysicalCeiling and f.restoredCutover — exactly mirroring what Restore does for the header section.
🟠 Significant: Encryption-opcode entries are a present gap, not a future one
The "Out of Scope (future)" note says "once Stages 6/7/8 land, they will need the same treatment as DEL_PREFIX." But applyEncryption for opcodes 0x03-0x07 is already live on main (dispatched through applyReservedOpcode). In a cluster that has run encryption rotations, those entries won't advance metaAppliedIndex (since applyEncryption calls WriteSidecar, not f.store.ApplyMutationsRaft). The same Option 2 reasoning as HLC lease entries applies — safe, conservative full restore — but the doc should state this as a current known gap rather than a future one. Consider a single paragraph under the HLC-lease Option 2 commitment:
"The same reasoning applies to encryption-opcode entries (0x03-0x07, already in-tree): they go through
applyEncryption/WriteSidecarand do not touchf.store.metaAppliedIndexwill not advance for encryption-only apply windows; the skip gate will conservatively fall back to full restore in those cases."
🟢 Minor
isPebbleMetaKey — function name correction: The doc references isReservedMetaKey (§3). The actual function in lsm_store.go:534 is isPebbleMetaKey. Branch 3 should extend isPebbleMetaKey.
kvFSM.Apply dispatch is now three-way: The §2 pseudocode comment says "leave dispatch (HLC-lease vs request) UNCHANGED." The actual applyReservedOpcode is now three-way: HLC-lease (0x02), encryption (0x03-0x07), MVCC requests. Update the comment.
B4 test spec — add v2-format coverage: The spec says "asserts that applyHLCCeilingFromSnapshot sets f.hlc.PhysicalCeiling() for both skip and restore paths." Once C1 is fixed, the test should also explicitly exercise the v2 snapshot format (hlcSnapshotMagicV2) to prevent regression.
Summary
| Area | Status |
|---|---|
| §3 Crash-safety (DEL_PREFIX + nosync, round-1 caveats) | ✅ Correctly addressed in round-2 |
| §4 OCC two-case idempotency argument | ✅ Correct and complete in round-2 |
| §5 HLC ceiling preservation — v2 snapshot correctness | 🔴 applyHLCCeilingFromSnapshot only checks v1 magic; silently skips ceiling for v2 snapshots |
§5 restoredCutover (Stage 8a §3.4) on skip path |
🟠 Not set when skip fires; EnableRaftEnvelope handoff may be lost if entry is compacted |
Error handling in fsmAlreadyAtIndex |
✅ Bool-only return, conservative fallback, round-1 critical fixed |
Branch 2 (ApplyIndexAware seam) |
🔴 Already exists on main (encryption work) — implementation plan needs updating |
| B3/B4/B5 branch granularity | ✅ Well-chosen; numbering shifts now that B2 is done |
| Encryption-opcode gap as "future" | 🟠 Opcodes are live; rephrase as a current known gap |
isPebbleMetaKey name |
🟢 Cosmetic |
| HLC Option 2 commitment | ✅ Correct and consistent with the code |
The two critical items (C1 — v2 HLC ceiling bug, C2 — B2 already done) should be resolved before the implementation branches are opened. Everything else is either already sound or a documentation polish.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: df9580b6ae
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if !bytes.Equal(hdr[:8], hlcSnapshotMagic[:]) { | ||
| return nil // old format, no ceiling. |
There was a problem hiding this comment.
Preserve v2 snapshot header state when skipping restore
Fresh evidence beyond the earlier HLC concern: current snapshots can be emitted with the v2 EKVTHLC2 header (kv/snapshot.go's ReadSnapshotHeader returns both ceiling and cutover), but this skip-path helper only accepts the v1 hlcSnapshotMagic and returns nil for anything else. In clusters that have crossed the v2 cutover, skipping kvFSM.Restore would therefore apply neither the snapshotted HLC ceiling nor the restored cutover, so a restarted node can lose snapshot header state while believing the body restore was safely elided; the skip path should reuse/export the same header parser and setter semantics as kvFSM.Restore, not a 16-byte v1-only probe.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
docs/design/2026_06_02_idempotent_snapshot_restore.md (1)
487-499: ⚡ Quick winAdd a fallback-reason dimension to metrics for lease-only false positives.
You already call out the HLC-lease-only case as safe-but-overrestoring; add a reason label/counter (e.g.,
reason="missing_meta|read_error|lease_only_gap|stale") so B5 can be gated on real cause distribution, not just skip rate.Based on learnings: Ensure FSM apply operations are idempotent and that committed entries cannot be lost; check crash-restart paths.
Also applies to: 553-563
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/design/2026_06_02_idempotent_snapshot_restore.md` around lines 487 - 499, Add a "reason" dimension to the skip-gate/restore metrics and increment it with concrete values ("missing_meta", "read_error", "lease_only_gap", "stale") at the decision points so B5 can be gated by cause distribution; specifically, in the code paths around f.applyHLCLease (which is in-memory and doesn't touch f.store) and the skip-gate fallback logic, emit a metrics counter with reason="lease_only_gap" when the only committed entries are HLC lease ticks, and use the other reason labels at the respective error/metadata branches. Also audit FSM apply code (functions/methods that perform apply operations and rely on LastAppliedIndex) to ensure apply operations are idempotent and that committed entries are persisted across crash/restart paths (verify LastAppliedIndex handling and any f.store writes) so the metric reason accurately reflects true causes rather than lost commits.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/design/2026_06_02_idempotent_snapshot_restore.md`:
- Around line 579-580: Update the B5 behaviour description to fix the wording
typo: locate the row for **B5** in the document (the line containing "Lower
`HEALTH_TIMEOUT_SECONDS` default once production data shows steady-state skip
rate ≥ 90 % | Tighter ceiling, but tracker the env override still honoured") and
change the phrase "but tracker the env override still honoured" to "but the env
override is still honoured" so the sentence reads clearly.
- Around line 553-569: Add fenced code block languages to the two examples so
markdown lint passes: update the metric fence to use a language (e.g. ```text)
and the log line fence to use a language (e.g. ```text). Locate the metric
example containing fsm_cold_start_restore_total and
fsm_cold_start_applied_index_gap and the log example containing the message
"restoreSnapshotState skipped (FSM at index %d, snapshot at %d, HLC ceiling
applied from header)" and prepend the appropriate language identifier to each
opening ``` fence.
---
Nitpick comments:
In `@docs/design/2026_06_02_idempotent_snapshot_restore.md`:
- Around line 487-499: Add a "reason" dimension to the skip-gate/restore metrics
and increment it with concrete values ("missing_meta", "read_error",
"lease_only_gap", "stale") at the decision points so B5 can be gated by cause
distribution; specifically, in the code paths around f.applyHLCLease (which is
in-memory and doesn't touch f.store) and the skip-gate fallback logic, emit a
metrics counter with reason="lease_only_gap" when the only committed entries are
HLC lease ticks, and use the other reason labels at the respective
error/metadata branches. Also audit FSM apply code (functions/methods that
perform apply operations and rely on LastAppliedIndex) to ensure apply
operations are idempotent and that committed entries are persisted across
crash/restart paths (verify LastAppliedIndex handling and any f.store writes) so
the metric reason accurately reflects true causes rather than lost commits.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 02558587-5367-43c0-a0eb-3bbbea5d536e
📒 Files selected for processing (1)
docs/design/2026_06_02_idempotent_snapshot_restore.md
… (v2)
Round-2 of this doc wrongly pushed back on gemini and claude review
findings as 'fabricated.' My grep was running against
test/event-driven-leader-readiness (27 commits behind origin/main) so
several origin/main-only entities (ApplyIndexAware, SetApplyIndex,
pendingApplyIdx, applyReservedOpcode, applyEncryption, WriteSidecar,
v2 snapshot header) returned zero matches in my local tree. The
reviewers were correct; round 2 was wrong.
Round 3 rebases against the actual origin/main:
- Drop the round-2 'create ApplyIndexAware seam' framing.
ApplyIndexAware already exists at
internal/raftengine/statemachine.go:46; engine.go:2292 already
calls aware.SetApplyIndex(entry.Index); kvFSM.SetApplyIndex at
kv/fsm.go:122 already stashes the index in f.pendingApplyIdx
(kv/fsm.go:53). Branch 2 shrinks to 'thread f.pendingApplyIdx
through the data-Apply leaves so applyMutationsWithOpts /
deletePrefixAtWithOpts can bundle metaAppliedIndex in the same
pebble.Batch as metaLastCommitTSBytes'.
- Address codex round-2 P1 (v2 snapshot header). The skip path
must preserve BOTH the HLC ceiling AND the Stage 8a cutover.
§5 swaps the round-2 v1-only 16-byte probe for the existing
kv.ReadSnapshotHeader(*bufio.Reader) parser, which handles v1,
v2, ErrSnapshotHeaderUnknownMagic, and headerless legacy
uniformly. SnapshotHeaderApplier is a new opt-in interface
matching kvFSM's existing Restore side-effects (SetPhysicalCeiling
+ restoredCutover).
- Keep applyReservedOpcode untouched in the data-Apply path.
Encryption opcodes 0x03..0x05 already consume f.pendingApplyIdx
via applyEncryption -> WriteSidecar; metaAppliedIndex correctly
does NOT advance for sidecar-only entries, leaving the skip
gate to conservatively fall back on Stage 6/7/8 maintenance
windows where the only entries are encryption ops.
- Drop the round-2 Errata table; add a Round-2 Retraction section
at the end with grep-verifiable origin/main file:line references
so future readers can see why round-2 was wrong and what round 3
rebased to.
- Fix coderabbit nits: 'text' language on the two fenced metric /
log code blocks; 'tracker the env override' typo in the B4 row
becomes 'the env override remains honoured'.
Doc-only. No code change.
|
round-3 push: What changedI owe you and the round-1 reviewers a retraction. Round-2's "Errata" pushed back on gemini-code-assist and claude[bot] as if their references to
The reviewers were right; round 2 was wrong. Round-3 substantive changes
ApologyThe lesson for me: always grep against @claude review |
|
Claude finished @bootjp's task in 3m 57s —— View job Round-3 Review Todo List:
Design Review — Round 3Verdict on the retraction: Correct and well-documented. Every entity the round-2 Errata dismissed as fabricated has been verified on
The Errata section is well-written; it includes the 🔴 Critical: Import cycle in §5
|
| Area | Status |
|---|---|
| All round-2 errata verified correct | ✅ |
| §1 Branch 2 scope correctly narrowed | ✅ |
§5 v1+v2+headerless parity via ReadSnapshotHeader |
✅ design shape correct |
§5 import cycle (kv.ReadSnapshotHeader from wal_store.go) |
🔴 Must use ApplySnapshotHeaderFromFile on the interface |
isPebbleMetaKey / lsm_store.go:534 wrong name+line in §3 |
🟠 Carry-over from Round 2 — fix before B2 |
| Line number annotations in §2 | 🟠 Minor — correct before doc is finalized |
| §6 crash-safety + OCC idempotency | ✅ |
§8 observability including fallback_reason label |
✅ |
| B2/B3/B4 branch granularity | ✅ |
The one blocking item before implementation can start is resolving the import cycle in §5 by using ApplySnapshotHeaderFromFile(snapPath string) error on SnapshotHeaderApplier instead of calling kv.ReadSnapshotHeader directly from wal_store.go. Everything else is either sound or a doc-annotation fix.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9753105440
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| between two snapshots are lease ticks, the skip gate will fall back | ||
| to full restore — safe and rare. Adding a synthetic pebble batch per |
There was a problem hiding this comment.
Account for periodic HLC leases before skipping restore
This fallback is not rare in the current system: RunHLCLeaseRenewal proposes an HLC lease every second while the node is leader (kv/coordinator.go:641-663, and the sharded path mirrors it), and applyHLCLease only mutates memory. Once a local snapshot is persisted after any lease-only tail, metaAppliedIndex remains at the last data mutation while snapshot.Metadata.Index includes those leases; on restart fsmAlreadyAtIndex will reject the skip, full restore will reinstall the same stale meta key from the snapshot, and opening the WAL at the snapshot index omits the compacted leases, so idle clusters can keep doing O(snapshot size) restores indefinitely. The design needs a way to advance the durable applied index for these no-store entries (or otherwise prove/header-cover them) rather than accepting a permanent fallback.
Useful? React with 👍 / 👎.
Codex round-3 P2 (line 438) correctly observed that the round-3 'safe
and rare' framing of the HLC-lease-only fallback was wrong:
- RunHLCLeaseRenewal (kv/coordinator.go:650) ticks at 1s while the
local node is leader.
- applyHLCLease is memory-only; metaAppliedIndex does not advance.
- For any cluster with a leader running >1s, lease entries trail
every snapshot. Snapshot persists at index X, last data-Apply
index Y < X, metaAppliedIndex stays at Y on restart.
fsmAlreadyAtIndex(X) checks Y >= X -> false -> full restore.
Idle clusters never fire the skip; active clusters have a
meaningful window where it doesn't fire.
Round 4 closes the gap by bumping metaAppliedIndex to
snapshot.Metadata.Index inside persistCreatedSnapshot
(engine.go:2683), BEFORE e.persist.SaveSnap. After a successful
snapshot persist, LastAppliedIndex >= snapshot.Index holds
unconditionally, so the skip fires on the next restart regardless of
how many lease entries trail the snapshot.
Mechanism:
- New AppliedIndexWriter interface (sibling to AppliedIndexReader)
with SetDurableAppliedIndex(idx uint64) error.
- pebbleStore.SetDurableAppliedIndex runs a single-key pebble.Batch
write under dbMu.RLock with raftApplyWriteOpts (honours
ELASTICKV_FSM_SYNC_MODE).
- kvFSM forwards to f.store; engine.persistCreatedSnapshot calls
the writer seam before SaveSnap.
Crash ordering: SetDurableAppliedIndex runs first. The only
observable states are (metaAppliedIndex,snapshot pointer) in
{(Y,X'<X), (X,X'), (X,X), (Z>X,X)}; none of these can yield
'snapshot pointer = X but metaAppliedIndex < X'. Over-restore
impossible.
Cost: one extra pebble Batch.Commit per snapshot persist (~one extra
fsync per SnapshotCount entries; default 10000) versus the rejected
Option A (1 fsync/sec/group continuously).
Implementation Plan B2 row gains the SetDurableAppliedIndex /
persistCreatedSnapshot bump. B2 tests gain a snapshot-persist
round-trip + a SaveSnap-failure crash-simulation. B3 tests gain an
idle-cluster integration test that proves the codex P2 scenario is
closed end-to-end.
Header status bumped to Round 4. Round-3 retraction section appended
with a meta-lesson on quantifying 'rare' against production timer
cadences rather than intuition.
Doc-only PR; no code change.
|
round-4 push: You were right that round 3's framing of the HLC-lease-only fallback as "safe and rare" was wrong. Verified the production cadence:
Round-4 fix: snapshot-persist meta key bump
if w, ok := e.fsm.(AppliedIndexWriter); ok {
if err := w.SetDurableAppliedIndex(snap.Metadata.Index); err != nil {
return errors.WithStack(err)
}
}
if err := e.persist.SaveSnap(snap); err != nil { ... }New Crash ordering proves over-restore impossibleBump runs BEFORE SaveSnap. Observable states at any crash boundary:
No state can yield "snapshot pointer = X but metaAppliedIndex < X". Cost~1 extra pebble Tests
Meta-lessonThe doc body has a Round-3 retraction section calling out the lesson: "rare" should be a quantitative claim against the actual production timer cadence, not an intuition. Your review cited @claude review |
|
Claude finished @bootjp's task in 5m 1s —— View job Round-4 ReviewTodo List:
The round-4 codex P2 fix (snapshot-persist meta key bump) is mechanically sound. The crash-ordering proof is correct. One item carries forward as blocking from round-3 with no change in round-4; the rest are annotation-level issues. 🔴 Critical (carry-forward from round-3, still blocking): Import cycle in §5This was the single blocking item from round-3 and it is unchanged in round-4.
// wal_store.go (proposed) — still creates an import cycle
ceiling, cutover, err := kv.ReadSnapshotHeader(br)Verified on The fix is unchanged from round-3: change // internal/raftengine/statemachine.go (interface)
type SnapshotHeaderApplier interface {
ApplySnapshotHeaderFromFile(snapPath string) error
}
// wal_store.go — never names the kv package
func applyHeaderStateOnSkip(fsm StateMachine, snapPath string) error {
setter, ok := fsm.(SnapshotHeaderApplier)
if !ok { return nil }
return errors.WithStack(setter.ApplySnapshotHeaderFromFile(snapPath))
}
// kv/fsm.go — kv.ReadSnapshotHeader stays inside the kv package
func (f *kvFSM) ApplySnapshotHeaderFromFile(snapPath string) error {
file, err := os.Open(snapPath)
if err != nil { return errors.WithStack(err) }
defer file.Close()
ceiling, cutover, err := ReadSnapshotHeader(bufio.NewReader(file))
if err != nil { return errors.WithStack(err) }
if f.hlc != nil && ceiling > 0 {
f.hlc.SetPhysicalCeiling(int64(ceiling))
}
f.restoredCutover = cutover
return nil
}✅ Round-4
|
| State | metaAppliedIndex | Snapshot pointer | fsmAlreadyAtIndex check |
Result |
|---|---|---|---|---|
| Before bump | Y | X' < X | Y ≥ X' ? | Correct (may skip or restore) |
| Bump done, SaveSnap pending | X | X' | X ≥ X' → true | Skip fires safely — no over-restore |
| Both done | X | X | X ≥ X → true | Correct optimisation |
| Both + data Apply at Z | Z | X | Z ≥ X → true | Correct |
The bump can never write a value lower than the current metaAppliedIndex: snap.Metadata.Index = e.applied at snapshot creation time, and e.applied ≥ last data Apply index, so SetDurableAppliedIndex always moves the meta key forward or stays equal. ✅
✅ pebbleStore.SetDurableAppliedIndex lock ordering — Sound
dbMu.RLock() (not dbMu.Lock()) is the correct lock for a single-key batch operation that doesn't swap the db pointer. No applyMu is needed: the raft apply loop is serial (engine.go:2292 seam runs on the same goroutine as Apply), so concurrent applyMutationsWithOpts + SetDurableAppliedIndex cannot occur. swapInTempDB holds dbMu.Lock(), which correctly blocks SetDurableAppliedIndex until the swap completes. ✅
🟠 Minor: Idle-cluster test spec needs ELASTICKV_RAFT_SNAPSHOT_COUNT override note
The B3 test spec (line 656) says:
"An idle-cluster test runs a 3-node cluster with no data writes for
2 * SnapshotCount * hlcRenewalIntervalseconds"
With defaultSnapshotEvery = 10_000 (confirmed at engine.go:93) and hlcRenewalInterval = 1 s, that's 20 000 seconds — clearly not a runnable integration test. The spec should explicitly state that the test sets ELASTICKV_RAFT_SNAPSHOT_COUNT=10 (or overrides snapshotEvery via a test-only constructor) to make the scenario tractable. Without this note, the implementer will have no signal that a configuration override is expected.
🟢 Minor: Carry-forward annotation errors (unchanged from round-3)
These were round-3 minor nits and remain in round-4:
- Wrong function name / line: §3 line 227 says "extends
isReservedMetaKey(lsm_store.go:454)". Actual:isPebbleMetaKeyat line 534. applyMutationsWithOptsline: Doc says:1130; actual is:1292.metaLastCommitTSBytesinapplyMutationsWithOpts: Doc says:1162; actual is:1324.deletePrefixAtWithOpts: Doc says:1196(actual:1358) and:1231(actual:1393).
These are cosmetic but will cause confusion when implementers look up the code. Fix before B2 opens.
Summary
| Area | Status |
|---|---|
Round-4 codex P2 fix: AppliedIndexWriter + persistCreatedSnapshot bump |
✅ Mechanically sound |
| Round-4 crash-ordering proof (bump-before-SaveSnap table) | ✅ Correct |
pebbleStore.SetDurableAppliedIndex lock ordering |
✅ Correct |
§5 import cycle (kv.ReadSnapshotHeader from wal_store.go) |
🔴 Still present — blocking from round-3 |
Idle-cluster test spec (B3) needs ELASTICKV_RAFT_SNAPSHOT_COUNT override note |
🟠 Minor |
isReservedMetaKey / line number annotations |
🟢 Cosmetic carry-forward |
The one blocking item before any implementation branch opens remains §5's import cycle, which requires SnapshotHeaderApplier.ApplySnapshotHeaderFromFile to replace the direct kv.ReadSnapshotHeader call in wal_store.go.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a5eacc77c7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| **Mechanism**: bump `metaAppliedIndex` to `snapshot.Metadata.Index` | ||
| when persisting a created snapshot, in `persistCreatedSnapshot` | ||
| (`internal/raftengine/etcd/engine.go:2683`), **before** | ||
| `e.persist.SaveSnap`: |
There was a problem hiding this comment.
This proposed fix only hooks persistCreatedSnapshot, but the HLC-lease-only case being fixed is produced by normal SnapshotCount snapshots, which currently go through maybePersistLocalSnapshot → persistLocalSnapshot → persistLocalSnapshotPayload and call persist.SaveSnap directly in internal/raftengine/etcd/wal_store.go:525, bypassing persistCreatedSnapshot entirely. With that path unchanged, periodic local snapshots can still persist at index X while metaAppliedIndex remains at the last data write Y < X, so the permanent full-restore fallback described above is not closed for the steady-state snapshots this optimization depends on.
Useful? React with 👍 / 👎.
Addresses claude[bot] round-4 blocking finding + codex round-4 P2. claude[bot] BLOCKING (carry-forward from round-3 / round-4): §5 pseudocode `ceiling, cutover, err := kv.ReadSnapshotHeader(br)` from `internal/raftengine/etcd/wal_store.go` would create an import cycle because `kv/snapshot.go` and `kv/fsm.go` both already import `github.com/bootjp/elastickv/internal/raftengine`. Round-5 fix: restructure the SnapshotHeaderApplier seam so the file open AND the kv.ReadSnapshotHeader call live inside `kvFSM.ApplySnapshotHeaderFromFile(snapPath string) error`. The wal_store.go applyHeaderStateOnSkip helper only ever sees the interface and never names the kv package. This matches the existing template used by ApplyIndexAware (which delivers entry.Index to kv without importing kv from the engine). codex round-4 P2 at :455: the round-4 snapshot-persist hook only covers `persistCreatedSnapshot` (engine.go:2683), which fires only for config snapshots (membership changes). The steady-state SnapshotCount-triggered hot path is: maybePersistLocalSnapshot (engine.go:2070) -> e.persistLocalSnapshotPayload (engine.go:4032) -> free persistLocalSnapshotPayload (wal_store.go:519) -> persist.SaveSnap (wal_store.go:525) This path bypasses persistCreatedSnapshot entirely, so the round-4 fix didn't actually close the round-3 P2 gap for the snapshots the optimisation depends on. Round-5 fix: add a second SetDurableAppliedIndex hook in the engine wrapper `e.persistLocalSnapshotPayload` (engine.go:4032), placed under `e.snapshotMu.Lock()` and BEFORE the free-function call. The hook lives in the engine wrapper, not in the free function, so the free function stays signature-stable and reusable from tests that bypass the engine. The B2 row of the Implementation Plan now enumerates both hook sites explicitly. Minor: idle-cluster test spec annotated with the ELASTICKV_RAFT_SNAPSHOT_COUNT=10 override (default 10000 + hlcRenewalInterval=1s would require >= 20 000 seconds, not runnable). Cosmetic: line-number annotations corrected per claude's verification against origin/main HEAD: - isReservedMetaKey -> isPebbleMetaKey at lsm_store.go:534 - applyMutationsWithOpts at lsm_store.go:1292 (was :1130) - applyMutationsWithOpts meta bundle at :1324 (was :1162) - deletePrefixAtWithOpts at lsm_store.go:1358 (was :1196) - deletePrefixAtWithOpts meta bundle at :1393 (was :1231) Header status bumped to Round 5. Round-4 retraction section appended with grep-verified diagnosis of both round-4 issues + meta-lesson on verifying import direction before pseudocoding a cross-package call site. Doc-only PR; no code change.
…samples Round-5 corrected the body-prose line numbers (:1162 -> :1324, :1231 -> :1393) but missed the comments inside the §2 code samples that colocate the new metaAppliedIndex Set with the existing metaLastCommitTSBytes Set. Update those comments too so implementers reading the §2 pseudocode see consistent line numbers.
|
round-5 push: Addresses claude[bot] round-4 BLOCKING + codex round-4 P2 at 🔴 § 5 import cycle (claude carry-forward, BLOCKING)Verified on Round-5 fix: push the file open AND the // wal_store.go — never imports kv
func applyHeaderStateOnSkip(fsm StateMachine, snapPath string) error {
setter, ok := fsm.(SnapshotHeaderApplier)
if !ok { return nil }
return errors.WithStack(setter.ApplySnapshotHeaderFromFile(snapPath))
}
// kv/fsm.go — kv.ReadSnapshotHeader stays inside kv
func (f *kvFSM) ApplySnapshotHeaderFromFile(snapPath string) error { ... }This matches the existing 🟡 codex round-4 P2 at
|
| Symbol | Round-3 / round-4 cited | Actual (round-5) |
|---|---|---|
isReservedMetaKey |
lsm_store.go:454 |
isPebbleMetaKey at :534 |
applyMutationsWithOpts |
:1130 |
:1292 |
| applyMutationsWithOpts meta bundle | :1162 |
:1324 |
deletePrefixAtWithOpts |
:1196 |
:1358 |
| deletePrefixAtWithOpts meta bundle | :1231 |
:1393 |
Round-4 retraction section appended documenting both round-4 issues with grep-verified diagnoses and the meta-lesson: when a §X seam touches a package boundary, verify import direction explicitly before pseudocoding the call site (round-5's fix structurally mirrors the existing ApplyIndexAware template that round-3 / round-4 overlooked).
@claude review
|
Claude finished @bootjp's task in 3m 31s —— View job Round-5 ReviewTodo List:
All round-4 blocking and significant items have been resolved. Verified against ✅ Round-4 Critical 1 — Import cycle: Resolved
Import direction confirmed:
✅ Round-4 Critical 2 — Local snapshot path bypass: ResolvedTwo persist sites are now correctly identified and both get the
The free function ✅ B3 Idle Test Tractability: ResolvedThe test spec now explicitly sets ✅ Cosmetic Line Number Corrections: All Verified
🟢 Concurrent write safety between apply path and snapshot persist pathOne thing not explicitly addressed in the doc but worth confirming:
No change needed — just confirming soundness. 🟢 Minor:
|
| Area | Status |
|---|---|
§5 import cycle — ApplySnapshotHeaderFromFile via SnapshotHeaderApplier |
✅ Resolved |
§6 local snapshot path — e.persistLocalSnapshotPayload hook |
✅ Resolved |
B3 idle test tractability — ELASTICKV_RAFT_SNAPSHOT_COUNT=10 |
✅ Resolved |
| Cosmetic line number annotations | ✅ All corrected |
Concurrent write safety (apply path vs snapshot path on metaAppliedIndex) |
✅ Sound |
persistCreatedSnapshot line annotation (2683 vs 2679) |
🟢 Cosmetic |
wal_store.go:525 in retraction section (actual :524) |
🟢 Cosmetic |
The design is ready for implementation. All blocking and significant issues from rounds 1–4 have been addressed. The two remaining items are single-line annotation imprecisions that can be fixed opportunistically but are not blockers. Branch 2 can proceed.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fea230b40a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // Round-5: bump metaAppliedIndex BEFORE the free-function | ||
| // persistLocalSnapshotPayload (which calls persist.SaveSnap at | ||
| // wal_store.go:525). Lives in the engine wrapper so the free | ||
| // function stays signature-stable and is reusable from tests | ||
| // that bypass the engine. Skipped silently when the FSM does | ||
| // not implement AppliedIndexWriter (legacy fakes / test shims). | ||
| if w, ok := e.fsm.(AppliedIndexWriter); ok { | ||
| if err := w.SetDurableAppliedIndex(index); err != nil { |
There was a problem hiding this comment.
Move the applied-index bump into the snapshot payload
In the current local-snapshot workflow, persistLocalSnapshot writes the FSM snapshot file/token before it calls this wrapper, so this hook updates only the live Pebble DB and not the snapshot body that was just serialized. If the next cold start takes any conservative full-restore path (missing/corrupt meta key, read error, or restoring this snapshot elsewhere), that restore reinstalls the stale metaAppliedIndex=Y from the snapshot file while opening the WAL at snapshot index X, where the lease-only entries up to X are already compacted; the node then falls back on every restart again. The bump needs to be part of the state captured by the FSM snapshot (or the restore path must synthesize it from snapshot.Metadata.Index), not just written after the snapshot file has already been produced.
Useful? React with 👍 / 👎.
claude[bot] round-5 verdict flagged two 1-line annotation imprecisions
as 'cosmetic, not blockers':
- persistCreatedSnapshot at engine.go:2683 -> actual function
definition at :2679; :2683 is the SaveSnap call inside it
- persist.SaveSnap at wal_store.go:525 -> actual call site at :524
Both are off-by-one references to internal implementation lines that
the implementer will look up anyway. Fixing them to spare the
implementer the verification step. No structural change.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 18043f7c4c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if !ok { | ||
| return nil // FSM has no header state; skip is harmless. | ||
| } | ||
| return errors.WithStack(setter.ApplySnapshotHeaderFromFile(snapPath)) |
There was a problem hiding this comment.
Verify snapshot CRC before applying skipped header
When fsmAlreadyAtIndex is true, this skip path reads and applies HLC/cutover header state from snapPath without checking the token CRC. I checked the current openAndRestoreFSMSnapshot path in internal/raftengine/etcd/fsm_snapshot_file.go, and it explicitly compares the file footer to tok.CRC32C before calling fsm.Restore to avoid mutating FSM state from corrupt snapshot bytes. With the proposed ApplySnapshotHeaderFromFile(snapPath) shape, a corrupted/truncated .fsm file or mismatched token can silently install a bogus ceiling/cutover (or treat a corrupt short header as headerless) precisely in the skip case; thread tok.CRC32C through and verify the snapshot file before applying any header side effects.
Useful? React with 👍 / 👎.
Codex round-5 P1: the round-5 ApplySnapshotHeaderFromFile signature
took no tokenCRC and skipped all three of openAndRestoreFSMSnapshot's
fail-closed verifications:
1. fsmMinFileSize -> ErrFSMSnapshotTooSmall
2. footer-vs-tokenCRC -> ErrFSMSnapshotTokenCRC
3. full-body CRC vs footer -> ErrFSMSnapshotFileCRC
A corrupt / wrong-token / truncated snapshot could silently install
bogus HLC ceiling / Stage 8a cutover -- or be parsed by
ReadSnapshotHeader as headerless legacy and silently apply
ceiling=0, cutover=0.
Round-6 fix:
- applyHeaderStateOnSkip(fsm, snapPath, tokenCRC) -- threads
tok.CRC32C through from restoreSnapshotState.
- SnapshotHeaderApplier.ApplySnapshotHeaderFromFile(snapPath,
tokenCRC) -- new signature.
- kvFSM.ApplySnapshotHeaderFromFile mirrors
openAndRestoreFSMSnapshot's three-step verification:
1. fsmMinFileSize check
2. readFSMFooter + compare to tokenCRC
3. crc32 TeeReader over the whole payload; parse the header
inside the tee'd reader so the header bytes are included
in the computed CRC; drain remaining payload; compare to
footer
Side-effects (SetPhysicalCeiling + restoredCutover assignment)
only run after all three checks pass.
- §4 caller signature updated; §5 pseudocode rewritten; B3 test
list extended with three CRC-corruption tests (one per failure
mode); Implementation Plan B3 row revised; Round-5 retraction
section appended documenting the cost note (full-file read on
skip path still cheaper than restore which also reads + writes
pebble) and the meta-lesson (mirror existing fail-closed
checks when relocating side-effects).
- Open Questions gains an entry for a future follow-up that
persists HLC ceiling + cutover as durable meta keys, eliding
the file read entirely. Out of scope here.
Doc-only PR; no code change.
|
round-6 push: You're right: I pre-declared the loop done at round-5 / round-5.1 but missed that the new Round-6 fixThread // wal_store.go
return applyHeaderStateOnSkip(fsm, fsmSnapPath(fsmSnapDir, tok.Index), tok.CRC32C)
// statemachine.go
type SnapshotHeaderApplier interface {
ApplySnapshotHeaderFromFile(snapPath string, tokenCRC uint32) error
}
// kv/fsm.go — mirrors openAndRestoreFSMSnapshot's three-step verification
func (f *kvFSM) ApplySnapshotHeaderFromFile(snapPath string, tokenCRC uint32) error {
// 1. size check -> ErrFSMSnapshotTooSmall
// 2. readFSMFooter vs tokenCRC -> ErrFSMSnapshotTokenCRC
// 3. crc32 TeeReader over the WHOLE payload (header parsed inside the
// tee'd reader so its bytes are included) vs footer
// -> ErrFSMSnapshotFileCRC
// ... only after all three pass: SetPhysicalCeiling + restoredCutover
}Cost noteStep 3 reads the whole snapshot file on the skip path (~6 s for a 6 GiB FSM at 1 GiB/s SSD read), but is still strictly cheaper than the restore path it replaces — the same full-file read happens inside B3 tests addedThree CRC-corruption tests, one per failure mode (truncate / wrong-token / one-byte body flip) asserting the corresponding typed error AND that Open Questions updateAdded a future follow-up: persist HLC ceiling + cutover as durable meta keys (analogous to Round-5 retraction appendedRound-5 was the right structural fix for the import cycle but lost sight of the CRC verification it was implicitly inheriting from @claude review |
|
Claude finished @bootjp's task in 7m 11s —— View job Round-6 ReviewTodo List:
The round-6 CRC verification fix is mechanically correct and closes the final blocking issue from round-5. The retraction is well-documented. One implementation-planning gap in the §5 pseudocode needs resolution before Branch 3 opens; everything else is sound. ✅ Round-6 core — CRC verification: CorrectVerified against The existing
The proposed
The order is correct. The seek-back after reading the footer mirrors Cost: Step 3 reads the full payload once. The doc's claim that this is strictly cheaper than the restore path (which also reads the payload AND writes a temp Pebble database) is correct — the restore-path's wall-clock is dominated by 🟠 Implementation planning:
|
| Symbol | Actual location | Exported? |
|---|---|---|
statFSMFileError |
etcd/fsm_snapshot_file.go:438 |
❌ no |
fsmMinFileSize |
etcd/fsm_snapshot_file.go:31 |
❌ no (= 4) |
readFSMFooter |
etcd/fsm_snapshot_file.go:311 |
❌ no |
fsmFooterSize |
etcd/fsm_snapshot_file.go:27 |
❌ no (= 4) |
fsmRestoreReadAhead |
etcd/fsm_snapshot_file.go:37 |
❌ no (= 1 MiB) |
crc32cTable |
etcd/fsm_snapshot_file.go:51 |
❌ no |
ErrFSMSnapshotTooSmall |
etcd/fsm_snapshot_file.go:68 |
✅ yes |
ErrFSMSnapshotTokenCRC |
etcd/fsm_snapshot_file.go:61 |
✅ yes |
ErrFSMSnapshotFileCRC |
etcd/fsm_snapshot_file.go:57 |
✅ yes |
Import cycle check: internal/raftengine/etcd does NOT import kv (confirmed: engine.go imports only internal/raftengine, internal/encryption, internal/monoclock; wal_store.go imports no bootjp packages beyond cockroachdb/errors). Therefore kv can safely import internal/raftengine/etcd — no cycle. The exported error sentinels (ErrFSMSnapshotTooSmall etc.) are then accessible from kv/fsm.go.
For the unexported constants and helpers, Branch 3 has three options:
- Export the needed symbols from
etcd(FsmFooterSize = 4, etc.) — minor naming churn. - Move them to
internal/raftengine(parent package) — cleanest architecture since bothkvandetcdalready import it. - Reimplement locally in
kv/fsm.go—const kFsmFooterSize = 4,var kCRC32CTable = crc32.MakeTable(crc32.Castagnoli), inlinebinary.Readfor footer reading. Trivial duplication of ~10 lines.
The design doc should specify the choice. Option 3 is the least risky (no exported symbol churn, no parent-package change) and is straightforward given the values are all trivial constants or one-liner functions. B3's test assertion on ErrFSMSnapshotFileCRC etc. can use the exported sentinels via the kv → etcd import.
✅ Codex P2 (snapshot body) — SetDurableAppliedIndex is write-to-live-DB: sound under the design's goals
The codex comment at 2026-06-03T04:24:12Z (line 614) raises: SetDurableAppliedIndex writes to the live Pebble DB (not the .fsm file), so a conservative full restore that calls swapInTempDB would overwrite the live DB with the .fsm file content, losing the bumped metaAppliedIndex = X and potentially re-degrading the skip gate.
Verified on current source:
SetDurableAppliedIndex(X)runs beforepersist.SaveSnap(snap)→ live DB getsX.- Live DB persists across normal restarts (no restore). On normal restarts,
LastAppliedIndex()readsX→ skip fires correctly. - After a conservative restore (
swapInTempDB), live DB is replaced by .fsm file content →metaAppliedIndexregresses toY(last data Apply, pre-bump). - The NEXT snapshot taken after restore calls
SetDurableAppliedIndex(X+K)→ live DB getsX+K→ subsequent normal restarts skip correctly. - Edge case: crash between restore and next snapshot, with only lease entries in
(X, X+K]→ live DB hasY < X+K→ full restore on next cold start.
This is a one-restart-per-conservative-restore degradation (same as pre-optimization behavior) in a narrow scenario (post-restore crash before next snapshot). Not a correctness issue. Not a regression versus the pre-optimization baseline (which always does full restore). The design is sound for its stated goal; the future optimization (durable meta keys for ceiling/cutover) would close this residual gap entirely. The doc's current treatment of this as "out-of-scope future work" is appropriate.
✅ All other round-5/6 verified items remain sound
| Item | Status |
|---|---|
v2 header parity via ReadSnapshotHeader |
✅ |
file.Seek(0, io.SeekStart) before TeeReader pass |
✅ correct, matches openAndRestoreFSMSnapshot:294 |
| Headerless legacy (ceil=0, cutover=0) no-op path | ✅ ceiling > 0 guard prevents HLC mutation |
ErrSnapshotHeaderUnknownMagic fail-closed |
✅ propagated before any side-effect |
restoredCutover set alongside HLC ceiling |
✅ mirrors kvFSM.Restore:523 exactly |
| Both snapshot persist hooks (Sites 1 + 2) | ✅ both enumerated in B2 row |
B3 idle-cluster test with ELASTICKV_RAFT_SNAPSHOT_COUNT=10 |
✅ tractable |
fsmAlreadyAtIndex bool-only return (conservative fallback) |
✅ |
| B3 three CRC-corruption test cases | ✅ cover all three typed errors |
| Open Questions: future durable meta key optimization | ✅ well-scoped as non-blocking |
isPebbleMetaKey name + line number corrections |
✅ (carried from R3) |
Summary
| Area | Status |
|---|---|
| Round-6 CRC verification (three-step) | ✅ Correct — mirrors openAndRestoreFSMSnapshot exactly |
| TeeReader + seek pattern for full-body CRC pass | ✅ Sound |
Unexported etcd symbols in §5 pseudocode |
🟠 Needs implementation note — trivially fixable (local constants or export) |
Codex P2 (SetDurableAppliedIndex on live DB vs snapshot body) |
✅ Sound for design goals; residual edge case is pre-optimization baseline |
Import cycle safety (kv → internal/raftengine/etcd) |
✅ Cycle-free — etcd does not import kv |
| B3 test spec (three CRC failure modes + idle cluster) | ✅ Adequate |
The one item needed before Branch 3 opens: add a note in §5 identifying the unexported symbols and specifying the resolution (local re-implementation or export). Everything else is ready for implementation.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d5a1bd270e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if err := setPebbleUint64InBatch(b, metaAppliedIndexBytes, idx); err != nil { | ||
| return errors.WithStack(err) | ||
| } | ||
| return errors.WithStack(b.Commit(s.raftApplyWriteOpts())) |
There was a problem hiding this comment.
Force the snapshot checkpoint write durable
When ELASTICKV_FSM_SYNC_MODE=nosync, raftApplyWriteOpts() is pebble.NoSync, so this checkpoint batch can return before the Pebble WAL record is fsynced. The immediately following SaveSnap can persist the raft snapshot pointer at X; a crash before Pebble flush leaves the pointer durable while metaAppliedIndex rolls back to Y < X, and because WAL replay starts at X, the compacted HLC leases cannot rebuild the key, so the steady-state skip can fall back forever again. The checkpoint that orders before SaveSnap needs pebble.Sync or must be captured in the snapshot payload, independent of the raft-apply sync mode.
Useful? React with 👍 / 👎.
| } | ||
|
|
||
| // kv/fsm.go (new method on kvFSM) -- kv.ReadSnapshotHeader stays inside kv | ||
| func (f *kvFSM) ApplySnapshotHeaderFromFile(snapPath string, tokenCRC uint32) error { |
There was a problem hiding this comment.
Keep CRC verification in an importable seam
This method is proposed in kv/fsm.go to avoid importing kv from wal_store.go, but the verifier symbols it uses below (fsmMinFileSize, readFSMFooter, ErrFSMSnapshotTokenCRC, crc32cTable, fsmRestoreReadAhead, statFSMFileError) are currently unexported in internal/raftengine/etcd/fsm_snapshot_file.go. Since kv already imports internal/raftengine, it cannot import the etcd subpackage without a cycle and cannot access those helpers as written; implementing this design either duplicates the snapshot verifier in kv or loses the round-6 CRC parity. The verification needs to live behind an exported/shared engine helper or a callback-based seam that kv can actually use.
Useful? React with 👍 / 👎.
Address codex round-6 P2 x 2:
P2 line 410 -- seam implementability: round 6 placed CRC verification
in kvFSM.ApplySnapshotHeaderFromFile using fsmMinFileSize,
readFSMFooter, crc32cTable, fsmRestoreReadAhead, statFSMFileError,
ErrFSMSnapshotTokenCRC, ErrFSMSnapshotFileCRC -- all unexported in
internal/raftengine/etcd/fsm_snapshot_file.go. Production code has
neither direction of import between kv and internal/raftengine/etcd
on origin/main (both are test-only). Round-6 pseudocode would not
compile.
Round-7 splits the SnapshotHeaderApplier interface into a parse phase
and an apply phase:
type SnapshotHeaderApplier interface {
ParseSnapshotHeader(r io.Reader) (ceiling, cutover uint64, err error)
ApplySnapshotHeader(ceiling, cutover uint64)
}
wal_store.go.applyHeaderStateOnSkip orchestrates size + footer-vs-
tokenCRC + crc32 TeeReader using its own (unexported) helpers, hands
the tee'd reader to setter.ParseSnapshotHeader for the v1/v2 header
read + drain (kv.ReadSnapshotHeader still lives entirely inside kv),
verifies the computed CRC against the footer, and only then calls
setter.ApplySnapshotHeader. Side-effects are pure assignment after
the engine confirms all three checks passed.
No package imports change. kv keeps the header parser; etcd keeps
the CRC verifier. The two-phase split costs nothing in the skip path
because we are not trying to do a single-pass restore here.
P2 line 660 -- nosync durability: round 4 / 5 / 6 all wrote the
SetDurableAppliedIndex checkpoint with s.raftApplyWriteOpts(), which
returns pebble.NoSync under ELASTICKV_FSM_SYNC_MODE=nosync.
Crash sequence:
1. SetDurableAppliedIndex(X) returns (Pebble WAL buffered).
2. e.persist.SaveSnap(snap) returns durably (etcd snapshotter fsyncs).
3. Crash before Pebble flush.
Post-restart: metaAppliedIndex=Y<X (last fsynced data Apply), snapshot
pointer=X, WAL compaction starts at X. The lost applies cannot
rebuild the meta key; fsmAlreadyAtIndex(X)=false forever. The codex
round-3 P2 scenario recurs permanently.
Round-7 pins pebbleStore.SetDurableAppliedIndex to pebble.Sync
unconditionally. The checkpoint is the only durable carrier of
metaAppliedIndex at this point; it must be at least as durable as
the snapshot pointer that immediately follows via SaveSnap. Cost: +1
extra fsync per snapshot persist (default SnapshotCount=10000;
negligible).
Also:
- §6 'ELASTICKV_FSM_SYNC_MODE=nosync mode' subsection split into
(a) per-entry data Apply (existing reasoning) and (b) snapshot
checkpoint (new pebble.Sync rationale).
- B2 Implementation Plan row updated: 'pebble.Sync unconditionally'.
- B3 Implementation Plan row updated to reflect the two-phase
SnapshotHeaderApplier shape.
- Round-6 retraction section appended documenting both bugs +
meta-lessons.
Status header bumped to Round 7. Doc-only PR; no code change.
|
round-7 push: 🟡 P2 line 410 — seam implementability (Resolved)You're right: Round-7 fix: split type SnapshotHeaderApplier interface {
ParseSnapshotHeader(r io.Reader) (ceiling, cutover uint64, err error)
ApplySnapshotHeader(ceiling, cutover uint64)
}
No package imports change. 🟡 P2 line 660 — nosync durability (Resolved)Round 4 / 5 / 6 all wrote
Post-restart: Round-7 fix: pin §6 nosync subsection now splits into (a) per-entry data Apply (existing raft-WAL-as-source-of-truth reasoning) and (b) snapshot checkpoint (new pebble.Sync rationale). Lessons (Round-6 retraction section appended)
@claude review |
|
Claude finished @bootjp's task in 5m 11s —— View job Round-7 ReviewTodo List:
Both round-6 P2 fixes are correct. No new blocking issues. One minor documentation carry-forward needs fixing before B3 opens. ✅ P2 line 410 — two-phase
|
| Step | openAndRestoreFSMSnapshot |
applyHeaderStateOnSkip (round-7) |
|---|---|---|
| Size check | info.Size() < fsmMinFileSize |
Same |
| Footer vs token | readFSMFooter + footer != tokenCRC |
Same |
| Full-body CRC | restoreAndComputeCRC TeeReader over payload |
TeeReader via ParseSnapshotHeader + drain |
| Header side-effects | fsm.Restore → ReadSnapshotHeader + f.hlc.SetPhysicalCeiling + f.restoredCutover = |
ParseSnapshotHeader + ApplySnapshotHeader |
| State mutation on any failure | ❌ (error aborts before fsm.Restore writes anything durable) |
❌ (ApplySnapshotHeader only after all checks pass) |
Shape is structurally identical to the existing ApplyIndexAware seam precedent: engine orchestrates, FSM implements the kv-package half. ✅
✅ P2 line 660 — pebble.Sync unconditionally for SetDurableAppliedIndex: Sound
The durability analysis in §6(b) is correct and complete. Verified:
raftApplyWriteOpts()(lsm_store.go:1257) returnss.fsmApplyWriteOpts, which ispebble.NoSyncunderELASTICKV_FSM_SYNC_MODE=nosync. ✅ (round-7 correctly avoids this helper)directApplyWriteOpts()(lsm_store.go:1251) returnspebble.Syncalways, with the same rationale documented as forSetDurableAppliedIndex: "direct callers do not have raft-log replay as a durability backstop." The snapshot checkpoint is in the same category. ✅persistLocalSnapshotPayload(wal_store.go:519-524) is a free function; the engine wrappere.persistLocalSnapshotPayloadholdse.snapshotMubefore callingSetDurableAppliedIndex.SetDurableAppliedIndexusess.dbMu.RLock()only — nosnapshotMuinside, no lock ordering conflict. ✅
Crash-ordering proof confirmed valid under nosync with pebble.Sync on checkpoint:
With SetDurableAppliedIndex → pebble.Sync, the bump is fsynced before SaveSnap returns. The previously identified crash scenario (codex round-6 P2 line 660) is closed: there is no longer a state where snapshot pointer = X is durable while metaAppliedIndex < X. The permanent-fallback regression is impossible. ✅
Nice side-effect under nosync mode: A pebble.Sync commit flushes the shared Pebble WAL, making any preceding pebble.NoSync data Apply batches durable as a side-effect. The snapshot checkpoint therefore also promotes any un-fsynced raft-apply entries into the fsync boundary — a conservative bonus, not relied upon by correctness.
🟢 Minor: Stale ApplySnapshotHeaderFromFile reference in B3 test spec
The B3 test row (around line 892) still says:
"assert the specific typed error surfaces from
ApplySnapshotHeaderFromFileWITHOUT mutatingf.hlcorf.restoredCutover"
ApplySnapshotHeaderFromFile was the round-6 single-method name. Round-7 replaced it with ParseSnapshotHeader + ApplySnapshotHeader. The test scenario is still correct — inject corruption, verify typed error from applyHeaderStateOnSkip, verify FSM state unchanged — but the test surface function name should be updated to reference applyHeaderStateOnSkip (or the full skip path through restoreSnapshotState) rather than the no-longer-existing ApplySnapshotHeaderFromFile. Purely cosmetic; won't block B3 opening.
✅ All prior blocking/significant items confirmed resolved
| Area | Status |
|---|---|
§5 import cycle (kv.ReadSnapshotHeader from wal_store.go) |
✅ Resolved — two-phase split keeps both sides in their home package |
| §5 v2 header parity (v1-only probe was round-5 bug) | ✅ ReadSnapshotHeader handles v1/v2/headerless/unknown-magic |
§5 restoredCutover set alongside HLC ceiling on skip |
✅ ApplySnapshotHeader(ceiling, cutover) sets both |
§5 seam implementability (unexported etcd symbols) |
✅ CRC verifier stays in etcd; parser stays in kv; no new exports needed |
§6 SetDurableAppliedIndex under nosync (round-6 P2 line 660) |
✅ pebble.Sync unconditionally |
| §6 both snapshot persist sites (round-5 fix) | ✅ Site 1 persistCreatedSnapshot + Site 2 e.persistLocalSnapshotPayload |
fsmAlreadyAtIndex conservative bool-only return |
✅ Any error/missing/no-interface → false → full restore |
| OCC two-case idempotency argument (§7) | ✅ Raw overwrite + OCC ErrWriteConflict non-HaltApply |
B3 idle-cluster test (ELASTICKV_RAFT_SNAPSHOT_COUNT=10) |
✅ Tractable (20 s window) |
| B3 three CRC-corruption test cases | ✅ One per typed error, state unchanged assertion |
isPebbleMetaKey / line number annotations |
✅ All corrected in round-5 |
Summary
The design is ready for implementation. All blocking issues from rounds 1–6 have been addressed. The one remaining item (B3 test spec method name) is cosmetic and can be fixed opportunistically. Branch 2 can proceed.
claude[bot] round-7 verdict flagged one cosmetic carry-forward: the B3 test row at line 893 still cites the round-6 single-method ApplySnapshotHeaderFromFile. Round-7 split that into ParseSnapshotHeader + ApplySnapshotHeader; the test exercise surface from the implementer's viewpoint is applyHeaderStateOnSkip (or restoreSnapshotState with fsmAlreadyAtIndex returning true). Rename to reflect the round-7 shape; the test scenarios (inject corruption, assert typed error, assert FSM state unchanged) are unchanged.
…oth snapshot persist sites (#915) ## Summary Implements **Branch 2** of the cold-start snapshot-restore skip optimisation designed in PR #910. After this lands the `metaAppliedIndex` Pebble meta key is durably written on every raft-Apply data mutation AND at every snapshot persist — but the skip gate itself (Branch 3) is NOT yet wired, so behaviour is observationally identical to `main` except for the new meta key in fsm.db. Branch 2 is meant to soak in production for at least one release before Branch 3 enables the skip; this PR is intentionally a no-op-from-the-outside change with comprehensive plumbing. ## Reading order (6 commits, designed to review one-at-a-time) | # | commit | scope | |---|---|---| | 1 | `2339a6f2` | raftengine: opt-in interfaces (`AppliedIndexReader` / `AppliedIndexWriter`) | | 2 | `525fc152` | pebbleStore: `metaAppliedIndex` const + `LastAppliedIndex` + `SetDurableAppliedIndex` (with `pebble.Sync` UNCONDITIONALLY) | | 3 | `aa9b8acc` | `MVCCStore` interface extension: `ApplyMutationsRaftAt` / `DeletePrefixAtRaftAt` overloads, threading appliedIndex through `applyMutationsWithOpts` + `deletePrefixAtWithOpts` | | 4 | `7cd72bda` | kvFSM seam wiring: `AppliedIndexReader()` / `SetDurableAppliedIndex()` accessors + all 7 data-Apply leaves switched to `*RaftAt` with `f.pendingApplyIdx` | | 5 | `f1e8748c` | engine hooks at BOTH snapshot persist sites: `persistCreatedSnapshot` + `e.persistLocalSnapshotPayload` call `SetDurableAppliedIndex` BEFORE `persist.SaveSnap` | | 6 | `2c42f7d6` | tests (10 new tests across store + engine) | ## Design constraints honoured All from `docs/design/2026_06_02_idempotent_snapshot_restore.md`: - **§2 "Why both leaves"**: meta key bundle in BOTH `applyMutationsWithOpts` AND `deletePrefixAtWithOpts` so DEL_PREFIX entries don't silently leave `LastAppliedIndex` behind. Tested by `TestDeletePrefixAtRaftAt_BundlesMetaAppliedIndex`. - **§3 `dbMu.RLock()`**: both `LastAppliedIndex` and `SetDurableAppliedIndex` acquire the read-lock, matching the lock-ordering discipline at `lsm_store.go:153 / :553 / :675`. - **§4 fallback policy**: `AppliedIndexReader()` returns nil when the store doesn't implement the seam; `LastAppliedIndex` returns `(0, false, nil)` for missing OR truncated meta key. Branch 3 will then fall back to full restore conservatively. - **§6 `ELASTICKV_FSM_SYNC_MODE=nosync` mode**: `SetDurableAppliedIndex` is **pinned to `pebble.Sync` unconditionally**. Rationale documented at length in the method's doc-comment — once `persist.SaveSnap` returns, WAL compaction discards every log entry ≤ `snap.Metadata.Index`, so there's no source to replay the meta key bump from. +1 fsync per snapshot persist (rare; default `SnapshotCount=10000`). Tested by `TestSetDurableAppliedIndex_UsesPebbleSync`. - **§6 "HLC lease entries — checkpoint at snapshot persist"**: BOTH `persistCreatedSnapshot` (config snapshots) AND `e.persistLocalSnapshotPayload` (steady-state `SnapshotCount`-triggered hot path) call the hook. Both crash-ordering tested by `TestPersistCreatedSnapshot_*`. - **§8 compatibility**: `StateMachine.Apply`'s public signature is unchanged. New interfaces are opt-in. Old call sites (`ApplyMutationsRaft` without `*At`) still work, just pass `appliedIndex=0` to opt out of the meta key bump. ## Test results ``` go vet ./... → 0 issues go test ./store/ -short → ok 29.4s go test ./kv/ -short → ok 10.4s go test ./internal/raftengine/... -short → ok 32.8s go test ./store/ -run 'TestLastAppliedIndex|TestSetDurable...|TestApply...|TestDelete...' → ok 1.6s go test ./internal/raftengine/etcd/ -run 'TestRecording|TestPersistCreatedSnapshot_' → ok 0.03s ``` 10 new tests added (see commit `2c42f7d6` for the full inventory). ## What this does NOT do - **Does NOT enable the skip gate.** `restoreSnapshotState` still always restores. Branch 3 wires the `fsmAlreadyAtIndex` check + `applyHeaderStateOnSkip` + the two-phase `SnapshotHeaderApplier` seam. - **Does NOT change `HEALTH_TIMEOUT_SECONDS=300`.** Branch 4 lowers it once Branch 3 has soaked. - **Does NOT touch the snapshot-install hot path** (`Engine.applySnapshot`) per Non-Goals in the design. ## Soak plan Branch 2 should run in production for at least one release before Branch 3 opens. Operators can verify the meta key is being written via: ```bash # Inspect a pebble fsm.db (read-only) ldb --db=/var/lib/elastickv/n3/fsm.db get '_meta_applied_index' --hex # Expected: 8 little-endian bytes equal to the current applied index ``` ## Refs - PR #910 (design) — round 1..7 design history + retraction sections explaining the design constraints this PR honours - PR #909 — `HEALTH_TIMEOUT_SECONDS` band-aid that this series eventually obviates <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Durable tracking of Raft-applied indexes to ensure consistent snapshot/save ordering. * **Bug Fixes** * Improved snapshot persistence reliability by pinning durable applied index before snapshot writes. * Stronger durability for writes bundled with Raft entry indices, reducing restore/recovery surprises. * **Tests** * Added comprehensive tests covering applied-index ordering, failure handling, and persistence behavior. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
PR #909 (
HEALTH_TIMEOUT_SECONDS60s → 300s) の 根本原因側 の設計提案。コードは無変更、ドキュメント追加のみ。何を解決したいか
internal/raftengine/etcd/wal_store.go:117 loadWalStateが、WAL の persisted snapshot pointer が空でない限りコールド起動の度にrestoreSnapshotState(fsm, snapshot, fsmSnapDir)を呼びます。EKV/token-format snapshot の経路では最終的にpebbleStore.Restore→restorePebbleNativeAtomic→swapInTempDB(store/lsm_store.go:1816, :2002) が走り、fsm.db の全内容を sibling tempdir に書き戻してから rename で差し替える = 起動コストが O(snapshot size)。
5 ノード 192.168.0.x クラスタで実測 ~46 s (Found 2 WALs
012720/012734のオープン → Found 0 WALs の tempdir 作成 → Found 4 WALs001309..001315の rename 後再オープン)。これが PR #909 で観測された "gRPC port did not come up" の主因です。
問題は この復元の大半が冗長 だという点:
fsm.Applyが変更内容を Pebble の WriteBatch で durable に書き込み済みapplied = Y > snapshot.Metadata.Index = Xまで進んだ後の fsm.db は state ≥ X を保持済み提案アーキテクチャ
(1)
StateMachine.Apply(index, data)化Engine の唯一の呼び出しサイト (
engine.go:1769 applyNormalEntry) でentry.Indexを渡すだけ。in-tree 実装は機械的変換、外部実装は無し。
(2)
pebbleStoreの meta key に applied-index を atomic 同梱kvFSM.Apply(index, data)が leaf MVCC mutation で取得するpebble.Batchにkey=metaAppliedIndex, value=BE64(index)を追加。Pebble の batch は atomic なので torn-write window なし。
(3)
restoreSnapshotState条件分岐AppliedIndexReaderinterface 未実装の FSM、または meta key 不在 (アップグレード直後の最初の再起動) は 現状の full restore にフォールバック → strictly additive。期待される効果
pebble.Open+ WAL replay + raft follower 化 + gRPC bind の <5 秒 (FSM サイズに依存しない)HEALTH_TIMEOUT_SECONDSを引き下げ可能 (Branch 5 で実施)実装シーケンス (別 PR 4 本に分割)
StateMachine.Apply(index, data)interface 変更 + 全実装更新 (index は受け取って捨てるだけ)kvFSMで index を leaf mutation に流し、pebbleStoreで meta key を WriteBatch 同梱 +LastAppliedIndex()公開 + 単体テストrestoreSnapshotStateにfsmAlreadyAtIndexガート + metrics + INFO logHEALTH_TIMEOUT_SECONDSを tighter な値に引き下げ各 branch は対応する invariant を unit test でガード。
Crash-safety / Compatibility
詳細は
docs/design/2026_06_02_idempotent_snapshot_restore.mdの §3 §5 参照。要約:
pebble.Batchで commit されるため、クラッシュ後の Pebble WAL replay は両者を一緒に restore するか両者を一緒に discard する。中間状態はあり得ない。観測対象
restoreSnapshotState skipped (FSM already at index N >= snapshot index X)レビュー観点
kvFSM.Applyが現状 idempotent と読めているか)設計が固まり次第 Branch 2 (interface change) から着手します。
Summary by CodeRabbit