Skip to content

fix(ops/rolling-update): bump HEALTH_TIMEOUT_SECONDS default 60s -> 300s#909

Merged
bootjp merged 2 commits into
mainfrom
fix/rolling-update-grpc-timeout-default
Jun 3, 2026
Merged

fix(ops/rolling-update): bump HEALTH_TIMEOUT_SECONDS default 60s -> 300s#909
bootjp merged 2 commits into
mainfrom
fix/rolling-update-grpc-timeout-default

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Jun 2, 2026

Summary

scripts/rolling-update.shHEALTH_TIMEOUT_SECONDS デフォルトを 60s → 300s に引き上げます。

なぜ 60s では足りないか

2026-06-02 の 5 ノードクラスタ (192.168.0.x) ローリング再起動で、.212 ノードが起動中に wait_for_grpc がタイムアウトし、スクリプトが gRPC port did not come up on 192.168.0.212:50051 で exit。docker logs --tail 200 を辿ると以下の wall-clock 内訳:

12:56:05  pebble.Open("/var/lib/elastickv/n3/fsm.db") — 既存 DB を開いて WAL 012720/012734 を replay
12:56:06  pebble.Open(<sibling tempdir>) — restorePebbleNativeAtomic が新規 Pebble を作成 (Found 0 WALs)
   ↓ +45s  restoreBatchLoopInto が FSM スナップショットを batch-loop で書き戻す
12:56:51  pebble.Open(fsm.db) — swapInTempDB が old dir を RemoveAll + temp を Rename → 再オープン (Found 4 WALs: 001309..001315)
   ↓       WAL replay (commit=57365699 / applied=57361461 → 4238 entries)
   ↓       raft follower-ization
   ↓       grpc.Serve bind

つまり多くの場合 restorePebbleNativeAtomic + swapInTempDB だけで ~46 秒 消費しており、その後にまだ WAL replay + raft follower 化 + gRPC bind が残るため、60s ceiling では確実にタイムアウトします。

タイムアウトはさらに二次被害を生みます: 再起動コンテナが応答不能のまま docker restart policy で再投入される → 残りクォーラムから見て当該ノードが選挙に応答しない → raft term が増え続ける(我々のクラスタでは term 665 まで上昇)。

変更内容

  • HEALTH_TIMEOUT_SECONDS のデフォルトを 60300
  • 上に コールドスタートの内訳と why 300s を明記したコメントブロック (将来オペレーターが値を縮めたくなった時の根拠)

動作確認

  • bash -n scripts/rolling-update.sh → syntax OK
  • HEALTH_TIMEOUT_SECONDS=120 bash scripts/rolling-update.sh ... のようなオペレーター側オーバーライドは引き続き有効

設計フォローアップ (別 PR)

これは 症状の応急処置 で、根本原因(restoreSnapshotState がコールドスタート時に毎回 FSM 全復元する)は別 PR で扱います:

  • pebbleStoreLastAppliedIndex() を生やし、毎 Apply の WriteBatch にメタキーとしてアトミックに同梱
  • restoreSnapshotStateLastAppliedIndex() >= snapshot.Metadata.Index ならスキップ
  • 設計ドキュメント: docs/design/2026_06_02_idempotent_snapshot_restore.md (Coming up next)

これが入れば 46s の payload は完全に消え、HEALTH_TIMEOUT_SECONDS は将来引き下げ可能になります。

The wait_for_grpc budget must cover the full cold-start path:
  pebble.Open(fsm.db)
   -> loadWalState
     -> restoreSnapshotState
       -> pebbleStore.Restore
         -> restorePebbleNativeAtomic (writes a sibling temp DB)
         -> swapInTempDB (RemoveAll old dir + Rename temp into place + reopen)
   -> WAL entry replay (commit - applied entries)
   -> raft follower-ization
   -> gRPC.Serve bind

For multi-GiB FSMs the restorePebbleNativeAtomic + swapInTempDB pair
dominates wall-clock cost (observed ~46s on a 5M-key node, scales
linearly with FSM size and disk throughput).

A 60s ceiling caused the restarted container to be declared unhealthy
mid-restore, which the rolling-update.sh exit path surfaces as
'gRPC port did not come up on $NODE_HOST:$RAFT_PORT'. The repeated
timeout-induced restart loop also drives raft term inflation on the
remaining quorum because the restarting node is unresponsive across
elections.

300s comfortably absorbs the FSM restore for multi-GiB working sets
while still failing fast on genuine crash-loop bugs (those never reach
step 3). Operators with smaller FSMs can shrink it via env override.

Follow-up (separate PR / design doc): make restoreSnapshotState
idempotent so cold-start restore can be skipped when the on-disk FSM
is already at >= snapshot.Metadata.Index, eliminating the 46s pay-on-
restart cost entirely.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 2, 2026

Warning

Review limit reached

@bootjp, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 59 minutes and 21 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 18f01b92-4dbe-45a6-b19f-be03b6c80422

📥 Commits

Reviewing files that changed from the base of the PR and between 94579fc and fd3ff1a.

📒 Files selected for processing (1)
  • scripts/rolling-update.sh
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/rolling-update-grpc-timeout-default

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request increases the default HEALTH_TIMEOUT_SECONDS from 60 to 300 seconds in scripts/rolling-update.sh to accommodate longer cold-start times for nodes with large FSMs, adding detailed documentation for this decision. The reviewer pointed out that this longer timeout could cause the script to block for up to 5 minutes if a container crashes immediately, and suggested updating wait_for_grpc to check if the container is still running and exit early in such cases.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread scripts/rolling-update.sh
# for the multi-GiB working sets we operate against today and still
# fails fast on genuine crash-loop bugs (which never finish (1) or (2)).
# Operators with smaller FSMs can shrink HEALTH_TIMEOUT_SECONDS via env.
HEALTH_TIMEOUT_SECONDS="${HEALTH_TIMEOUT_SECONDS:-300}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Increasing the default HEALTH_TIMEOUT_SECONDS to 300 seconds (5 minutes) introduces a potential operational bottleneck when a container fails to start or crashes immediately. Currently, wait_for_grpc (line 672) only checks grpc_healthy and sleeps in a loop. If the container exits or enters a crash loop, the script will block and wait for the full 5 minutes before failing. To prevent this, consider updating wait_for_grpc to check if the container is still running (e.g., via docker inspect) and exit early if it has stopped.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 2, 2026

@claude review

PR は scripts/rolling-update.sh の HEALTH_TIMEOUT_SECONDS デフォルトを 60s → 300s に引き上げる 1 行差分 + 出所コメントです。

レビュー観点:

  • 300s が FSM サイズ多 GiB 帯のコールド起動 (pebble.Open + restorePebbleNativeAtomic + swapInTempDB + WAL replay + raft follower-ization + grpc bind) を吸収するのに妥当か
  • 環境変数オーバーライドが破壊されていないか (HEALTH_TIMEOUT_SECONDS=120 bash ... 等)
  • 根本原因の構造的修正は PR docs(design): idempotent FSM snapshot restore on cold start #910 (design doc) で別途扱う旨が読み取れるか

@claude
Copy link
Copy Markdown

claude Bot commented Jun 2, 2026

Claude finished @bootjp's task in 1m 23s —— View job


Code Review

  • Read changed file
  • Review: 300s adequacy for cold-start
  • Review: env-var override integrity
  • Review: root-cause follow-up legibility
  • Review: crash-loop concern (Gemini finding)

LGTM with two notes. The 60→300 change is justified and the env-var override mechanism is intact. Two things worth addressing before merge.


1. 300s adequacy — OK

The breakdown in the comment is accurate and matches the log excerpt in the PR description. The dominant cost (restorePebbleNativeAtomic + swapInTempDB) is O(FSM size / disk throughput) and the observed 46s for a ~5M-key FSM on SSD maps to roughly 1 GiB of payload. At linear scaling:

FSM size Est. step-2 time Budget used
1 GiB ~46 s ~50 s total
3 GiB ~138 s ~145 s total
6 GiB ~276 s ~285 s total

300s covers up to ~6 GiB with comfortable headroom, and the :- override is the safety valve beyond that. The value is reasonable.


2. Env-var override — OK

HEALTH_TIMEOUT_SECONDS="${HEALTH_TIMEOUT_SECONDS:-300}" correctly uses the shell :- default: if the caller exports the variable, the existing value wins. HEALTH_TIMEOUT_SECONDS=120 bash rolling-update.sh ... continues to work as before. The multi-node loop re-passes the value at line 624 (HEALTH_TIMEOUT_SECONDS="$HEALTH_TIMEOUT_SECONDS" \), so sub-invocations also honour the caller-supplied value. No regression here.


3. Root-cause follow-up — minor doc nit

The PR description clearly points to PR #910 as the structural fix, but the code comment itself has no cross-reference. A future maintainer reading only the shell script has no pointer to the planned fix. Suggest adding one line at the end of the comment block:

# Root-cause fix (skip restoreSnapshotState when already applied): see
# docs/design/2026_06_02_idempotent_snapshot_restore.md / issue #910.

Not blocking, but will prevent the "why is this 300?" question from being re-opened six months from now after PR #910 ships and someone tries to lower the value.


4. Crash-loop concern (Gemini finding) — agree, but out of scope here

Gemini flagged that wait_for_grpc (line 672–681) has no container-liveness early exit: if the process panics after pebble.Open but before gRPC bind, the loop spins for the full 300s before reporting failure.

The code comment's claim that crash-loop bugs "never finish (1) or (2)" is an overstatement — an OOM or panic mid-restorePebbleNativeAtomic would still exhaust the timeout. The correct fix is an early-exit check in the loop:

wait_for_grpc() {
  local i
  for ((i = 0; i < HEALTH_TIMEOUT_SECONDS; i++)); do
    if grpc_healthy; then
      return 0
    fi
    # Exit early if the container has already stopped.
    local status
    status="$(docker inspect --format '{{.State.Status}}' "$CONTAINER_NAME" 2>/dev/null || echo missing)"
    if [[ "$status" != "running" ]]; then
      return 1
    fi
    sleep 1
  done
  return 1
}

However, this limitation predates the PR and is orthogonal to the default-value change. Fix this → — I'd recommend filing a follow-up issue and handling it separately from this emergency fix.


Summary: the 1-line default change is correct and safe to merge. The only actionable item before merge is the cross-reference comment (§3); the wait_for_grpc liveness improvement (§4) can be a follow-up.

Address claude[bot] §3 doc nit on PR #909: future maintainers reading
only the shell script have no pointer to the planned structural fix.
Add an inline pointer to PR #910 / docs/design/.../snapshot-restore.md
so the 'why is this 300?' question doesn't get re-opened after #910
ships.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 3, 2026

round-2 push: fd3ff1a3 adds the cross-reference comment per claude[bot] §3 doc nit. The wait_for_grpc liveness early-exit (§4) is intentionally deferred to a separate follow-up issue per its out-of-scope flagging.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Jun 3, 2026

Claude finished @bootjp's task in 58s —— View job


Code Review (Round 2)

  • Read changed file and round-2 diff
  • Verify §3 cross-reference comment was added correctly
  • Confirm env-var override remains intact
  • Post final review

LGTM — round-2 addresses the only actionable item from round-1.

§3 cross-reference — ✅ Done and well done

fd3ff1a adds four lines at the end of the comment block that directly answer the "why is this 300?" question a future maintainer would ask after PR #910 ships:

# Root-cause fix (skip restoreSnapshotState when the on-disk FSM is
# already at >= snapshot.Metadata.Index): see PR #910 and
# docs/design/2026_06_02_idempotent_snapshot_restore.md. Once that
# lands and steady-state skip rates are observed in production, this
# default can be tightened back down.

This is better than the minimal one-liner I suggested — it names the optimization (LastAppliedIndex >= snapshot.Metadata.Index skip), links both the PR and the design doc, and gives an explicit signal for when the value can be lowered. No further changes needed here.

§4 wait_for_grpc early-exit — explicitly deferred, ✅ Accepted

Pre-existing limitation, orthogonal to this emergency fix. Appropriate to track separately.

Remaining invariants — still hold

Check Result
HEALTH_TIMEOUT_SECONDS="${HEALTH_TIMEOUT_SECONDS:-300}" (line 219) Caller-supplied override wins
HEALTH_TIMEOUT_SECONDS="$HEALTH_TIMEOUT_SECONDS" \ (line 629) Sub-invocations inherit the value
for ((i = 0; i < HEALTH_TIMEOUT_SECONDS; i++)) (line 679) Loop uses the resolved value
bash -n syntax No issues

Ready to merge.


@bootjp bootjp merged commit 37af29e into main Jun 3, 2026
9 checks passed
@bootjp bootjp deleted the fix/rolling-update-grpc-timeout-default branch June 3, 2026 06:29
bootjp added a commit that referenced this pull request Jun 3, 2026
## Summary

PR #909 (`HEALTH_TIMEOUT_SECONDS` 60s → 300s) の **根本原因側**
の設計提案。コードは無変更、ドキュメント追加のみ。

## 何を解決したいか

`internal/raftengine/etcd/wal_store.go:117 loadWalState` が、WAL の
persisted snapshot pointer が空でない限りコールド起動の度に
`restoreSnapshotState(fsm, snapshot, fsmSnapDir)` を呼びます。EKV/token-format
snapshot の経路では最終的に
`pebbleStore.Restore` → `restorePebbleNativeAtomic` → `swapInTempDB`
(`store/lsm_store.go:1816, :2002`) が走り、
**fsm.db の全内容を sibling tempdir に書き戻してから rename で差し替える** = 起動コストが
**O(snapshot size)**。

5 ノード 192.168.0.x クラスタで実測 **~46 s** (Found 2 WALs `012720/012734` のオープン
→ Found 0 WALs の tempdir 作成 → Found 4 WALs `001309..001315` の rename
後再オープン)。
これが PR #909 で観測された "gRPC port did not come up" の主因です。

問題は **この復元の大半が冗長** だという点:
- 毎回成功した `fsm.Apply` が変更内容を Pebble の WriteBatch で durable に書き込み済み
- FSM が `applied = Y > snapshot.Metadata.Index = X` まで進んだ後の fsm.db は
state ≥ X を保持済み
- にもかかわらず次のコールド起動で **その状態を捨てて古い snapshot から再構築** → raft が同じエントリを再 replay

## 提案アーキテクチャ

### (1) `StateMachine.Apply(index, data)` 化

```go
// before
Apply(data []byte) any
// after
Apply(index uint64, data []byte) any
```

Engine の唯一の呼び出しサイト (`engine.go:1769 applyNormalEntry`) で `entry.Index`
を渡すだけ。
in-tree 実装は機械的変換、外部実装は無し。

### (2) `pebbleStore` の meta key に applied-index を atomic 同梱

`kvFSM.Apply(index, data)` が leaf MVCC mutation で取得する `pebble.Batch` に
`key=metaAppliedIndex, value=BE64(index)` を追加。
Pebble の batch は atomic なので **torn-write window なし**。

### (3) `restoreSnapshotState` 条件分岐

```go
if skip, err := fsmAlreadyAtIndex(fsm, tok.Index); err != nil {
    return err
} else if skip {
    // FSM data on disk is at least as fresh as the snapshot.
    return nil
}
return openAndRestoreFSMSnapshot(...)
```

`AppliedIndexReader` interface 未実装の FSM、または meta key 不在
(アップグレード直後の最初の再起動) は **現状の full restore にフォールバック** → strictly additive。

## 期待される効果

- コールド起動の steady-state コスト: `pebble.Open` + WAL replay + raft follower 化
+ gRPC bind の **<5 秒** (FSM サイズに依存しない)
- raft term 暴発 (timeout-induced restart loop による) の解消
- 将来 `HEALTH_TIMEOUT_SECONDS` を引き下げ可能 (Branch 5 で実施)

## 実装シーケンス (別 PR 4 本に分割)

| Branch | 内容 | 動作変化 |
|---|---|---|
| **B2** | `StateMachine.Apply(index, data)` interface 変更 + 全実装更新 (index
は受け取って捨てるだけ) | なし |
| **B3** | `kvFSM` で index を leaf mutation に流し、`pebbleStore` で meta key
を WriteBatch 同梱 + `LastAppliedIndex()` 公開 + 単体テスト | meta key が書き込まれ始める
(まだ skip は無効) |
| **B4** | `restoreSnapshotState` に `fsmAlreadyAtIndex` ガート + metrics +
INFO log | **ユーザー可視な perf 改善** ここで発火 |
| **B5** | `HEALTH_TIMEOUT_SECONDS` を tighter な値に引き下げ | docker restart
loop 抑止 |

各 branch は対応する invariant を unit test でガード。

## Crash-safety / Compatibility

詳細は `docs/design/2026_06_02_idempotent_snapshot_restore.md` の §3 §5 参照。
要約:
- meta key とデータ mutation は同一 `pebble.Batch` で commit されるため、クラッシュ後の
Pebble WAL replay は両者を一緒に restore するか両者を一緒に discard する。中間状態はあり得ない。
- 8-byte length check 失敗時は読み取り側が conservative に "missing" として扱い、full
restore にフォールバック。
- 旧 fsm.db に meta key が無い場合は最初の再起動は現状動作。データ移行不要。
- ロールバックは wal_store.go の skip 分岐を revert するだけ。meta key は dead data
として残るが無害。

## 観測対象

```
fsm_cold_start_restore_total{outcome="executed|skipped|fallback"}
fsm_cold_start_applied_index_gap (executed 時の snapshot.Index - LastAppliedIndex)
```

+ skip 時の INFO log: `restoreSnapshotState skipped (FSM already at index
N >= snapshot index X)`

## レビュー観点

- §3 の Pebble batch 原子性に基づく crash-safety 証明が説得力あるか
- §4 の "skip 後の replay 冪等性" 議論で見落としは無いか (`kvFSM.Apply` が現状 idempotent
と読めているか)
- Branch 分割 (4 本) の粒度が妥当か / もっと束ねるべきか
- HLC lease entry の index 監視に関する Open Question (§Open Questions 1 番目)
の解像度

設計が固まり次第 Branch 2 (interface change) から着手します。


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Documentation**
* Revised design doc detailing idempotent snapshot restore: safe skip of
heavy snapshot bodies when on-disk state is proven current, header CRC
verification for the skip path, and a durable applied-index write before
snapshot persistence. Clarifies crash-safety, observability,
compatibility, and implementation plan.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/bootjp/elastickv/pull/910?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

<!-- review_stack_entry_end -->
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
bootjp added a commit that referenced this pull request Jun 4, 2026
…oth snapshot persist sites (#915)

## Summary

Implements **Branch 2** of the cold-start snapshot-restore skip
optimisation designed in PR #910. After this lands the
`metaAppliedIndex` Pebble meta key is durably written on every
raft-Apply data mutation AND at every snapshot persist — but the skip
gate itself (Branch 3) is NOT yet wired, so behaviour is observationally
identical to `main` except for the new meta key in fsm.db. Branch 2 is
meant to soak in production for at least one release before Branch 3
enables the skip; this PR is intentionally a no-op-from-the-outside
change with comprehensive plumbing.

## Reading order (6 commits, designed to review one-at-a-time)

| # | commit | scope |
|---|---|---|
| 1 | `2339a6f2` | raftengine: opt-in interfaces (`AppliedIndexReader` /
`AppliedIndexWriter`) |
| 2 | `525fc152` | pebbleStore: `metaAppliedIndex` const +
`LastAppliedIndex` + `SetDurableAppliedIndex` (with `pebble.Sync`
UNCONDITIONALLY) |
| 3 | `aa9b8acc` | `MVCCStore` interface extension:
`ApplyMutationsRaftAt` / `DeletePrefixAtRaftAt` overloads, threading
appliedIndex through `applyMutationsWithOpts` + `deletePrefixAtWithOpts`
|
| 4 | `7cd72bda` | kvFSM seam wiring: `AppliedIndexReader()` /
`SetDurableAppliedIndex()` accessors + all 7 data-Apply leaves switched
to `*RaftAt` with `f.pendingApplyIdx` |
| 5 | `f1e8748c` | engine hooks at BOTH snapshot persist sites:
`persistCreatedSnapshot` + `e.persistLocalSnapshotPayload` call
`SetDurableAppliedIndex` BEFORE `persist.SaveSnap` |
| 6 | `2c42f7d6` | tests (10 new tests across store + engine) |

## Design constraints honoured

All from `docs/design/2026_06_02_idempotent_snapshot_restore.md`:

- **§2 "Why both leaves"**: meta key bundle in BOTH
`applyMutationsWithOpts` AND `deletePrefixAtWithOpts` so DEL_PREFIX
entries don't silently leave `LastAppliedIndex` behind. Tested by
`TestDeletePrefixAtRaftAt_BundlesMetaAppliedIndex`.
- **§3 `dbMu.RLock()`**: both `LastAppliedIndex` and
`SetDurableAppliedIndex` acquire the read-lock, matching the
lock-ordering discipline at `lsm_store.go:153 / :553 / :675`.
- **§4 fallback policy**: `AppliedIndexReader()` returns nil when the
store doesn't implement the seam; `LastAppliedIndex` returns `(0, false,
nil)` for missing OR truncated meta key. Branch 3 will then fall back to
full restore conservatively.
- **§6 `ELASTICKV_FSM_SYNC_MODE=nosync` mode**: `SetDurableAppliedIndex`
is **pinned to `pebble.Sync` unconditionally**. Rationale documented at
length in the method's doc-comment — once `persist.SaveSnap` returns,
WAL compaction discards every log entry ≤ `snap.Metadata.Index`, so
there's no source to replay the meta key bump from. +1 fsync per
snapshot persist (rare; default `SnapshotCount=10000`). Tested by
`TestSetDurableAppliedIndex_UsesPebbleSync`.
- **§6 "HLC lease entries — checkpoint at snapshot persist"**: BOTH
`persistCreatedSnapshot` (config snapshots) AND
`e.persistLocalSnapshotPayload` (steady-state `SnapshotCount`-triggered
hot path) call the hook. Both crash-ordering tested by
`TestPersistCreatedSnapshot_*`.
- **§8 compatibility**: `StateMachine.Apply`'s public signature is
unchanged. New interfaces are opt-in. Old call sites
(`ApplyMutationsRaft` without `*At`) still work, just pass
`appliedIndex=0` to opt out of the meta key bump.

## Test results

```
go vet ./...                                                    → 0 issues
go test ./store/ -short                                         → ok 29.4s
go test ./kv/    -short                                         → ok 10.4s
go test ./internal/raftengine/... -short                        → ok 32.8s
go test ./store/ -run 'TestLastAppliedIndex|TestSetDurable...|TestApply...|TestDelete...' → ok 1.6s
go test ./internal/raftengine/etcd/ -run 'TestRecording|TestPersistCreatedSnapshot_' → ok 0.03s
```

10 new tests added (see commit `2c42f7d6` for the full inventory).

## What this does NOT do

- **Does NOT enable the skip gate.** `restoreSnapshotState` still always
restores. Branch 3 wires the `fsmAlreadyAtIndex` check +
`applyHeaderStateOnSkip` + the two-phase `SnapshotHeaderApplier` seam.
- **Does NOT change `HEALTH_TIMEOUT_SECONDS=300`.** Branch 4 lowers it
once Branch 3 has soaked.
- **Does NOT touch the snapshot-install hot path**
(`Engine.applySnapshot`) per Non-Goals in the design.

## Soak plan

Branch 2 should run in production for at least one release before Branch
3 opens. Operators can verify the meta key is being written via:

```bash
# Inspect a pebble fsm.db (read-only)
ldb --db=/var/lib/elastickv/n3/fsm.db get '_meta_applied_index' --hex
# Expected: 8 little-endian bytes equal to the current applied index
```

## Refs

- PR #910 (design) — round 1..7 design history + retraction sections
explaining the design constraints this PR honours
- PR #909 — `HEALTH_TIMEOUT_SECONDS` band-aid that this series
eventually obviates


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Durable tracking of Raft-applied indexes to ensure consistent
snapshot/save ordering.

* **Bug Fixes**
* Improved snapshot persistence reliability by pinning durable applied
index before snapshot writes.
* Stronger durability for writes bundled with Raft entry indices,
reducing restore/recovery surprises.

* **Tests**
* Added comprehensive tests covering applied-index ordering, failure
handling, and persistence behavior.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant