fix(ops/rolling-update): bump HEALTH_TIMEOUT_SECONDS default 60s -> 300s#909
Conversation
The wait_for_grpc budget must cover the full cold-start path:
pebble.Open(fsm.db)
-> loadWalState
-> restoreSnapshotState
-> pebbleStore.Restore
-> restorePebbleNativeAtomic (writes a sibling temp DB)
-> swapInTempDB (RemoveAll old dir + Rename temp into place + reopen)
-> WAL entry replay (commit - applied entries)
-> raft follower-ization
-> gRPC.Serve bind
For multi-GiB FSMs the restorePebbleNativeAtomic + swapInTempDB pair
dominates wall-clock cost (observed ~46s on a 5M-key node, scales
linearly with FSM size and disk throughput).
A 60s ceiling caused the restarted container to be declared unhealthy
mid-restore, which the rolling-update.sh exit path surfaces as
'gRPC port did not come up on $NODE_HOST:$RAFT_PORT'. The repeated
timeout-induced restart loop also drives raft term inflation on the
remaining quorum because the restarting node is unresponsive across
elections.
300s comfortably absorbs the FSM restore for multi-GiB working sets
while still failing fast on genuine crash-loop bugs (those never reach
step 3). Operators with smaller FSMs can shrink it via env override.
Follow-up (separate PR / design doc): make restoreSnapshotState
idempotent so cold-start restore can be skipped when the on-disk FSM
is already at >= snapshot.Metadata.Index, eliminating the 46s pay-on-
restart cost entirely.
|
Warning Review limit reached
More reviews will be available in 59 minutes and 21 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request increases the default HEALTH_TIMEOUT_SECONDS from 60 to 300 seconds in scripts/rolling-update.sh to accommodate longer cold-start times for nodes with large FSMs, adding detailed documentation for this decision. The reviewer pointed out that this longer timeout could cause the script to block for up to 5 minutes if a container crashes immediately, and suggested updating wait_for_grpc to check if the container is still running and exit early in such cases.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| # for the multi-GiB working sets we operate against today and still | ||
| # fails fast on genuine crash-loop bugs (which never finish (1) or (2)). | ||
| # Operators with smaller FSMs can shrink HEALTH_TIMEOUT_SECONDS via env. | ||
| HEALTH_TIMEOUT_SECONDS="${HEALTH_TIMEOUT_SECONDS:-300}" |
There was a problem hiding this comment.
Increasing the default HEALTH_TIMEOUT_SECONDS to 300 seconds (5 minutes) introduces a potential operational bottleneck when a container fails to start or crashes immediately. Currently, wait_for_grpc (line 672) only checks grpc_healthy and sleeps in a loop. If the container exits or enters a crash loop, the script will block and wait for the full 5 minutes before failing. To prevent this, consider updating wait_for_grpc to check if the container is still running (e.g., via docker inspect) and exit early if it has stopped.
|
@claude review PR は scripts/rolling-update.sh の HEALTH_TIMEOUT_SECONDS デフォルトを 60s → 300s に引き上げる 1 行差分 + 出所コメントです。 レビュー観点:
|
|
Claude finished @bootjp's task in 1m 23s —— View job Code Review
LGTM with two notes. The 60→300 change is justified and the env-var override mechanism is intact. Two things worth addressing before merge. 1. 300s adequacy — OKThe breakdown in the comment is accurate and matches the log excerpt in the PR description. The dominant cost (
300s covers up to ~6 GiB with comfortable headroom, and the 2. Env-var override — OK
3. Root-cause follow-up — minor doc nitThe PR description clearly points to PR #910 as the structural fix, but the code comment itself has no cross-reference. A future maintainer reading only the shell script has no pointer to the planned fix. Suggest adding one line at the end of the comment block: # Root-cause fix (skip restoreSnapshotState when already applied): see
# docs/design/2026_06_02_idempotent_snapshot_restore.md / issue #910.Not blocking, but will prevent the "why is this 300?" question from being re-opened six months from now after PR #910 ships and someone tries to lower the value. 4. Crash-loop concern (Gemini finding) — agree, but out of scope hereGemini flagged that The code comment's claim that crash-loop bugs "never finish (1) or (2)" is an overstatement — an OOM or panic mid- wait_for_grpc() {
local i
for ((i = 0; i < HEALTH_TIMEOUT_SECONDS; i++)); do
if grpc_healthy; then
return 0
fi
# Exit early if the container has already stopped.
local status
status="$(docker inspect --format '{{.State.Status}}' "$CONTAINER_NAME" 2>/dev/null || echo missing)"
if [[ "$status" != "running" ]]; then
return 1
fi
sleep 1
done
return 1
}However, this limitation predates the PR and is orthogonal to the default-value change. Fix this → — I'd recommend filing a follow-up issue and handling it separately from this emergency fix. Summary: the 1-line default change is correct and safe to merge. The only actionable item before merge is the cross-reference comment (§3); the |
|
round-2 push: @claude review |
|
Claude finished @bootjp's task in 58s —— View job Code Review (Round 2)
LGTM — round-2 addresses the only actionable item from round-1. §3 cross-reference — ✅ Done and well done
# Root-cause fix (skip restoreSnapshotState when the on-disk FSM is
# already at >= snapshot.Metadata.Index): see PR #910 and
# docs/design/2026_06_02_idempotent_snapshot_restore.md. Once that
# lands and steady-state skip rates are observed in production, this
# default can be tightened back down.This is better than the minimal one-liner I suggested — it names the optimization ( §4 wait_for_grpc early-exit — explicitly deferred, ✅ AcceptedPre-existing limitation, orthogonal to this emergency fix. Appropriate to track separately. Remaining invariants — still hold
Ready to merge. |
## Summary PR #909 (`HEALTH_TIMEOUT_SECONDS` 60s → 300s) の **根本原因側** の設計提案。コードは無変更、ドキュメント追加のみ。 ## 何を解決したいか `internal/raftengine/etcd/wal_store.go:117 loadWalState` が、WAL の persisted snapshot pointer が空でない限りコールド起動の度に `restoreSnapshotState(fsm, snapshot, fsmSnapDir)` を呼びます。EKV/token-format snapshot の経路では最終的に `pebbleStore.Restore` → `restorePebbleNativeAtomic` → `swapInTempDB` (`store/lsm_store.go:1816, :2002`) が走り、 **fsm.db の全内容を sibling tempdir に書き戻してから rename で差し替える** = 起動コストが **O(snapshot size)**。 5 ノード 192.168.0.x クラスタで実測 **~46 s** (Found 2 WALs `012720/012734` のオープン → Found 0 WALs の tempdir 作成 → Found 4 WALs `001309..001315` の rename 後再オープン)。 これが PR #909 で観測された "gRPC port did not come up" の主因です。 問題は **この復元の大半が冗長** だという点: - 毎回成功した `fsm.Apply` が変更内容を Pebble の WriteBatch で durable に書き込み済み - FSM が `applied = Y > snapshot.Metadata.Index = X` まで進んだ後の fsm.db は state ≥ X を保持済み - にもかかわらず次のコールド起動で **その状態を捨てて古い snapshot から再構築** → raft が同じエントリを再 replay ## 提案アーキテクチャ ### (1) `StateMachine.Apply(index, data)` 化 ```go // before Apply(data []byte) any // after Apply(index uint64, data []byte) any ``` Engine の唯一の呼び出しサイト (`engine.go:1769 applyNormalEntry`) で `entry.Index` を渡すだけ。 in-tree 実装は機械的変換、外部実装は無し。 ### (2) `pebbleStore` の meta key に applied-index を atomic 同梱 `kvFSM.Apply(index, data)` が leaf MVCC mutation で取得する `pebble.Batch` に `key=metaAppliedIndex, value=BE64(index)` を追加。 Pebble の batch は atomic なので **torn-write window なし**。 ### (3) `restoreSnapshotState` 条件分岐 ```go if skip, err := fsmAlreadyAtIndex(fsm, tok.Index); err != nil { return err } else if skip { // FSM data on disk is at least as fresh as the snapshot. return nil } return openAndRestoreFSMSnapshot(...) ``` `AppliedIndexReader` interface 未実装の FSM、または meta key 不在 (アップグレード直後の最初の再起動) は **現状の full restore にフォールバック** → strictly additive。 ## 期待される効果 - コールド起動の steady-state コスト: `pebble.Open` + WAL replay + raft follower 化 + gRPC bind の **<5 秒** (FSM サイズに依存しない) - raft term 暴発 (timeout-induced restart loop による) の解消 - 将来 `HEALTH_TIMEOUT_SECONDS` を引き下げ可能 (Branch 5 で実施) ## 実装シーケンス (別 PR 4 本に分割) | Branch | 内容 | 動作変化 | |---|---|---| | **B2** | `StateMachine.Apply(index, data)` interface 変更 + 全実装更新 (index は受け取って捨てるだけ) | なし | | **B3** | `kvFSM` で index を leaf mutation に流し、`pebbleStore` で meta key を WriteBatch 同梱 + `LastAppliedIndex()` 公開 + 単体テスト | meta key が書き込まれ始める (まだ skip は無効) | | **B4** | `restoreSnapshotState` に `fsmAlreadyAtIndex` ガート + metrics + INFO log | **ユーザー可視な perf 改善** ここで発火 | | **B5** | `HEALTH_TIMEOUT_SECONDS` を tighter な値に引き下げ | docker restart loop 抑止 | 各 branch は対応する invariant を unit test でガード。 ## Crash-safety / Compatibility 詳細は `docs/design/2026_06_02_idempotent_snapshot_restore.md` の §3 §5 参照。 要約: - meta key とデータ mutation は同一 `pebble.Batch` で commit されるため、クラッシュ後の Pebble WAL replay は両者を一緒に restore するか両者を一緒に discard する。中間状態はあり得ない。 - 8-byte length check 失敗時は読み取り側が conservative に "missing" として扱い、full restore にフォールバック。 - 旧 fsm.db に meta key が無い場合は最初の再起動は現状動作。データ移行不要。 - ロールバックは wal_store.go の skip 分岐を revert するだけ。meta key は dead data として残るが無害。 ## 観測対象 ``` fsm_cold_start_restore_total{outcome="executed|skipped|fallback"} fsm_cold_start_applied_index_gap (executed 時の snapshot.Index - LastAppliedIndex) ``` + skip 時の INFO log: `restoreSnapshotState skipped (FSM already at index N >= snapshot index X)` ## レビュー観点 - §3 の Pebble batch 原子性に基づく crash-safety 証明が説得力あるか - §4 の "skip 後の replay 冪等性" 議論で見落としは無いか (`kvFSM.Apply` が現状 idempotent と読めているか) - Branch 分割 (4 本) の粒度が妥当か / もっと束ねるべきか - HLC lease entry の index 監視に関する Open Question (§Open Questions 1 番目) の解像度 設計が固まり次第 Branch 2 (interface change) から着手します。 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Revised design doc detailing idempotent snapshot restore: safe skip of heavy snapshot bodies when on-disk state is proven current, header CRC verification for the skip path, and a durable applied-index write before snapshot persistence. Clarifies crash-safety, observability, compatibility, and implementation plan. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/bootjp/elastickv/pull/910?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…oth snapshot persist sites (#915) ## Summary Implements **Branch 2** of the cold-start snapshot-restore skip optimisation designed in PR #910. After this lands the `metaAppliedIndex` Pebble meta key is durably written on every raft-Apply data mutation AND at every snapshot persist — but the skip gate itself (Branch 3) is NOT yet wired, so behaviour is observationally identical to `main` except for the new meta key in fsm.db. Branch 2 is meant to soak in production for at least one release before Branch 3 enables the skip; this PR is intentionally a no-op-from-the-outside change with comprehensive plumbing. ## Reading order (6 commits, designed to review one-at-a-time) | # | commit | scope | |---|---|---| | 1 | `2339a6f2` | raftengine: opt-in interfaces (`AppliedIndexReader` / `AppliedIndexWriter`) | | 2 | `525fc152` | pebbleStore: `metaAppliedIndex` const + `LastAppliedIndex` + `SetDurableAppliedIndex` (with `pebble.Sync` UNCONDITIONALLY) | | 3 | `aa9b8acc` | `MVCCStore` interface extension: `ApplyMutationsRaftAt` / `DeletePrefixAtRaftAt` overloads, threading appliedIndex through `applyMutationsWithOpts` + `deletePrefixAtWithOpts` | | 4 | `7cd72bda` | kvFSM seam wiring: `AppliedIndexReader()` / `SetDurableAppliedIndex()` accessors + all 7 data-Apply leaves switched to `*RaftAt` with `f.pendingApplyIdx` | | 5 | `f1e8748c` | engine hooks at BOTH snapshot persist sites: `persistCreatedSnapshot` + `e.persistLocalSnapshotPayload` call `SetDurableAppliedIndex` BEFORE `persist.SaveSnap` | | 6 | `2c42f7d6` | tests (10 new tests across store + engine) | ## Design constraints honoured All from `docs/design/2026_06_02_idempotent_snapshot_restore.md`: - **§2 "Why both leaves"**: meta key bundle in BOTH `applyMutationsWithOpts` AND `deletePrefixAtWithOpts` so DEL_PREFIX entries don't silently leave `LastAppliedIndex` behind. Tested by `TestDeletePrefixAtRaftAt_BundlesMetaAppliedIndex`. - **§3 `dbMu.RLock()`**: both `LastAppliedIndex` and `SetDurableAppliedIndex` acquire the read-lock, matching the lock-ordering discipline at `lsm_store.go:153 / :553 / :675`. - **§4 fallback policy**: `AppliedIndexReader()` returns nil when the store doesn't implement the seam; `LastAppliedIndex` returns `(0, false, nil)` for missing OR truncated meta key. Branch 3 will then fall back to full restore conservatively. - **§6 `ELASTICKV_FSM_SYNC_MODE=nosync` mode**: `SetDurableAppliedIndex` is **pinned to `pebble.Sync` unconditionally**. Rationale documented at length in the method's doc-comment — once `persist.SaveSnap` returns, WAL compaction discards every log entry ≤ `snap.Metadata.Index`, so there's no source to replay the meta key bump from. +1 fsync per snapshot persist (rare; default `SnapshotCount=10000`). Tested by `TestSetDurableAppliedIndex_UsesPebbleSync`. - **§6 "HLC lease entries — checkpoint at snapshot persist"**: BOTH `persistCreatedSnapshot` (config snapshots) AND `e.persistLocalSnapshotPayload` (steady-state `SnapshotCount`-triggered hot path) call the hook. Both crash-ordering tested by `TestPersistCreatedSnapshot_*`. - **§8 compatibility**: `StateMachine.Apply`'s public signature is unchanged. New interfaces are opt-in. Old call sites (`ApplyMutationsRaft` without `*At`) still work, just pass `appliedIndex=0` to opt out of the meta key bump. ## Test results ``` go vet ./... → 0 issues go test ./store/ -short → ok 29.4s go test ./kv/ -short → ok 10.4s go test ./internal/raftengine/... -short → ok 32.8s go test ./store/ -run 'TestLastAppliedIndex|TestSetDurable...|TestApply...|TestDelete...' → ok 1.6s go test ./internal/raftengine/etcd/ -run 'TestRecording|TestPersistCreatedSnapshot_' → ok 0.03s ``` 10 new tests added (see commit `2c42f7d6` for the full inventory). ## What this does NOT do - **Does NOT enable the skip gate.** `restoreSnapshotState` still always restores. Branch 3 wires the `fsmAlreadyAtIndex` check + `applyHeaderStateOnSkip` + the two-phase `SnapshotHeaderApplier` seam. - **Does NOT change `HEALTH_TIMEOUT_SECONDS=300`.** Branch 4 lowers it once Branch 3 has soaked. - **Does NOT touch the snapshot-install hot path** (`Engine.applySnapshot`) per Non-Goals in the design. ## Soak plan Branch 2 should run in production for at least one release before Branch 3 opens. Operators can verify the meta key is being written via: ```bash # Inspect a pebble fsm.db (read-only) ldb --db=/var/lib/elastickv/n3/fsm.db get '_meta_applied_index' --hex # Expected: 8 little-endian bytes equal to the current applied index ``` ## Refs - PR #910 (design) — round 1..7 design history + retraction sections explaining the design constraints this PR honours - PR #909 — `HEALTH_TIMEOUT_SECONDS` band-aid that this series eventually obviates <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Durable tracking of Raft-applied indexes to ensure consistent snapshot/save ordering. * **Bug Fixes** * Improved snapshot persistence reliability by pinning durable applied index before snapshot writes. * Stronger durability for writes bundled with Raft entry indices, reducing restore/recovery surprises. * **Tests** * Added comprehensive tests covering applied-index ordering, failure handling, and persistence behavior. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
scripts/rolling-update.shのHEALTH_TIMEOUT_SECONDSデフォルトを 60s → 300s に引き上げます。なぜ 60s では足りないか
2026-06-02 の 5 ノードクラスタ (192.168.0.x) ローリング再起動で、
.212ノードが起動中にwait_for_grpcがタイムアウトし、スクリプトがgRPC port did not come up on 192.168.0.212:50051で exit。docker logs --tail 200を辿ると以下の wall-clock 内訳:つまり多くの場合
restorePebbleNativeAtomic+swapInTempDBだけで ~46 秒 消費しており、その後にまだ WAL replay + raft follower 化 + gRPC bind が残るため、60s ceiling では確実にタイムアウトします。タイムアウトはさらに二次被害を生みます: 再起動コンテナが応答不能のまま docker restart policy で再投入される → 残りクォーラムから見て当該ノードが選挙に応答しない → raft term が増え続ける(我々のクラスタでは term 665 まで上昇)。
変更内容
HEALTH_TIMEOUT_SECONDSのデフォルトを60→300に動作確認
bash -n scripts/rolling-update.sh→ syntax OKHEALTH_TIMEOUT_SECONDS=120 bash scripts/rolling-update.sh ...のようなオペレーター側オーバーライドは引き続き有効設計フォローアップ (別 PR)
これは 症状の応急処置 で、根本原因(
restoreSnapshotStateがコールドスタート時に毎回 FSM 全復元する)は別 PR で扱います:pebbleStoreにLastAppliedIndex()を生やし、毎 Apply の WriteBatch にメタキーとしてアトミックに同梱restoreSnapshotStateでLastAppliedIndex() >= snapshot.Metadata.Indexならスキップdocs/design/2026_06_02_idempotent_snapshot_restore.md(Coming up next)これが入れば 46s の payload は完全に消え、
HEALTH_TIMEOUT_SECONDSは将来引き下げ可能になります。