fix(ops/rolling-update): bump HEALTH_TIMEOUT_SECONDS default 60s -> 300s by bootjp · Pull Request #909 · bootjp/elastickv

bootjp · 2026-06-02T13:41:09Z

Summary

scripts/rolling-update.sh の HEALTH_TIMEOUT_SECONDS デフォルトを 60s → 300s に引き上げます。

なぜ 60s では足りないか

2026-06-02 の 5 ノードクラスタ (192.168.0.x) ローリング再起動で、.212 ノードが起動中に wait_for_grpc がタイムアウトし、スクリプトが gRPC port did not come up on 192.168.0.212:50051 で exit。docker logs --tail 200 を辿ると以下の wall-clock 内訳:

12:56:05  pebble.Open("/var/lib/elastickv/n3/fsm.db") — 既存 DB を開いて WAL 012720/012734 を replay
12:56:06  pebble.Open(<sibling tempdir>) — restorePebbleNativeAtomic が新規 Pebble を作成 (Found 0 WALs)
   ↓ +45s  restoreBatchLoopInto が FSM スナップショットを batch-loop で書き戻す
12:56:51  pebble.Open(fsm.db) — swapInTempDB が old dir を RemoveAll + temp を Rename → 再オープン (Found 4 WALs: 001309..001315)
   ↓       WAL replay (commit=57365699 / applied=57361461 → 4238 entries)
   ↓       raft follower-ization
   ↓       grpc.Serve bind

つまり多くの場合 restorePebbleNativeAtomic + swapInTempDB だけで ~46 秒 消費しており、その後にまだ WAL replay + raft follower 化 + gRPC bind が残るため、60s ceiling では確実にタイムアウトします。

タイムアウトはさらに二次被害を生みます: 再起動コンテナが応答不能のまま docker restart policy で再投入される → 残りクォーラムから見て当該ノードが選挙に応答しない → raft term が増え続ける(我々のクラスタでは term 665 まで上昇)。

変更内容

HEALTH_TIMEOUT_SECONDS のデフォルトを 60 → 300 に
上に コールドスタートの内訳と why 300s を明記したコメントブロック (将来オペレーターが値を縮めたくなった時の根拠)

動作確認

bash -n scripts/rolling-update.sh → syntax OK
HEALTH_TIMEOUT_SECONDS=120 bash scripts/rolling-update.sh ... のようなオペレーター側オーバーライドは引き続き有効

設計フォローアップ (別 PR)

これは 症状の応急処置 で、根本原因(restoreSnapshotState がコールドスタート時に毎回 FSM 全復元する)は別 PR で扱います:

pebbleStore に LastAppliedIndex() を生やし、毎 Apply の WriteBatch にメタキーとしてアトミックに同梱
restoreSnapshotState で LastAppliedIndex() >= snapshot.Metadata.Index ならスキップ
設計ドキュメント: docs/design/2026_06_02_idempotent_snapshot_restore.md (Coming up next)

これが入れば 46s の payload は完全に消え、HEALTH_TIMEOUT_SECONDS は将来引き下げ可能になります。

The wait_for_grpc budget must cover the full cold-start path: pebble.Open(fsm.db) -> loadWalState -> restoreSnapshotState -> pebbleStore.Restore -> restorePebbleNativeAtomic (writes a sibling temp DB) -> swapInTempDB (RemoveAll old dir + Rename temp into place + reopen) -> WAL entry replay (commit - applied entries) -> raft follower-ization -> gRPC.Serve bind For multi-GiB FSMs the restorePebbleNativeAtomic + swapInTempDB pair dominates wall-clock cost (observed ~46s on a 5M-key node, scales linearly with FSM size and disk throughput). A 60s ceiling caused the restarted container to be declared unhealthy mid-restore, which the rolling-update.sh exit path surfaces as 'gRPC port did not come up on $NODE_HOST:$RAFT_PORT'. The repeated timeout-induced restart loop also drives raft term inflation on the remaining quorum because the restarting node is unresponsive across elections. 300s comfortably absorbs the FSM restore for multi-GiB working sets while still failing fast on genuine crash-loop bugs (those never reach step 3). Operators with smaller FSMs can shrink it via env override. Follow-up (separate PR / design doc): make restoreSnapshotState idempotent so cold-start restore can be skipped when the on-disk FSM is already at >= snapshot.Metadata.Index, eliminating the 46s pay-on- restart cost entirely.

coderabbitai · 2026-06-02T13:41:18Z

Warning

Review limit reached

@bootjp, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 59 minutes and 21 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 18f01b92-4dbe-45a6-b19f-be03b6c80422

📥 Commits

Reviewing files that changed from the base of the PR and between 94579fc and fd3ff1a.

📒 Files selected for processing (1)

scripts/rolling-update.sh

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/rolling-update-grpc-timeout-default

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request increases the default HEALTH_TIMEOUT_SECONDS from 60 to 300 seconds in scripts/rolling-update.sh to accommodate longer cold-start times for nodes with large FSMs, adding detailed documentation for this decision. The reviewer pointed out that this longer timeout could cause the script to block for up to 5 minutes if a container crashes immediately, and suggested updating wait_for_grpc to check if the container is still running and exit early in such cases.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-02T13:42:23Z

+# for the multi-GiB working sets we operate against today and still
+# fails fast on genuine crash-loop bugs (which never finish (1) or (2)).
+# Operators with smaller FSMs can shrink HEALTH_TIMEOUT_SECONDS via env.
+HEALTH_TIMEOUT_SECONDS="${HEALTH_TIMEOUT_SECONDS:-300}"


Increasing the default HEALTH_TIMEOUT_SECONDS to 300 seconds (5 minutes) introduces a potential operational bottleneck when a container fails to start or crashes immediately. Currently, wait_for_grpc (line 672) only checks grpc_healthy and sleeps in a loop. If the container exits or enters a crash loop, the script will block and wait for the full 5 minutes before failing. To prevent this, consider updating wait_for_grpc to check if the container is still running (e.g., via docker inspect) and exit early if it has stopped.

bootjp · 2026-06-02T13:45:22Z

@claude review

PR は scripts/rolling-update.sh の HEALTH_TIMEOUT_SECONDS デフォルトを 60s → 300s に引き上げる 1 行差分 + 出所コメントです。

レビュー観点:

300s が FSM サイズ多 GiB 帯のコールド起動 (pebble.Open + restorePebbleNativeAtomic + swapInTempDB + WAL replay + raft follower-ization + grpc bind) を吸収するのに妥当か
環境変数オーバーライドが破壊されていないか (HEALTH_TIMEOUT_SECONDS=120 bash ... 等)
根本原因の構造的修正は PR docs(design): idempotent FSM snapshot restore on cold start #910 (design doc) で別途扱う旨が読み取れるか

claude · 2026-06-02T13:45:44Z

Claude finished @bootjp's task in 1m 23s —— View job

Code Review

Read changed file
Review: 300s adequacy for cold-start
Review: env-var override integrity
Review: root-cause follow-up legibility
Review: crash-loop concern (Gemini finding)

LGTM with two notes. The 60→300 change is justified and the env-var override mechanism is intact. Two things worth addressing before merge.

1. 300s adequacy — OK

The breakdown in the comment is accurate and matches the log excerpt in the PR description. The dominant cost (restorePebbleNativeAtomic + swapInTempDB) is O(FSM size / disk throughput) and the observed 46s for a ~5M-key FSM on SSD maps to roughly 1 GiB of payload. At linear scaling:

FSM size	Est. step-2 time	Budget used
1 GiB	~46 s	~50 s total
3 GiB	~138 s	~145 s total
6 GiB	~276 s	~285 s total

300s covers up to ~6 GiB with comfortable headroom, and the :- override is the safety valve beyond that. The value is reasonable.

2. Env-var override — OK

HEALTH_TIMEOUT_SECONDS="${HEALTH_TIMEOUT_SECONDS:-300}" correctly uses the shell :- default: if the caller exports the variable, the existing value wins. HEALTH_TIMEOUT_SECONDS=120 bash rolling-update.sh ... continues to work as before. The multi-node loop re-passes the value at line 624 (HEALTH_TIMEOUT_SECONDS="$HEALTH_TIMEOUT_SECONDS" \), so sub-invocations also honour the caller-supplied value. No regression here.

3. Root-cause follow-up — minor doc nit

The PR description clearly points to PR #910 as the structural fix, but the code comment itself has no cross-reference. A future maintainer reading only the shell script has no pointer to the planned fix. Suggest adding one line at the end of the comment block:

# Root-cause fix (skip restoreSnapshotState when already applied): see
# docs/design/2026_06_02_idempotent_snapshot_restore.md / issue #910.

Not blocking, but will prevent the "why is this 300?" question from being re-opened six months from now after PR #910 ships and someone tries to lower the value.

4. Crash-loop concern (Gemini finding) — agree, but out of scope here

Gemini flagged that wait_for_grpc (line 672–681) has no container-liveness early exit: if the process panics after pebble.Open but before gRPC bind, the loop spins for the full 300s before reporting failure.

The code comment's claim that crash-loop bugs "never finish (1) or (2)" is an overstatement — an OOM or panic mid-restorePebbleNativeAtomic would still exhaust the timeout. The correct fix is an early-exit check in the loop:

wait_for_grpc() {
  local i
  for ((i = 0; i < HEALTH_TIMEOUT_SECONDS; i++)); do
    if grpc_healthy; then
      return 0
    fi
    # Exit early if the container has already stopped.
    local status
    status="$(docker inspect --format '{{.State.Status}}' "$CONTAINER_NAME" 2>/dev/null || echo missing)"
    if [[ "$status" != "running" ]]; then
      return 1
    fi
    sleep 1
  done
  return 1
}

However, this limitation predates the PR and is orthogonal to the default-value change. Fix this → — I'd recommend filing a follow-up issue and handling it separately from this emergency fix.

Summary: the 1-line default change is correct and safe to merge. The only actionable item before merge is the cross-reference comment (§3); the wait_for_grpc liveness improvement (§4) can be a follow-up.

Address claude[bot] §3 doc nit on PR #909: future maintainers reading only the shell script have no pointer to the planned structural fix. Add an inline pointer to PR #910 / docs/design/.../snapshot-restore.md so the 'why is this 300?' question doesn't get re-opened after #910 ships.

bootjp · 2026-06-03T03:34:48Z

round-2 push: fd3ff1a3 adds the cross-reference comment per claude[bot] §3 doc nit. The wait_for_grpc liveness early-exit (§4) is intentionally deferred to a separate follow-up issue per its out-of-scope flagging.

@claude review

claude · 2026-06-03T03:34:59Z

Claude finished @bootjp's task in 58s —— View job

Code Review (Round 2)

Read changed file and round-2 diff
Verify §3 cross-reference comment was added correctly
Confirm env-var override remains intact
Post final review

LGTM — round-2 addresses the only actionable item from round-1.

§3 cross-reference — ✅ Done and well done

fd3ff1a adds four lines at the end of the comment block that directly answer the "why is this 300?" question a future maintainer would ask after PR #910 ships:

# Root-cause fix (skip restoreSnapshotState when the on-disk FSM is
# already at >= snapshot.Metadata.Index): see PR #910 and
# docs/design/2026_06_02_idempotent_snapshot_restore.md. Once that
# lands and steady-state skip rates are observed in production, this
# default can be tightened back down.

This is better than the minimal one-liner I suggested — it names the optimization (LastAppliedIndex >= snapshot.Metadata.Index skip), links both the PR and the design doc, and gives an explicit signal for when the value can be lowered. No further changes needed here.

§4 wait_for_grpc early-exit — explicitly deferred, ✅ Accepted

Pre-existing limitation, orthogonal to this emergency fix. Appropriate to track separately.

Remaining invariants — still hold

Check	Result
`HEALTH_TIMEOUT_SECONDS="${HEALTH_TIMEOUT_SECONDS:-300}"` (line 219)	Caller-supplied override wins
`HEALTH_TIMEOUT_SECONDS="$HEALTH_TIMEOUT_SECONDS" \` (line 629)	Sub-invocations inherit the value
`for ((i = 0; i < HEALTH_TIMEOUT_SECONDS; i++))` (line 679)	Loop uses the resolved value
`bash -n` syntax	No issues

Ready to merge.

## Summary PR #909 (`HEALTH_TIMEOUT_SECONDS` 60s → 300s) の **根本原因側** の設計提案。コードは無変更、ドキュメント追加のみ。 ## 何を解決したいか `internal/raftengine/etcd/wal_store.go:117 loadWalState` が、WAL の persisted snapshot pointer が空でない限りコールド起動の度に `restoreSnapshotState(fsm, snapshot, fsmSnapDir)` を呼びます。EKV/token-format snapshot の経路では最終的に `pebbleStore.Restore` → `restorePebbleNativeAtomic` → `swapInTempDB` (`store/lsm_store.go:1816, :2002`) が走り、 **fsm.db の全内容を sibling tempdir に書き戻してから rename で差し替える** = 起動コストが **O(snapshot size)**。 5 ノード 192.168.0.x クラスタで実測 **~46 s** (Found 2 WALs `012720/012734` のオープン → Found 0 WALs の tempdir 作成 → Found 4 WALs `001309..001315` の rename 後再オープン)。これが PR #909 で観測された "gRPC port did not come up" の主因です。問題は **この復元の大半が冗長** だという点: - 毎回成功した `fsm.Apply` が変更内容を Pebble の WriteBatch で durable に書き込み済み - FSM が `applied = Y > snapshot.Metadata.Index = X` まで進んだ後の fsm.db は state ≥ X を保持済み - にもかかわらず次のコールド起動で **その状態を捨てて古い snapshot から再構築** → raft が同じエントリを再 replay ## 提案アーキテクチャ ### (1) `StateMachine.Apply(index, data)` 化 ```go // before Apply(data []byte) any // after Apply(index uint64, data []byte) any ``` Engine の唯一の呼び出しサイト (`engine.go:1769 applyNormalEntry`) で `entry.Index` を渡すだけ。 in-tree 実装は機械的変換、外部実装は無し。 ### (2) `pebbleStore` の meta key に applied-index を atomic 同梱 `kvFSM.Apply(index, data)` が leaf MVCC mutation で取得する `pebble.Batch` に `key=metaAppliedIndex, value=BE64(index)` を追加。 Pebble の batch は atomic なので **torn-write window なし**。 ### (3) `restoreSnapshotState` 条件分岐 ```go if skip, err := fsmAlreadyAtIndex(fsm, tok.Index); err != nil { return err } else if skip { // FSM data on disk is at least as fresh as the snapshot. return nil } return openAndRestoreFSMSnapshot(...) ``` `AppliedIndexReader` interface 未実装の FSM、または meta key 不在 (アップグレード直後の最初の再起動) は **現状の full restore にフォールバック** → strictly additive。 ## 期待される効果 - コールド起動の steady-state コスト: `pebble.Open` + WAL replay + raft follower 化 + gRPC bind の **<5 秒** (FSM サイズに依存しない) - raft term 暴発 (timeout-induced restart loop による) の解消 - 将来 `HEALTH_TIMEOUT_SECONDS` を引き下げ可能 (Branch 5 で実施) ## 実装シーケンス (別 PR 4 本に分割) | Branch | 内容 | 動作変化 | |---|---|---| | **B2** | `StateMachine.Apply(index, data)` interface 変更 + 全実装更新 (index は受け取って捨てるだけ) | なし | | **B3** | `kvFSM` で index を leaf mutation に流し、`pebbleStore` で meta key を WriteBatch 同梱 + `LastAppliedIndex()` 公開 + 単体テスト | meta key が書き込まれ始める (まだ skip は無効) | | **B4** | `restoreSnapshotState` に `fsmAlreadyAtIndex` ガート + metrics + INFO log | **ユーザー可視な perf 改善** ここで発火 | | **B5** | `HEALTH_TIMEOUT_SECONDS` を tighter な値に引き下げ | docker restart loop 抑止 | 各 branch は対応する invariant を unit test でガード。 ## Crash-safety / Compatibility 詳細は `docs/design/2026_06_02_idempotent_snapshot_restore.md` の §3 §5 参照。要約: - meta key とデータ mutation は同一 `pebble.Batch` で commit されるため、クラッシュ後の Pebble WAL replay は両者を一緒に restore するか両者を一緒に discard する。中間状態はあり得ない。 - 8-byte length check 失敗時は読み取り側が conservative に "missing" として扱い、full restore にフォールバック。 - 旧 fsm.db に meta key が無い場合は最初の再起動は現状動作。データ移行不要。 - ロールバックは wal_store.go の skip 分岐を revert するだけ。meta key は dead data として残るが無害。 ## 観測対象 ``` fsm_cold_start_restore_total{outcome="executed|skipped|fallback"} fsm_cold_start_applied_index_gap (executed 時の snapshot.Index - LastAppliedIndex) ``` + skip 時の INFO log: `restoreSnapshotState skipped (FSM already at index N >= snapshot index X)` ## レビュー観点 - §3 の Pebble batch 原子性に基づく crash-safety 証明が説得力あるか - §4 の "skip 後の replay 冪等性" 議論で見落としは無いか (`kvFSM.Apply` が現状 idempotent と読めているか) - Branch 分割 (4 本) の粒度が妥当か / もっと束ねるべきか - HLC lease entry の index 監視に関する Open Question (§Open Questions 1 番目) の解像度設計が固まり次第 Branch 2 (interface change) から着手します。  ## Summary by CodeRabbit * **Documentation** * Revised design doc detailing idempotent snapshot restore: safe skip of heavy snapshot bodies when on-disk state is proven current, header CRC verification for the skip path, and a durable applied-index write before snapshot persistence. Clarifies crash-safety, observability, compatibility, and implementation plan.  [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/bootjp/elastickv/pull/910?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

…oth snapshot persist sites (#915) ## Summary Implements **Branch 2** of the cold-start snapshot-restore skip optimisation designed in PR #910. After this lands the `metaAppliedIndex` Pebble meta key is durably written on every raft-Apply data mutation AND at every snapshot persist — but the skip gate itself (Branch 3) is NOT yet wired, so behaviour is observationally identical to `main` except for the new meta key in fsm.db. Branch 2 is meant to soak in production for at least one release before Branch 3 enables the skip; this PR is intentionally a no-op-from-the-outside change with comprehensive plumbing. ## Reading order (6 commits, designed to review one-at-a-time) | # | commit | scope | |---|---|---| | 1 | `2339a6f2` | raftengine: opt-in interfaces (`AppliedIndexReader` / `AppliedIndexWriter`) | | 2 | `525fc152` | pebbleStore: `metaAppliedIndex` const + `LastAppliedIndex` + `SetDurableAppliedIndex` (with `pebble.Sync` UNCONDITIONALLY) | | 3 | `aa9b8acc` | `MVCCStore` interface extension: `ApplyMutationsRaftAt` / `DeletePrefixAtRaftAt` overloads, threading appliedIndex through `applyMutationsWithOpts` + `deletePrefixAtWithOpts` | | 4 | `7cd72bda` | kvFSM seam wiring: `AppliedIndexReader()` / `SetDurableAppliedIndex()` accessors + all 7 data-Apply leaves switched to `*RaftAt` with `f.pendingApplyIdx` | | 5 | `f1e8748c` | engine hooks at BOTH snapshot persist sites: `persistCreatedSnapshot` + `e.persistLocalSnapshotPayload` call `SetDurableAppliedIndex` BEFORE `persist.SaveSnap` | | 6 | `2c42f7d6` | tests (10 new tests across store + engine) | ## Design constraints honoured All from `docs/design/2026_06_02_idempotent_snapshot_restore.md`: - **§2 "Why both leaves"**: meta key bundle in BOTH `applyMutationsWithOpts` AND `deletePrefixAtWithOpts` so DEL_PREFIX entries don't silently leave `LastAppliedIndex` behind. Tested by `TestDeletePrefixAtRaftAt_BundlesMetaAppliedIndex`. - **§3 `dbMu.RLock()`**: both `LastAppliedIndex` and `SetDurableAppliedIndex` acquire the read-lock, matching the lock-ordering discipline at `lsm_store.go:153 / :553 / :675`. - **§4 fallback policy**: `AppliedIndexReader()` returns nil when the store doesn't implement the seam; `LastAppliedIndex` returns `(0, false, nil)` for missing OR truncated meta key. Branch 3 will then fall back to full restore conservatively. - **§6 `ELASTICKV_FSM_SYNC_MODE=nosync` mode**: `SetDurableAppliedIndex` is **pinned to `pebble.Sync` unconditionally**. Rationale documented at length in the method's doc-comment — once `persist.SaveSnap` returns, WAL compaction discards every log entry ≤ `snap.Metadata.Index`, so there's no source to replay the meta key bump from. +1 fsync per snapshot persist (rare; default `SnapshotCount=10000`). Tested by `TestSetDurableAppliedIndex_UsesPebbleSync`. - **§6 "HLC lease entries — checkpoint at snapshot persist"**: BOTH `persistCreatedSnapshot` (config snapshots) AND `e.persistLocalSnapshotPayload` (steady-state `SnapshotCount`-triggered hot path) call the hook. Both crash-ordering tested by `TestPersistCreatedSnapshot_*`. - **§8 compatibility**: `StateMachine.Apply`'s public signature is unchanged. New interfaces are opt-in. Old call sites (`ApplyMutationsRaft` without `*At`) still work, just pass `appliedIndex=0` to opt out of the meta key bump. ## Test results ``` go vet ./... → 0 issues go test ./store/ -short → ok 29.4s go test ./kv/ -short → ok 10.4s go test ./internal/raftengine/... -short → ok 32.8s go test ./store/ -run 'TestLastAppliedIndex|TestSetDurable...|TestApply...|TestDelete...' → ok 1.6s go test ./internal/raftengine/etcd/ -run 'TestRecording|TestPersistCreatedSnapshot_' → ok 0.03s ``` 10 new tests added (see commit `2c42f7d6` for the full inventory). ## What this does NOT do - **Does NOT enable the skip gate.** `restoreSnapshotState` still always restores. Branch 3 wires the `fsmAlreadyAtIndex` check + `applyHeaderStateOnSkip` + the two-phase `SnapshotHeaderApplier` seam. - **Does NOT change `HEALTH_TIMEOUT_SECONDS=300`.** Branch 4 lowers it once Branch 3 has soaked. - **Does NOT touch the snapshot-install hot path** (`Engine.applySnapshot`) per Non-Goals in the design. ## Soak plan Branch 2 should run in production for at least one release before Branch 3 opens. Operators can verify the meta key is being written via: ```bash # Inspect a pebble fsm.db (read-only) ldb --db=/var/lib/elastickv/n3/fsm.db get '_meta_applied_index' --hex # Expected: 8 little-endian bytes equal to the current applied index ``` ## Refs - PR #910 (design) — round 1..7 design history + retraction sections explaining the design constraints this PR honours - PR #909 — `HEALTH_TIMEOUT_SECONDS` band-aid that this series eventually obviates  ## Summary by CodeRabbit * **New Features** * Durable tracking of Raft-applied indexes to ensure consistent snapshot/save ordering. * **Bug Fixes** * Improved snapshot persistence reliability by pinning durable applied index before snapshot writes. * Stronger durability for writes bundled with Raft entry indices, reducing restore/recovery surprises. * **Tests** * Added comprehensive tests covering applied-index ordering, failure handling, and persistence behavior.

gemini-code-assist Bot reviewed Jun 2, 2026

View reviewed changes

bootjp mentioned this pull request Jun 2, 2026

docs(design): idempotent FSM snapshot restore on cold start #910

Merged

bootjp merged commit 37af29e into main Jun 3, 2026
9 checks passed

bootjp deleted the fix/rolling-update-grpc-timeout-default branch June 3, 2026 06:29

bootjp mentioned this pull request Jun 3, 2026

feat(snapshot-skip B2): plumb metaAppliedIndex through raft-Apply + both snapshot persist sites #915

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ops/rolling-update): bump HEALTH_TIMEOUT_SECONDS default 60s -> 300s#909

fix(ops/rolling-update): bump HEALTH_TIMEOUT_SECONDS default 60s -> 300s#909
bootjp merged 2 commits into
mainfrom
fix/rolling-update-grpc-timeout-default

bootjp commented Jun 2, 2026

Uh oh!

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading

Review limit reached

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Uh oh!

bootjp commented Jun 2, 2026

Uh oh!

claude Bot commented Jun 2, 2026 •

edited

Loading

Uh oh!

bootjp commented Jun 3, 2026

Uh oh!

claude Bot commented Jun 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bootjp commented Jun 2, 2026

Summary

なぜ 60s では足りないか

変更内容

動作確認

設計フォローアップ (別 PR)

Uh oh!

coderabbitai Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp commented Jun 2, 2026

Uh oh!

claude Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

1. 300s adequacy — OK

2. Env-var override — OK

3. Root-cause follow-up — minor doc nit

4. Crash-loop concern (Gemini finding) — agree, but out of scope here

Uh oh!

bootjp commented Jun 3, 2026

Uh oh!

claude Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review (Round 2)

§3 cross-reference — ✅ Done and well done

§4 wait_for_grpc early-exit — explicitly deferred, ✅ Accepted

Remaining invariants — still hold

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading

claude Bot commented Jun 2, 2026 •

edited

Loading

claude Bot commented Jun 3, 2026 •

edited

Loading