perf(kv): serve ShardStore.GetAt reads via lease, not read-index#575
perf(kv): serve ShardStore.GetAt reads via lease, not read-index#575
Conversation
Root cause for the post-#573 raft saturation (1500/s dispatch_errors, 324k heartbeat drops in 10 min, 282/401 goroutines parked in Engine.submitRead): every ShardStore.GetAt called isLinearizableRaftLeader -> linearizableReadEngineCtx -> engine.LinearizableRead -> submitRead, i.e. each in-script redis.call submitted its OWN read-index through the single dispatch worker. The LeaseReadThrough done once at luaScriptContext startup only verified leadership and set startTS; it did NOT short-circuit subsequent GetAt reads. Fix: introduce leaseReadEngineCtx (lease-aware sibling of linearizableReadEngineCtx). When the engine exposes LeaseProvider and the engine-driven LastQuorumAck is within LeaseDuration of a single time.Now() sample, return the current AppliedIndex without dispatching a new read-index. Lease miss still falls through to the full LinearizableRead round-trip, preserving correctness when the lease is cold (e.g. immediately after a leadership transfer). Switch the two read-path callers in ShardStore to use the new helper: - isLinearizableRaftLeader (on the GetAt / ScanAt hot path) - the leader-liveness guard inside LatestCommitTS Safety matches engineLeaseAckValid's existing contract (state == Leader, ack non-zero, ack <= now, now - ack < LeaseDuration). No correctness change — the lease already guarantees no concurrent leader within the window, so AppliedIndex is safe to serve. Expected prod impact: submitRead call rate drops by ~N× where N is the number of storage-layer reads per script (5+ in BullMQ workloads post-#573). Heartbeat drops and dispatch_errors should follow. Tests cover fast path, stale ack fallback, non-leader fallback, engines without LeaseProvider, and nil-engine guard.
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 46 minutes and 24 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces leaseReadEngineCtx, a lease-aware alternative to linearizableReadEngineCtx, to optimize read operations in ShardStore. By utilizing the engine's LeaseProvider and LastQuorumAck, the system can now skip the LinearizableRead round-trip when a valid lease is held, significantly reducing Raft dispatcher saturation during high read loads. The changes include the implementation of the helper function, its integration into isLinearizableRaftLeader and LatestCommitTS, and comprehensive unit tests covering various edge cases. I have no feedback to provide.
…tbeats (#577) ## Summary - Adds an **opt-in** 4-lane etcd raft dispatcher (heartbeat / replication / snapshot / other) gated on `ELASTICKV_RAFT_DISPATCHER_LANES=1`. Default behavior is unchanged. - Isolates `MsgSnap` on its own goroutine so a multi-MiB snapshot transfer can no longer stall subsequent `MsgApp`s queued to the same peer; addresses the residual failure mode from the 324k-heartbeat-drop incident that PR #575 partially mitigated. - Keeps every existing correctness invariant: per-peer within-type ordering, `dispatchDropCount` accounting, and `postDispatchReport` firing after every dispatch attempt (so `ReportUnreachable` / `ReportSnapshot` still land in etcd/raft). ## What changed `peerQueues` already had two lanes (`heartbeat` + `normal`) from PR #522. With the flag on, it now carries four: | Lane | Message types | Buffer | |---|---|---| | `heartbeat` | `MsgHeartbeat` / `MsgHeartbeatResp`, Vote / PreVote (+Resp), ReadIndex (+Resp), `MsgTimeoutNow` | 512 | | `replication` | `MsgApp` / `MsgAppResp` | `MaxInflightMsg` (1024) | | `snapshot` | `MsgSnap` | 4 | | `other` | anything unclassified | 16 | Each lane gets its own goroutine, drained independently. Within-type ordering to a given peer is preserved because all of a peer's messages of one type share exactly one channel and one worker. See `selectDispatchLane`. When the flag is off we construct the old `heartbeat + normal` pair exactly as today — no behavior change, no goroutine count change, no allocation change. ## Why behind a flag The raft dispatch path is high blast radius: a bug here can drop heartbeats cluster-wide and trigger mass elections. Rather than swap the hot path in one shot, this lands the structural change plumbed but dormant so the default production footprint is unchanged. ## Rollout plan 1. Merge default-off (this PR). 2. Flip `ELASTICKV_RAFT_DISPATCHER_LANES=1` in staging, watch `elastickv_raft_dispatch_drop_total`, `elastickv_raft_dispatch_errors_total`, and heartbeat loss metrics for 24h under synthetic write load + forced snapshot transfer. 3. Enable on one production replica, bake 24h, then fleetwide. 4. Once soaked, delete the flag and the 2-lane branch in `startPeerDispatcher` / `selectDispatchLane` in a follow-up PR. ## Test plan - [x] `go test -race -count=1 -timeout 300s ./internal/raftengine/etcd/...` — passes (flag off) - [x] `ELASTICKV_RAFT_DISPATCHER_LANES=1 go test -race -count=1 -timeout 300s ./internal/raftengine/etcd/...` — passes (flag on) - [x] New targeted tests: - `TestSelectDispatchLane_LegacyTwoLane` — pins today's routing table - `TestSelectDispatchLane_FourLane` — pins new routing table - `TestFourLaneDispatcher_SnapshotDoesNotBlockReplication` — exercises the HOL-blocking invariant the flag is meant to fix - `TestFourLaneDispatcher_RemovePeerClosesAllLanes` — ensures no goroutine leak on peer removal in 4-lane mode - `TestDispatcherLanesEnabledFromEnv` — env parsing - [ ] Staging soak with flag enabled under write-heavy load (blocker for removing the flag in a follow-up) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Optional 4-lane dispatch mode (toggled via environment flag) to route replication, snapshot, priority, and other traffic separately for reduced head-of-line blocking and improved responsiveness. * Safer peer removal behavior ensuring in-flight lane workers exit and post-removal messages are dropped cleanly. * **Tests** * Added unit and concurrency tests covering legacy vs. 4-lane routing, lifecycle, and env-var parsing. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Adds an `EXTRA_ENV` pass-through to `scripts/rolling-update.sh` so operators can set container environment variables from `deploy.env` without editing this script each time. ## Motivation Immediate trigger: enabling `ELASTICKV_RAFT_DISPATCHER_LANES=1` (feature flag added in PR #577). We want to flip it on via the deploy envelope, not by rebuilding the image or hand-editing this script. Post-#575 raft metrics are already healthy, so the 4-lane dispatcher is being enabled as a defensive measure — it keeps heartbeats from being starved by MsgApp / MsgSnap bursts under extreme write load, rather than as a fix for anything actively broken. ## Change - `run_container`: if `EXTRA_ENV` is set, split on whitespace and forward each pair as a single `docker run -e KEY=VALUE` flag. - Comment documents the whitespace-split semantics (pairs must not contain whitespace; values may contain characters bash would otherwise interpret). ## Usage ```bash # deploy.env EXTRA_ENV="ELASTICKV_RAFT_DISPATCHER_LANES=1 ELASTICKV_PEBBLE_CACHE_MB=512" ``` ## Test plan - [x] `bash -n scripts/rolling-update.sh` passes - [x] `shellcheck` / `make lint` — 0 issues - [ ] Rolling deploy with EXTRA_ENV set; verify via `docker inspect elastickv --format '{{.Config.Env}}'` on each node <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added the ability to pass arbitrary additional environment variables to containers via a new input; these are injected into container runtime. * **Improvements** * Extra env entries are validated and safely normalized/escaped for remote transport. * Core runtime variables are now sent in escaped/quoted form to the remote execution context for safer handling. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…592) ## Root cause Every committed raft entry triggered a `pebble.Sync` inside `store.ApplyMutations` / `store.DeletePrefixAt` (see `store/lsm_store.go:1056, 1108`), on top of the raft WAL fsync that `etcd/raft` already performs per `Ready` batch in `persistReadyToWAL` (`internal/raftengine/etcd/wal_store.go:376`). The raft Ready loop in `drainReady` (`internal/raftengine/etcd/engine.go:1389`) already batches multiple entries per `Ready`, so the raft WAL fsync is fine. The hot fsync is the FSM-side `b.Commit(pebble.Sync)` that the apply loop (`applyCommitted`) hits once per entry. A prior cleanup (`docs/review_todo.md` section 3.4) intentionally kept `ApplyMutations` on `pebble.Sync`; this CL makes that choice tunable. Microbenchmark (Apple M1 Max, APFS tempdir, `BenchmarkApplyMutations_SyncMode`): | mode | ns/op | allocs/op | |--------|---------:|----------:| | sync | 16292899 | 7 | | nosync | 16293 | 8 | ~1000x on this platform. Real hardware fsync latency varies, but the sync/nosync ratio is consistently large on any WAL that enforces platform durability. ## Durability argument Pebble's FSM-commit fsync is redundant with the raft WAL under this codebase's crash-recovery model: 1. Raft WAL (`etcd/raft`) fsyncs every committed entry via `persist.Save` before `Advance`. 2. On restart, `newMemoryStorage` (`internal/raftengine/etcd/persistence.go:352`) reloads the snapshot + all WAL entries. `newRawNode` does not set `Config.Applied`, so `etcdraft` defaults it to `snapshot.Metadata.Index`. 3. The engine sets `e.applied = maxAppliedIndex(LocalSnap)` and every committed entry past the snapshot is re-delivered through `CommittedEntries` on the first `Ready`. 4. `kv/fsm.applyCommitWithIdempotencyFallback` treats an already-committed key (`LatestCommitTS >= commitTS`) as an idempotent retry, so replaying an entry whose effect survived the crash is safe. 5. FSM snapshots are fsynced (`writeFSMSnapshotFile` then `f.Sync()` in `fsm_snapshot_file.go`). Therefore a crash that loses the unfsynced tail of Pebble's own WAL is recoverable: raft replays from the last fsynced FSM snapshot onwards, and the idempotent apply path re-materialises the lost state. Pebble on the FSM commit path effectively becomes a volatile cache of applied state whose durability boundary is the raft WAL. Other `pebble.Sync` call sites (snapshot-batch commit, metadata-restore writes, compaction `commitSnapshotBatch`) are untouched: those are orthogonal durability boundaries (e.g. restore-directory swap) and are not per-proposal cost. ## Env var + default * `ELASTICKV_FSM_SYNC_MODE=sync` (default) - current behaviour. * `ELASTICKV_FSM_SYNC_MODE=nosync` - `b.Commit(pebble.NoSync)` on the FSM hot path. Raft WAL remains the durability boundary. Unknown values fall back to `sync` (fail-safe toward durability). Parsing is case-insensitive and whitespace-tolerant. A Prometheus gauge `elastickv_fsm_apply_sync_mode{mode="sync"|"nosync"}` is set at `NewRegistry` time via `store.FSMApplySyncModeLabel()`, so dashboards can alert if a rolling deploy accidentally flips the durability posture. ## Test plan - [x] `go test ./store/... ./monitoring/... ./kv/... ./internal/raftengine/... -count=1` - [x] Env var parsing: sync/nosync/mixed-case/whitespace/unknown then sync default - [x] Functional equivalence of sync vs nosync on a Pebble store - [x] Clean-shutdown reopen visibility (NoSync + Close + reopen preserves writes) - [x] Prometheus gauge mutual exclusivity across successive SetFSMApplySyncMode calls - [ ] (Follow-up) Jepsen-style OS-level crash test for unfsynced-tail recovery - tracked in `JEPSEN_TODO.md` ## Benchmark ``` go test ./store -run='^$' -bench='BenchmarkApplyMutations_SyncMode' -benchtime=2s -benchmem BenchmarkApplyMutations_SyncMode/sync-10 141 16292899 ns/op 368 B/op 7 allocs/op BenchmarkApplyMutations_SyncMode/nosync-10 129262 16293 ns/op 284 B/op 8 allocs/op ``` ## Related - Previous lever documented in `docs/review_todo.md` section 3.4 (ApplyMutations retained `pebble.Sync`). - Pebble block cache default 256 MiB (#588) - WAL retention / purge (#589) - Lease read via AppliedIndex (#575) - Raft dispatcher lanes (#577) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Configuration option for FSM commit durability mode with adjustable sync and no-sync settings * New Prometheus metric exposing the current durability mode on the `/metrics` endpoint * **Tests** * Unit tests validating FSM durability mode functionality * Integration tests ensuring functional correctness across different durability modes * Performance benchmarks evaluating FSM commit operation characteristics <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
Serve
ShardStore.GetAt/ScanAt/LatestCommitTSfrom the leader-local lease when the engine-drivenLastQuorumAckis fresh, instead of dispatching a fresh read-index round-trip per call.Root cause (from PR #573's post-deploy telemetry)
elastickv_raft_dispatch_errors_total→ 1500/sMsgHeartbeat drop_count→ 324,864 in 10 minEngine.submitReadelastickv_lease_read_total{outcome="hit"}=99.7%— the Coordinator-level LeaseRead at script start IS hittingThe asymmetry:
luaScriptContextrunsLeaseReadThroughonce at start and usesstartTSafterwards, but each in-scriptredis.call→ShardStore.GetAt(ctx, key, startTS)calledisLinearizableRaftLeader→linearizableReadEngineCtx→engine.LinearizableRead→submitRead. Every storage-layer read submitted its own read-index through the single dispatch worker. At 7 scripts/s × 5-7 redis.call/script × 1-7 GetAts/call, the dispatcher was swamped and heartbeats couldn't get through.Change
New
kv.leaseReadEngineCtx(ctx, engine)— lease-aware sibling oflinearizableReadEngineCtx:LeaseProviderANDengineLeaseAckValid(state, LastQuorumAck, time.Now(), LeaseDuration)holds, returnAppliedIndex()directly.LinearizableReadround-trip.Two call sites switched:
isLinearizableRaftLeader(the fence insideShardStore.GetAt/ScanAt)LatestCommitTSSafety is identical to
Coordinator.LeaseRead—engineLeaseAckValidalready enforcesstate == Leader && !ack.IsZero() && !ack.After(now) && now.Sub(ack) < LeaseDuration. The lease guarantees no concurrent leader within the window, so localAppliedIndexis safe.Expected prod impact
Engine.submitReadcall rate: ~N× lower, where N = number of storage reads per script (5+ in BullMQ workloads post-perf(redis-lua): short-circuit missing-key ZRANGEBYSCORE in fast path #573).elastickv_raft_dispatch_errors_totalshould stop climbing.Heartbeat drop_countshould plateau.Test plan
go test -race ./kv/...(3.6s) greengo test -race -short ./adapter/...(56s) greenTestLeaseReadEngineCtx_*cover: fast path, stale-ack fallback, non-leader fallback, missing LeaseProvider, nil-enginerate(elastickv_lua_cmd_fastpath_total{cmd="zrangebyscore",outcome="hit"}[1m])for throughput,elastickv_raft_dispatch_errors_total, and the leader's goroutine count via pprof