perf(kv): serve ShardStore.GetAt reads via lease, not read-index by bootjp · Pull Request #575 · bootjp/elastickv

bootjp · 2026-04-21T16:18:57Z

Summary

Serve ShardStore.GetAt / ScanAt / LatestCommitTS from the leader-local lease when the engine-driven LastQuorumAck is fresh, instead of dispatching a fresh read-index round-trip per call.

Root cause (from PR #573's post-deploy telemetry)

elastickv_raft_dispatch_errors_total → 1500/s
MsgHeartbeat drop_count → 324,864 in 10 min
Leader pprof: 282 of 401 goroutines parked in Engine.submitRead
elastickv_lease_read_total{outcome="hit"}=99.7% — the Coordinator-level LeaseRead at script start IS hitting

The asymmetry: luaScriptContext runs LeaseReadThrough once at start and uses startTS afterwards, but each in-script redis.call → ShardStore.GetAt(ctx, key, startTS) called isLinearizableRaftLeader → linearizableReadEngineCtx → engine.LinearizableRead → submitRead. Every storage-layer read submitted its own read-index through the single dispatch worker. At 7 scripts/s × 5-7 redis.call/script × 1-7 GetAts/call, the dispatcher was swamped and heartbeats couldn't get through.

Change

New kv.leaseReadEngineCtx(ctx, engine) — lease-aware sibling of linearizableReadEngineCtx:

If engine exposes LeaseProvider AND engineLeaseAckValid(state, LastQuorumAck, time.Now(), LeaseDuration) holds, return AppliedIndex() directly.
Otherwise fall through to the existing LinearizableRead round-trip.

Two call sites switched:

isLinearizableRaftLeader (the fence inside ShardStore.GetAt / ScanAt)
the leader-liveness guard inside LatestCommitTS

Safety is identical to Coordinator.LeaseRead — engineLeaseAckValid already enforces state == Leader && !ack.IsZero() && !ack.After(now) && now.Sub(ack) < LeaseDuration. The lease guarantees no concurrent leader within the window, so local AppliedIndex is safe.

Expected prod impact

Engine.submitRead call rate: ~N× lower, where N = number of storage reads per script (5+ in BullMQ workloads post-perf(redis-lua): short-circuit missing-key ZRANGEBYSCORE in fast path #573).
elastickv_raft_dispatch_errors_total should stop climbing.
Heartbeat drop_count should plateau.
Raft Commit p99 should drop back toward the pre-incident 5s baseline once reads stop hogging the dispatcher.

Test plan

go test -race ./kv/... (3.6s) green
go test -race -short ./adapter/... (56s) green
New TestLeaseReadEngineCtx_* cover: fast path, stale-ack fallback, non-leader fallback, missing LeaseProvider, nil-engine
Deploy; watch rate(elastickv_lua_cmd_fastpath_total{cmd="zrangebyscore",outcome="hit"}[1m]) for throughput, elastickv_raft_dispatch_errors_total, and the leader's goroutine count via pprof

Root cause for the post-#573 raft saturation (1500/s dispatch_errors, 324k heartbeat drops in 10 min, 282/401 goroutines parked in Engine.submitRead): every ShardStore.GetAt called isLinearizableRaftLeader -> linearizableReadEngineCtx -> engine.LinearizableRead -> submitRead, i.e. each in-script redis.call submitted its OWN read-index through the single dispatch worker. The LeaseReadThrough done once at luaScriptContext startup only verified leadership and set startTS; it did NOT short-circuit subsequent GetAt reads. Fix: introduce leaseReadEngineCtx (lease-aware sibling of linearizableReadEngineCtx). When the engine exposes LeaseProvider and the engine-driven LastQuorumAck is within LeaseDuration of a single time.Now() sample, return the current AppliedIndex without dispatching a new read-index. Lease miss still falls through to the full LinearizableRead round-trip, preserving correctness when the lease is cold (e.g. immediately after a leadership transfer). Switch the two read-path callers in ShardStore to use the new helper: - isLinearizableRaftLeader (on the GetAt / ScanAt hot path) - the leader-liveness guard inside LatestCommitTS Safety matches engineLeaseAckValid's existing contract (state == Leader, ack non-zero, ack <= now, now - ack < LeaseDuration). No correctness change — the lease already guarantees no concurrent leader within the window, so AppliedIndex is safe to serve. Expected prod impact: submitRead call rate drops by ~N× where N is the number of storage-layer reads per script (5+ in BullMQ workloads post-#573). Heartbeat drops and dispatch_errors should follow. Tests cover fast path, stale ack fallback, non-leader fallback, engines without LeaseProvider, and nil-engine guard.

coderabbitai · 2026-04-21T16:19:08Z

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 46 minutes and 24 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 46 minutes and 24 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c7abeed4-f8a3-421b-9f33-f01ecbdbbe3d

📥 Commits

Reviewing files that changed from the base of the PR and between a0a2185 and 4a585f5.

📒 Files selected for processing (3)

kv/lease_read_test.go
kv/raft_engine.go
kv/shard_store.go

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/shard-store-lease-read

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces leaseReadEngineCtx, a lease-aware alternative to linearizableReadEngineCtx, to optimize read operations in ShardStore. By utilizing the engine's LeaseProvider and LastQuorumAck, the system can now skip the LinearizableRead round-trip when a valid lease is held, significantly reducing Raft dispatcher saturation during high read loads. The changes include the implementation of the helper function, its integration into isLinearizableRaftLeader and LatestCommitTS, and comprehensive unit tests covering various edge cases. I have no feedback to provide.

…tbeats (#577) ## Summary - Adds an **opt-in** 4-lane etcd raft dispatcher (heartbeat / replication / snapshot / other) gated on `ELASTICKV_RAFT_DISPATCHER_LANES=1`. Default behavior is unchanged. - Isolates `MsgSnap` on its own goroutine so a multi-MiB snapshot transfer can no longer stall subsequent `MsgApp`s queued to the same peer; addresses the residual failure mode from the 324k-heartbeat-drop incident that PR #575 partially mitigated. - Keeps every existing correctness invariant: per-peer within-type ordering, `dispatchDropCount` accounting, and `postDispatchReport` firing after every dispatch attempt (so `ReportUnreachable` / `ReportSnapshot` still land in etcd/raft). ## What changed `peerQueues` already had two lanes (`heartbeat` + `normal`) from PR #522. With the flag on, it now carries four: | Lane | Message types | Buffer | |---|---|---| | `heartbeat` | `MsgHeartbeat` / `MsgHeartbeatResp`, Vote / PreVote (+Resp), ReadIndex (+Resp), `MsgTimeoutNow` | 512 | | `replication` | `MsgApp` / `MsgAppResp` | `MaxInflightMsg` (1024) | | `snapshot` | `MsgSnap` | 4 | | `other` | anything unclassified | 16 | Each lane gets its own goroutine, drained independently. Within-type ordering to a given peer is preserved because all of a peer's messages of one type share exactly one channel and one worker. See `selectDispatchLane`. When the flag is off we construct the old `heartbeat + normal` pair exactly as today — no behavior change, no goroutine count change, no allocation change. ## Why behind a flag The raft dispatch path is high blast radius: a bug here can drop heartbeats cluster-wide and trigger mass elections. Rather than swap the hot path in one shot, this lands the structural change plumbed but dormant so the default production footprint is unchanged. ## Rollout plan 1. Merge default-off (this PR). 2. Flip `ELASTICKV_RAFT_DISPATCHER_LANES=1` in staging, watch `elastickv_raft_dispatch_drop_total`, `elastickv_raft_dispatch_errors_total`, and heartbeat loss metrics for 24h under synthetic write load + forced snapshot transfer. 3. Enable on one production replica, bake 24h, then fleetwide. 4. Once soaked, delete the flag and the 2-lane branch in `startPeerDispatcher` / `selectDispatchLane` in a follow-up PR. ## Test plan - [x] `go test -race -count=1 -timeout 300s ./internal/raftengine/etcd/...` — passes (flag off) - [x] `ELASTICKV_RAFT_DISPATCHER_LANES=1 go test -race -count=1 -timeout 300s ./internal/raftengine/etcd/...` — passes (flag on) - [x] New targeted tests: - `TestSelectDispatchLane_LegacyTwoLane` — pins today's routing table - `TestSelectDispatchLane_FourLane` — pins new routing table - `TestFourLaneDispatcher_SnapshotDoesNotBlockReplication` — exercises the HOL-blocking invariant the flag is meant to fix - `TestFourLaneDispatcher_RemovePeerClosesAllLanes` — ensures no goroutine leak on peer removal in 4-lane mode - `TestDispatcherLanesEnabledFromEnv` — env parsing - [ ] Staging soak with flag enabled under write-heavy load (blocker for removing the flag in a follow-up)  ## Summary by CodeRabbit * **New Features** * Optional 4-lane dispatch mode (toggled via environment flag) to route replication, snapshot, priority, and other traffic separately for reduced head-of-line blocking and improved responsiveness. * Safer peer removal behavior ensuring in-flight lane workers exit and post-removal messages are dropped cleanly. * **Tests** * Added unit and concurrency tests covering legacy vs. 4-lane routing, lifecycle, and env-var parsing.

## Summary Adds an `EXTRA_ENV` pass-through to `scripts/rolling-update.sh` so operators can set container environment variables from `deploy.env` without editing this script each time. ## Motivation Immediate trigger: enabling `ELASTICKV_RAFT_DISPATCHER_LANES=1` (feature flag added in PR #577). We want to flip it on via the deploy envelope, not by rebuilding the image or hand-editing this script. Post-#575 raft metrics are already healthy, so the 4-lane dispatcher is being enabled as a defensive measure — it keeps heartbeats from being starved by MsgApp / MsgSnap bursts under extreme write load, rather than as a fix for anything actively broken. ## Change - `run_container`: if `EXTRA_ENV` is set, split on whitespace and forward each pair as a single `docker run -e KEY=VALUE` flag. - Comment documents the whitespace-split semantics (pairs must not contain whitespace; values may contain characters bash would otherwise interpret). ## Usage ```bash # deploy.env EXTRA_ENV="ELASTICKV_RAFT_DISPATCHER_LANES=1 ELASTICKV_PEBBLE_CACHE_MB=512" ``` ## Test plan - [x] `bash -n scripts/rolling-update.sh` passes - [x] `shellcheck` / `make lint` — 0 issues - [ ] Rolling deploy with EXTRA_ENV set; verify via `docker inspect elastickv --format '{{.Config.Env}}'` on each node  ## Summary by CodeRabbit * **New Features** * Added the ability to pass arbitrary additional environment variables to containers via a new input; these are injected into container runtime. * **Improvements** * Extra env entries are validated and safely normalized/escaped for remote transport. * Core runtime variables are now sent in escaped/quoted form to the remote execution context for safer handling.

…592) ## Root cause Every committed raft entry triggered a `pebble.Sync` inside `store.ApplyMutations` / `store.DeletePrefixAt` (see `store/lsm_store.go:1056, 1108`), on top of the raft WAL fsync that `etcd/raft` already performs per `Ready` batch in `persistReadyToWAL` (`internal/raftengine/etcd/wal_store.go:376`). The raft Ready loop in `drainReady` (`internal/raftengine/etcd/engine.go:1389`) already batches multiple entries per `Ready`, so the raft WAL fsync is fine. The hot fsync is the FSM-side `b.Commit(pebble.Sync)` that the apply loop (`applyCommitted`) hits once per entry. A prior cleanup (`docs/review_todo.md` section 3.4) intentionally kept `ApplyMutations` on `pebble.Sync`; this CL makes that choice tunable. Microbenchmark (Apple M1 Max, APFS tempdir, `BenchmarkApplyMutations_SyncMode`): | mode | ns/op | allocs/op | |--------|---------:|----------:| | sync | 16292899 | 7 | | nosync | 16293 | 8 | ~1000x on this platform. Real hardware fsync latency varies, but the sync/nosync ratio is consistently large on any WAL that enforces platform durability. ## Durability argument Pebble's FSM-commit fsync is redundant with the raft WAL under this codebase's crash-recovery model: 1. Raft WAL (`etcd/raft`) fsyncs every committed entry via `persist.Save` before `Advance`. 2. On restart, `newMemoryStorage` (`internal/raftengine/etcd/persistence.go:352`) reloads the snapshot + all WAL entries. `newRawNode` does not set `Config.Applied`, so `etcdraft` defaults it to `snapshot.Metadata.Index`. 3. The engine sets `e.applied = maxAppliedIndex(LocalSnap)` and every committed entry past the snapshot is re-delivered through `CommittedEntries` on the first `Ready`. 4. `kv/fsm.applyCommitWithIdempotencyFallback` treats an already-committed key (`LatestCommitTS >= commitTS`) as an idempotent retry, so replaying an entry whose effect survived the crash is safe. 5. FSM snapshots are fsynced (`writeFSMSnapshotFile` then `f.Sync()` in `fsm_snapshot_file.go`). Therefore a crash that loses the unfsynced tail of Pebble's own WAL is recoverable: raft replays from the last fsynced FSM snapshot onwards, and the idempotent apply path re-materialises the lost state. Pebble on the FSM commit path effectively becomes a volatile cache of applied state whose durability boundary is the raft WAL. Other `pebble.Sync` call sites (snapshot-batch commit, metadata-restore writes, compaction `commitSnapshotBatch`) are untouched: those are orthogonal durability boundaries (e.g. restore-directory swap) and are not per-proposal cost. ## Env var + default * `ELASTICKV_FSM_SYNC_MODE=sync` (default) - current behaviour. * `ELASTICKV_FSM_SYNC_MODE=nosync` - `b.Commit(pebble.NoSync)` on the FSM hot path. Raft WAL remains the durability boundary. Unknown values fall back to `sync` (fail-safe toward durability). Parsing is case-insensitive and whitespace-tolerant. A Prometheus gauge `elastickv_fsm_apply_sync_mode{mode="sync"|"nosync"}` is set at `NewRegistry` time via `store.FSMApplySyncModeLabel()`, so dashboards can alert if a rolling deploy accidentally flips the durability posture. ## Test plan - [x] `go test ./store/... ./monitoring/... ./kv/... ./internal/raftengine/... -count=1` - [x] Env var parsing: sync/nosync/mixed-case/whitespace/unknown then sync default - [x] Functional equivalence of sync vs nosync on a Pebble store - [x] Clean-shutdown reopen visibility (NoSync + Close + reopen preserves writes) - [x] Prometheus gauge mutual exclusivity across successive SetFSMApplySyncMode calls - [ ] (Follow-up) Jepsen-style OS-level crash test for unfsynced-tail recovery - tracked in `JEPSEN_TODO.md` ## Benchmark ``` go test ./store -run='^$' -bench='BenchmarkApplyMutations_SyncMode' -benchtime=2s -benchmem BenchmarkApplyMutations_SyncMode/sync-10 141 16292899 ns/op 368 B/op 7 allocs/op BenchmarkApplyMutations_SyncMode/nosync-10 129262 16293 ns/op 284 B/op 8 allocs/op ``` ## Related - Previous lever documented in `docs/review_todo.md` section 3.4 (ApplyMutations retained `pebble.Sync`). - Pebble block cache default 256 MiB (#588) - WAL retention / purge (#589) - Lease read via AppliedIndex (#575) - Raft dispatcher lanes (#577)  ## Summary by CodeRabbit * **New Features** * Configuration option for FSM commit durability mode with adjustable sync and no-sync settings * New Prometheus metric exposing the current durability mode on the `/metrics` endpoint * **Tests** * Unit tests validating FSM durability mode functionality * Integration tests ensuring functional correctness across different durability modes * Performance benchmarks evaluating FSM commit operation characteristics

gemini-code-assist Bot reviewed Apr 21, 2026

View reviewed changes

bootjp merged commit 82bbaa3 into main Apr 21, 2026
8 checks passed

bootjp deleted the perf/shard-store-lease-read branch April 21, 2026 16:26

bootjp mentioned this pull request Apr 21, 2026

perf(raft): partition etcd dispatcher by message type to isolate heartbeats #577

Merged

4 tasks

bootjp mentioned this pull request Apr 23, 2026

feat(rolling-update): forward EXTRA_ENV as docker -e flags #591

Merged

3 tasks

bootjp mentioned this pull request Apr 23, 2026

perf(store): add ELASTICKV_FSM_SYNC_MODE for FSM apply fsync opt-out #592

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(kv): serve ShardStore.GetAt reads via lease, not read-index#575

perf(kv): serve ShardStore.GetAt reads via lease, not read-index#575
bootjp merged 1 commit intomainfrom
perf/shard-store-lease-read

bootjp commented Apr 21, 2026

Uh oh!

coderabbitai Bot commented Apr 21, 2026

Rate limit exceeded

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bootjp commented Apr 21, 2026

Summary

Root cause (from PR #573's post-deploy telemetry)

Change

Expected prod impact

Test plan

Uh oh!

coderabbitai Bot commented Apr 21, 2026

Rate limit exceeded

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant