Skip to content

perf(kv): serve ShardStore.GetAt reads via lease, not read-index#575

Merged
bootjp merged 1 commit intomainfrom
perf/shard-store-lease-read
Apr 21, 2026
Merged

perf(kv): serve ShardStore.GetAt reads via lease, not read-index#575
bootjp merged 1 commit intomainfrom
perf/shard-store-lease-read

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 21, 2026

Summary

Serve ShardStore.GetAt / ScanAt / LatestCommitTS from the leader-local lease when the engine-driven LastQuorumAck is fresh, instead of dispatching a fresh read-index round-trip per call.

Root cause (from PR #573's post-deploy telemetry)

  • elastickv_raft_dispatch_errors_total → 1500/s
  • MsgHeartbeat drop_count → 324,864 in 10 min
  • Leader pprof: 282 of 401 goroutines parked in Engine.submitRead
  • elastickv_lease_read_total{outcome="hit"}=99.7% — the Coordinator-level LeaseRead at script start IS hitting

The asymmetry: luaScriptContext runs LeaseReadThrough once at start and uses startTS afterwards, but each in-script redis.callShardStore.GetAt(ctx, key, startTS) called isLinearizableRaftLeaderlinearizableReadEngineCtxengine.LinearizableReadsubmitRead. Every storage-layer read submitted its own read-index through the single dispatch worker. At 7 scripts/s × 5-7 redis.call/script × 1-7 GetAts/call, the dispatcher was swamped and heartbeats couldn't get through.

Change

New kv.leaseReadEngineCtx(ctx, engine) — lease-aware sibling of linearizableReadEngineCtx:

  • If engine exposes LeaseProvider AND engineLeaseAckValid(state, LastQuorumAck, time.Now(), LeaseDuration) holds, return AppliedIndex() directly.
  • Otherwise fall through to the existing LinearizableRead round-trip.

Two call sites switched:

  • isLinearizableRaftLeader (the fence inside ShardStore.GetAt / ScanAt)
  • the leader-liveness guard inside LatestCommitTS

Safety is identical to Coordinator.LeaseReadengineLeaseAckValid already enforces state == Leader && !ack.IsZero() && !ack.After(now) && now.Sub(ack) < LeaseDuration. The lease guarantees no concurrent leader within the window, so local AppliedIndex is safe.

Expected prod impact

  • Engine.submitRead call rate: ~N× lower, where N = number of storage reads per script (5+ in BullMQ workloads post-perf(redis-lua): short-circuit missing-key ZRANGEBYSCORE in fast path #573).
  • elastickv_raft_dispatch_errors_total should stop climbing.
  • Heartbeat drop_count should plateau.
  • Raft Commit p99 should drop back toward the pre-incident 5s baseline once reads stop hogging the dispatcher.

Test plan

  • go test -race ./kv/... (3.6s) green
  • go test -race -short ./adapter/... (56s) green
  • New TestLeaseReadEngineCtx_* cover: fast path, stale-ack fallback, non-leader fallback, missing LeaseProvider, nil-engine
  • Deploy; watch rate(elastickv_lua_cmd_fastpath_total{cmd="zrangebyscore",outcome="hit"}[1m]) for throughput, elastickv_raft_dispatch_errors_total, and the leader's goroutine count via pprof

Root cause for the post-#573 raft saturation (1500/s dispatch_errors,
324k heartbeat drops in 10 min, 282/401 goroutines parked in
Engine.submitRead): every ShardStore.GetAt called
isLinearizableRaftLeader -> linearizableReadEngineCtx ->
engine.LinearizableRead -> submitRead, i.e. each in-script
redis.call submitted its OWN read-index through the single dispatch
worker. The LeaseReadThrough done once at luaScriptContext startup
only verified leadership and set startTS; it did NOT short-circuit
subsequent GetAt reads.

Fix: introduce leaseReadEngineCtx (lease-aware sibling of
linearizableReadEngineCtx). When the engine exposes LeaseProvider
and the engine-driven LastQuorumAck is within LeaseDuration of a
single time.Now() sample, return the current AppliedIndex without
dispatching a new read-index. Lease miss still falls through to the
full LinearizableRead round-trip, preserving correctness when the
lease is cold (e.g. immediately after a leadership transfer).

Switch the two read-path callers in ShardStore to use the new
helper:
- isLinearizableRaftLeader (on the GetAt / ScanAt hot path)
- the leader-liveness guard inside LatestCommitTS

Safety matches engineLeaseAckValid's existing contract (state ==
Leader, ack non-zero, ack <= now, now - ack < LeaseDuration). No
correctness change — the lease already guarantees no concurrent
leader within the window, so AppliedIndex is safe to serve.

Expected prod impact: submitRead call rate drops by ~N× where N is
the number of storage-layer reads per script (5+ in BullMQ workloads
post-#573). Heartbeat drops and dispatch_errors should follow.

Tests cover fast path, stale ack fallback, non-leader fallback,
engines without LeaseProvider, and nil-engine guard.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 21, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 46 minutes and 24 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 46 minutes and 24 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c7abeed4-f8a3-421b-9f33-f01ecbdbbe3d

📥 Commits

Reviewing files that changed from the base of the PR and between a0a2185 and 4a585f5.

📒 Files selected for processing (3)
  • kv/lease_read_test.go
  • kv/raft_engine.go
  • kv/shard_store.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/shard-store-lease-read

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces leaseReadEngineCtx, a lease-aware alternative to linearizableReadEngineCtx, to optimize read operations in ShardStore. By utilizing the engine's LeaseProvider and LastQuorumAck, the system can now skip the LinearizableRead round-trip when a valid lease is held, significantly reducing Raft dispatcher saturation during high read loads. The changes include the implementation of the helper function, its integration into isLinearizableRaftLeader and LatestCommitTS, and comprehensive unit tests covering various edge cases. I have no feedback to provide.

@bootjp bootjp merged commit 82bbaa3 into main Apr 21, 2026
8 checks passed
@bootjp bootjp deleted the perf/shard-store-lease-read branch April 21, 2026 16:26
bootjp added a commit that referenced this pull request Apr 22, 2026
…tbeats (#577)

## Summary
- Adds an **opt-in** 4-lane etcd raft dispatcher (heartbeat /
replication / snapshot / other) gated on
`ELASTICKV_RAFT_DISPATCHER_LANES=1`. Default behavior is unchanged.
- Isolates `MsgSnap` on its own goroutine so a multi-MiB snapshot
transfer can no longer stall subsequent `MsgApp`s queued to the same
peer; addresses the residual failure mode from the 324k-heartbeat-drop
incident that PR #575 partially mitigated.
- Keeps every existing correctness invariant: per-peer within-type
ordering, `dispatchDropCount` accounting, and `postDispatchReport`
firing after every dispatch attempt (so `ReportUnreachable` /
`ReportSnapshot` still land in etcd/raft).

## What changed
`peerQueues` already had two lanes (`heartbeat` + `normal`) from PR
#522. With the flag on, it now carries four:

| Lane | Message types | Buffer |
|---|---|---|
| `heartbeat` | `MsgHeartbeat` / `MsgHeartbeatResp`, Vote / PreVote
(+Resp), ReadIndex (+Resp), `MsgTimeoutNow` | 512 |
| `replication` | `MsgApp` / `MsgAppResp` | `MaxInflightMsg` (1024) |
| `snapshot` | `MsgSnap` | 4 |
| `other` | anything unclassified | 16 |

Each lane gets its own goroutine, drained independently. Within-type
ordering to a given peer is preserved because all of a peer's messages
of one type share exactly one channel and one worker. See
`selectDispatchLane`.

When the flag is off we construct the old `heartbeat + normal` pair
exactly as today — no behavior change, no goroutine count change, no
allocation change.

## Why behind a flag
The raft dispatch path is high blast radius: a bug here can drop
heartbeats cluster-wide and trigger mass elections. Rather than swap the
hot path in one shot, this lands the structural change plumbed but
dormant so the default production footprint is unchanged.

## Rollout plan
1. Merge default-off (this PR).
2. Flip `ELASTICKV_RAFT_DISPATCHER_LANES=1` in staging, watch
`elastickv_raft_dispatch_drop_total`,
`elastickv_raft_dispatch_errors_total`, and heartbeat loss metrics for
24h under synthetic write load + forced snapshot transfer.
3. Enable on one production replica, bake 24h, then fleetwide.
4. Once soaked, delete the flag and the 2-lane branch in
`startPeerDispatcher` / `selectDispatchLane` in a follow-up PR.

## Test plan
- [x] `go test -race -count=1 -timeout 300s
./internal/raftengine/etcd/...` — passes (flag off)
- [x] `ELASTICKV_RAFT_DISPATCHER_LANES=1 go test -race -count=1 -timeout
300s ./internal/raftengine/etcd/...` — passes (flag on)
- [x] New targeted tests:
  - `TestSelectDispatchLane_LegacyTwoLane` — pins today's routing table
  - `TestSelectDispatchLane_FourLane` — pins new routing table
- `TestFourLaneDispatcher_SnapshotDoesNotBlockReplication` — exercises
the HOL-blocking invariant the flag is meant to fix
- `TestFourLaneDispatcher_RemovePeerClosesAllLanes` — ensures no
goroutine leak on peer removal in 4-lane mode
  - `TestDispatcherLanesEnabledFromEnv` — env parsing
- [ ] Staging soak with flag enabled under write-heavy load (blocker for
removing the flag in a follow-up)


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Optional 4-lane dispatch mode (toggled via environment flag) to route
replication, snapshot, priority, and other traffic separately for
reduced head-of-line blocking and improved responsiveness.
* Safer peer removal behavior ensuring in-flight lane workers exit and
post-removal messages are dropped cleanly.

* **Tests**
* Added unit and concurrency tests covering legacy vs. 4-lane routing,
lifecycle, and env-var parsing.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
bootjp added a commit that referenced this pull request Apr 23, 2026
## Summary

Adds an `EXTRA_ENV` pass-through to `scripts/rolling-update.sh` so
operators can set container environment variables from `deploy.env`
without editing this script each time.

## Motivation

Immediate trigger: enabling `ELASTICKV_RAFT_DISPATCHER_LANES=1` (feature
flag added in PR #577). We want to flip it on via the deploy envelope,
not by rebuilding the image or hand-editing this script.

Post-#575 raft metrics are already healthy, so the 4-lane dispatcher is
being enabled as a defensive measure — it keeps heartbeats from being
starved by MsgApp / MsgSnap bursts under extreme write load, rather than
as a fix for anything actively broken.

## Change

- `run_container`: if `EXTRA_ENV` is set, split on whitespace and
forward each pair as a single `docker run -e KEY=VALUE` flag.
- Comment documents the whitespace-split semantics (pairs must not
contain whitespace; values may contain characters bash would otherwise
interpret).

## Usage

```bash
# deploy.env
EXTRA_ENV="ELASTICKV_RAFT_DISPATCHER_LANES=1 ELASTICKV_PEBBLE_CACHE_MB=512"
```

## Test plan

- [x] `bash -n scripts/rolling-update.sh` passes
- [x] `shellcheck` / `make lint` — 0 issues
- [ ] Rolling deploy with EXTRA_ENV set; verify via `docker inspect
elastickv --format '{{.Config.Env}}'` on each node


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added the ability to pass arbitrary additional environment variables
to containers via a new input; these are injected into container
runtime.
* **Improvements**
* Extra env entries are validated and safely normalized/escaped for
remote transport.
* Core runtime variables are now sent in escaped/quoted form to the
remote execution context for safer handling.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
bootjp added a commit that referenced this pull request Apr 23, 2026
…592)

## Root cause

Every committed raft entry triggered a `pebble.Sync` inside
`store.ApplyMutations` / `store.DeletePrefixAt` (see
`store/lsm_store.go:1056, 1108`), on top of the raft WAL fsync that
`etcd/raft` already performs per `Ready` batch in `persistReadyToWAL`
(`internal/raftengine/etcd/wal_store.go:376`).

The raft Ready loop in `drainReady`
(`internal/raftengine/etcd/engine.go:1389`) already batches multiple
entries per `Ready`, so the raft WAL fsync is fine. The hot fsync is the
FSM-side `b.Commit(pebble.Sync)` that the apply loop (`applyCommitted`)
hits once per entry. A prior cleanup (`docs/review_todo.md` section 3.4)
intentionally kept `ApplyMutations` on `pebble.Sync`; this CL makes that
choice tunable.

Microbenchmark (Apple M1 Max, APFS tempdir,
`BenchmarkApplyMutations_SyncMode`):

| mode   | ns/op    | allocs/op |
|--------|---------:|----------:|
| sync   | 16292899 | 7         |
| nosync |    16293 | 8         |

~1000x on this platform. Real hardware fsync latency varies, but the
sync/nosync ratio is consistently large on any WAL that enforces
platform durability.

## Durability argument

Pebble's FSM-commit fsync is redundant with the raft WAL under this
codebase's crash-recovery model:

1. Raft WAL (`etcd/raft`) fsyncs every committed entry via
`persist.Save` before `Advance`.
2. On restart, `newMemoryStorage`
(`internal/raftengine/etcd/persistence.go:352`) reloads the snapshot +
all WAL entries. `newRawNode` does not set `Config.Applied`, so
`etcdraft` defaults it to `snapshot.Metadata.Index`.
3. The engine sets `e.applied = maxAppliedIndex(LocalSnap)` and every
committed entry past the snapshot is re-delivered through
`CommittedEntries` on the first `Ready`.
4. `kv/fsm.applyCommitWithIdempotencyFallback` treats an
already-committed key (`LatestCommitTS >= commitTS`) as an idempotent
retry, so replaying an entry whose effect survived the crash is safe.
5. FSM snapshots are fsynced (`writeFSMSnapshotFile` then `f.Sync()` in
`fsm_snapshot_file.go`).

Therefore a crash that loses the unfsynced tail of Pebble's own WAL is
recoverable: raft replays from the last fsynced FSM snapshot onwards,
and the idempotent apply path re-materialises the lost state. Pebble on
the FSM commit path effectively becomes a volatile cache of applied
state whose durability boundary is the raft WAL.

Other `pebble.Sync` call sites (snapshot-batch commit, metadata-restore
writes, compaction `commitSnapshotBatch`) are untouched: those are
orthogonal durability boundaries (e.g. restore-directory swap) and are
not per-proposal cost.

## Env var + default

* `ELASTICKV_FSM_SYNC_MODE=sync` (default) - current behaviour.
* `ELASTICKV_FSM_SYNC_MODE=nosync` - `b.Commit(pebble.NoSync)` on the
FSM hot path. Raft WAL remains the durability boundary.

Unknown values fall back to `sync` (fail-safe toward durability).
Parsing is case-insensitive and whitespace-tolerant.

A Prometheus gauge `elastickv_fsm_apply_sync_mode{mode="sync"|"nosync"}`
is set at `NewRegistry` time via `store.FSMApplySyncModeLabel()`, so
dashboards can alert if a rolling deploy accidentally flips the
durability posture.

## Test plan

- [x] `go test ./store/... ./monitoring/... ./kv/...
./internal/raftengine/... -count=1`
- [x] Env var parsing: sync/nosync/mixed-case/whitespace/unknown then
sync default
- [x] Functional equivalence of sync vs nosync on a Pebble store
- [x] Clean-shutdown reopen visibility (NoSync + Close + reopen
preserves writes)
- [x] Prometheus gauge mutual exclusivity across successive
SetFSMApplySyncMode calls
- [ ] (Follow-up) Jepsen-style OS-level crash test for unfsynced-tail
recovery - tracked in `JEPSEN_TODO.md`

## Benchmark

```
go test ./store -run='^$' -bench='BenchmarkApplyMutations_SyncMode' -benchtime=2s -benchmem
BenchmarkApplyMutations_SyncMode/sync-10      141   16292899 ns/op   368 B/op   7 allocs/op
BenchmarkApplyMutations_SyncMode/nosync-10    129262     16293 ns/op   284 B/op   8 allocs/op
```

## Related

- Previous lever documented in `docs/review_todo.md` section 3.4
(ApplyMutations retained `pebble.Sync`).
- Pebble block cache default 256 MiB (#588)
- WAL retention / purge (#589)
- Lease read via AppliedIndex (#575)
- Raft dispatcher lanes (#577)


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Configuration option for FSM commit durability mode with adjustable
sync and no-sync settings
* New Prometheus metric exposing the current durability mode on the
`/metrics` endpoint

* **Tests**
  * Unit tests validating FSM durability mode functionality
* Integration tests ensuring functional correctness across different
durability modes
* Performance benchmarks evaluating FSM commit operation characteristics

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant