fix(kv): bound verifyLeaderEngine ReadIndex with 5s deadline#745
fix(kv): bound verifyLeaderEngine ReadIndex with 5s deadline#745
Conversation
verifyLeaderEngine() called engine.VerifyLeader with context.Background(), so any caller without an upstream context blocked indefinitely on a ReadIndex round-trip. A single transient stall accumulated callers permanently because they never timed out and never returned. Production hit this on 2026-05-08: follower 192.168.0.214 lost its network route (no route to host, ARP INCOMPLETE), the leader's ReadIndex completion stalled intermittently, and verify-callers piled up at roughly 9/sec without bound. After ~37 minutes the leader (192.168.0.212) held 20,560 goroutines (20,478 in submitRead select, oldest 39 minutes), CPU pinned at 1870% (Engine.run Ready loop walks pendingReads O(N) per tick, so the queue feeds back on itself), and host MemAvailable trended toward 0 until OOM. Each new leader after failover re-entered the same death spiral. Affected callers (all use the no-context variant): - LeaderProxy.Commit / .Abort -- every Redis write hits this - Coordinate.VerifyLeader / ShardedCoordinator.VerifyLeader[ForKey] - adapter S3/SQS /healthz/leader handlers (Caddy probes) - main_admin.go LeaderProbe (admin dashboard /admin/healthz/leader) - adapter/sqs.go isVerifiedSQSLeader, adapter/s3.go isVerifiedS3Leader Fix: cap the no-context path at 5s (matching leaderForwardTimeout). On timeout, callers see context.DeadlineExceeded -- LeaderProxy falls back to forwardWithRetry as it already does for any verify failure, healthz handlers report not-leader, and the lock resolver skips this tick. Self-review (5 lenses): 1. Data loss -- none. The fix only shortens a never-returning wait. verifyLeaderEngine is a freshness check, not a write path. 2. Concurrency -- the new ctx is local to each call (defer cancel), no shared state, no lock changes. Engine-side blocking semantics unchanged; we just stop waiting forever. 3. Performance -- positive. Removes the unbounded goroutine pile-up and the O(N) pendingReads walk it caused. No new allocations on the success path beyond the WithTimeout context. 4. Data consistency -- ReadIndex still completes when quorum heartbeats land within 5s. A timeout means the caller could not confirm leadership freshness, which the existing "fall through to forward" path already treats as a soft failure. 5. Test coverage -- kv/raft_engine_test.go pins the regression: a blockingLeaderView that holds VerifyLeader on its ctx must surface DeadlineExceeded within 2x verifyLeaderTimeout. Test: go test -race -count=1 ./kv -- 9.3s, all green. Future work (separate PRs): plumb real request contexts through LeaderProxy.Commit/Abort and the healthz handlers so a client-side deadline cascades naturally instead of relying on this fixed bound.
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis PR adds a 5-second timeout guard to the ChangesLeader Verification Timeout
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@claude review |
|
Claude finished @bootjp's task in 2m 44s —— View job Code Review
The fix is correct and minimal. The root cause analysis is solid — Three observations, ordered by severity: 1. Incident narrative in source code (style)
The rationale is exactly right and should stay — "why 5s," "why this constant exists," the O(N) // 5s matches leaderForwardTimeout: a verify that takes longer than a
// single forward RPC is useless as a freshness check, and the proxy's
// verify-then-forward path stays within its 5s retry budget.
//
// See incident 2026-05-08 (PR #745) for the goroutine-pile production failure
// this prevents.
const verifyLeaderTimeout = 5 * time.Second2. Test uses stdlib
|
There was a problem hiding this comment.
Code Review
This pull request introduces a 5-second timeout to the verifyLeaderEngine function to prevent indefinite blocking and potential resource exhaustion during ReadIndex stalls, addressing a production issue that previously led to goroutine leaks and OOM errors. A regression test was added to verify that the function now returns a deadline exceeded error within a reasonable timeframe. The review feedback suggests enhancing the test by asserting a minimum elapsed time to ensure the timeout logic is correctly exercised and not returning early for other reasons.
| if elapsed > 2*verifyLeaderTimeout { | ||
| t.Fatalf("verifyLeaderEngine(blocking) returned after %s; want <= 2x verifyLeaderTimeout (%s)", elapsed, verifyLeaderTimeout) | ||
| } |
There was a problem hiding this comment.
To ensure that the test is actually exercising the timeout logic and not returning early for some other reason, consider asserting a minimum elapsed time. This confirms that the blockingLeaderView mock correctly held the call until the context deadline was reached.
if elapsed < verifyLeaderTimeout {
t.Fatalf("verifyLeaderEngine(blocking) returned too early after %s; want >= %s", elapsed, verifyLeaderTimeout)
}
if elapsed > 2*verifyLeaderTimeout {
t.Fatalf("verifyLeaderEngine(blocking) returned after %s; want <= 2x verifyLeaderTimeout (%s)", elapsed, verifyLeaderTimeout)
}Round-1 review on commit ad924ad: - gemini medium: assert minimum elapsed time, not just upper bound. Without a lower bound a future regression returning DeadlineExceeded before actually waiting would silently pass. Added a lower-bound check with 200ms slack so a slow CI scheduler that trips ctx.Done() a hair before the wall clock catches up does not flake. - claude bot (style): trim incident narrative from the verifyLeaderTimeout doc comment. Specific IPs, goroutine counts, and timestamps belong in the commit message and PR body, not the source — they go stale and add no design value at the call site. Kept the rationale (why 5s, the O(N) pendingReads feedback loop, match leaderForwardTimeout) and added a "see PR #745" reference for readers who want the full incident detail. - claude bot: switch the regression test from stdlib `stderrors "errors"` to `github.com/cockroachdb/errors` for consistency with the project convention (CLAUDE.md mandates cockroachdb/errors at boundaries; the test is a boundary). Also drops the import alias which made gci complain about ordering. - claude bot: add `if testing.Short() { t.Skip(...) }` guard so `go test -short ./kv/...` does not pay the verifyLeaderTimeout (5s) wait. Default `make test` still runs it. - reviewdog/golangci (gci): import order fixed by reordering third-party (cockroachdb) above local (bootjp) per the project's default gci sections (standard, default, prefix(...)). No behavior change. Test still passes: go test -race -count=1 -run TestVerifyLeaderEngine ./kv -- 6.06s go test -race -count=1 -short ./kv -- 5.55s (test skipped under -short) No semantic change to the production fix; this commit is test + comment polish only, so no caller audit needed.
|
@claude review |
|
Claude finished @bootjp's task in 1m 43s —— View job Claude finished @bootjp's task —— Round-2 Code Review
All three round-1 issues are resolved, and the Gemini lower-bound suggestion has been incorporated. One item remains open. Remaining issue: gci import ordering in test file
import (
"context"
"testing"
"time"
"github.com/cockroachdb/errors" // ← its own group
"github.com/bootjp/elastickv/internal/raftengine" // ← separate group
)The project convention (see import (
"context"
"testing"
"time"
"github.com/bootjp/elastickv/internal/raftengine"
"github.com/cockroachdb/errors"
)
Minor nit: comment references review tooling
// Lower bound: confirm the engine actually held the call until the
// deadline fired, not that some other error path returned immediately.The PR description already credits the source; the source file only needs the rationale. Everything else looks correct
The gci fix is one import-block tweak; everything else is shippable. |
Round-2 review on commit 0be9fc7: - reviewdog/golangci (gci): test file still failed gci because the project convention puts ALL non-stdlib imports in a single group (no blank line between third-party and local). Round-2's "cockroachdb above bootjp with a blank line between" guess was wrong; merging into one alphabetical group (bootjp before cockroachdb) matches kv/coordinator.go and resolves the warning. - claude bot (nit): drop "Pulled in from gemini's PR #745 round-1 review" from the lower-bound rationale comment. Process metadata rots once the PR is merged; the design reason is what the source needs. No behavior change. Test still passes 6.05s.
|
@claude review |
|
Claude finished @bootjp's task in 1m 5s —— View job Round-3 Code Review
Both round-2 issues are resolved. This is clean. Round-2 issues — verified closedgci import ordering ( import (
"context"
"testing"
"time"
"github.com/bootjp/elastickv/internal/raftengine"
"github.com/cockroachdb/errors"
)Comment referencing review tooling ( Everything checks out
No remaining issues. This PR is ready to merge. |
Follow-up to #745 (verifyLeaderEngine 5s deadline incident hotfix). PR #745 capped the no-context verify path at 5s as defense-in-depth; this PR plumbs the caller's context.Context end-to-end through: Dispatch path: ShardedCoordinator.Dispatch(ctx) → dispatchTxn(ctx) → dispatchSingleShardTxn(ctx) / commitPrimaryTxn(ctx) / commitSecondaryTxns(ctx) / abortPreparedTxn(ctx) / broadcastToAllGroups(ctx) / dispatchDelPrefixBroadcast(ctx) → Transactional.Commit/Abort(ctx, reqs) → leaseRefreshingTxn / LeaderProxy / TransactionManager / ShardRouter → applyRequests(ctx) → proposer.Propose(ctx) → verifyLeaderEngineCtx(ctx) Lock-resolver / shard-store path: LockResolver.resolveExpiredLock(ctx) → applyTxnResolution(ctx) / ShardStore.applyScanLockResolutions(ctx) / ShardStore.tryAbortExpiredPrimary(ctx) → g.Txn.Commit(ctx) Leader-probe / healthz path: Coordinator.VerifyLeader(ctx) / Coordinator.VerifyLeaderForKey(ctx, key) LeaderProbe.IsVerifiedLeader(ctx) adapter S3/SQS/DynamoDB /healthz/leader handlers feed r.Context() Adapter healthz helpers: isVerifiedSQSLeader(ctx, coordinator) isVerifiedDynamoLeader(ctx, coordinator) S3Server.isVerifiedS3Leader(ctx) DistributionServer.verifyCatalogLeader(ctx) Admin LeaderProbe wiring: main_admin.go newAdminLeaderProbe → coordinate.VerifyLeader(ctx) After: a Redis BLPOP timeout=2s whose dispatch lands on a slow ReadIndex now fails after 2s (its own deadline), not after the 5s verifyLeaderTimeout safety bound. A Caddy active health probe with a 1s budget likewise fails after 1s. The 5s bound stays in place as defense-in-depth for the no-arg verifyLeaderEngine() — still hit by LockResolver background loops, HLC lease ticks, and any future internal caller that genuinely cannot inherit a deadline. Also fixes proposer.Propose at kv/transaction.go:152 which used context.Background() inline; same shape as the original verifyLeaderEngine bug, just on the propose path. Now plumbed. Batched commit caveat: TransactionManager.commitRaw merges many callers into a single propose via a separate goroutine, so no single ctx can bound the underlying applyRequests. The wait site in commitRaw selects between item.done and ctx.Done so per-caller cancellation still works (the propose continues, the abandoned caller exits early); the goroutine's applyRequests call uses context.Background by design and is documented as such. Test stub updates (3 ripples through the Transactional interface, 6 ripples through Coordinator, 2 through LeaderProbe): kv/coordinator_txn_test.go::stubTransactional kv/coordinator_retry_test.go::scriptedTransactional kv/sharded_coordinator_txn_test.go::recordingTransactional kv/sharded_coordinator_abort_test.go::failingTransactional kv/sharded_lease_test.go::fixedTransactional kv/shard_router_test.go::fakeTM kv/shard_router_partition_test.go::fakeTxn kv/leader_routed_store_test.go::stubLeaderCoordinator adapter/redis_retry_test.go::retryOnceCoordinator adapter/redis_info_test.go::infoTestCoordinator adapter/redis_hello_test.go::helloTestCoordinator adapter/redis_keys_pattern_test.go::stubAdapterCoordinator adapter/distribution_server_test.go::distributionCoordinatorStub adapter/dynamodb_test.go::testCoordinatorWrapper adapter/s3_test.go::routeAwareS3Coordinator + followerS3Coordinator internal/admin/router_test.go::LeaderProbeFunc literal forms Self-review (5 lenses): 1. Data loss — none. ctx is value-passed; no persisted state changes. verifyLeaderEngineCtx already existed and is what we now route to; only the caller of the no-arg variant changes (background loops only). 2. Concurrency — ctx flows through the existing locking unchanged. commitRaw goroutine intentionally uses Background to avoid mixing per-caller deadlines into a shared batched propose; documented. 3. Performance — same call count, no extra allocations beyond the ctx value passed by reference. context.WithTimeout wrapping in the no-ctx fallback is unchanged. 4. Data consistency — verify is a freshness check, not a write path. Shorter caller deadlines surface ErrLeaderNotFound earlier instead of waiting out the 5s safety bound. 5. Test coverage — every Transactional / Coordinator / LeaderProbe stub gains the ctx parameter (currently `_ context.Context`, ready for future tests asserting cancel propagation). Test: go test -race -count=1 -short ./kv ./adapter ./internal/admin -- 9.6m total, all green (adapter is the long pole on Jepsen-shaped tests; kv 6.6s, admin 1.4s). Design doc: docs/design/2026_05_10_proposed_kv_ctx_plumbing.md.
#749) ## Summary Follow-up to #745. PR #745 capped `verifyLeaderEngine`'s no-context call site at 5s as an incident hotfix; this PR plumbs the caller's `context.Context` end-to-end through the kv write path and the verify-leader path so callers that hold a request deadline see THEIR deadline honoured, not the 5s safety bound. The 5s bound stays in place as defense-in-depth for callers without an upstream context (LockResolver, HLC lease tick). ## What changed **Interface signatures (kv-internal):** ```go type Transactional interface { Commit(ctx context.Context, reqs []*pb.Request) (*TransactionResponse, error) Abort(ctx context.Context, reqs []*pb.Request) (*TransactionResponse, error) } type Coordinator interface { // … VerifyLeader(ctx context.Context) error VerifyLeaderForKey(ctx context.Context, key []byte) error } type LeaderProbe interface { IsVerifiedLeader(ctx context.Context) bool } ``` **Plumbing chains** (every step now carries ctx): ``` Dispatch path: ShardedCoordinator.Dispatch(ctx) → dispatchTxn(ctx) / dispatchSingleShardTxn(ctx) / commitPrimaryTxn(ctx) / commitSecondaryTxns(ctx) / abortPreparedTxn(ctx) / broadcastToAllGroups(ctx) / dispatchDelPrefixBroadcast(ctx) → Transactional.Commit/Abort(ctx) → leaseRefreshingTxn / LeaderProxy / TransactionManager / ShardRouter → applyRequests(ctx) → proposer.Propose(ctx) → verifyLeaderEngineCtx(ctx) Lock-resolver / shard-store path: LockResolver.resolveExpiredLock(ctx) → applyTxnResolution(ctx) / applyScanLockResolutions(ctx) / tryAbortExpiredPrimary(ctx) → g.Txn.Commit(ctx) Healthz: S3 / SQS / DynamoDB /healthz/leader handlers → isVerifiedSQSLeader(ctx, …) / isVerifiedDynamoLeader(ctx, …) / S3Server.isVerifiedS3Leader(ctx) / DistributionServer.verifyCatalogLeader(ctx) → coordinator.VerifyLeader(ctx) / VerifyLeaderForKey(ctx, key) Admin LeaderProbe: /admin/healthz/leader handler → LeaderProbe.IsVerifiedLeader(r.Context()) → coordinate.VerifyLeader(ctx) ``` Also fixes the inline `context.Background()` at `kv/transaction.go:152` (`proposer.Propose`) — same shape as the original `verifyLeaderEngine` bug, on the propose path. ## Behaviour For callers with upstream deadlines: | caller | before | after | |---|---|---| | Redis BLPOP `timeout=2s` on slow ReadIndex | fails after 5s (verifyLeaderTimeout) | fails after 2s (caller's deadline) | | Caddy active probe `timeout=1s` | fails after 5s | fails after 1s | | gRPC `Dispatch(ctx)` with explicit deadline | dropped to 5s | honours ctx deadline | For background callers without upstream ctx: - LockResolver tick, HLC lease renewer, etc. continue to hit `verifyLeaderEngine()` (no-arg variant) which still wraps with `context.WithTimeout(context.Background(), verifyLeaderTimeout)`. PR #745's 5s bound is unchanged for them. ## Batched-commit caveat `TransactionManager.commitRaw` merges many callers into a single propose via a separate goroutine, so no single ctx can bound the underlying `applyRequests`. The fix: - Wait site in `commitRaw` selects between `item.done` and `ctx.Done()` so per-caller cancellation works (caller exits early; the propose still completes; other waiters in the same batch get their results normally). - Goroutine's `applyRequests` call uses `context.Background` by design, documented in the code. ## Self-review (5 lenses) 1. **Data loss** — none. ctx is value-passed; no persisted state changes. `verifyLeaderEngineCtx` is the existing path; only the caller of the no-arg variant changes. 2. **Concurrency** — ctx flows through the existing locking unchanged. `commitRaw` goroutine intentionally uses Background to avoid mixing per-caller deadlines into a shared batched propose; documented. 3. **Performance** — same call count, no extra allocations beyond the ctx value passed by reference. `context.WithTimeout` wrapping in the no-ctx fallback is unchanged. 4. **Data consistency** — verify is a freshness check, not a write path. Shorter caller deadlines surface `ErrLeaderNotFound` earlier instead of waiting out the 5s safety bound. 5. **Test coverage** — every Transactional / Coordinator / LeaderProbe stub gains the ctx parameter (currently `_ context.Context`, ready for future tests asserting cancel propagation). 16 stubs / signatures updated. ## Test plan - [x] `go test -race -count=1 -short ./kv ./adapter ./internal/admin` — all green - kv: 6.6s - adapter: 565s (long Jepsen-shaped tests) - internal/admin: 1.4s - [x] No semantic change beyond the ctx propagation; all existing tests pass without modification of their assertions. ## Follow-up scope (not in this PR) The audit also surfaced ~210 `context.Background()` usages in tree, mostly concentrated in: | file | count | nature | |---|---|---| | adapter/redis_compat_commands.go | 66 | dispatch helpers, internal commands | | adapter/redis_lua_context.go | 25 | Lua script context loaders | | adapter/redis.go | 25 | handler context fallback paths | | proxy/* | 28 | proxy package, separate concern | These are "easy win" candidates where a `ctx` is in scope but `Background()` was used instead. Not bundled here to keep the PR reviewable; a follow-up "audit pass" PR targeting Redis adapter helpers is the natural next step. The single `context.TODO()` (kv/fsm.go) is structural — raft `Apply()` does not provide a ctx — and is left as is. ## Design doc `docs/design/2026_05_10_proposed_kv_ctx_plumbing.md` — included in this PR. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Refactor** * Updated internal APIs to propagate request context through leader verification and transaction operations, enabling request deadlines to be respected instead of using fixed internal timeouts. * **Documentation** * Added design document outlining context plumbing through KV write paths and leader-probe verification across adapters. [](https://app.coderabbit.ai/change-stack/bootjp/elastickv/pull/749) <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
verifyLeaderEngine()calledengine.VerifyLeaderwithcontext.Background(), so callers without an upstream context blocked indefinitely on a ReadIndex round-trip. A single transient stall accumulated callers permanently. This caps the no-context path at 5s.Production incident — 2026-05-08
Follower 192.168.0.214 lost its network route (
no route to host, ARPINCOMPLETE). The leader's ReadIndex completion stalled intermittently and verify-callers piled up at ~9/sec without bound.After ~37 minutes the leader (192.168.0.212) showed:
etcd.(*Engine).submitRead[select, 35-39 minutes]Engine.runReady loop walkspendingReadsO(N) per tick → queue feeds back on itself)Mitigation:
docker restart elastickvon 212 dropped it to 74% CPU / 163 MiB. 214 was hardware-rebooted and is REACHABLE again. This PR prevents the next leader from re-entering the spiral.Affected callers
All use the no-context
verifyLeaderEnginevariant:kv/leader_proxy.go—LeaderProxy.Commit/.Abort(every Redis write)kv/coordinator.go—Coordinate.VerifyLeaderkv/sharded_coordinator.go—ShardedCoordinator.VerifyLeader/VerifyLeaderForKeyadapter/s3.go—isVerifiedS3Leader/ inline VerifyLeader at line 2291 (healthz)adapter/sqs.go—isVerifiedSQSLeader(healthz)main_admin.go—LeaderProbecallback for/admin/healthz/leaderFailure mode on timeout
context.DeadlineExceededsurfaces to the caller.LeaderProxyfalls back toforwardWithRetry(the existing path for any verify failure). Healthz handlers report 503 not-leader. Background loops (lock resolver, HLC lease) skip this tick.No new infinite loop: even when this node is the leader, a verify-failure → forward path already exists in
LeaderProxy.Commit; that path is bounded byleaderProxyRetryBudget = 5sandmaxForwardRetries = 3.Self-review (5 lenses)
verifyLeaderEngineis a freshness check, not a write path. Already-committed proposals are unaffected.defer cancel), no shared state, no lock changes. Engine-side blocking semantics unchanged; we just stop waiting forever.pendingReadswalk it caused. No new allocations on the success path beyond theWithTimeoutcontext.kv/raft_engine_test.go::TestVerifyLeaderEngine_BoundsBlockingReadIndexpins the regression: ablockingLeaderViewthat holdsVerifyLeaderon its ctx must surfaceDeadlineExceededwithin2 * verifyLeaderTimeout.Test plan
go test -race -count=1 ./kv— 9.3s, all greenTestVerifyLeaderEngine_BoundsBlockingReadIndexcovers the blocking caseFuture work (separate PRs)
Plumb real request contexts through
LeaderProxy.Commit/Abortand the healthz handlers so client-side deadlines cascade naturally instead of relying on this fixed bound. Today the Redis adapter's per-command deadline doesn't reachLeaderProxy; the proxy interface takes[]*pb.Requestonly.Summary by CodeRabbit
Bug Fixes
Tests