refactor(kv): plumb caller context through write + verify-leader paths#749
refactor(kv): plumb caller context through write + verify-leader paths#749
Conversation
Follow-up to #745 (verifyLeaderEngine 5s deadline incident hotfix). PR #745 capped the no-context verify path at 5s as defense-in-depth; this PR plumbs the caller's context.Context end-to-end through: Dispatch path: ShardedCoordinator.Dispatch(ctx) → dispatchTxn(ctx) → dispatchSingleShardTxn(ctx) / commitPrimaryTxn(ctx) / commitSecondaryTxns(ctx) / abortPreparedTxn(ctx) / broadcastToAllGroups(ctx) / dispatchDelPrefixBroadcast(ctx) → Transactional.Commit/Abort(ctx, reqs) → leaseRefreshingTxn / LeaderProxy / TransactionManager / ShardRouter → applyRequests(ctx) → proposer.Propose(ctx) → verifyLeaderEngineCtx(ctx) Lock-resolver / shard-store path: LockResolver.resolveExpiredLock(ctx) → applyTxnResolution(ctx) / ShardStore.applyScanLockResolutions(ctx) / ShardStore.tryAbortExpiredPrimary(ctx) → g.Txn.Commit(ctx) Leader-probe / healthz path: Coordinator.VerifyLeader(ctx) / Coordinator.VerifyLeaderForKey(ctx, key) LeaderProbe.IsVerifiedLeader(ctx) adapter S3/SQS/DynamoDB /healthz/leader handlers feed r.Context() Adapter healthz helpers: isVerifiedSQSLeader(ctx, coordinator) isVerifiedDynamoLeader(ctx, coordinator) S3Server.isVerifiedS3Leader(ctx) DistributionServer.verifyCatalogLeader(ctx) Admin LeaderProbe wiring: main_admin.go newAdminLeaderProbe → coordinate.VerifyLeader(ctx) After: a Redis BLPOP timeout=2s whose dispatch lands on a slow ReadIndex now fails after 2s (its own deadline), not after the 5s verifyLeaderTimeout safety bound. A Caddy active health probe with a 1s budget likewise fails after 1s. The 5s bound stays in place as defense-in-depth for the no-arg verifyLeaderEngine() — still hit by LockResolver background loops, HLC lease ticks, and any future internal caller that genuinely cannot inherit a deadline. Also fixes proposer.Propose at kv/transaction.go:152 which used context.Background() inline; same shape as the original verifyLeaderEngine bug, just on the propose path. Now plumbed. Batched commit caveat: TransactionManager.commitRaw merges many callers into a single propose via a separate goroutine, so no single ctx can bound the underlying applyRequests. The wait site in commitRaw selects between item.done and ctx.Done so per-caller cancellation still works (the propose continues, the abandoned caller exits early); the goroutine's applyRequests call uses context.Background by design and is documented as such. Test stub updates (3 ripples through the Transactional interface, 6 ripples through Coordinator, 2 through LeaderProbe): kv/coordinator_txn_test.go::stubTransactional kv/coordinator_retry_test.go::scriptedTransactional kv/sharded_coordinator_txn_test.go::recordingTransactional kv/sharded_coordinator_abort_test.go::failingTransactional kv/sharded_lease_test.go::fixedTransactional kv/shard_router_test.go::fakeTM kv/shard_router_partition_test.go::fakeTxn kv/leader_routed_store_test.go::stubLeaderCoordinator adapter/redis_retry_test.go::retryOnceCoordinator adapter/redis_info_test.go::infoTestCoordinator adapter/redis_hello_test.go::helloTestCoordinator adapter/redis_keys_pattern_test.go::stubAdapterCoordinator adapter/distribution_server_test.go::distributionCoordinatorStub adapter/dynamodb_test.go::testCoordinatorWrapper adapter/s3_test.go::routeAwareS3Coordinator + followerS3Coordinator internal/admin/router_test.go::LeaderProbeFunc literal forms Self-review (5 lenses): 1. Data loss — none. ctx is value-passed; no persisted state changes. verifyLeaderEngineCtx already existed and is what we now route to; only the caller of the no-arg variant changes (background loops only). 2. Concurrency — ctx flows through the existing locking unchanged. commitRaw goroutine intentionally uses Background to avoid mixing per-caller deadlines into a shared batched propose; documented. 3. Performance — same call count, no extra allocations beyond the ctx value passed by reference. context.WithTimeout wrapping in the no-ctx fallback is unchanged. 4. Data consistency — verify is a freshness check, not a write path. Shorter caller deadlines surface ErrLeaderNotFound earlier instead of waiting out the 5s safety bound. 5. Test coverage — every Transactional / Coordinator / LeaderProbe stub gains the ctx parameter (currently `_ context.Context`, ready for future tests asserting cancel propagation). Test: go test -race -count=1 -short ./kv ./adapter ./internal/admin -- 9.6m total, all green (adapter is the long pole on Jepsen-shaped tests; kv 6.6s, admin 1.4s). Design doc: docs/design/2026_05_10_proposed_kv_ctx_plumbing.md.
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (9)
📝 WalkthroughWalkthroughThis PR threads ChangesContext Plumbing for KV Write and Leader Verification
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request plumbs context.Context through the KV write and leadership verification paths, allowing request-specific deadlines to govern operations and preventing resource exhaustion during stalls. The changes update the Transactional, Coordinator, and LeaderProbe interfaces to accept context parameters across the system's adapters and internal managers. Feedback focuses on ensuring that transaction cleanup (Abort) operations use detached contexts to prevent orphaned locks when a caller's context has already expired. Additionally, it is recommended to enforce a default safety timeout within the VerifyLeader methods when the provided context lacks a deadline and to explicitly document the blocking nature of these interface methods.
| // abort cleanup on failure. | ||
| if needsTxnCleanup(reqs) { | ||
| _, _err := t.Abort(reqs) | ||
| _, _err := t.Abort(ctx, reqs) |
There was a problem hiding this comment.
The Abort call here uses the caller's context, which may already be cancelled or expired (causing the previous applyRequests to fail). If the context is cancelled, the Abort proposal will fail immediately, leaving transaction locks held until the lock resolver eventually cleans them up. Consider using a detached context with a short timeout for cleanup operations to ensure they have a chance to complete, aligning with the practice of using context.Context for managing deadlines.
References
- When designing interfaces, use context.Context for managing deadlines and cancellation.
| primaryGid, maxIndex, err := c.commitPrimaryTxn(ctx, startTS, primaryKey, grouped, commitTS) | ||
| if err != nil { | ||
| c.abortPreparedTxn(startTS, primaryKey, prepared, abortTSFrom(startTS, commitTS)) | ||
| c.abortPreparedTxn(ctx, startTS, primaryKey, prepared, abortTSFrom(startTS, commitTS)) |
There was a problem hiding this comment.
Similar to the sequential commit path, abortPreparedTxn here uses the potentially cancelled caller context. This may prevent the abort from being successfully proposed to the shards, leading to prolonged lock contention. Using a detached context for cleanup is recommended to ensure the operation has a managed deadline via the context.
References
- When designing interfaces, use context.Context for managing deadlines and cancellation.
| func (c *Coordinate) VerifyLeader() error { | ||
| return verifyLeaderEngine(c.engine) | ||
| func (c *Coordinate) VerifyLeader(ctx context.Context) error { | ||
| return verifyLeaderEngineCtx(ctx, c.engine) |
There was a problem hiding this comment.
To maintain the 'defense-in-depth' safety bound mentioned in the PR description, VerifyLeader should ideally enforce a default timeout (e.g., 5s) if the provided context does not already have a deadline. This prevents background callers that might pass context.Background() from blocking indefinitely if a ReadIndex round-trip stalls. Additionally, ensure this blocking semantic is explicitly stated in the documentation to avoid ambiguity.
References
- When designing interfaces, use context.Context for managing deadlines and cancellation.
- When an interface method is intended to block until a certain condition is met, explicitly state this blocking semantic in the documentation.
| return errors.WithStack(ErrLeaderNotFound) | ||
| } | ||
| return verifyLeaderEngine(engineForGroup(g)) | ||
| return verifyLeaderEngineCtx(ctx, engineForGroup(g)) |
There was a problem hiding this comment.
As with the single-shard coordinator, VerifyLeader should enforce a default timeout when the input context lacks a deadline to ensure the 5s safety bound is preserved for all callers. Furthermore, the blocking nature of this method should be explicitly documented in the interface description.
References
- When designing interfaces, use context.Context for managing deadlines and cancellation.
- When an interface method is intended to block until a certain condition is met, explicitly state this blocking semantic in the documentation.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d11258a668
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| if r.coordinator.IsLeader() { | ||
| if err := r.coordinator.VerifyLeader(); err != nil { | ||
| if err := r.coordinator.VerifyLeader(r.handlerContext()); err != nil { |
There was a problem hiding this comment.
Reinstate bounded context for Redis KEYS leader checks
This call switched from VerifyLeader() (which used the 5s verifyLeaderEngine guard) to VerifyLeader(r.handlerContext()), but handlerContext() is rooted in context.WithCancel(context.Background()) and typically has no deadline. When ReadIndex/VerifyLeader stalls, KEYS can now block indefinitely instead of failing fast, tying up the command handler until client disconnect or server shutdown and reintroducing the unbounded wait behavior the 5s guard was meant to prevent.
Useful? React with 👍 / 👎.
|
|
||
| if err := r.retryRedisWrite(ctx, func() error { | ||
| if err := r.coordinator.VerifyLeader(); err != nil { | ||
| if err := r.coordinator.VerifyLeader(r.handlerContext()); err != nil { |
There was a problem hiding this comment.
Pass FLUSHDB timeout context into VerifyLeader
flushDatabase creates a redisDispatchTimeout-bounded context and uses it for retries, but leader verification now uses r.handlerContext() instead of that ctx. If VerifyLeader blocks, the command can run past the intended dispatch timeout because cancellation from ctx no longer reaches the verify call. Using the same ctx here keeps FLUSHDB bounded as designed.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
kv/transaction.go (1)
237-246:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse a non-canceled context for transactional cleanup on commit failure.
At Line 243,
Abortreuses the caller ctx. If the original failure iscontext deadline exceeded/canceled, cleanup can be skipped, leaving intents behind and prolonging lock contention.💡 Proposed fix
if err != nil { // Only attempt transactional cleanup for transactional batches. Raw request // batches may partially succeed across shards by design. One-phase // transactional requests do not leave intents behind, so they do not need // abort cleanup on failure. if needsTxnCleanup(reqs) { - _, _err := t.Abort(ctx, reqs) + cleanupCtx := ctx + var cancel context.CancelFunc + if cleanupCtx == nil || cleanupCtx.Err() != nil { + cleanupCtx, cancel = context.WithTimeout(context.Background(), 5*time.Second) + defer cancel() + } + _, _err := t.Abort(cleanupCtx, reqs) if _err != nil { return nil, errors.WithStack(errors.CombineErrors(err, _err)) } } return nil, errors.WithStack(err) }As per coding guidelines,
kv/transaction.go: Ensure cross-shard transaction atomicity is maintained and OCC commit-ts ordering is correct.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@kv/transaction.go` around lines 237 - 246, The Abort call in the transactional cleanup currently reuses the caller ctx which may be canceled or timed out, causing cleanup to be skipped; change the cleanup path inside the needsTxnCleanup handling to detect if ctx.Err() != nil and, if so, run t.Abort using a fresh non-canceled context (e.g., context.Background() or a context.WithTimeout(context.Background(), cleanupTimeout) and cancel it after use) when invoking t.Abort(ctxForCleanup, reqs) so cleanup runs even if the original ctx was canceled.
🧹 Nitpick comments (1)
kv/transaction_batch_test.go (1)
84-154: ⚡ Quick winAdd a regression test for canceled-context commit behavior.
The API is now context-aware, but this suite doesn’t yet pin cancellation/deadline behavior (especially the batched raw-commit wait semantics). Adding one focused case would prevent silent contract drift.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@kv/transaction_batch_test.go` around lines 84 - 154, Add a new test (e.g., TestTransactionManagerBatchesConcurrentRawCommits_CancelContext) that verifies Commit respects context cancellation while waiting in the raw batch window: set rawBatchWindow to a non-trivial duration, create two concurrent Commit calls via NewTransactionWithProposer(r) where one Commit is passed a context that you cancel before the batch window elapses and the other uses a background context; assert the canceled Commit returns a context.Canceled (or context.DeadlineExceeded) error and does not write its key (use st.GetAt to verify), while the non-canceled Commit succeeds and its key is present; use the same setup symbols from the existing test (rawBatchWindow, newSingleRaft, NewKvFSMWithHLC, NewTransactionWithProposer, tm.Commit, store.GetAt) and clean up/reset rawBatchWindow in t.Cleanup.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@adapter/dynamodb_test.go`:
- Around line 1834-1836: The testCoordinatorWrapper.VerifyLeader method
incorrectly discards the caller ctx and uses context.Background(); update
testCoordinatorWrapper.VerifyLeader to forward the passed ctx to
w.inner.VerifyLeader(ctx) (matching how VerifyLeaderForKey forwards context) so
caller deadlines and cancellation propagate through the wrapper to the inner
coordinator.
In `@docs/design/2026_05_10_proposed_kv_ctx_plumbing.md`:
- Around line 90-98: The doc uses outdated method names VerifyLeaderCtx and
VerifyLeaderForKeyCtx; update all references and function signatures to the
merged API names VerifyLeader(ctx context.Context) error and
VerifyLeaderForKey(ctx context.Context, key []byte) error (and note that any
no-arg wrappers remain the no-ctx entry points that call these with
context.Background()); update the occurrences around the current function
listings and the later mention (previously lines referencing VerifyLeaderCtx /
VerifyLeaderForKeyCtx) to match the new symbols so tests and follow-up work are
not misdirected.
---
Outside diff comments:
In `@kv/transaction.go`:
- Around line 237-246: The Abort call in the transactional cleanup currently
reuses the caller ctx which may be canceled or timed out, causing cleanup to be
skipped; change the cleanup path inside the needsTxnCleanup handling to detect
if ctx.Err() != nil and, if so, run t.Abort using a fresh non-canceled context
(e.g., context.Background() or a context.WithTimeout(context.Background(),
cleanupTimeout) and cancel it after use) when invoking t.Abort(ctxForCleanup,
reqs) so cleanup runs even if the original ctx was canceled.
---
Nitpick comments:
In `@kv/transaction_batch_test.go`:
- Around line 84-154: Add a new test (e.g.,
TestTransactionManagerBatchesConcurrentRawCommits_CancelContext) that verifies
Commit respects context cancellation while waiting in the raw batch window: set
rawBatchWindow to a non-trivial duration, create two concurrent Commit calls via
NewTransactionWithProposer(r) where one Commit is passed a context that you
cancel before the batch window elapses and the other uses a background context;
assert the canceled Commit returns a context.Canceled (or
context.DeadlineExceeded) error and does not write its key (use st.GetAt to
verify), while the non-canceled Commit succeeds and its key is present; use the
same setup symbols from the existing test (rawBatchWindow, newSingleRaft,
NewKvFSMWithHLC, NewTransactionWithProposer, tm.Commit, store.GetAt) and clean
up/reset rawBatchWindow in t.Cleanup.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d54ecd57-dc97-4a30-a539-2ad77b01ed1b
📒 Files selected for processing (45)
adapter/distribution_server.goadapter/distribution_server_test.goadapter/dynamodb.goadapter/dynamodb_admin.goadapter/dynamodb_test.goadapter/internal.goadapter/redis.goadapter/redis_compat_commands.goadapter/redis_compat_helpers.goadapter/redis_hello_test.goadapter/redis_info_test.goadapter/redis_keys_pattern_test.goadapter/redis_retry_test.goadapter/s3.goadapter/s3_admin.goadapter/s3_test.goadapter/sqs.goadapter/sqs_admin.godocs/design/2026_05_10_proposed_kv_ctx_plumbing.mdinternal/admin/router.gointernal/admin/router_test.gokv/coordinator.gokv/coordinator_leader_test.gokv/coordinator_retry_test.gokv/coordinator_txn_test.gokv/leader_proxy.gokv/leader_proxy_test.gokv/leader_routed_store.gokv/leader_routed_store_test.gokv/lock_resolver.gokv/lock_resolver_test.gokv/shard_router.gokv/shard_router_partition_test.gokv/shard_router_test.gokv/shard_store.gokv/shard_store_txn_lock_test.gokv/sharded_coordinator.gokv/sharded_coordinator_abort_test.gokv/sharded_coordinator_leader_test.gokv/sharded_coordinator_sampler_test.gokv/sharded_coordinator_txn_test.gokv/sharded_lease_test.gokv/transaction.gokv/transaction_batch_test.gomain_admin.go
… bounded redis (PR #749 r1) Round-1 review on commit d11258a: P1 (Codex): adapter/redis.go::keys() switched from VerifyLeader() (5 s engineEngine guard) to VerifyLeader(r.handlerContext()), but handlerContext is the server's long-lived baseCtx with no deadline. A stalled ReadIndex on KEYS could now hang the command handler indefinitely. Fixed at TWO layers: 1. verifyLeaderEngineCtx (kv/raft_engine.go) now applies a default verifyLeaderTimeout when the inbound ctx has no deadline. This covers handlerContext, future Background-passing callers, and anything the audit missed. Callers with a tighter deadline keep theirs because context.WithTimeout returns the earlier of the two expirations. 2. keys() now wraps r.handlerContext() with redisDispatchTimeout (matching every other Redis dispatch path) so its overall command budget is the per-call timeout, not 5 s. P2 (Codex): adapter/redis_compat_commands.go::flushDatabase had a redisDispatchTimeout-bounded ctx for retries but called VerifyLeader(r.handlerContext()) — leader verification escaped the flush command's deadline. Now uses the per-call ctx so the entire FLUSHDB path is bounded as designed. Major (CodeRabbit): adapter/dynamodb_test.go::testCoordinatorWrapper swallowed the caller ctx and substituted context.Background(), silently undoing this PR's deadline propagation for any test routing through the wrapper. Now forwards ctx unchanged. Medium (gemini): commitSequential's Abort cleanup and dispatchTxn's abortPreparedTxn cleanup both used the (likely already-cancelled) caller ctx, risking a no-op cleanup that leaves intents until LockResolver picks them up at a future tick. Both paths now use context.WithTimeout(context.WithoutCancel(ctx), verifyLeaderTimeout) so cleanup MUST go through, bounded at 5 s. Same pattern monitoring/* uses for shutdown ctx. Medium (gemini): VerifyLeader should enforce a default timeout when the input ctx has no deadline — covered by the verifyLeaderEngineCtx guard described above. The Coordinator.VerifyLeader / VerifyLeaderForKey methods themselves stay clean (just forward ctx); the deadline discipline is centralized in the helper. Minor (CodeRabbit): docs/design/2026_05_10_proposed_kv_ctx_plumbing.md described pre-merge VerifyLeaderCtx / VerifyLeaderForKeyCtx names. Updated to match the merged surface (VerifyLeader(ctx) / VerifyLeaderForKey(ctx, key)) and the new internal-guard mechanism. reviewdog gci: adapter/redis_hello_test.go and redis_info_test.go import groups merged into the project's single non-stdlib group via golangci-lint fmt. Caller audit (semantic change in verifyLeaderEngineCtx — adding default-deadline wrap): - All in-tree callers: kv/coordinator.go::VerifyLeader, kv/sharded_coordinator.go::VerifyLeader/ForKey, kv/leader_proxy.go::Commit/Abort. - Behavior for callers that already pass a bounded ctx: unchanged. - Behavior for callers passing Background / long-lived ctx (handlerContext, future audits): now bounded at 5 s — strict improvement, matches the pre-PR-749 default behaviour callers expected from the no-arg verifyLeaderEngine wrapper. - No caller depends on "wait forever" semantics; the Redis KEYS call site that triggered Codex P1 is the canonical example. Test: go test -race -count=1 -short ./kv ./internal/admin -- 7.7s, all green.
|
@claude review |
|
Claude finished @bootjp's task in 4m 1s —— View job PR Review: refactor(kv): plumb caller context through write + verify-leader pathsTasks:
SummaryThe core goal of this PR — threading the caller's Two issues need addressing before merge: one CI blocker and one correctness gap. Issues1. CI Blocker —
|
Round-2 review on commit fe8162e from claude bot: > Correctness gap — prewriteTxn inline abort paths still use > caller ctx > > The PR correctly applies context.WithoutCancel at both cleanup > sites in dispatchTxn (for commitPrimaryTxn failure) and > commitSequential (for t.Abort). However, the two inline abort > calls inside prewriteTxn still pass the caller's ctx to > abortPreparedTxn: > > * sharded_coordinator.go:494 (prepare-loop Commit failure) > * sharded_coordinator.go:504 (validateReadOnlyShards failure) > > When ctx expires (e.g., Redis client disconnects mid-prewrite), > both the shard commit and the subsequent abortPreparedTxn fail > immediately with context.Canceled, leaving prepared intents > behind. The defense-in-depth in verifyLeaderEngineCtx does not > help: it only applies the 5 s fallback when there is no > deadline, but a cancelled context's cancellation propagates > immediately regardless. Fixed both prewriteTxn cleanup sites the same way as round-1 patched dispatchTxn / commitSequential: cleanupCtx, cancel := context.WithTimeout( context.WithoutCancel(ctx), verifyLeaderTimeout) c.abortPreparedTxn(cleanupCtx, ...) cancel() Now any prewrite intent we already persisted on prior shards is guaranteed an Abort attempt within 5 s, regardless of whether the upstream cancellation cascade comes from the client (ctx expired) or from a slow shard (Commit returned ctx.Err). The other Claude bot finding (CI gci) was already addressed in round-2 (fe8162e) — golangci-lint passes on the current commit. The Jepsen test failure on the round-2 CI run is a Maven dependency-resolution flake (com.cognitect.aws:dynamodb / com.cognitect.aws:sqs / org.slf4j:slf4j-* checksum failures during dependency download), unrelated to the diff. Non-blocking observation acknowledged: rangeList / readValueAt in adapter/redis.go still pass r.handlerContext() to VerifyLeaderForKey rather than a per-call ctx. Those calls are correctly bounded at 5 s by verifyLeaderEngineCtx's default- deadline guard added in r1, so they are not a regression vs the pre-PR-749 behaviour. Tightening them to the per-call dispatch budget (consistent with the keys() fix) belongs to the follow-up Redis-adapter audit pass already noted in the design doc — not in scope for this PR. Caller audit (semantic change in prewriteTxn cleanup): - Only call sites of abortPreparedTxn that previously used a potentially-cancelled ctx now derive a detached cleanup ctx. - abortPreparedTxn itself is unchanged; it just takes the ctx it is given and forwards to g.Txn.Commit. The caller-side detachment is the fix. - No other prewriteTxn paths reach abortPreparedTxn. Test: go test -race -count=1 -short ./kv -- 6.2s, all green.
|
@claude review |
|
Claude finished @bootjp's task in 4m 40s —— View job PR Review: refactor(kv): plumb caller context through write + verify-leader pathsTasks:
SummaryThis is a re-review after two follow-up commits ( Issues addressed since round 1✅ gci lint failures (
|
Summary
Follow-up to #745. PR #745 capped
verifyLeaderEngine's no-context call site at 5s as an incident hotfix; this PR plumbs the caller'scontext.Contextend-to-end through the kv write path and the verify-leader path so callers that hold a request deadline see THEIR deadline honoured, not the 5s safety bound.The 5s bound stays in place as defense-in-depth for callers without an upstream context (LockResolver, HLC lease tick).
What changed
Interface signatures (kv-internal):
Plumbing chains (every step now carries ctx):
Also fixes the inline
context.Background()atkv/transaction.go:152(proposer.Propose) — same shape as the originalverifyLeaderEnginebug, on the propose path.Behaviour
For callers with upstream deadlines:
timeout=2son slow ReadIndextimeout=1sDispatch(ctx)with explicit deadlineFor background callers without upstream ctx:
verifyLeaderEngine()(no-arg variant) which still wraps withcontext.WithTimeout(context.Background(), verifyLeaderTimeout). PR fix(kv): bound verifyLeaderEngine ReadIndex with 5s deadline #745's 5s bound is unchanged for them.Batched-commit caveat
TransactionManager.commitRawmerges many callers into a single propose via a separate goroutine, so no single ctx can bound the underlyingapplyRequests. The fix:commitRawselects betweenitem.doneandctx.Done()so per-caller cancellation works (caller exits early; the propose still completes; other waiters in the same batch get their results normally).applyRequestscall usescontext.Backgroundby design, documented in the code.Self-review (5 lenses)
verifyLeaderEngineCtxis the existing path; only the caller of the no-arg variant changes.commitRawgoroutine intentionally uses Background to avoid mixing per-caller deadlines into a shared batched propose; documented.context.WithTimeoutwrapping in the no-ctx fallback is unchanged.ErrLeaderNotFoundearlier instead of waiting out the 5s safety bound._ context.Context, ready for future tests asserting cancel propagation). 16 stubs / signatures updated.Test plan
go test -race -count=1 -short ./kv ./adapter ./internal/admin— all greenFollow-up scope (not in this PR)
The audit also surfaced ~210
context.Background()usages in tree, mostly concentrated in:These are "easy win" candidates where a
ctxis in scope butBackground()was used instead. Not bundled here to keep the PR reviewable; a follow-up "audit pass" PR targeting Redis adapter helpers is the natural next step.The single
context.TODO()(kv/fsm.go) is structural — raftApply()does not provide a ctx — and is left as is.Design doc
docs/design/2026_05_10_proposed_kv_ctx_plumbing.md— included in this PR.Summary by CodeRabbit
Refactor
Documentation