Skip to content

refactor(kv): plumb caller context through write + verify-leader paths#749

Merged
bootjp merged 3 commits intomainfrom
fix/verify-leader-context-plumbing
May 10, 2026
Merged

refactor(kv): plumb caller context through write + verify-leader paths#749
bootjp merged 3 commits intomainfrom
fix/verify-leader-context-plumbing

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented May 9, 2026

Summary

Follow-up to #745. PR #745 capped verifyLeaderEngine's no-context call site at 5s as an incident hotfix; this PR plumbs the caller's context.Context end-to-end through the kv write path and the verify-leader path so callers that hold a request deadline see THEIR deadline honoured, not the 5s safety bound.

The 5s bound stays in place as defense-in-depth for callers without an upstream context (LockResolver, HLC lease tick).

What changed

Interface signatures (kv-internal):

type Transactional interface {
    Commit(ctx context.Context, reqs []*pb.Request) (*TransactionResponse, error)
    Abort(ctx context.Context, reqs []*pb.Request) (*TransactionResponse, error)
}

type Coordinator interface {
    // …
    VerifyLeader(ctx context.Context) error
    VerifyLeaderForKey(ctx context.Context, key []byte) error
}

type LeaderProbe interface {
    IsVerifiedLeader(ctx context.Context) bool
}

Plumbing chains (every step now carries ctx):

Dispatch path:
  ShardedCoordinator.Dispatch(ctx)
    → dispatchTxn(ctx) / dispatchSingleShardTxn(ctx) /
      commitPrimaryTxn(ctx) / commitSecondaryTxns(ctx) /
      abortPreparedTxn(ctx) / broadcastToAllGroups(ctx) /
      dispatchDelPrefixBroadcast(ctx)
    → Transactional.Commit/Abort(ctx)
    → leaseRefreshingTxn / LeaderProxy / TransactionManager / ShardRouter
    → applyRequests(ctx) → proposer.Propose(ctx)
    → verifyLeaderEngineCtx(ctx)

Lock-resolver / shard-store path:
  LockResolver.resolveExpiredLock(ctx)
    → applyTxnResolution(ctx) / applyScanLockResolutions(ctx) /
      tryAbortExpiredPrimary(ctx)
    → g.Txn.Commit(ctx)

Healthz:
  S3 / SQS / DynamoDB /healthz/leader handlers
    → isVerifiedSQSLeader(ctx, …) / isVerifiedDynamoLeader(ctx, …) /
      S3Server.isVerifiedS3Leader(ctx) /
      DistributionServer.verifyCatalogLeader(ctx)
    → coordinator.VerifyLeader(ctx) / VerifyLeaderForKey(ctx, key)

Admin LeaderProbe:
  /admin/healthz/leader handler → LeaderProbe.IsVerifiedLeader(r.Context())
    → coordinate.VerifyLeader(ctx)

Also fixes the inline context.Background() at kv/transaction.go:152 (proposer.Propose) — same shape as the original verifyLeaderEngine bug, on the propose path.

Behaviour

For callers with upstream deadlines:

caller before after
Redis BLPOP timeout=2s on slow ReadIndex fails after 5s (verifyLeaderTimeout) fails after 2s (caller's deadline)
Caddy active probe timeout=1s fails after 5s fails after 1s
gRPC Dispatch(ctx) with explicit deadline dropped to 5s honours ctx deadline

For background callers without upstream ctx:

Batched-commit caveat

TransactionManager.commitRaw merges many callers into a single propose via a separate goroutine, so no single ctx can bound the underlying applyRequests. The fix:

  • Wait site in commitRaw selects between item.done and ctx.Done() so per-caller cancellation works (caller exits early; the propose still completes; other waiters in the same batch get their results normally).
  • Goroutine's applyRequests call uses context.Background by design, documented in the code.

Self-review (5 lenses)

  1. Data loss — none. ctx is value-passed; no persisted state changes. verifyLeaderEngineCtx is the existing path; only the caller of the no-arg variant changes.
  2. Concurrency — ctx flows through the existing locking unchanged. commitRaw goroutine intentionally uses Background to avoid mixing per-caller deadlines into a shared batched propose; documented.
  3. Performance — same call count, no extra allocations beyond the ctx value passed by reference. context.WithTimeout wrapping in the no-ctx fallback is unchanged.
  4. Data consistency — verify is a freshness check, not a write path. Shorter caller deadlines surface ErrLeaderNotFound earlier instead of waiting out the 5s safety bound.
  5. Test coverage — every Transactional / Coordinator / LeaderProbe stub gains the ctx parameter (currently _ context.Context, ready for future tests asserting cancel propagation). 16 stubs / signatures updated.

Test plan

  • go test -race -count=1 -short ./kv ./adapter ./internal/admin — all green
    • kv: 6.6s
    • adapter: 565s (long Jepsen-shaped tests)
    • internal/admin: 1.4s
  • No semantic change beyond the ctx propagation; all existing tests pass without modification of their assertions.

Follow-up scope (not in this PR)

The audit also surfaced ~210 context.Background() usages in tree, mostly concentrated in:

file count nature
adapter/redis_compat_commands.go 66 dispatch helpers, internal commands
adapter/redis_lua_context.go 25 Lua script context loaders
adapter/redis.go 25 handler context fallback paths
proxy/* 28 proxy package, separate concern

These are "easy win" candidates where a ctx is in scope but Background() was used instead. Not bundled here to keep the PR reviewable; a follow-up "audit pass" PR targeting Redis adapter helpers is the natural next step.

The single context.TODO() (kv/fsm.go) is structural — raft Apply() does not provide a ctx — and is left as is.

Design doc

docs/design/2026_05_10_proposed_kv_ctx_plumbing.md — included in this PR.

Summary by CodeRabbit

  • Refactor

    • Updated internal APIs to propagate request context through leader verification and transaction operations, enabling request deadlines to be respected instead of using fixed internal timeouts.
  • Documentation

    • Added design document outlining context plumbing through KV write paths and leader-probe verification across adapters.

Review Change Stack

Follow-up to #745 (verifyLeaderEngine 5s deadline incident hotfix).
PR #745 capped the no-context verify path at 5s as defense-in-depth;
this PR plumbs the caller's context.Context end-to-end through:

  Dispatch path:
    ShardedCoordinator.Dispatch(ctx) → dispatchTxn(ctx) →
      dispatchSingleShardTxn(ctx) /
      commitPrimaryTxn(ctx) /
      commitSecondaryTxns(ctx) /
      abortPreparedTxn(ctx) /
      broadcastToAllGroups(ctx) /
      dispatchDelPrefixBroadcast(ctx)
    → Transactional.Commit/Abort(ctx, reqs)
    → leaseRefreshingTxn / LeaderProxy / TransactionManager / ShardRouter
    → applyRequests(ctx) → proposer.Propose(ctx)
    → verifyLeaderEngineCtx(ctx)

  Lock-resolver / shard-store path:
    LockResolver.resolveExpiredLock(ctx) → applyTxnResolution(ctx) /
      ShardStore.applyScanLockResolutions(ctx) /
      ShardStore.tryAbortExpiredPrimary(ctx)
    → g.Txn.Commit(ctx)

  Leader-probe / healthz path:
    Coordinator.VerifyLeader(ctx) /
    Coordinator.VerifyLeaderForKey(ctx, key)
    LeaderProbe.IsVerifiedLeader(ctx)
    adapter S3/SQS/DynamoDB /healthz/leader handlers feed r.Context()

  Adapter healthz helpers:
    isVerifiedSQSLeader(ctx, coordinator)
    isVerifiedDynamoLeader(ctx, coordinator)
    S3Server.isVerifiedS3Leader(ctx)
    DistributionServer.verifyCatalogLeader(ctx)

  Admin LeaderProbe wiring:
    main_admin.go newAdminLeaderProbe → coordinate.VerifyLeader(ctx)

After: a Redis BLPOP timeout=2s whose dispatch lands on a slow
ReadIndex now fails after 2s (its own deadline), not after the 5s
verifyLeaderTimeout safety bound. A Caddy active health probe with a
1s budget likewise fails after 1s. The 5s bound stays in place as
defense-in-depth for the no-arg verifyLeaderEngine() — still hit by
LockResolver background loops, HLC lease ticks, and any future
internal caller that genuinely cannot inherit a deadline.

Also fixes proposer.Propose at kv/transaction.go:152 which used
context.Background() inline; same shape as the original
verifyLeaderEngine bug, just on the propose path. Now plumbed.

Batched commit caveat: TransactionManager.commitRaw merges many
callers into a single propose via a separate goroutine, so no
single ctx can bound the underlying applyRequests. The wait site
in commitRaw selects between item.done and ctx.Done so per-caller
cancellation still works (the propose continues, the abandoned
caller exits early); the goroutine's applyRequests call uses
context.Background by design and is documented as such.

Test stub updates (3 ripples through the Transactional interface,
6 ripples through Coordinator, 2 through LeaderProbe):
  kv/coordinator_txn_test.go::stubTransactional
  kv/coordinator_retry_test.go::scriptedTransactional
  kv/sharded_coordinator_txn_test.go::recordingTransactional
  kv/sharded_coordinator_abort_test.go::failingTransactional
  kv/sharded_lease_test.go::fixedTransactional
  kv/shard_router_test.go::fakeTM
  kv/shard_router_partition_test.go::fakeTxn
  kv/leader_routed_store_test.go::stubLeaderCoordinator
  adapter/redis_retry_test.go::retryOnceCoordinator
  adapter/redis_info_test.go::infoTestCoordinator
  adapter/redis_hello_test.go::helloTestCoordinator
  adapter/redis_keys_pattern_test.go::stubAdapterCoordinator
  adapter/distribution_server_test.go::distributionCoordinatorStub
  adapter/dynamodb_test.go::testCoordinatorWrapper
  adapter/s3_test.go::routeAwareS3Coordinator + followerS3Coordinator
  internal/admin/router_test.go::LeaderProbeFunc literal forms

Self-review (5 lenses):
1. Data loss — none. ctx is value-passed; no persisted state changes.
   verifyLeaderEngineCtx already existed and is what we now route to;
   only the caller of the no-arg variant changes (background loops
   only).
2. Concurrency — ctx flows through the existing locking unchanged.
   commitRaw goroutine intentionally uses Background to avoid mixing
   per-caller deadlines into a shared batched propose; documented.
3. Performance — same call count, no extra allocations beyond the
   ctx value passed by reference. context.WithTimeout wrapping in the
   no-ctx fallback is unchanged.
4. Data consistency — verify is a freshness check, not a write path.
   Shorter caller deadlines surface ErrLeaderNotFound earlier instead
   of waiting out the 5s safety bound.
5. Test coverage — every Transactional / Coordinator / LeaderProbe
   stub gains the ctx parameter (currently `_ context.Context`,
   ready for future tests asserting cancel propagation).

Test:
  go test -race -count=1 -short ./kv ./adapter ./internal/admin
  -- 9.6m total, all green (adapter is the long pole on Jepsen-shaped
  tests; kv 6.6s, admin 1.4s).

Design doc: docs/design/2026_05_10_proposed_kv_ctx_plumbing.md.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 9, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 30 minutes and 34 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3a40f421-3ce9-4c3f-98ba-b6f6d3f24f01

📥 Commits

Reviewing files that changed from the base of the PR and between d11258a and 13b8707.

📒 Files selected for processing (9)
  • adapter/dynamodb_test.go
  • adapter/redis.go
  • adapter/redis_compat_commands.go
  • adapter/redis_hello_test.go
  • adapter/redis_info_test.go
  • docs/design/2026_05_10_proposed_kv_ctx_plumbing.md
  • kv/raft_engine.go
  • kv/sharded_coordinator.go
  • kv/transaction.go
📝 Walkthrough

Walkthrough

This PR threads context.Context through KV transaction commit/abort operations, leader verification checks, and admin leader-probe interfaces, ensuring request deadlines are respected instead of blocking on internal timeouts. Adapters, test stubs, and resolution paths are updated to propagate the context consistently.

Changes

Context Plumbing for KV Write and Leader Verification

Layer / File(s) Summary
Interface Contracts
kv/transaction.go, kv/coordinator.go, internal/admin/router.go
Transactional.Commit/Abort, Coordinator.VerifyLeader/VerifyLeaderForKey, and LeaderProbe.IsVerifiedLeader now accept context.Context as the first parameter.
Transaction Manager
kv/transaction.go
TransactionManager.Commit and Abort propagate context through applyRequests to proposer.Propose; callers can abandon waiting for batched results via context cancellation while the batched propose continues in background.
Coordinator Dispatch
kv/coordinator.go, kv/sharded_coordinator.go
dispatchOnce and dispatch paths route context through dispatchTxn/dispatchRaw to transactionManager.Commit; VerifyLeader/VerifyLeaderForKey accept context; multi-shard PREPARE/COMMIT, primary/secondary commit, and abort paths all propagate context.
Leader Proxy & Routing
kv/leader_proxy.go, kv/shard_router.go, kv/leader_routed_store.go
LeaderProxy.Commit/Abort and ShardRouter.Commit/Abort accept context; leader-proxy retry budget derives parent context from caller context instead of context.Background(); leader-routed store passes context to VerifyLeaderForKey.
Admin Leader Probe
internal/admin/router.go, main_admin.go
LeaderProbe.IsVerifiedLeader becomes context-aware; /admin/healthz/leader handler passes r.Context() to the probe; newAdminLeaderProbe wires coordinator verification with request context.
Adapter Leader Verification
adapter/redis.go, adapter/redis_compat_commands.go, adapter/redis_compat_helpers.go, adapter/s3.go, adapter/s3_admin.go, adapter/sqs.go, adapter/sqs_admin.go, adapter/dynamodb.go, adapter/dynamodb_admin.go, adapter/distribution_server.go
Redis handler context, S3/SQS request context, DynamoDB/Distribution request context are passed into leader verification calls (VerifyLeader(ctx) or helper functions like isVerifiedS3Leader(ctx)).
Shard Transaction Resolution
kv/shard_store.go, kv/lock_resolver.go
Transaction and lock resolution helpers (applyTxnResolution, resolveScanLocks, resolveExpiredLock) updated to accept and propagate context through transaction commits.
Test Infrastructure
adapter/*_test.go, kv/*_test.go, internal/admin/router_test.go
All test stubs (distributionCoordinatorStub, testCoordinatorWrapper, stubLeaderCoordinator, fakeTxn, fakeTM, recordingTransactional, fixedTransactional, and S3 test doubles) updated to accept context; test callsites pass context.Background() where context was previously unused.
Admin Router Tests
internal/admin/router_test.go
All leader-probe callbacks updated to accept context.Context parameter; test wiring for /admin/healthz/leader endpoint and HTTP method handling updated accordingly.
Design Document
docs/design/2026_05_10_proposed_kv_ctx_plumbing.md
Describes problem (blocking on internal timeouts instead of respecting request deadlines), goals (thread context through KV and leader paths while preserving 5s no-ctx bound), implementation surface, and behavioral effects.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

  • bootjp/elastickv#618: Modifies coordinator dispatch and leader-proxy paths, overlapping with context propagation changes in this PR.
  • bootjp/elastickv#745: Bounds the no-context verifyLeaderEngine path with a 5s timeout, complementing this PR's context-aware leader verification.
  • bootjp/elastickv#539: Modifies S3 leader health check and leader verification logic that this PR updates to be context-aware.

Poem

🐰 Through the KV paths we hop and bound,
With context trails that flow around,
No more timeouts blocking the way,
Request deadlines lead the day!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 23.44% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main change: plumbing caller context through write and verify-leader paths in the KV module, which is the core objective of this refactoring PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/verify-leader-context-plumbing

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request plumbs context.Context through the KV write and leadership verification paths, allowing request-specific deadlines to govern operations and preventing resource exhaustion during stalls. The changes update the Transactional, Coordinator, and LeaderProbe interfaces to accept context parameters across the system's adapters and internal managers. Feedback focuses on ensuring that transaction cleanup (Abort) operations use detached contexts to prevent orphaned locks when a caller's context has already expired. Additionally, it is recommended to enforce a default safety timeout within the VerifyLeader methods when the provided context lacks a deadline and to explicitly document the blocking nature of these interface methods.

Comment thread kv/transaction.go Outdated
// abort cleanup on failure.
if needsTxnCleanup(reqs) {
_, _err := t.Abort(reqs)
_, _err := t.Abort(ctx, reqs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Abort call here uses the caller's context, which may already be cancelled or expired (causing the previous applyRequests to fail). If the context is cancelled, the Abort proposal will fail immediately, leaving transaction locks held until the lock resolver eventually cleans them up. Consider using a detached context with a short timeout for cleanup operations to ensure they have a chance to complete, aligning with the practice of using context.Context for managing deadlines.

References
  1. When designing interfaces, use context.Context for managing deadlines and cancellation.

Comment thread kv/sharded_coordinator.go Outdated
primaryGid, maxIndex, err := c.commitPrimaryTxn(ctx, startTS, primaryKey, grouped, commitTS)
if err != nil {
c.abortPreparedTxn(startTS, primaryKey, prepared, abortTSFrom(startTS, commitTS))
c.abortPreparedTxn(ctx, startTS, primaryKey, prepared, abortTSFrom(startTS, commitTS))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the sequential commit path, abortPreparedTxn here uses the potentially cancelled caller context. This may prevent the abort from being successfully proposed to the shards, leading to prolonged lock contention. Using a detached context for cleanup is recommended to ensure the operation has a managed deadline via the context.

References
  1. When designing interfaces, use context.Context for managing deadlines and cancellation.

Comment thread kv/coordinator.go
func (c *Coordinate) VerifyLeader() error {
return verifyLeaderEngine(c.engine)
func (c *Coordinate) VerifyLeader(ctx context.Context) error {
return verifyLeaderEngineCtx(ctx, c.engine)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To maintain the 'defense-in-depth' safety bound mentioned in the PR description, VerifyLeader should ideally enforce a default timeout (e.g., 5s) if the provided context does not already have a deadline. This prevents background callers that might pass context.Background() from blocking indefinitely if a ReadIndex round-trip stalls. Additionally, ensure this blocking semantic is explicitly stated in the documentation to avoid ambiguity.

References
  1. When designing interfaces, use context.Context for managing deadlines and cancellation.
  2. When an interface method is intended to block until a certain condition is met, explicitly state this blocking semantic in the documentation.

Comment thread kv/sharded_coordinator.go
return errors.WithStack(ErrLeaderNotFound)
}
return verifyLeaderEngine(engineForGroup(g))
return verifyLeaderEngineCtx(ctx, engineForGroup(g))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

As with the single-shard coordinator, VerifyLeader should enforce a default timeout when the input context lacks a deadline to ensure the 5s safety bound is preserved for all callers. Furthermore, the blocking nature of this method should be explicitly documented in the interface description.

References
  1. When designing interfaces, use context.Context for managing deadlines and cancellation.
  2. When an interface method is intended to block until a certain condition is met, explicitly state this blocking semantic in the documentation.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d11258a668

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread adapter/redis.go Outdated

if r.coordinator.IsLeader() {
if err := r.coordinator.VerifyLeader(); err != nil {
if err := r.coordinator.VerifyLeader(r.handlerContext()); err != nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reinstate bounded context for Redis KEYS leader checks

This call switched from VerifyLeader() (which used the 5s verifyLeaderEngine guard) to VerifyLeader(r.handlerContext()), but handlerContext() is rooted in context.WithCancel(context.Background()) and typically has no deadline. When ReadIndex/VerifyLeader stalls, KEYS can now block indefinitely instead of failing fast, tying up the command handler until client disconnect or server shutdown and reintroducing the unbounded wait behavior the 5s guard was meant to prevent.

Useful? React with 👍 / 👎.

Comment thread adapter/redis_compat_commands.go Outdated

if err := r.retryRedisWrite(ctx, func() error {
if err := r.coordinator.VerifyLeader(); err != nil {
if err := r.coordinator.VerifyLeader(r.handlerContext()); err != nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Pass FLUSHDB timeout context into VerifyLeader

flushDatabase creates a redisDispatchTimeout-bounded context and uses it for retries, but leader verification now uses r.handlerContext() instead of that ctx. If VerifyLeader blocks, the command can run past the intended dispatch timeout because cancellation from ctx no longer reaches the verify call. Using the same ctx here keeps FLUSHDB bounded as designed.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
kv/transaction.go (1)

237-246: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use a non-canceled context for transactional cleanup on commit failure.

At Line 243, Abort reuses the caller ctx. If the original failure is context deadline exceeded/canceled, cleanup can be skipped, leaving intents behind and prolonging lock contention.

💡 Proposed fix
 	if err != nil {
 		// Only attempt transactional cleanup for transactional batches. Raw request
 		// batches may partially succeed across shards by design. One-phase
 		// transactional requests do not leave intents behind, so they do not need
 		// abort cleanup on failure.
 		if needsTxnCleanup(reqs) {
-			_, _err := t.Abort(ctx, reqs)
+			cleanupCtx := ctx
+			var cancel context.CancelFunc
+			if cleanupCtx == nil || cleanupCtx.Err() != nil {
+				cleanupCtx, cancel = context.WithTimeout(context.Background(), 5*time.Second)
+				defer cancel()
+			}
+			_, _err := t.Abort(cleanupCtx, reqs)
 			if _err != nil {
 				return nil, errors.WithStack(errors.CombineErrors(err, _err))
 			}
 		}
 		return nil, errors.WithStack(err)
 	}

As per coding guidelines, kv/transaction.go: Ensure cross-shard transaction atomicity is maintained and OCC commit-ts ordering is correct.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@kv/transaction.go` around lines 237 - 246, The Abort call in the
transactional cleanup currently reuses the caller ctx which may be canceled or
timed out, causing cleanup to be skipped; change the cleanup path inside the
needsTxnCleanup handling to detect if ctx.Err() != nil and, if so, run t.Abort
using a fresh non-canceled context (e.g., context.Background() or a
context.WithTimeout(context.Background(), cleanupTimeout) and cancel it after
use) when invoking t.Abort(ctxForCleanup, reqs) so cleanup runs even if the
original ctx was canceled.
🧹 Nitpick comments (1)
kv/transaction_batch_test.go (1)

84-154: ⚡ Quick win

Add a regression test for canceled-context commit behavior.

The API is now context-aware, but this suite doesn’t yet pin cancellation/deadline behavior (especially the batched raw-commit wait semantics). Adding one focused case would prevent silent contract drift.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@kv/transaction_batch_test.go` around lines 84 - 154, Add a new test (e.g.,
TestTransactionManagerBatchesConcurrentRawCommits_CancelContext) that verifies
Commit respects context cancellation while waiting in the raw batch window: set
rawBatchWindow to a non-trivial duration, create two concurrent Commit calls via
NewTransactionWithProposer(r) where one Commit is passed a context that you
cancel before the batch window elapses and the other uses a background context;
assert the canceled Commit returns a context.Canceled (or
context.DeadlineExceeded) error and does not write its key (use st.GetAt to
verify), while the non-canceled Commit succeeds and its key is present; use the
same setup symbols from the existing test (rawBatchWindow, newSingleRaft,
NewKvFSMWithHLC, NewTransactionWithProposer, tm.Commit, store.GetAt) and clean
up/reset rawBatchWindow in t.Cleanup.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@adapter/dynamodb_test.go`:
- Around line 1834-1836: The testCoordinatorWrapper.VerifyLeader method
incorrectly discards the caller ctx and uses context.Background(); update
testCoordinatorWrapper.VerifyLeader to forward the passed ctx to
w.inner.VerifyLeader(ctx) (matching how VerifyLeaderForKey forwards context) so
caller deadlines and cancellation propagate through the wrapper to the inner
coordinator.

In `@docs/design/2026_05_10_proposed_kv_ctx_plumbing.md`:
- Around line 90-98: The doc uses outdated method names VerifyLeaderCtx and
VerifyLeaderForKeyCtx; update all references and function signatures to the
merged API names VerifyLeader(ctx context.Context) error and
VerifyLeaderForKey(ctx context.Context, key []byte) error (and note that any
no-arg wrappers remain the no-ctx entry points that call these with
context.Background()); update the occurrences around the current function
listings and the later mention (previously lines referencing VerifyLeaderCtx /
VerifyLeaderForKeyCtx) to match the new symbols so tests and follow-up work are
not misdirected.

---

Outside diff comments:
In `@kv/transaction.go`:
- Around line 237-246: The Abort call in the transactional cleanup currently
reuses the caller ctx which may be canceled or timed out, causing cleanup to be
skipped; change the cleanup path inside the needsTxnCleanup handling to detect
if ctx.Err() != nil and, if so, run t.Abort using a fresh non-canceled context
(e.g., context.Background() or a context.WithTimeout(context.Background(),
cleanupTimeout) and cancel it after use) when invoking t.Abort(ctxForCleanup,
reqs) so cleanup runs even if the original ctx was canceled.

---

Nitpick comments:
In `@kv/transaction_batch_test.go`:
- Around line 84-154: Add a new test (e.g.,
TestTransactionManagerBatchesConcurrentRawCommits_CancelContext) that verifies
Commit respects context cancellation while waiting in the raw batch window: set
rawBatchWindow to a non-trivial duration, create two concurrent Commit calls via
NewTransactionWithProposer(r) where one Commit is passed a context that you
cancel before the batch window elapses and the other uses a background context;
assert the canceled Commit returns a context.Canceled (or
context.DeadlineExceeded) error and does not write its key (use st.GetAt to
verify), while the non-canceled Commit succeeds and its key is present; use the
same setup symbols from the existing test (rawBatchWindow, newSingleRaft,
NewKvFSMWithHLC, NewTransactionWithProposer, tm.Commit, store.GetAt) and clean
up/reset rawBatchWindow in t.Cleanup.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d54ecd57-dc97-4a30-a539-2ad77b01ed1b

📥 Commits

Reviewing files that changed from the base of the PR and between 07ff07d and d11258a.

📒 Files selected for processing (45)
  • adapter/distribution_server.go
  • adapter/distribution_server_test.go
  • adapter/dynamodb.go
  • adapter/dynamodb_admin.go
  • adapter/dynamodb_test.go
  • adapter/internal.go
  • adapter/redis.go
  • adapter/redis_compat_commands.go
  • adapter/redis_compat_helpers.go
  • adapter/redis_hello_test.go
  • adapter/redis_info_test.go
  • adapter/redis_keys_pattern_test.go
  • adapter/redis_retry_test.go
  • adapter/s3.go
  • adapter/s3_admin.go
  • adapter/s3_test.go
  • adapter/sqs.go
  • adapter/sqs_admin.go
  • docs/design/2026_05_10_proposed_kv_ctx_plumbing.md
  • internal/admin/router.go
  • internal/admin/router_test.go
  • kv/coordinator.go
  • kv/coordinator_leader_test.go
  • kv/coordinator_retry_test.go
  • kv/coordinator_txn_test.go
  • kv/leader_proxy.go
  • kv/leader_proxy_test.go
  • kv/leader_routed_store.go
  • kv/leader_routed_store_test.go
  • kv/lock_resolver.go
  • kv/lock_resolver_test.go
  • kv/shard_router.go
  • kv/shard_router_partition_test.go
  • kv/shard_router_test.go
  • kv/shard_store.go
  • kv/shard_store_txn_lock_test.go
  • kv/sharded_coordinator.go
  • kv/sharded_coordinator_abort_test.go
  • kv/sharded_coordinator_leader_test.go
  • kv/sharded_coordinator_sampler_test.go
  • kv/sharded_coordinator_txn_test.go
  • kv/sharded_lease_test.go
  • kv/transaction.go
  • kv/transaction_batch_test.go
  • main_admin.go

Comment thread adapter/dynamodb_test.go Outdated
Comment thread docs/design/2026_05_10_proposed_kv_ctx_plumbing.md Outdated
… bounded redis (PR #749 r1)

Round-1 review on commit d11258a:

P1 (Codex): adapter/redis.go::keys() switched from VerifyLeader()
(5 s engineEngine guard) to VerifyLeader(r.handlerContext()), but
handlerContext is the server's long-lived baseCtx with no deadline.
A stalled ReadIndex on KEYS could now hang the command handler
indefinitely. Fixed at TWO layers:

  1. verifyLeaderEngineCtx (kv/raft_engine.go) now applies a default
     verifyLeaderTimeout when the inbound ctx has no deadline. This
     covers handlerContext, future Background-passing callers, and
     anything the audit missed. Callers with a tighter deadline keep
     theirs because context.WithTimeout returns the earlier of the
     two expirations.

  2. keys() now wraps r.handlerContext() with redisDispatchTimeout
     (matching every other Redis dispatch path) so its overall
     command budget is the per-call timeout, not 5 s.

P2 (Codex): adapter/redis_compat_commands.go::flushDatabase had a
redisDispatchTimeout-bounded ctx for retries but called
VerifyLeader(r.handlerContext()) — leader verification escaped the
flush command's deadline. Now uses the per-call ctx so the entire
FLUSHDB path is bounded as designed.

Major (CodeRabbit): adapter/dynamodb_test.go::testCoordinatorWrapper
swallowed the caller ctx and substituted context.Background(),
silently undoing this PR's deadline propagation for any test
routing through the wrapper. Now forwards ctx unchanged.

Medium (gemini): commitSequential's Abort cleanup and dispatchTxn's
abortPreparedTxn cleanup both used the (likely already-cancelled)
caller ctx, risking a no-op cleanup that leaves intents until
LockResolver picks them up at a future tick. Both paths now use
context.WithTimeout(context.WithoutCancel(ctx), verifyLeaderTimeout)
so cleanup MUST go through, bounded at 5 s. Same pattern monitoring/*
uses for shutdown ctx.

Medium (gemini): VerifyLeader should enforce a default timeout when
the input ctx has no deadline — covered by the verifyLeaderEngineCtx
guard described above. The Coordinator.VerifyLeader / VerifyLeaderForKey
methods themselves stay clean (just forward ctx); the deadline
discipline is centralized in the helper.

Minor (CodeRabbit): docs/design/2026_05_10_proposed_kv_ctx_plumbing.md
described pre-merge VerifyLeaderCtx / VerifyLeaderForKeyCtx names.
Updated to match the merged surface (VerifyLeader(ctx) /
VerifyLeaderForKey(ctx, key)) and the new internal-guard mechanism.

reviewdog gci: adapter/redis_hello_test.go and redis_info_test.go
import groups merged into the project's single non-stdlib group via
golangci-lint fmt.

Caller audit (semantic change in verifyLeaderEngineCtx — adding
default-deadline wrap):
- All in-tree callers: kv/coordinator.go::VerifyLeader,
  kv/sharded_coordinator.go::VerifyLeader/ForKey,
  kv/leader_proxy.go::Commit/Abort.
- Behavior for callers that already pass a bounded ctx: unchanged.
- Behavior for callers passing Background / long-lived ctx
  (handlerContext, future audits): now bounded at 5 s — strict
  improvement, matches the pre-PR-749 default behaviour callers
  expected from the no-arg verifyLeaderEngine wrapper.
- No caller depends on "wait forever" semantics; the Redis KEYS
  call site that triggered Codex P1 is the canonical example.

Test:
  go test -race -count=1 -short ./kv ./internal/admin
  -- 7.7s, all green.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 9, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 9, 2026

Claude finished @bootjp's task in 4m 1s —— View job


PR Review: refactor(kv): plumb caller context through write + verify-leader paths

Tasks:

  • Read PR context and understand the changes
  • Analyze key changed files
  • Check reviewer comments for validity
  • Review context propagation correctness
  • Check for missed edge cases
  • Post final review

Summary

The core goal of this PR — threading the caller's context.Context end-to-end through the kv write and leader-verify paths so request deadlines are honoured — is sound, well-scoped, and the batched-commit design (Background ctx in the batch goroutine, per-caller wait via select) is a correct solution to the inherent tension between batching and per-caller deadline propagation.

Two issues need addressing before merge: one CI blocker and one correctness gap.


Issues

1. CI Blocker — gci lint failures in two test files

The reviewdog bot flagged both adapter/redis_hello_test.go and adapter/redis_info_test.go for gci import formatting. These will block CI until fixed. Run:

make lint  # or: gci write adapter/redis_hello_test.go adapter/redis_info_test.go

The diff shows the files changed from aligned single-line methods to padded alignment for the wider VerifyLeader(context.Context) error signatures, which likely touched the import block during auto-formatting. Easiest fix is to re-run gci write on those two files.

2. Correctness gap — prewriteTxn inline abort paths still use caller ctx

The PR correctly applies context.WithoutCancel at both cleanup sites in dispatchTxn (for commitPrimaryTxn failure) and commitSequential (for t.Abort). However, the two inline abort calls inside prewriteTxn (kv/sharded_coordinator.go:494–495 and kv/sharded_coordinator.go:504–505) still pass the caller's ctx to abortPreparedTxn:

// sharded_coordinator.go:494
if _, err := g.Txn.Commit(ctx, []*pb.Request{req}); err != nil {
    c.abortPreparedTxn(ctx, startTS, primaryKey, prepared, abortTSFrom(startTS, commitTS))
    return nil, errors.WithStack(err)
}

// sharded_coordinator.go:504
if err := c.validateReadOnlyShards(ctx, groupedReadKeys, gids, startTS); err != nil {
    c.abortPreparedTxn(ctx, startTS, primaryKey, prepared, abortTSFrom(startTS, commitTS))
    return nil, err
}

When ctx expires (e.g., Redis client disconnects mid-prewrite), both the shard commit and the subsequent abortPreparedTxn will fail immediately with context.Canceled, leaving all already-prepared intents behind. These then linger until LockResolver picks them up. The defense-in-depth in verifyLeaderEngineCtx does not help here: it only applies the 5 s fallback when there is no deadline, but a cancelled context's cancellation propagates immediately regardless.

The same pattern as the dispatchTxn fix should be applied: Fix this →

// Suggested fix — same shape as dispatchTxn's cleanup
if _, err := g.Txn.Commit(ctx, []*pb.Request{req}); err != nil {
    cleanupCtx, cancel := context.WithTimeout(context.WithoutCancel(ctx), verifyLeaderTimeout)
    c.abortPreparedTxn(cleanupCtx, startTS, primaryKey, prepared, abortTSFrom(startTS, commitTS))
    cancel()
    return nil, errors.WithStack(err)
}

// and:
if err := c.validateReadOnlyShards(ctx, groupedReadKeys, gids, startTS); err != nil {
    cleanupCtx, cancel := context.WithTimeout(context.WithoutCancel(ctx), verifyLeaderTimeout)
    c.abortPreparedTxn(cleanupCtx, startTS, primaryKey, prepared, abortTSFrom(startTS, commitTS))
    cancel()
    return nil, err
}

Observations (non-blocking)

rangeList and readValueAt use handlerContext() without a deadline bound (adapter/redis.go:3264, adapter/redis.go:3520). These are protected by the verifyLeaderEngineCtx defense-in-depth (the 5 s fallback applies when no deadline is set), but they're inconsistent with the keys fix (which explicitly bounds with redisDispatchTimeout). Low severity, could be cleaned up in the follow-up Redis adapter audit pass.

Reviewer comments addressed in the latest commit:

  • CodeRabbit's testCoordinatorWrapper.VerifyLeader drop-ctx bug (adapter/dynamodb_test.go:1834) — ✅ correctly forward ctx in the current diff
  • Design doc method names (VerifyLeaderCtx vs VerifyLeader) — ✅ doc already uses VerifyLeader(ctx context.Context) in the current state
  • Gemini's "enforce default timeout inside VerifyLeader" — ✅ addressed by the verifyLeaderEngineCtx defense-in-depth (applies 5 s when ctx has no deadline)
  • Codex P1 (KEYS unbounded block) — ✅ fixed with the explicit redisDispatchTimeout-bounded ctx in keys()
  • Codex P2 (FLUSHDB using handlerContext() instead of dispatch ctx) — ✅ fixed to use the retryRedisWrite ctx

The commitRaw batched-commit design is correct. Using context.Background() in the batch goroutine while exiting early via select { case <-ctx.Done() } at the wait site is the right model. Other waiters in the same batch are unaffected by one caller's cancellation.

forwardWithRetry parentCtx change (from context.Background() to callerCtx) is correct: context.WithDeadline(callerCtx, deadline) correctly picks the earlier of the two expirations, so a Redis client deadline shorter than leaderProxyRetryBudget now exits early.


Bottom line

Two required fixes:

  1. Run gci write on the two flagged test files to clear the CI lint blocker
  2. Apply context.WithoutCancel to the two abortPreparedTxn calls inside prewriteTxn so cancelled-ctx cases don't leave stale intents

Everything else is clean. The architecture of the change is correct and the PR description accurately describes what shipped.

Round-2 review on commit fe8162e from claude bot:

> Correctness gap — prewriteTxn inline abort paths still use
> caller ctx
>
> The PR correctly applies context.WithoutCancel at both cleanup
> sites in dispatchTxn (for commitPrimaryTxn failure) and
> commitSequential (for t.Abort). However, the two inline abort
> calls inside prewriteTxn still pass the caller's ctx to
> abortPreparedTxn:
>
>   * sharded_coordinator.go:494 (prepare-loop Commit failure)
>   * sharded_coordinator.go:504 (validateReadOnlyShards failure)
>
> When ctx expires (e.g., Redis client disconnects mid-prewrite),
> both the shard commit and the subsequent abortPreparedTxn fail
> immediately with context.Canceled, leaving prepared intents
> behind. The defense-in-depth in verifyLeaderEngineCtx does not
> help: it only applies the 5 s fallback when there is no
> deadline, but a cancelled context's cancellation propagates
> immediately regardless.

Fixed both prewriteTxn cleanup sites the same way as round-1
patched dispatchTxn / commitSequential:

  cleanupCtx, cancel := context.WithTimeout(
      context.WithoutCancel(ctx), verifyLeaderTimeout)
  c.abortPreparedTxn(cleanupCtx, ...)
  cancel()

Now any prewrite intent we already persisted on prior shards is
guaranteed an Abort attempt within 5 s, regardless of whether
the upstream cancellation cascade comes from the client (ctx
expired) or from a slow shard (Commit returned ctx.Err).

The other Claude bot finding (CI gci) was already addressed in
round-2 (fe8162e) — golangci-lint passes on the current commit.
The Jepsen test failure on the round-2 CI run is a Maven
dependency-resolution flake (com.cognitect.aws:dynamodb /
com.cognitect.aws:sqs / org.slf4j:slf4j-* checksum failures
during dependency download), unrelated to the diff.

Non-blocking observation acknowledged: rangeList / readValueAt
in adapter/redis.go still pass r.handlerContext() to
VerifyLeaderForKey rather than a per-call ctx. Those calls are
correctly bounded at 5 s by verifyLeaderEngineCtx's default-
deadline guard added in r1, so they are not a regression vs the
pre-PR-749 behaviour. Tightening them to the per-call dispatch
budget (consistent with the keys() fix) belongs to the
follow-up Redis-adapter audit pass already noted in the design
doc — not in scope for this PR.

Caller audit (semantic change in prewriteTxn cleanup):
- Only call sites of abortPreparedTxn that previously used a
  potentially-cancelled ctx now derive a detached cleanup ctx.
- abortPreparedTxn itself is unchanged; it just takes the ctx it
  is given and forwards to g.Txn.Commit. The caller-side
  detachment is the fix.
- No other prewriteTxn paths reach abortPreparedTxn.

Test:
  go test -race -count=1 -short ./kv -- 6.2s, all green.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 9, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 9, 2026

Claude finished @bootjp's task in 4m 40s —— View job


PR Review: refactor(kv): plumb caller context through write + verify-leader paths

Tasks:

  • Read PR context and understand the changes
  • Review current state of key files (post r1 + r2 commits)
  • Check if previous review issues were addressed
  • Analyze remaining concerns and new code
  • Post final review

Summary

This is a re-review after two follow-up commits (fe8162e r1 and 13b8707 r2). All issues from the prior round have been addressed. The PR is ready to merge.


Issues addressed since round 1

✅ gci lint failures (redis_hello_test.go, redis_info_test.go) — fixed in r1

Both test files were reformatted. CI now passes golangci-lint.

prewriteTxn abort cleanup with cancelled ctx — fixed in r2 (13b8707)

Both abortPreparedTxn call sites inside prewriteTxn (sharded_coordinator.go:501-503 and :517-519) now correctly use context.WithTimeout(context.WithoutCancel(ctx), verifyLeaderTimeout), matching the pattern already applied in dispatchTxn and commitSequential. This closes the stale-intent risk when a client disconnects mid-prewrite.

✅ KEYS unbounded ReadIndex block (Codex P1) — fixed in r1

keys() now wraps r.handlerContext() with redisDispatchTimeout before passing to VerifyLeader.

✅ FLUSHDB using handlerContext() for VerifyLeader (Codex P2) — fixed in r1

flushDatabase now passes the per-call ctx (already bounded by redisDispatchTimeout) to VerifyLeader.

verifyLeaderEngineCtx defense-in-depth (Gemini) — fixed in r1

verifyLeaderEngineCtx (kv/raft_engine.go:67-71) now applies verifyLeaderTimeout as a fallback when the incoming ctx has no deadline, covering handlerContext, background loops, and any future caller that passes context.Background() directly.

commitSequential Abort with cancelled ctx (CodeRabbit/Gemini) — fixed in r1

transaction.go:251-253 correctly uses context.WithoutCancel(ctx) + verifyLeaderTimeout for the transactional Abort cleanup path.

testCoordinatorWrapper.VerifyLeader dropped ctx (CodeRabbit) — fixed in initial commit

adapter/dynamodb_test.go:1834-1835 now forwards the caller ctx to w.inner.VerifyLeader(ctx).

✅ Design doc stale method names (CodeRabbit) — fixed in r1

docs/design/2026_05_10_proposed_kv_ctx_plumbing.md uses the correct VerifyLeader(ctx context.Context) / VerifyLeaderForKey(ctx context.Context, key []byte) names throughout.


Observations (non-blocking)

rangeList and readValueAt use r.handlerContext() without per-call deadline (adapter/redis.go:3264 and :3520). These remain correctly bounded at 5 s by the verifyLeaderEngineCtx defense-in-depth added in r1 — no regression versus pre-PR-745 behavior. Tightening them to redisDispatchTimeout (consistent with keys()) belongs to the follow-up Redis adapter audit pass already scoped in the design doc.

commitSecondaryWithRetry uses the caller's ctx — this is correct by design. Secondary commits are best-effort; if ctx expires, lock resolution handles the remaining secondaries. No WithoutCancel pattern needed here.

applyRawBatch uses context.Background() (transaction.go:411) — intentional and correctly documented. The batch goroutine cannot inherit any single caller's ctx; per-caller cancellation is honoured at the wait site in commitRaw via select { case <-item.done / case <-ctx.Done() }.


Bottom line

All three mandatory fixes (gci lint blocker, prewriteTxn abort cleanup, KEYS/FLUSHDB deadline binding) are in. The architecture of the change is correct, the WithoutCancel pattern is applied consistently to every abort cleanup site, and the verifyLeaderEngineCtx defense-in-depth correctly covers the handlerContext gap. Ready to merge.

@bootjp bootjp merged commit 79e3545 into main May 10, 2026
9 of 10 checks passed
@bootjp bootjp deleted the fix/verify-leader-context-plumbing branch May 10, 2026 08:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant