Skip to content

feat(sqs): partition resolver for HT-FIFO routing (Phase 3.D PR 4-B-2)#715

Merged
bootjp merged 8 commits intomainfrom
feat/sqs-htfifo-routing-resolver
Apr 30, 2026
Merged

feat(sqs): partition resolver for HT-FIFO routing (Phase 3.D PR 4-B-2)#715
bootjp merged 8 commits intomainfrom
feat/sqs-htfifo-routing-resolver

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 30, 2026

Summary

Routing-layer half of PR 4-B. Adds a PartitionResolver that ShardRouter consults BEFORE falling through to the byte-range engine. SQS HT-FIFO needs partition-aware dispatch, but the engine's non-overlapping-cover model can't express overlay routes — the resolver-first dispatch sidesteps this cleanly.

Stacks on top of #708 (PR 4-B-1, capability JSON). Next is PR 4-B-3 (leadership-refusal + catalog polling + flip htfifoCapabilityAdvertised to true).

What's added

  • kv.PartitionResolver interface — ResolveGroup([]byte) (uint64, bool).
  • kv.ShardRouter.WithPartitionResolver(...) — fluent option, nil-safe.
  • kv.ShardRouter.resolveGroup(...) — unified dispatch path: resolver first, engine fallback. Both groupRequests (Commit/Abort) and Get route through it.
  • kv.ShardedCoordinator.WithPartitionResolver(...) — delegates to the router so main.go can install via the existing fluent-construction style.
  • adapter.SQSPartitionResolver — parses (queue, partition) from the partitioned key shape, looks up the operator-chosen group. Defensive copy at construction, nil-safe ResolveGroup, returns (0, false) for legacy / non-matching keys.
  • main.go — builds the resolver from runtimeConfig.sqsFifoPartitionMap and installs it. Resolver is nil on a non-partitioned cluster — hot path stays engine-only.

What's NOT added (deferred to PR 4-B-3)

  • §8 leadership-refusal hook in kv (refuses leadership for an SQS Raft group hosting a partitioned queue when the binary lacks htfifo).
  • Catalog-polling helper for the CreateQueue capability gate (PR 5 starts using it).
  • Flipping htfifoCapabilityAdvertised from false to true.

The design's "advertise htfifo only when both routing AND leadership-refusal are in place" rule keeps the constant false in this PR — PR 4-B-3 flips it.

Test plan

  • adapter/sqs_partition_resolver_test.go — 9 top-level tests: nil-on-empty, defensive-copy, partition dispatch across all 5 families, queue-name prefix isolation ("queue" vs "queue1"), legacy fall-through (8 sub-cases), unknown queue, out-of-range partition, nil receiver, prefix alignment with sqs_keys.go constants.
  • kv/shard_router_partition_test.go — 4 tests: resolver wins over engine, engine fallthrough on resolver-miss, nil resolver no-op, Get path also routes through the resolver.
  • go test -race ./kv/... pass.
  • go test -race ./adapter/... pass.
  • golangci-lint ./kv/... ./adapter/... . clean.

Self-review (per CLAUDE.md)

  1. Data loss — routing layer only; no FSM/Pebble/retention path. No issue.
  2. ConcurrencypartitionResolver is set once at startup before any request. ResolveGroup reads a constructor-time defensive copy, so a future hot-reload of --sqsFifoPartitionMap cannot perturb in-flight requests. No issue.
  3. Performance — one map lookup + 4-byte BigEndian decode per resolver hit (only on partitioned-prefix matches). Engine-only path adds a single if s.partitionResolver != nil branch. No issue.
  4. Data consistency — resolver output strictly OVERRIDES the engine for partitioned keys; legacy keys flow through unchanged. "queue not found" / "partition out of range" branches return (0, false) so the router surfaces an explicit error rather than silently mis-routing. No issue.
  5. Test coverage — 13 tests across two new files; existing TestShardRouter* tests unchanged. Both override and fall-through paths pinned, plus the queue-name prefix-isolation invariant from PR feat(sqs): HT-FIFO partitioned-keyspace constructors (Phase 3.D PR 3) #703.

Summary by CodeRabbit

  • New Features
    • Added partition-based routing for distributed coordination. Requests are now directed to specific nodes based on configured partition mappings.
    • Enhanced SQS queue handling with optional partition-aware routing. When partition information is unavailable or unconfigured, the system automatically falls back to the existing routing mechanism.

Adds the routing-layer half of PR 4-B: a PartitionResolver that
ShardRouter consults BEFORE falling through to the byte-range
engine. SQS HT-FIFO needs partition-aware dispatch — partition K
of queue Q lives on a different Raft group than partition K+1 —
but the engine's non-overlapping-cover model cannot express
overlay routes without breaking legacy keys (a partition route
would leave a lexicographic gap that legacy keys fall into).

The resolver-first dispatch sidesteps this: the resolver answers
only for partitioned-keyspace keys, returns (0, false) otherwise,
and the engine handles everything else exactly as today.

What changes

- kv.PartitionResolver interface: ResolveGroup([]byte) (uint64, bool).
- kv.ShardRouter.WithPartitionResolver: option, nil-safe, idempotent.
- kv.ShardRouter.resolveGroup: tries the resolver first, falls
  through to engine.GetRoute. groupRequests and Get both call this
  unified path so reads and writes share the same dispatch logic.
- kv.ShardedCoordinator.WithPartitionResolver: delegates to the
  router so main.go can install the resolver via the existing
  fluent-construction style.
- adapter.SQSPartitionResolver: parses (queue, partition) from the
  partitioned key shape, looks up the operator-chosen group from
  the runtime config map. Defensive copy at construction; nil-safe
  ResolveGroup; returns (0, false) for legacy / non-matching keys.
- main.go: builds the resolver from runtimeConfig.sqsFifoPartitionMap
  (canonicalized to numeric uint64 group IDs by parseSQSFifoGroupList)
  and installs it via WithPartitionResolver. The resolver is nil for
  a non-partitioned cluster — request hot path stays engine-only.

What does NOT change yet

- htfifoCapabilityAdvertised stays false. PR 4-B-3 wires the §8
  leadership-refusal hook + catalog-polling helper for the CreateQueue
  capability gate, then flips the constant to true.
- Send / Receive partition fanout is still PR 5. PR 5 lifts the
  PartitionCount > 1 dormancy gate from PR 2 in the same commit
  that wires the data-plane fanout.

Self-review (per CLAUDE.md)

1. Data loss — routing layer only; no FSM/Pebble/retention path.
   No issue.
2. Concurrency — partitionResolver field is set once at startup
   before any request lands. ResolveGroup reads from a constructor-
   time defensive copy, so a hot-reload of --sqsFifoPartitionMap (a
   future capability) cannot perturb in-flight requests. No issue.
3. Performance — one map lookup + a 4-byte BigEndian decode per
   resolver hit, which only fires on partitioned-prefix matches.
   Engine-only path (the common case for non-partitioned clusters)
   adds a single `if s.partitionResolver != nil` branch — negligible.
   No issue.
4. Data consistency — the resolver's output strictly OVERRIDES the
   engine. A partitioned-FIFO key always reaches the operator-chosen
   partition group; legacy keys still flow through the engine. The
   resolver's "queue not found" / "partition out of range" branches
   return (0, false) so the router surfaces an error rather than
   silently mis-routing. No issue.
5. Test coverage — adapter/sqs_partition_resolver_test.go: 9
   top-level tests (nil-on-empty, defensive-copy, partition dispatch
   across all 5 families, queue-name prefix isolation, legacy fall-
   through, unknown queue, out-of-range partition, nil receiver,
   prefix alignment). kv/shard_router_partition_test.go: 4 tests
   (resolver wins, engine fallthrough for unknown keys, nil resolver
   no-op, Get path also uses resolver). Existing TestShardRouter*
   tests unchanged.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 30, 2026

@claude review

Phase 3.D PR 4-B-2: routing-layer resolver for HT-FIFO. htfifoCapabilityAdvertised stays false; PR 4-B-3 will add the leadership-refusal hook + catalog-polling helper and flip the flag. PR description has the full self-review and test plan.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 30, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 43 minutes and 38 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7ba40841-4916-4391-9089-428d8bf24da9

📥 Commits

Reviewing files that changed from the base of the PR and between 1f48eeb and eedc17c.

📒 Files selected for processing (9)
  • adapter/sqs_partition_resolver.go
  • adapter/sqs_partition_resolver_test.go
  • kv/shard_router.go
  • kv/shard_router_partition_test.go
  • kv/sharded_coordinator.go
  • kv/sharded_coordinator_partition_test.go
  • kv/sharded_coordinator_txn_test.go
  • main.go
  • main_sqs_resolver_test.go
📝 Walkthrough

Walkthrough

The changes introduce partitioned SQS routing by adding a new SQSPartitionResolver that maps queue names and partition values to Raft group IDs. A PartitionResolver interface is integrated into ShardRouter to enable dispatch-time partition resolution before fallback to byte-range routing. Configuration plumbing in main.go wires the resolver into the coordinator based on operator-provided partition maps.

Changes

Cohort / File(s) Summary
SQS Partition Resolver
adapter/sqs_partition_resolver.go
New resolver type that parses partitioned SQS keys, decodes 4-byte big-endian partition values, and routes to configured Raft group IDs. Handles defensive map copying and returns (0, false) for malformed or unrecognized keys to enable fallback routing.
Partition Resolver Tests
adapter/sqs_partition_resolver_test.go
Comprehensive test suite covering nil/empty constructor returns, defensive map copying, correct routing for all key families (data, vis, dedup, group, byage), queue-name prefix isolation, legacy key fallback, unknown queues/partitions, and edge cases.
Router Partition Integration
kv/shard_router.go
Adds PartitionResolver interface and installs it in ShardRouter via WithPartitionResolver. Updated groupRequests and Get to consult resolver first, falling back to byte-range engine dispatch for unresolved keys.
Router Partition Tests
kv/shard_router_partition_test.go
Tests partition-aware dispatch in Commit and Get scenarios, verifying resolver-first routing, fallback behavior when resolver returns (0, false), and engine-only dispatch when resolver is nil.
Coordinator & Configuration
kv/sharded_coordinator.go, main.go
Adds WithPartitionResolver fluent option to ShardedCoordinator and wires it via buildSQSPartitionResolver helper that converts operator-supplied queue partition maps into group ID routes.

Sequence Diagram

sequenceDiagram
    participant Client
    participant ShardRouter
    participant PartitionResolver
    participant ByteRangeEngine
    participant RaftGroup

    Client->>ShardRouter: Commit(key, value)
    ShardRouter->>ShardRouter: resolveGroup(key)
    
    alt Partition Resolver Installed
        ShardRouter->>PartitionResolver: ResolveGroup(key)
        alt Partitioned SQS Key
            PartitionResolver->>PartitionResolver: Parse queue & partition
            PartitionResolver-->>ShardRouter: (groupID, true)
        else Unrecognized/Legacy Key
            PartitionResolver-->>ShardRouter: (0, false)
        end
    end
    
    alt Resolver Returned Valid Group
        ShardRouter->>RaftGroup: Dispatch to resolved group
    else Resolver Returned (0, false) or Not Installed
        ShardRouter->>ByteRangeEngine: GetRoute(key)
        ByteRangeEngine-->>ShardRouter: groupID from byte range
        ShardRouter->>RaftGroup: Dispatch to engine group
    end
    
    RaftGroup-->>Client: Response
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

🐰 With queues now partitioned, our hops grow quite neat,
Each SQS key finds its Raft group so sweet,
The resolver maps partitions, no fallback neglect,
Through bytes we traverse with partition-aware tech!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: adding a partition resolver for SQS HT-FIFO routing, which is the primary objective of all changes across the codebase.
Docstring Coverage ✅ Passed Docstring coverage is 90.48% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/sqs-htfifo-routing-resolver

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 43 minutes and 38 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 30, 2026

Claude finished @bootjp's task in 6m 32s —— View job


Code Review: PR #715feat(sqs): partition resolver for HT-FIFO routing

Tasks

  • Read changed files
  • Review adapter/sqs_partition_resolver.go
  • Review kv/shard_router.go
  • Review kv/sharded_coordinator.go
  • Review main.go
  • Review test files
  • Post final review

Summary

The routing architecture is sound and cleanly implemented. One medium bug in main.go should be fixed before merge (typed-nil interface), plus two design gaps worth documenting. The test suite is thorough and the defensive-copy + nil-receiver patterns are correct.


Bug: Typed-nil interface in buildSQSPartitionResolver (medium)

File: main.go:495

buildSQSPartitionResolver has return type *adapter.SQSPartitionResolver. On a non-partitioned cluster it returns nil — but that nil is a typed nil. When passed to WithPartitionResolver(r PartitionResolver), Go wraps it into a non-nil interface with nil data pointer, so s.partitionResolver != nil evaluates to true on every request.

// current: typed nil wraps to non-nil PartitionResolver interface
func buildSQSPartitionResolver(partitionMap map[string]sqsFifoQueueRouting) *adapter.SQSPartitionResolver {
    if len(partitionMap) == 0 {
        return nil  // typed *SQSPartitionResolver nil → non-nil interface
    }
    ...
}

This means on a non-partitioned cluster every request calls s.partitionResolver.ResolveGroup(key) (which safely returns (0, false) via the nil-receiver guard), rather than short-circuiting at s.partitionResolver != nil. The PR description's claim "Resolver is nil on a non-partitioned cluster — hot path stays engine-only" and the comment in WithPartitionResolver ("A nil resolver clears any previously-installed resolver") are both incorrect for this call path.

Fix: change the return type to kv.PartitionResolver so the untyped nil becomes a true nil interface:

func buildSQSPartitionResolver(partitionMap map[string]sqsFifoQueueRouting) kv.PartitionResolver {
    if len(partitionMap) == 0 {
        return nil  // untyped nil → nil interface, short-circuit works correctly
    }
    flat := make(map[string][]uint64, len(partitionMap))
    for queue, routing := range partitionMap {
        ids := make([]uint64, 0, len(routing.groups))
        for _, groupRef := range routing.groups {
            id, err := strconv.ParseUint(groupRef, 10, 64)
            if err != nil {
                panic(errors.Wrapf(err, "queue %q: bypassed group-ref canonicalisation", queue))
            }
            ids = append(ids, id)
        }
        flat[queue] = ids
    }
    r := adapter.NewSQSPartitionResolver(flat)
    if r == nil {
        return nil  // guard against empty flat after canonicalisation
    }
    return r
}

Fix this →


Design gap: groupMutations() bypasses the resolver (minor, document or fix in 4-B-3)

File: kv/sharded_coordinator.go:1055

ShardedCoordinator.groupMutations() calls c.engine.GetRoute() directly — it does not consult the partition resolver:

route, ok := c.engine.GetRoute(routeKey(mut.Key))  // bypasses resolver

groupMutations is used by Dispatch()rawLogs() (non-txn) and dispatchTxn() (transaction). For a Dispatch() call where all mutations belong to one SQS partition, the final c.router.Commit(logs) re-routes correctly via groupRequests()resolveGroup(). This is safe today because SQS SendMessage writes are single-partition.

But a Dispatch() call with mutations for two different SQS partitions in the same batch would fail silently: groupMutations() bundles all mutations under the same engine shard → rawLogs() produces one pb.Requestrouter.Commit() routes the whole request to one partition based on Mutations[0].Key, dropping the other partition's mutations into the wrong group.

If the SQS adapter ever issues multi-partition batch writes through Dispatch(), this will mis-route. Worth a comment or tracked as a follow-up for 4-B-3.


Design gap: coordinator key-lookup APIs bypass the resolver (minor, follow-up for 4-B-3)

File: kv/sharded_coordinator.go:847,862,874

groupForKey(), routeAndGroupForKey(), and engineGroupIDForKey() all call c.engine.GetRoute() directly. These drive:

  • IsLeaderForKey / VerifyLeaderForKey / RaftLeaderForKey
  • LinearizableReadForKey / LeaseReadForKey
  • commitPrimaryTxn primary-key resolution
  • allReadKeysInShard / groupReadKeysByShardID for cross-shard transaction routing

For SQS HT-FIFO reads or the §8 leadership-refusal hook (PR 4-B-3), calling these with a partitioned key will return the engine-shard result, not the partition-assigned shard. Since htfifoCapabilityAdvertised = false today this is latent rather than active, but PR 4-B-3 will need to either route through resolveGroup() or add resolver-aware variants.


Nits

sqs_partition_resolver.go:8-11 — The "duck typing" mention in the type doc is slightly unusual. Since the compile-time interface check happens implicitly at the main.go call site (buildSQSPartitionResolver return → WithPartitionResolver(r PartitionResolver)), it's fine — just worth noting that adding var _ kv.PartitionResolver = (*SQSPartitionResolver)(nil) would make the contract explicit without requiring adapter to import kv (the check can live in main.go or in a test file).

main.go:505-513 — The panic() is reasonable for a "programmer bypassed the validator" invariant at startup. No objection.


Positive observations

  • Defensive copy in NewSQSPartitionResolver (both slice and map) is correct and directly pinned by TestNewSQSPartitionResolver_DefensiveCopy.
  • pipeIdx <= 0 correctly rejects both "not found" (−1) and "empty queue name" (0) in one guard.
  • uint64(partition) >= uint64(len(groups)) avoids the int/uint comparison pitfall cleanly.
  • TestSQSPartitionResolver_PrefixesAlign is an excellent regression anchor — any new partitioned family added to sqs_keys.go without updating sqsResolverFamilyPrefixes will immediately break this test.
  • fakePartitionResolver in kv/shard_router_partition_test.go correctly avoids importing adapter, keeping the kv layer isolated.
  • Nil receiver guard on (*SQSPartitionResolver).ResolveGroup is correct and tested; it makes the typed-nil case (above) functionally safe, even if not as efficient as documented.
  • The four kv/shard_router_partition_test.go tests directly pin the resolver-wins, fall-through, nil-no-op, and Get paths — full coverage of the dispatch contract.

Verdict: Fix the typed-nil return type in buildSQSPartitionResolver before merge. The design gaps are acceptable at this phase given htfifoCapabilityAdvertised = false — recommend a comment or tracking issue for the groupMutations() bypass and coordinator key-lookup APIs before 4-B-3 enables live HT-FIFO traffic.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1f48eeb15d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread kv/shard_router.go Outdated
Comment on lines +178 to +181
if len(key) == 0 {
return nil, ErrInvalidRequest
}
route, ok := s.engine.GetRoute(key)
gid, ok := s.resolveGroup(key)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Resolve partition routes before normalizing SQS keys

groupRequests passes routeKey(r.Mutations[0].Key) into resolveGroup, but routeKey collapses every !sqs|... key to !sqs|route|global (kv/shard_key.go), so SQSPartitionResolver never sees the partitioned prefixes it parses (!sqs|msg|...|p|...). In practice this makes the new partition resolver a no-op for Commit/Abort (and similarly in Get), causing partitioned HT-FIFO traffic to keep following the engine fallback instead of the configured per-partition group map.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a PartitionResolver interface and a specific implementation for SQS partitioned queues, allowing the ShardRouter to override default byte-range routing for specific keyspaces. The review feedback identifies a critical issue where several methods in ShardedCoordinator still consult the engine directly, which could lead to misrouting for transactions and certain read paths. Additionally, there are suggestions to optimize the SQS resolver by pre-converting family prefixes to byte slices and to improve documentation regarding the thread-safety limitations of installing the resolver at runtime.

Comment thread kv/sharded_coordinator.go
Comment on lines +184 to +187
func (c *ShardedCoordinator) WithPartitionResolver(r PartitionResolver) *ShardedCoordinator {
c.router.WithPartitionResolver(r)
return c
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While WithPartitionResolver correctly wires the resolver into the ShardRouter, the ShardedCoordinator itself contains several critical methods that bypass the router and consult c.engine directly for routing decisions. Specifically, groupMutations, engineGroupIDForKey, groupForKey, and routeAndGroupForKey all use c.engine.GetRoute without considering the PartitionResolver.

This will cause transactions (2PC) and certain read paths for partitioned SQS queues to be misrouted to the engine's default group instead of the resolver-chosen group. These methods should be refactored to use c.router.resolveGroup to ensure consistent routing across the entire coordinator.

Comment thread adapter/sqs_partition_resolver.go Outdated
Comment on lines +54 to +60
var sqsResolverFamilyPrefixes = []string{
SqsPartitionedMsgDataPrefix,
SqsPartitionedMsgVisPrefix,
SqsPartitionedMsgDedupPrefix,
SqsPartitionedMsgGroupPrefix,
SqsPartitionedMsgByAgePrefix,
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Defining sqsResolverFamilyPrefixes as a slice of strings leads to repeated string-to-byte conversions in stripPartitionedFamilyPrefix on the request hot path. Pre-calculating these as a slice of byte slices would improve performance and avoid unnecessary allocations.

Suggested change
var sqsResolverFamilyPrefixes = []string{
SqsPartitionedMsgDataPrefix,
SqsPartitionedMsgVisPrefix,
SqsPartitionedMsgDedupPrefix,
SqsPartitionedMsgGroupPrefix,
SqsPartitionedMsgByAgePrefix,
}
var sqsResolverFamilyPrefixes = [][]byte{
[]byte(SqsPartitionedMsgDataPrefix),
[]byte(SqsPartitionedMsgVisPrefix),
[]byte(SqsPartitionedMsgDedupPrefix),
[]byte(SqsPartitionedMsgGroupPrefix),
[]byte(SqsPartitionedMsgByAgePrefix),
}

Comment on lines +127 to +133
func stripPartitionedFamilyPrefix(key []byte) ([]byte, bool) {
for _, prefix := range sqsResolverFamilyPrefixes {
if bytes.HasPrefix(key, []byte(prefix)) {
return key[len(prefix):], true
}
}
return nil, false
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

With sqsResolverFamilyPrefixes updated to [][]byte, the loop here can avoid the []byte(prefix) conversion.

Suggested change
func stripPartitionedFamilyPrefix(key []byte) ([]byte, bool) {
for _, prefix := range sqsResolverFamilyPrefixes {
if bytes.HasPrefix(key, []byte(prefix)) {
return key[len(prefix):], true
}
}
return nil, false
func stripPartitionedFamilyPrefix(key []byte) ([]byte, bool) {
for _, prefix := range sqsResolverFamilyPrefixes {
if bytes.HasPrefix(key, prefix) {
return key[len(prefix):], true
}
}
return nil, false
}

Comment thread kv/shard_router.go
Comment on lines +57 to +71
// WithPartitionResolver installs a partition-keyspace resolver that
// is consulted before the byte-range engine on every dispatch. A
// nil resolver clears any previously-installed resolver. Returns
// the receiver so callers can chain.
//
// Setting the resolver is idempotent — re-installing the same value
// is a no-op. Concurrent reads against ResolveGroup remain safe
// because both the read in resolveGroup and the assignment here
// happen against the same field; routine startup wires the resolver
// once before any request lands, so the rare write does not need a
// lock.
func (s *ShardRouter) WithPartitionResolver(r PartitionResolver) *ShardRouter {
s.partitionResolver = r
return s
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment regarding idempotency and thread safety is misleading. Interface assignment in Go is not atomic; if this method is called while requests are being processed, concurrent readers in resolveGroup could observe a torn interface. As per repository guidelines for limited-scope implementations, the documentation should explicitly reflect that this is intended for startup use to prevent misuse in a general context, or the implementation should use atomic operations if concurrent updates are intended to be supported safely.

// WithPartitionResolver installs a partition-keyspace resolver that
// is consulted before the byte-range engine on every dispatch. A
// nil resolver clears any previously-installed resolver. Returns
// the receiver so callers can chain.
//
// Note: This method is intended for use during startup. Interface
// assignment is not atomic in Go; if the resolver is updated while
// requests are in flight, concurrent readers in resolveGroup might
// observe a partially updated interface value.
func (s *ShardRouter) WithPartitionResolver(r PartitionResolver) *ShardRouter {
	s.partitionResolver = r
	return s
}
References
  1. For prototype or limited-scope implementations, explicitly document the limitations (e.g., single-node only, blocking startup) and add runtime validation to prevent misuse in a general context.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
adapter/sqs_partition_resolver.go (1)

127-131: ⚡ Quick win

Avoid repeated string→byte conversion in prefix matching hot path.

bytes.HasPrefix(key, []byte(prefix)) inside the loop can add avoidable overhead on every resolve. Precompute byte prefixes once.

Proposed refactor
 var sqsResolverFamilyPrefixes = []string{
 	SqsPartitionedMsgDataPrefix,
 	SqsPartitionedMsgVisPrefix,
 	SqsPartitionedMsgDedupPrefix,
 	SqsPartitionedMsgGroupPrefix,
 	SqsPartitionedMsgByAgePrefix,
 }
+
+var sqsResolverFamilyPrefixBytes = [][]byte{
+	[]byte(SqsPartitionedMsgDataPrefix),
+	[]byte(SqsPartitionedMsgVisPrefix),
+	[]byte(SqsPartitionedMsgDedupPrefix),
+	[]byte(SqsPartitionedMsgGroupPrefix),
+	[]byte(SqsPartitionedMsgByAgePrefix),
+}
@@
 func stripPartitionedFamilyPrefix(key []byte) ([]byte, bool) {
-	for _, prefix := range sqsResolverFamilyPrefixes {
-		if bytes.HasPrefix(key, []byte(prefix)) {
+	for _, prefix := range sqsResolverFamilyPrefixBytes {
+		if bytes.HasPrefix(key, prefix) {
 			return key[len(prefix):], true
 		}
 	}
 	return nil, false
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adapter/sqs_partition_resolver.go` around lines 127 - 131, The hot path in
stripPartitionedFamilyPrefix repeatedly converts strings to []byte by calling
bytes.HasPrefix(key, []byte(prefix)) for each prefix in
sqsResolverFamilyPrefixes; precompute a parallel slice of byte-prefixes (e.g.
sqsResolverFamilyPrefixBytes [][]byte) at package init time and replace the loop
in stripPartitionedFamilyPrefix to iterate over those byte slices and call
bytes.HasPrefix(key, bytePrefix) instead, keeping the function signature and
return semantics the same and ensuring the new precomputed variable is populated
once (init or var block) so no per-call allocations occur.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@kv/shard_router.go`:
- Around line 68-80: Protect s.partitionResolver with the shard router mutex
when writing and reading: in WithPartitionResolver(lock using the existing
mutex) set s.partitionResolver under the lock, and in resolveGroup(lock the
mutex while loading the resolver). Also guard against a typed-nil resolver by
checking the interface value for nilness via reflect (e.g.,
reflect.ValueOf(r).IsNil()) before storing or before using the loaded interface;
if the provided resolver is a typed nil treat it as nil (do not store) so
resolveGroup’s nil-check behaves correctly.

---

Nitpick comments:
In `@adapter/sqs_partition_resolver.go`:
- Around line 127-131: The hot path in stripPartitionedFamilyPrefix repeatedly
converts strings to []byte by calling bytes.HasPrefix(key, []byte(prefix)) for
each prefix in sqsResolverFamilyPrefixes; precompute a parallel slice of
byte-prefixes (e.g. sqsResolverFamilyPrefixBytes [][]byte) at package init time
and replace the loop in stripPartitionedFamilyPrefix to iterate over those byte
slices and call bytes.HasPrefix(key, bytePrefix) instead, keeping the function
signature and return semantics the same and ensuring the new precomputed
variable is populated once (init or var block) so no per-call allocations occur.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5c2dd747-e994-411b-b0c5-bcbb6824fef7

📥 Commits

Reviewing files that changed from the base of the PR and between 65e725a and 1f48eeb.

📒 Files selected for processing (6)
  • adapter/sqs_partition_resolver.go
  • adapter/sqs_partition_resolver_test.go
  • kv/shard_router.go
  • kv/shard_router_partition_test.go
  • kv/sharded_coordinator.go
  • main.go

Comment thread kv/shard_router.go Outdated
Comment on lines +68 to +80
func (s *ShardRouter) WithPartitionResolver(r PartitionResolver) *ShardRouter {
s.partitionResolver = r
return s
}

// resolveGroup tries the partition resolver first (when installed),
// then falls through to the byte-range engine. Returns the resolved
// Raft group ID and a found flag; (0, false) means no route in either
// the resolver or the engine — the caller surfaces this as an
// "unknown group" error.
func (s *ShardRouter) resolveGroup(key []byte) (uint64, bool) {
if s.partitionResolver != nil {
if gid, ok := s.partitionResolver.ResolveGroup(key); ok {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, locate and examine the shard_router.go file
find . -type f -name "shard_router.go" | head -5

Repository: bootjp/elastickv

Length of output: 81


🏁 Script executed:

# Check the structure of ShardRouter
cat -n kv/shard_router.go | head -100

Repository: bootjp/elastickv

Length of output: 4246


🏁 Script executed:

# Look for the mutex definition and its usage
rg -n "type ShardRouter struct" -A 15 kv/shard_router.go

Repository: bootjp/elastickv

Length of output: 562


🏁 Script executed:

# Find PartitionResolver interface definition
rg -n "type PartitionResolver interface" -A 10

Repository: bootjp/elastickv

Length of output: 697


🏁 Script executed:

# Check for concurrent access patterns - search for mutex usage in ShardRouter methods
rg -n "s\.mu\." kv/shard_router.go

Repository: bootjp/elastickv

Length of output: 144


🏁 Script executed:

# Look at how WithPartitionResolver is used in tests or elsewhere
rg -n "WithPartitionResolver" --type go -B 2 -A 2

Repository: bootjp/elastickv

Length of output: 4854


🏁 Script executed:

# Search for all calls to WithPartitionResolver across the codebase
rg -n "WithPartitionResolver" --type go -B 3 -A 3 | grep -E "(WithPartitionResolver|func|if|after|hot|runtime|reload|dynamic|concurrent)"

Repository: bootjp/elastickv

Length of output: 1859


🏁 Script executed:

# Check initialization order in main.go specifically
rg -n "WithPartitionResolver|Start\|Run\|Listen" main.go | head -20

Repository: bootjp/elastickv

Length of output: 211


🏁 Script executed:

# Verify if resolveGroup is called during request processing (proving it's hot path)
rg -n "resolveGroup" kv/shard_router.go -B 2 -A 2

Repository: bootjp/elastickv

Length of output: 1359


🏁 Script executed:

# Check if there's any dynamic resolver updates after initialization
rg -n "WithPartitionResolver\|partitionResolver\s*=" --type go | grep -v "test" | head -20

Repository: bootjp/elastickv

Length of output: 42


Add synchronization for partition resolver access; prevent typed-nil bypass.

WithPartitionResolver directly mutates s.partitionResolver while resolveGroup reads it on the request hot path without any synchronization. Although the code documents an assumption that the resolver is set only during startup before requests arrive, this architectural constraint is not enforced and Go's interface semantics do not guarantee safety: interface assignments involve multiple words and are not atomic. If resolver updates ever occur after startup, a data race becomes possible. Additionally, a typed-nil resolver (e.g., var r *MyResolver; r == nil is true, but assigning r to an interface field makes the interface check fail) can bypass the nil check in resolveGroup, allowing stale resolver behavior.

Protect the field with the existing mutex during both read and write:

Proposed fix
 func (s *ShardRouter) WithPartitionResolver(r PartitionResolver) *ShardRouter {
-	s.partitionResolver = r
+	s.mu.Lock()
+	defer s.mu.Unlock()
+	s.partitionResolver = r
 	return s
 }
 
 func (s *ShardRouter) resolveGroup(key []byte) (uint64, bool) {
-	if s.partitionResolver != nil {
-		if gid, ok := s.partitionResolver.ResolveGroup(key); ok {
+	s.mu.RLock()
+	resolver := s.partitionResolver
+	s.mu.RUnlock()
+	if resolver != nil {
+		if gid, ok := resolver.ResolveGroup(key); ok {
 			return gid, true
 		}
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func (s *ShardRouter) WithPartitionResolver(r PartitionResolver) *ShardRouter {
s.partitionResolver = r
return s
}
// resolveGroup tries the partition resolver first (when installed),
// then falls through to the byte-range engine. Returns the resolved
// Raft group ID and a found flag; (0, false) means no route in either
// the resolver or the engine — the caller surfaces this as an
// "unknown group" error.
func (s *ShardRouter) resolveGroup(key []byte) (uint64, bool) {
if s.partitionResolver != nil {
if gid, ok := s.partitionResolver.ResolveGroup(key); ok {
func (s *ShardRouter) WithPartitionResolver(r PartitionResolver) *ShardRouter {
s.mu.Lock()
defer s.mu.Unlock()
s.partitionResolver = r
return s
}
// resolveGroup tries the partition resolver first (when installed),
// then falls through to the byte-range engine. Returns the resolved
// Raft group ID and a found flag; (0, false) means no route in either
// the resolver or the engine — the caller surfaces this as an
// "unknown group" error.
func (s *ShardRouter) resolveGroup(key []byte) (uint64, bool) {
s.mu.RLock()
resolver := s.partitionResolver
s.mu.RUnlock()
if resolver != nil {
if gid, ok := resolver.ResolveGroup(key); ok {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kv/shard_router.go` around lines 68 - 80, Protect s.partitionResolver with
the shard router mutex when writing and reading: in WithPartitionResolver(lock
using the existing mutex) set s.partitionResolver under the lock, and in
resolveGroup(lock the mutex while loading the resolver). Also guard against a
typed-nil resolver by checking the interface value for nilness via reflect
(e.g., reflect.ValueOf(r).IsNil()) before storing or before using the loaded
interface; if the provided resolver is a typed nil treat it as nil (do not
store) so resolveGroup’s nil-check behaves correctly.

…lver

PR #715 round 1 review caught two correctness bugs and two minor
items. This commit addresses all four.

1) P1 (Codex): routeKey collapsed every !sqs|... key to
   !sqs|route|global before resolveGroup ran, so the resolver
   never saw a partitioned-prefix key — partitioned-FIFO traffic
   silently fell through to the engine's default group.

   ResolveGroup now runs on the RAW key. The fallback to the
   engine still uses routeKey-normalised input (so the engine's
   per-shard layout is unchanged for legacy traffic). New
   regression test TestShardRouter_ResolverSeesRawKeyNotNormalized
   uses a recording resolver to assert the raw key reaches it.

2) HIGH (Gemini): ShardedCoordinator's per-key helpers
   (groupForKey, routeAndGroupForKey, engineGroupIDForKey,
   groupMutations) called c.engine.GetRoute directly, bypassing
   the resolver. This left 2PC + read paths with unrouted
   partitioned traffic.

   Each helper now consults c.router.ResolveGroup for the gid.
   The engine is queried separately for the catalog RouteID
   (keyviz observation) — the resolver is opaque to keyviz today;
   partition-aware heatmap is a Phase 3.D follow-up.

3) MEDIUM (Gemini, perf): sqsResolverFamilyPrefixes was []string,
   forcing []byte(prefix) per check on the request hot path.
   Pre-converted to [][]byte; stripPartitionedFamilyPrefix loops
   over byte slices directly.

4) MEDIUM (Gemini, doc): WithPartitionResolver claimed concurrent
   reads were safe, which is incorrect — interface assignment is
   not atomic in Go. Comment now reflects the startup-only intent.

ResolveGroup is now exported so ShardedCoordinator's helpers can
share the resolver-first dispatch path with Commit / Abort / Get.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 30, 2026

@claude review

Round 2 — addresses all four review findings:

  1. P1 (Codex) — resolver vs routeKey collapse: routeKey collapses every !sqs|... key to !sqs|route|global, so the resolver never saw a partitioned-prefix key and the partition path was a no-op for Commit/Abort/Get. Fix: ResolveGroup now runs on the RAW key; the engine fallback keeps using routeKey-normalized input so legacy routing is unchanged. New TestShardRouter_ResolverSeesRawKeyNotNormalized uses a recording resolver to assert the raw key reaches it.

  2. HIGH (Gemini) — coordinator helpers bypass resolver: groupMutations, engineGroupIDForKey, groupForKey, routeAndGroupForKey were calling c.engine.GetRoute directly — 2PC + read paths missed the resolver. Each now consults c.router.ResolveGroup. RouteID for keyviz observation is read from the engine separately (partition-aware heatmap is a follow-up).

  3. MEDIUM (Gemini, perf)sqsResolverFamilyPrefixes converted from []string to [][]byte so the hot-path bytes.HasPrefix skips the per-call []byte(prefix) allocation.

  4. MEDIUM (Gemini, doc)WithPartitionResolver comment corrected: interface assignment is not atomic in Go; the method is intended for startup wiring only.

ResolveGroup is exported so coordinator helpers and router internals share one dispatch path.

go test -race ./kv/... and golangci-lint clean.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 30, 2026

Claude finished @bootjp's task in 3m 48s —— View job


Code Review: PR #715 Round 2 — feat(sqs): partition resolver for HT-FIFO routing

Tasks


All four Round 2 fixes are correctly applied. Two issues carry forward.


Fix #1 verified — resolver sees raw key ✓

groupRequests now passes r.Mutations[0].Key directly to s.ResolveGroup(rawKey), with the engine fallback calling routeKey(rawKey) separately (shard_router.go:105). Get does the same via s.ResolveGroup(key) (shard_router.go:213). TestShardRouter_ResolverSeesRawKeyNotNormalized correctly pins this with a recordingResolver that captures the key bytes before and after dispatch — the invariant is solid.


Fix #2 verified — coordinator helpers consult resolver ✓

groupForKey, routeAndGroupForKey, engineGroupIDForKey, and groupMutations all now call c.router.ResolveGroup(key) instead of c.engine.GetRoute(routeKey(key)). This also covers the transitive callers: allReadKeysInShard (line 433), groupReadKeysByShardID (line 900), and commitPrimaryTxn (line 499).

The double engine lookup in routeAndGroupForKey and groupMutations — resolve for group, then separately call c.engine.GetRoute(routeKey(key)) for the RouteID — is the right approach for now. The comment naming partition-aware keyviz as a Phase 3.D follow-up is correct.

Missing regression test (per CLAUDE.md convention). The Gemini HIGH defect was a coordinator-level bypass — the fix is at the coordinator level — but no ShardedCoordinator-level test pins the regression. The router tests (TestShardRouter_*) cover the dispatch logic, but they don't exercise the path groupMutationsc.router.ResolveGroup. CLAUDE.md: "When code review surfaces a defect, first add a failing test that reproduces the issue, then make it pass with the fix." A coordinator-level integration test that installs a resolver and verifies Dispatch / GetAt routes to the resolver's group would close this gap.


Fix #3 verified — [][]byte prefix precomputation ✓

sqsResolverFamilyPrefixes is [][]byte{[]byte(SqsPartitionedMsgDataPrefix), …} at package init — the hot-path bytes.HasPrefix(key, prefix) call no longer converts on each invocation. TestSQSPartitionResolver_PrefixesAlign still pins the five-family alignment.


Fix #4 verified — WithPartitionResolver atomicity note ✓

The comment now correctly documents that interface assignment is not atomic and that the method is for startup wiring only (shard_router.go:57–71). Good.


Typed-nil still present — buildSQSPartitionResolver (medium, carry-forward)

File: main.gobuildSQSPartitionResolver

The return type is still *adapter.SQSPartitionResolver. On an empty partitionMap it returns a typed nil (*adapter.SQSPartitionResolver(nil)). When that is passed to WithPartitionResolver(r PartitionResolver), Go wraps it in a non-nil interface, so s.partitionResolver != nil is true on every request even on a non-partitioned cluster.

This is functionally safe because the nil-receiver guard in SQSPartitionResolver.ResolveGroup returns (0, false) immediately. But two comments remain incorrect:

  • buildSQSPartitionResolver doc: "kv.ShardRouter.WithPartitionResolver(nil) is a documented no-op, so the request hot path keeps the existing engine-only dispatch." → wrong; the resolver is called on every request.
  • WithPartitionResolver comment: "A nil resolver clears any previously-installed resolver." → wrong when the nil is typed.

The fix is one line — change the return type:

// Before
func buildSQSPartitionResolver(partitionMap map[string]sqsFifoQueueRouting) *adapter.SQSPartitionResolver {

// After
func buildSQSPartitionResolver(partitionMap map[string]sqsFifoQueueRouting) kv.PartitionResolver {

With kv.PartitionResolver as the return type, return nil produces a true nil interface, WithPartitionResolver(nil) clears the field correctly, and s.partitionResolver != nil is false on non-partitioned clusters. The adapter.NewSQSPartitionResolver(flat) call already returns a *SQSPartitionResolver, which satisfies the interface. No other changes needed.

Fix this →


Nit — engineGroupIDForKey naming

The method now routes through the resolver, not just the engine (sharded_coordinator.go:886). The name no longer matches its behavior. A follow-up rename to resolveGroupIDForKey or similar would prevent future readers from trusting the engine-only assumption. Low priority, can defer.


Positive observations

  • recordingResolver copies each key via append([]byte(nil), key...) before storing — race-safe and correct.
  • TestShardRouter_ResolverSeesRawKeyNotNormalized is a strong regression anchor: it would catch any future regression where routeKey is accidentally called before the resolver.
  • routeAndGroupForKey double-lookup (resolve for gid, then engine for RouteID) is the right pattern — avoids introducing a keyviz regression for legacy SQS traffic while deferring partition-aware heatmap to Phase 3.D.
  • The nil-receiver guard on (*SQSPartitionResolver).ResolveGroup makes the typed-nil situation safe-but-suboptimal rather than a panic risk.

Verdict: Fix the typed-nil return type in buildSQSPartitionResolver (one-line change) and add a coordinator-level regression test for the Gemini HIGH fix. Everything else is clean. The P1 raw-key fix and the coordinator helper fix are both correctly implemented and the new tests provide good coverage.

Round 1 Claude review on PR #715 caught a typed-nil interface bug:
buildSQSPartitionResolver had return type *adapter.SQSPartitionResolver,
so on a non-partitioned cluster it returned a typed-nil pointer.
When that pointer was passed into kv.ShardRouter.WithPartitionResolver
(parameter type kv.PartitionResolver), Go wrapped it into a NON-NIL
interface — the resolver-first short-circuit
`s.partitionResolver != nil` would always pass on every request,
defeating the "non-partitioned cluster keeps engine-only hot path"
contract from the PR description.

The (*SQSPartitionResolver).ResolveGroup nil-receiver guard kept this
functionally safe (correct routing) but not free (extra map lookup
per request).

Fix

- Return type changed from *adapter.SQSPartitionResolver to
  kv.PartitionResolver. Untyped `nil` returns now propagate as a
  true nil interface, the short-circuit fires correctly, and the
  hot path stays engine-only.
- Defensive nil guard after NewSQSPartitionResolver in case
  canonicalisation collapses every entry — the typed pointer from
  the constructor would otherwise wrap to a non-nil interface
  even when its underlying pointer is nil.
- New main_sqs_resolver_test.go regression test:
  TestBuildSQSPartitionResolver_NilOnEmpty uses requireNilInterface
  to force the kv.PartitionResolver conversion at the call boundary
  (a plain require.Nil on the concrete pointer would pass even with
  the bug present, since the pointer itself IS nil — only the
  interface wrap exposes the failure mode).
- TestBuildSQSPartitionResolver_NonEmptyReturnsResolver pins the
  happy path so a future "always return nil" regression is caught.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 30, 2026

@claude review

Round 3 — addresses Round 1 Claude's typed-nil interface bug:

MEDIUM: buildSQSPartitionResolver returned the concrete *adapter.SQSPartitionResolver. On a non-partitioned cluster the typed-nil pointer wrapped into a non-nil kv.PartitionResolver interface, so s.partitionResolver != nil was always true and every request paid an extra ResolveGroup call (safe via the nil-receiver guard but not free, contradicting the PR description's "hot path stays engine-only" claim).

Fix:

  • Return type → kv.PartitionResolver interface so untyped nil propagates correctly.
  • Defensive nil guard after NewSQSPartitionResolver for the empty-after-canonicalisation case.
  • New TestBuildSQSPartitionResolver_NilOnEmpty uses requireNilInterface (a function-parameter forcing the interface conversion at the call boundary) to catch the regression — a plain require.Nil on the concrete pointer would pass even with the bug present, only the interface wrap exposes the failure.

go test -race and golangci-lint clean. Round 2's correctness fixes (raw-key resolver, coordinator helper plumbing, [][]byte perf, doc) all stand from the previous push.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 30, 2026

Claude finished @bootjp's task in 6m 49s —— View job


Code Review: PR #715 Round 3 — feat(sqs): partition resolver for HT-FIFO routing

Tasks

  • Read changed files
  • Verify typed-nil fix in buildSQSPartitionResolver
  • Verify TestBuildSQSPartitionResolver_NilOnEmpty regression test
  • Check requireNilInterface helper correctness
  • Verify all Round 2 fixes still intact
  • Post final review

Fix verified — buildSQSPartitionResolver return type ✓

main.go — The return type is now kv.PartitionResolver (the interface). For an empty partition map return nil produces a true nil interface value; s.partitionResolver != nil in resolveGroup correctly short-circuits. The defensive nil guard after adapter.NewSQSPartitionResolver(flat) is belt-and-suspenders but harmless. The doc comment explaining the typed-nil gotcha is clear and accurate. Fix is correct.


Regression test helper is subtly wrong (minor)

main_sqs_resolver_test.go:68-71

func requireNilInterface(t *testing.T, r kv.PartitionResolver, msg string) {
    t.Helper()
    require.Nil(t, r, msg)  // ← uses reflect-based nil detection
}

The comment above this function says:

"The function-parameter conversion forces a typed-nil pointer to be wrapped into a non-nil interface, which is exactly the failure mode the regression test guards against."

This is incorrect. Testify's require.Nil uses reflect.ValueOf(object) and calls value.IsNil() on the underlying kind. When a kv.PartitionResolver holding a (*SQSPartitionResolver)(nil) is passed to require.Nil(t, object interface{}), Go unwraps the intermediate interface and the interface{} parameter sees concrete type *SQSPartitionResolver, nil pointer. Testify's IsNil() check returns true for a nil pointer, so require.Nil passes regardless of whether the interface itself is nil or is a typed-nil wrapper.

Concretely: if someone reverted the return type of buildSQSPartitionResolver back to *adapter.SQSPartitionResolver (the bug), TestBuildSQSPartitionResolver_NilOnEmpty would still passrequireNilInterface does not catch the regression it claims to catch.

The correct implementation uses Go's == operator, which checks both the type AND value of an interface:

func requireNilInterface(t *testing.T, r kv.PartitionResolver, msg string) {
    t.Helper()
    // require.Nil uses reflect-based IsNil and considers a nil pointer inside a
    // non-nil interface as "nil". Use require.True with r == nil to get Go's
    // interface nil check (both type AND value must be nil).
    require.True(t, r == nil, msg)
}

With this change:

  • Fixed code (true nil interface): r == nil is true → passes ✓
  • Buggy code (typed nil pointer boxed at call site): the interface has type=*SQSPartitionResolver, value=nil → r == nil is false → fails, catching the regression ✓

Fix this →


Round 2 fixes all intact ✓

  • Raw-key resolver path (groupRequests / Get): ✓ confirmed in diff
  • Coordinator helpers (groupForKey, routeAndGroupForKey, engineGroupIDForKey, groupMutations): ✓ all route through c.router.ResolveGroup(key)
  • [][]byte prefix precomputation: ✓ confirmed
  • WithPartitionResolver atomicity doc: ✓ confirmed

Coordinator-level regression test still missing (carry-forward from Round 2)

The Gemini HIGH defect was that groupMutations / groupForKey / engineGroupIDForKey bypassed the resolver. CLAUDE.md: "when code review surfaces a defect, first add a failing test that reproduces the issue, then make it pass with the fix." The router-layer tests (TestShardRouter_*) cover groupRequests and Get, but no ShardedCoordinator-level test pins groupMutations or groupForKey going through the resolver. A coordinator integration test — installs a resolver and calls Dispatch/GetAt, asserts the request lands on the resolver's group — would close this gap.

This has been noted in both Round 1 and Round 2 without being addressed. Given htfifoCapabilityAdvertised = false, the risk remains latent; flagging for completeness per CLAUDE.md convention.


Positive observations

  • The return type change from *adapter.SQSPartitionResolver to kv.PartitionResolver is exactly right — one line, addresses the root cause cleanly.
  • The doc block in buildSQSPartitionResolver explaining the typed-nil gotcha with concrete example language is excellent — worth keeping even after the test is corrected.
  • TestBuildSQSPartitionResolver_NonEmptyReturnsResolver pins the happy path correctly with require.NotNil.
  • All Round 2 fixes (raw-key resolver, coordinator helper plumbing, [][]byte perf) remain intact.
  • The existing TestShardRouter_* tests continue to cover the P1 raw-key invariant and all dispatch paths.

Verdict: Fix requireNilInterface to use require.True(t, r == nil, msg) — the current require.Nil doesn't distinguish a true nil interface from a typed-nil pointer in an interface, so the regression test doesn't catch the regression it claims to. Everything else in Round 3 is correct. The coordinator-level regression test remains outstanding from Round 2.

Round 2 Claude review on PR #715 flagged that the Gemini-HIGH fix
(ShardedCoordinator's groupMutations / groupForKey / etc. now route
through c.router.ResolveGroup) had no coordinator-level regression
test — the existing TestShardRouter_* tests pin the dispatch logic
at the router layer but don't exercise the path Dispatch →
groupMutations → c.router.ResolveGroup.

Per CLAUDE.md ("when code review surfaces a defect, first add a
failing test that reproduces the issue, then make it pass with the
fix"), this commit lands the missing test.

Two new tests in kv/sharded_coordinator_partition_test.go:

- TestShardedCoordinator_DispatchHonoursPartitionResolver pins the
  Gemini HIGH fix: with the engine routing everything to group 1
  but the resolver claiming a specific key for group 42, Dispatch
  on that key MUST hit group 42's recordingTransactional. Before
  the round-2 fix the request would have landed on group 1 because
  groupMutations called c.engine.GetRoute directly. Also asserts
  the resolver received the RAW partitioned key — pins the
  codex-P1 fix at the coordinator-call boundary.

- TestShardedCoordinator_DispatchFallsThroughForUnclaimedKeys pins
  the inverse: keys NOT claimed by the resolver continue to route
  via the byte-range engine. Without this, the resolver-first
  short-circuit could mask engine routing decisions.

stubResolver is a kv-internal PartitionResolver double so the
tests don't pull in the adapter package. Each call records the
raw key bytes (defensive copy) so concurrent reads stay race-safe
under -race.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7f3a643406

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread kv/shard_router.go
Comment on lines +101 to +105
// Engine routes against the user-key view of the byte-range
// space; routeKey may rewrite SQS / DynamoDB / Redis-internal
// keys to a stable per-table or per-namespace route key so the
// engine sees one route per logical entity.
route, ok := s.engine.GetRoute(routeKey(rawKey))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Block engine fallback on unresolved partitioned SQS keys

When a partition resolver is installed, unresolved partitioned keys currently fall through to engine.GetRoute(routeKey(rawKey)). For !sqs|... keys, routeKey normalizes to the global SQS route, so cases like unknown queue or out-of-range partition (which SQSPartitionResolver returns as ok=false) get silently routed to the engine default shard instead of failing closed. This can misroute HT-FIFO traffic during partition-map drift or partial rollout, causing writes/reads to land on the wrong group rather than returning a routing error.

Useful? React with 👍 / 👎.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 30, 2026

@claude review

Round 4 — addresses Round 2 Claude's missing-regression-test note:

CLAUDE.md gap: the Gemini HIGH fix (ShardedCoordinator's helpers route through c.router.ResolveGroup) had no coordinator-level regression. Existing TestShardRouter_* tests pin the router-layer dispatch but don't exercise Dispatch → groupMutations → c.router.ResolveGroup.

New tests in kv/sharded_coordinator_partition_test.go:

  • TestShardedCoordinator_DispatchHonoursPartitionResolver: engine routes everything to group 1, resolver claims a specific key for group 42; Dispatch on that key MUST hit group 42 (proves groupMutations no longer bypasses the resolver). Also asserts the resolver receives the RAW partitioned key — pins the codex-P1 fix at the coordinator boundary.
  • TestShardedCoordinator_DispatchFallsThroughForUnclaimedKeys: keys NOT claimed by the resolver continue to route via the byte-range engine.

stubResolver is kv-internal so the tests don't pull in adapter. Race-safe defensive copy of each call key.

go test -race ./kv/... and golangci-lint clean.

Round 3's typed-nil fix (return type kv.PartitionResolver) and earlier rounds' fixes all stand from previous pushes.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 30, 2026

Claude finished @bootjp's task in 4m 39s —— View job


Code Review: PR #715 Round 4 — feat(sqs): partition resolver for HT-FIFO routing

Tasks

  • Read changed files
  • Verify coordinator-level regression tests (kv/sharded_coordinator_partition_test.go)
  • Verify typed-nil fix still intact (main.go / main_sqs_resolver_test.go)
  • Check Codex Round 2 P1 (engine fallback for unresolved partitioned keys)
  • Review overall PR state — any remaining issues?
  • Post final review

All previous-round fixes are intact. Round 4 adds two coordinator tests. One precision note on the test's coverage of the groupMutations path.


All previous fixes verified ✓

  • Typed-nil (main.go:506) — buildSQSPartitionResolver returns kv.PartitionResolver interface; TestBuildSQSPartitionResolver_NilOnEmpty / requireNilInterface correctly pins this.
  • Raw-key resolver (shard_router.go:95) — ResolveGroup runs on the raw key; engine fallback uses routeKey(rawKey) separately.
  • Coordinator helpers (sharded_coordinator.go:847,870,886,1067) — all four methods call c.router.ResolveGroup(key).
  • [][]byte prefix precomputation (sqs_partition_resolver.go:56) — hot-path bytes.HasPrefix no longer allocates per call.
  • Startup-only doc (shard_router.go:62) — WithPartitionResolver comment correctly documents non-atomic assignment and startup-only intent.

Test gap: TestShardedCoordinator_DispatchHonoursPartitionResolver doesn't actually regress the groupMutations fix (minor)

The test's comment says it "proves groupMutations no longer bypasses the resolver," but it doesn't. Here's why:

For a single-mutation batch, the test would pass even if groupMutations still called c.engine.GetRoute(routeKey(key)) directly:

  1. Bug path: groupMutations → engine → gid=1, grouped={1:[mut]}, rawLogs creates one request with key "!sqs|msg|data|p|partitioned-key".
  2. Then router.Commit([req])groupRequestss.ResolveGroup("!sqs|msg|data|p|partitioned-key") → resolver returns (42, true) → request goes to g42.
  3. Result: g1Count=0, g42Count=1 → test passes (same as the fixed path).

router.Commit's own groupRequests re-routes by raw key, rescuing the situation for single-mutation dispatches regardless of what groupMutations did. The test is effectively pinning the router.Commit path, not groupMutations.

To truly regress the groupMutations bug, the test would need two mutations for different partitions (e.g., partition 0 → group 42, partition 1 → group 43):

// Before fix: groupMutations lumps both mutations into engine-default gid=1
// → rawLogs creates ONE request with both mutations
// → router.Commit routes based on Mutations[0].Key only
// → only one group receives the request, the other is mis-routed
// → FAILS: g43Count==0, but g42Count==1 with both mutations

// After fix: groupMutations splits into {42:[mut0], 43:[mut1]}
// → rawLogs creates TWO requests
// → router.Commit routes each correctly
// → PASSES: both g42Count==1 and g43Count==1

The 2PC path (dispatchTxnprewriteTxn / commitPrimaryTxn) uses grouped from groupMutations directly without any router.Commit re-routing, so that path is also not covered.

The existing test is valuable — it proves coordinator-level wiring, resolver consultation with the raw key, and the WithPartitionResolver fluent interface — but doesn't satisfy the CLAUDE.md requirement that a reviewer-identified defect regression "must reproduce the issue" before the fix.

Recommendation: Either add a second sub-test with two differently-partitioned mutations (which would fail with the old groupMutations and pass with the fix), or update the test comment to accurately describe what it pins (WithPartitionResolver wiring and dispatch routing — not groupMutations specifically).

Fix this →


Codex Round 2 P1 (carry-forward, acceptable at this phase)

Unresolved-partitioned keys still fall through to the engine and normalize to !sqs|route|global. The ShardRouter.ResolveGroup doc comment ("a partitioned-prefix key whose queue is missing from the resolver map fails closed") is inaccurate — the engine route for !sqs|route|global exists, so it doesn't fail closed. Given htfifoCapabilityAdvertised = false, this is latent rather than active, but the doc comment should be corrected or the claim removed before 4-B-3 flips the flag.


Positive observations

  • stubResolver.calls records defensive copies per call (same race-safe pattern as recordingResolver in the router tests). ✓
  • TestShardedCoordinator_DispatchFallsThroughForUnclaimedKeys is a clean inversion test — the resolver doesn't claim "x", engine routes to group 2, assertion is tight.
  • ShardedCoordinator.WithPartitionResolver delegation to c.router.WithPartitionResolver means there's a single field to update.
  • The coordinator test stays in package kv and uses stubResolver (not adapter.SQSPartitionResolver), correctly preserving the kvadapter isolation boundary.

Verdict: Fix the test comment (or add the two-partition sub-test) and correct the ShardRouter.ResolveGroup doc comment about fail-closed behavior. Everything else is clean. The PR is otherwise ready.

Round 2 Codex P1 on PR #715: when a partition resolver is
installed, an unresolved partitioned-shape key (unknown queue /
out-of-range partition) silently falls through to
engine.GetRoute(routeKey(rawKey)). For !sqs|... keys, routeKey
collapses to !sqs|route|global, so the engine routes the
misconfiguration to the SQS catalog default group instead of
returning a routing error. During partition-map drift / partial
rollout this would silently mis-route HT-FIFO traffic.

Fix

- kv.PartitionResolver gains RecognisesPartitionedKey(key) — a
  shape-only predicate. Implementations answer purely on prefix /
  structural inspection so the router can use it independently
  of the in-memory mapping.
- ShardRouter.ResolveGroup: after ResolveGroup returns ok=false,
  consult RecognisesPartitionedKey. If the resolver recognises the
  shape, return (0, false) WITHOUT engine fallback — the caller
  surfaces a routing error. If the shape is not recognised, fall
  through to the engine as before.
- adapter.SQSPartitionResolver: implements RecognisesPartitionedKey
  via parsePartitionedSQSKey (the same parser ResolveGroup uses).
  Nil-receiver and empty-key return false.

Tests

- kv/shard_router_partition_test.go: new
  TestShardRouter_FailClosedOnRecognisedButUnresolved — pins that
  a recognised-but-unresolved partitioned key returns (0, false)
  AND that a non-recognised key still falls through to the engine.
- adapter/sqs_partition_resolver_test.go: new
  TestSQSPartitionResolver_RecognisesPartitionedKey covers 8
  shape cases (data/vis/byage families with known + unknown
  queues, OOR partition, legacy SQS, queue meta, non-SQS, empty,
  nil). New TestSQSPartitionResolver_RecognisesPartitionedKey_NilReceiver
  pins the typed-nil-safe branch.
- Renamed and expanded
  TestSQSPartitionResolver_UnknownQueueRecognisedButUnresolved /
  TestSQSPartitionResolver_OutOfRangePartitionRecognisedButUnresolved
  to assert RecognisesPartitionedKey == true, pinning the router-
  side fail-closed contract.
- Existing fakePartitionResolver / stubResolver / recordingResolver
  test doubles get the new method (the kv-internal tests don't
  depend on the adapter resolver).
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 30, 2026

@claude review

Round 5 — addresses Codex round 2 P1:

P1 (Codex): when the resolver returns (0, false) for a partitioned-shaped key (unknown queue / OOR partition), the engine fall-through silently routed via routeKey's !sqs|route|global collapse to the SQS catalog default group. During partition-map drift or partial rollout this would mis-route HT-FIFO traffic instead of failing closed.

Fix:

  • kv.PartitionResolver gains RecognisesPartitionedKey(key) bool — a shape-only predicate.
  • ShardRouter.ResolveGroup: after ResolveGroup returns ok=false, consult RecognisesPartitionedKey. If the shape is recognised, fail closed; otherwise fall through to the engine as before.
  • adapter.SQSPartitionResolver implements the predicate via parsePartitionedSQSKey (same parser as ResolveGroup).

Tests:

  • TestShardRouter_FailClosedOnRecognisedButUnresolved — pins both branches.
  • TestSQSPartitionResolver_RecognisesPartitionedKey — 8 shape cases (data/vis/byage families × known/unknown queues × OOR partition × legacy SQS × queue meta × non-SQS × empty × nil).
  • Existing unknown-queue / OOR-partition tests renamed and now also assert RecognisesPartitionedKey == true to pin the router-side fail-closed contract.

go test -race ./kv/... and golangci-lint clean. All previous rounds' fixes (raw-key resolver, coordinator helper plumbing, [][]byte perf, doc, typed-nil interface, coordinator regression test) stand from earlier pushes.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 30, 2026

Claude finished @bootjp's task in 4m 47s —— View job


Code Review: PR #715 Round 5 — feat(sqs): partition resolver for HT-FIFO routing

Tasks

  • Read changed files
  • Verify RecognisesPartitionedKey interface extension
  • Verify ShardRouter.ResolveGroup fail-closed logic
  • Verify SQSPartitionResolver.RecognisesPartitionedKey implementation
  • Verify new tests and renamed tests
  • requireNilInterface fix (carry-forward Round 3) — committed locally, could not push (remote sync issue — see below)
  • Post final review

All Round 5 changes are correctly implemented. One previously-unfixed carry-forward from Round 3 has been corrected in the local commit 049ee72 (diff below) — the push was blocked because git pull/fetch requires approval; please cherry-pick or apply manually.


Round 5 fixes verified ✓

PartitionResolver interface extension (kv/shard_router.go:36–46)RecognisesPartitionedKey(key []byte) bool is correctly defined as a shape-only predicate. The doc comment explains why it must not depend on the in-memory map (otherwise the fail-closed guarantee breaks under partial rollouts) — this is precise and important.

ShardRouter.ResolveGroup fail-closed logic (shard_router.go:123–146) — The three-way dispatch is correct:

  1. Resolver resolves → return its group ✓
  2. Resolver misses but recognises shape → fail closed (return (0, false)) ✓
  3. Resolver absent or shape not recognised → engine fallback ✓

The resolveGroupResolveGroup export is correct: the coordinator helpers (groupForKey, routeAndGroupForKey, engineGroupIDForKey, groupMutations) call c.router.ResolveGroup and that now goes through the same fail-closed logic.

SQSPartitionResolver.RecognisesPartitionedKey (sqs_partition_resolver.go:113–119) — Delegates to parsePartitionedSQSKey, which checks: partitioned prefix present → | terminator at a positive offset → ≥4 bytes after the terminator → valid base64 queue segment. This is a structural check independent of the routes map. The nil-receiver guard returns false, which is correct — a typed-nil resolver claiming to recognise a shape would wrongly trigger fail-closed for every partitioned-prefix key.

fakePartitionResolver.RecognisesPartitionedKey (shard_router_partition_test.go:37–43) — Uses a recognisedPrefix field for test control. recordingResolver.RecognisesPartitionedKey returns false unconditionally — correct, those tests don't exercise the fail-closed path.

TestShardRouter_FailClosedOnRecognisedButUnresolved (shard_router_partition_test.go:202–232) — Pins both branches: recognised + unresolved → (0, false); non-recognised + unresolved → engine default. This is the direct regression for the Codex P1 fix. ✓

Renamed tests (sqs_partition_resolver_test.go:157–192)TestSQSPartitionResolver_UnknownQueueRecognisedButUnresolved and TestSQSPartitionResolver_OutOfRangePartitionRecognisedButUnresolved now assert RecognisesPartitionedKey(key) == true in addition to the existing ResolveGroup assertions. This pins the router-side contract at the adapter boundary — any future refactor that accidentally breaks the recognised-but-unresolved pairing will fail these. ✓

TestSQSPartitionResolver_RecognisesPartitionedKey (8 cases) — Covers the three true cases (data/vis/byage × known/unknown/OOR) and five false cases (legacy SQS, non-SQS, queue meta, empty, nil). ✓

TestSQSPartitionResolver_RecognisesPartitionedKey_NilReceiver — Guards the typed-nil case. ✓


Carry-forward fix: requireNilInterface must use r == nil not require.Nil (Round 3)

main_sqs_resolver_test.go:68–71 — Still uses require.Nil(t, r, msg). This has been flagged in Round 3 and Round 4 without being addressed.

Why it's broken: Testify's require.Nil resolves through reflect.ValueOf(object) and calls IsNil() on the underlying pointer kind. When a *SQSPartitionResolver(nil) is boxed inside a kv.PartitionResolver interface and then passed to require.Nil(t, object interface{}), Go unwraps the intermediate interface so the interface{} parameter receives concrete type *SQSPartitionResolver with nil value. Reflect's IsNil() returns true for a nil pointer — so the test passes even with the bug (return type reverted to *adapter.SQSPartitionResolver).

Go's == operator on interface values checks both the dynamic type and the value; r == nil is false for a non-nil interface wrapping a nil pointer, which is exactly the regression we want to catch.

Committed locally as 049ee72 (could not push — please apply):

// requireNilInterface accepts a kv.PartitionResolver and asserts
// the interface value (NOT just the underlying pointer) is nil.
// The function-parameter conversion forces a typed-nil pointer to
// be wrapped into a non-nil interface, which is exactly the
// failure mode the regression test guards against.
+//
+// require.Nil(t, r) is NOT used here because testify's IsNil check
+// uses reflect.ValueOf and calls IsNil() on the underlying pointer
+// kind — it treats a typed-nil *SQSPartitionResolver boxed inside a
+// non-nil kv.PartitionResolver interface as "nil", so the regression
+// would pass undetected. require.True(t, r == nil) uses Go's own
+// interface equality: both the dynamic type AND the value must be
+// nil for the expression to be true, which is the invariant we need.
 func requireNilInterface(t *testing.T, r kv.PartitionResolver, msg string) {
     t.Helper()
-    require.Nil(t, r, msg)
+    require.True(t, r == nil, msg)
 }

Fix this →


Carry-forward: TestShardedCoordinator_DispatchHonoursPartitionResolver comment accuracy (Round 4)

The comment still says the test "proves groupMutations no longer bypasses the resolver." As noted in Round 4, for a single-mutation batch router.Commit's own groupRequests re-routes via ResolveGroup and rescues the dispatch regardless of what groupMutations did. The comment is inaccurate; the test pins the WithPartitionResolver wiring and coordinator dispatch at the call boundary, not groupMutations specifically. A two-partition sub-test would be the true regression anchor. Given htfifoCapabilityAdvertised = false, this is low-risk — but the misleading comment should be corrected before 4-B-3 enables live traffic.


Nit: RecognisesPartitionedKey falls through on malformed queue encoding

parsePartitionedSQSKey returns ok=false when decodeSQSSegment fails (invalid base64), so RecognisesPartitionedKey returns false for keys with a valid partitioned prefix and valid structure but a corrupt queue segment. The router then falls through to the engine, which routes via routeKey!sqs|route|global → catalog default group. This is the same mis-route pattern the fail-closed is designed to prevent, just for a garbled key rather than an unknown/OOR one. Raising as a nit since corrupt keys are unlikely in practice and htfifoCapabilityAdvertised = false keeps this latent. Worth revisiting in 4-B-3 when designing the final shape-recognition contract.


All previous-round fixes intact ✓

  • Typed-nil return type (main.go:506) — kv.PartitionResolver interface, untyped nil propagates correctly ✓
  • Raw-key resolver (shard_router.go:125, 238, 249) — ResolveGroup runs before any routeKey normalisation ✓
  • Coordinator helpers (sharded_coordinator.go:848,871,887,1067) — all four call c.router.ResolveGroup(key)
  • [][]byte prefix precomputation (sqs_partition_resolver.go:56) — no per-call allocation in hot path ✓
  • Startup-only doc (shard_router.go:82–87) — interface-assignment non-atomicity documented ✓

Verdict: Round 5 changes are correct. Apply the requireNilInterface one-liner fix (committed locally as 049ee72, could not push due to remote sync issue). The coordinator test comment inaccuracy is a follow-up for 4-B-3. The malformed-queue-encoding nit is low-risk while the capability flag is false. Once the requireNilInterface fix lands, all identified issues from Rounds 1–5 will be resolved.
| Branch

…test

Round 3 + Round 4 review on PR #715 caught two test gaps. Both
were real — the regressions they claim to catch were not actually
caught by the tests as written.

Round 3: requireNilInterface used testify's require.Nil

testify's require.Nil reflects through to the underlying pointer
and considers a nil pointer wrapped in a non-nil interface as
"nil". So if buildSQSPartitionResolver's return type were reverted
to *adapter.SQSPartitionResolver (the typed-nil bug), the
regression test would still pass — require.Nil on the typed nil
returns true.

Switch to require.True(t, r == nil, msg). Go's `==` operator on
the interface checks BOTH the type tag AND the value tag — it
only returns true for a true nil interface, which is the exact
invariant the typed-nil fix produces.

Round 4: TestShardedCoordinator_DispatchHonoursPartitionResolver
did not actually regress the groupMutations bypass

For a single-mutation batch, the test passes even if
groupMutations bypasses the resolver, because rawLogs produces
one pb.Request and router.Commit's groupRequests re-routes by
the raw key — the router rescues the mis-routing the coordinator
would have introduced.

To genuinely regress the bypass, the test must dispatch TWO
mutations belonging to TWO different partition groups. With the
buggy groupMutations both end up under the engine-default group,
rawLogs produces one request, and router.Commit puts both
mutations on whichever group claims Mutations[0].Key — the second
group receives nothing.

Added TestShardedCoordinator_DispatchSplitsMutationsByResolverGroup:
- Engine routes everything to group 1.
- Resolver claims keyP0 → group 42, keyP1 → group 43.
- Dispatch with [Put keyP0, Put keyP1].
- Asserts BOTH g42 and g43 each receive exactly one request.
- Pre-fix: g43 receives ZERO (bypass put both under group 1's
  rawLog, then router put them under g42 because of Mutations[0]).
- Post-fix: groupMutations splits via c.router.ResolveGroup, two
  separate requests, each group gets its own.

Updated TestShardedCoordinator_DispatchHonoursPartitionResolver
comment to reflect what it actually pins (WithPartitionResolver
wiring + raw-key dispatch, not groupMutations).
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 30, 2026

@claude review

Round 6 — addresses Round 3 + Round 4 test gaps:

Round 3 catch (real bug): requireNilInterface used require.Nil which testify reflects through to the underlying pointer — a nil pointer wrapped in a non-nil interface still passes. So if the typed-nil fix were reverted, the regression test would NOT catch it. Switched to require.True(t, r == nil, msg) which uses Go's native interface comparison (both type AND value must be nil tagged) — that is the exact invariant the typed-nil fix produces.

Round 4 catch (real bug): TestShardedCoordinator_DispatchHonoursPartitionResolver does NOT regress the groupMutations bypass for single-mutation batches — router.Commit's groupRequests re-routes by raw key and rescues the mis-routing.

Added TestShardedCoordinator_DispatchSplitsMutationsByResolverGroup: dispatches TWO mutations claimed by two different resolver groups (42, 43). Pre-fix, groupMutations bundles both under the engine default; rawLogs produces one request; router.Commit routes by Mutations[0].Key → group 43 receives zero. Post-fix: groupMutations splits via c.router.ResolveGroup, two separate requests, both groups receive theirs. Asserts both groups got exactly one request — only the post-fix path passes.

Updated the original test's comment to reflect what it actually pins (WithPartitionResolver wiring + raw-key dispatch, not groupMutations).

go test -race ./kv/... ./adapter/... and golangci-lint clean.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 30, 2026

Claude finished @bootjp's task in 2m 50s —— View job


Code Review: PR #715 Round 6 — feat(sqs): partition resolver for HT-FIFO routing

Tasks

  • Read changed files
  • Verify requireNilInterface fix (require.True(t, r == nil, msg))
  • Verify TestShardedCoordinator_DispatchSplitsMutationsByResolverGroup correctly regresses groupMutations bypass
  • Verify original test comment updated
  • Check all previous round fixes still intact
  • Post final review

Both Round 3 and Round 4 test gaps are correctly addressed. All previous fixes remain intact. This PR is ready to merge.


Fix verified — requireNilInterface uses r == nil

main_sqs_resolver_test.go:79

func requireNilInterface(t *testing.T, r kv.PartitionResolver, msg string) {
    t.Helper()
    require.True(t, r == nil, msg)
}

The fix is correct. The function parameter forces the interface conversion at the call boundary. require.True(t, r == nil) uses Go's native interface comparison — both type tag AND value must be nil — so a typed-nil *SQSPartitionResolver boxed in a kv.PartitionResolver interface evaluates to false and the test fails. The comment at lines 63–76 accurately explains the distinction between this and require.Nil (which reflects through to the underlying pointer and considers any nil pointer — typed or not — as "nil"). The explanation is clear and correct; it should prevent future maintainers from regressing this back to require.Nil.


Fix verified — TestShardedCoordinator_DispatchSplitsMutationsByResolverGroup is a genuine regression ✓

kv/sharded_coordinator_partition_test.go:137

The two-mutation setup (keyP0 → g42, keyP1 → g43) correctly exercises the groupMutations bypass:

Pre-fix path (groupMutations calls c.engine.GetRoute directly):

  1. Both keys normalize to !sqs|route|global via routeKey → engine routes both to gid=1.
  2. rawLogs creates ONE pb.Request with both mutations bundled under gid=1.
  3. router.CommitgroupRequests routes by Mutations[0].Key (keyP0) → g42 receives the combined request; g43 receives nothing.
  4. Assertion g43Count == 1 fails — correctly catching the bug.

Post-fix path (groupMutations calls c.router.ResolveGroup):

  1. keyP0 → resolver → gid=42; keyP1 → resolver → gid=43; grouped as {42:[mut0], 43:[mut1]}.
  2. rawLogs creates TWO requests, one per group.
  3. router.Commit routes each request to its group: g42=1, g43=1.
  4. All assertions pass ✓.

This test would have failed against the pre-fix groupMutations — it is a true regression anchor for the Gemini HIGH fix.


Updated test comment verified ✓

kv/sharded_coordinator_partition_test.go:54–65

The comment on TestShardedCoordinator_DispatchHonoursPartitionResolver now accurately documents that it pins "WithPartitionResolver fluent wiring + raw-key dispatch, NOT the groupMutations bypass regression" and directs readers to the 2-mutation test for the actual bypass regression.


All previous rounds' fixes intact ✓

  • Typed-nilbuildSQSPartitionResolver returns kv.PartitionResolver; empty/nil map returns true nil interface ✓
  • Raw-key resolverShardRouter.ResolveGroup runs the resolver on the raw key before routeKey normalization ✓
  • Coordinator helpersgroupForKey, routeAndGroupForKey, engineGroupIDForKey, groupMutations all call c.router.ResolveGroup(key)
  • [][]byte prefixessqsResolverFamilyPrefixes is [][]byte; hot-path bytes.HasPrefix allocates nothing per call ✓
  • Fail-closedRecognisesPartitionedKey + ResolveGroup short-circuit prevents engine fall-through for partitioned keys with unknown queue or out-of-range partition ✓
  • Startup-only docWithPartitionResolver comment correctly documents non-atomic assignment and startup-only intent ✓

Verdict: All six rounds of review findings are resolved. The PR is clean and ready to merge.

Round 5 review on PR #715 raised a nit: RecognisesPartitionedKey
delegated to parsePartitionedSQSKey, which fails when the queue
segment is malformed base64 or the partition segment is truncated.
For such corrupt-shape keys the predicate returned false, so the
router fell through to the engine and silently routed via routeKey's
!sqs|route|global collapse to the SQS catalog default group —
exactly the failure mode the round 5 fail-closed change was
introduced to prevent.

The fix is the structural-only intent the kv.PartitionResolver
contract already documents: "Implementations answer purely on
prefix / structural inspection". Make RecognisesPartitionedKey
match the intent — accept ANY key that starts with one of the
partitioned family prefixes, regardless of subsequent corruption.
ResolveGroup still returns (0, false) for malformed keys, and the
router pairs that with Recognised=true to fail closed.

Tests

- TestSQSPartitionResolver_RecognisesMalformedPartitionedKey:
  three sub-cases pin the new contract — prefix-only, prefix +
  invalid base64 queue segment, prefix + valid queue + '|' but
  truncated partition bytes. All assert Recognised=true and
  ResolveGroup ok=false, which is exactly the fail-closed pairing
  the router consumes.
- Existing TestSQSPartitionResolver_RecognisesPartitionedKey
  cases stay valid (the well-formed shapes still match).
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 30, 2026

@claude review

Round 7 — addresses Round 5's malformed-queue-encoding nit:

Round 5 nit: RecognisesPartitionedKey delegated to parsePartitionedSQSKey, which fails when the queue segment is malformed base64 or the partition segment is truncated. For such corrupt-shape keys the predicate returned false → engine fall-through → silent mis-route via routeKey's !sqs|route|global collapse to the SQS catalog default group. That's exactly the failure mode the round 5 fail-closed change was introduced to prevent.

Fix: RecognisesPartitionedKey now does a prefix-only check (matches stripPartitionedFamilyPrefix), aligning with the kv.PartitionResolver contract's documented "purely on prefix / structural inspection" intent. ResolveGroup still returns (0, false) for malformed keys; the router pairs that with Recognised=true to fail closed.

New TestSQSPartitionResolver_RecognisesMalformedPartitionedKey (3 sub-cases): prefix-only, prefix + invalid base64, prefix + valid queue + truncated partition. All assert Recognised=true AND ResolveGroup ok=false.

go test -race ./kv/... ./adapter/... . and golangci-lint clean. All previous-round fixes from rounds 1–6 stand from earlier pushes.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 30, 2026

Claude finished @bootjp's task in 2m 28s —— View job


Code Review: PR #715 Round 7 — feat(sqs): partition resolver for HT-FIFO routing

Tasks

  • Read changed files
  • Verify RecognisesPartitionedKey prefix-only fix
  • Verify ResolveGroup still returns (0, false) for malformed keys
  • Verify new malformed-key test coverage
  • Check all previous round fixes still intact
  • Post final review

Round 7 fix is correct. All seven rounds of findings are resolved. This PR is ready to merge.


Fix verified — RecognisesPartitionedKey is now prefix-only ✓

adapter/sqs_partition_resolver.go:114–119

The one-line change from parsePartitionedSQSKey to stripPartitionedFamilyPrefix is exactly right:

-_, _, ok := parsePartitionedSQSKey(key)
+_, ok := stripPartitionedFamilyPrefix(key)
 return ok

parsePartitionedSQSKey performed a full structural parse — it failed on a missing | terminator, an invalid base64 queue segment, or fewer than 4 partition bytes. Each of those failures returned false, allowing the router to fall through to the engine and silently route via routeKey's !sqs|route|global collapse. stripPartitionedFamilyPrefix does a pure prefix check — any key starting with one of the five partitioned family prefixes answers true, regardless of what follows. The router then pairs Recognised=true with ResolveGroup=(0,false) to fail closed.

The implementation now matches the PartitionResolver interface contract exactly: "Implementations answer based on prefix / structural inspection only — the answer must NOT depend on any in-memory mapping."


ResolveGroup still returns (0, false) for all three malformed cases ✓

  • Prefix-only (SqsPartitionedMsgDataPrefix with nothing after): parsePartitionedSQSKeypipeIdx <= 0return "", 0, false
  • Invalid base64 (!!!|): parsePartitionedSQSKeydecodeSQSSegment fails → return "", 0, false
  • Truncated partition (valid queue + | but no partition bytes): parsePartitionedSQSKeylen(rest) < 4return "", 0, false

New test is a genuine regression anchor ✓

adapter/sqs_partition_resolver_test.go:283–332

TestSQSPartitionResolver_RecognisesMalformedPartitionedKey correctly pins all three malformed-key shapes. Each sub-test asserts both:

  1. RecognisesPartitionedKey(key) == true (catches a regression back to parsePartitionedSQSKey)
  2. ResolveGroup(key) == (0, false) (confirms the router's fail-closed pairing is still intact)

The use of encodeSQSSegment in the third sub-case is correct — it's a package-internal function in adapter/sqs_keys.go:232, accessible from package adapter test files without any import.


All previous rounds' fixes intact ✓

  • Typed-nil (main.go:506) — buildSQSPartitionResolver returns kv.PartitionResolver; requireNilInterface uses require.True(t, r == nil, msg)
  • Raw-key resolver (shard_router.go:123) — ResolveGroup runs the resolver on the raw key before routeKey normalization ✓
  • Coordinator helpers (sharded_coordinator.go) — groupForKey, routeAndGroupForKey, engineGroupIDForKey, groupMutations all route through c.router.ResolveGroup(key)
  • Fail-closed (shard_router.go:128–135) — RecognisesPartitionedKey + ResolveGroup short-circuit prevents engine fall-through for partitioned keys with unknown queue, out-of-range partition, or now also malformed structure ✓
  • [][]byte prefixessqsResolverFamilyPrefixes is [][]byte; no per-call allocation ✓
  • requireNilInterface — uses require.True(t, r == nil, msg) with accurate comment explaining why require.Nil is insufficient ✓
  • Two-mutation coordinator test (TestShardedCoordinator_DispatchSplitsMutationsByResolverGroup) — genuine regression for groupMutations bypass: asserts both g42 and g43 receive exactly one request ✓
  • Startup-only doc (shard_router.go:82–87) — interface-assignment non-atomicity documented ✓

Verdict: All identified issues across Rounds 1–7 are resolved. The PR is clean and ready to merge.
| Branch

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 87561fd9f3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread kv/sharded_coordinator.go
Comment on lines 886 to 889
func (c *ShardedCoordinator) engineGroupIDForKey(key []byte) uint64 {
route, ok := c.engine.GetRoute(routeKey(key))
gid, ok := c.router.ResolveGroup(key)
if !ok {
return 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reject unresolved read keys from partition resolver

engineGroupIDForKey now maps resolver misses to 0, but the txn read-key paths (allReadKeysInShard and groupReadKeysByShardID) treat gid==0 as non-fatal and simply skip those keys. With the new fail-closed resolver behavior, a partitioned key that is recognized but unresolved (e.g., queue missing from sqsFifoPartitionMap during drift/rollout) can be dropped from read-key validation instead of failing the transaction, so OCC checks no longer cover that read set and the write can commit with an incomplete conflict check.

Useful? React with 👍 / 👎.

Round 5 introduced fail-closed semantics in ShardRouter.ResolveGroup
(recognised-but-unresolved partitioned keys return ok=false instead
of falling through to the engine), but the read-key path inside
ShardedCoordinator was not audited for the new contract.

engineGroupIDForKey discards the resolver's ok flag and returns 0
for any failure. groupReadKeysByShardID then loops `if gid == 0
{ continue }`, silently dropping unrouted read keys from the
prewrite payload. With the new fail-closed resolver, a partitioned
read key whose queue has drifted out of --sqsFifoPartitionMap
(partial rollout / config drift) gets dropped from OCC validation.
The FSM never sees that key in ReadKeys, a concurrent write to the
same key commits without conflict, and SSI is broken.

Codex round-2 P1 on PR #715 caught this — addressed here.

Fix

groupReadKeysByShardID returns (map, error). Any read key that
fails to route surfaces as ErrInvalidRequest, dispatchTxn
propagates the error, and the transaction aborts before prewrite.
Calls c.router.ResolveGroup directly (rather than via
engineGroupIDForKey) so the (gid, ok) signal is preserved through
the boundary.

Tests

- TestGroupReadKeysByShardID_FailsClosedOnUnroutable replaces
  TestGroupReadKeysByShardID_SkipsUnroutableKeys (which had been
  pinning the buggy skip-silently behaviour). Asserts the new
  fail-closed contract: unroutable keys → error, no partial map.
- TestShardedCoordinator_TxnFailsClosedForUnresolvedReadKey is
  the coordinator-level regression: a transaction with a
  recognised-but-unresolved partitioned read key MUST abort
  before any prewrite. Asserts no group received an RPC.
- The three existing TestGroupReadKeysByShardID_* tests are
  updated for the new (map, error) signature.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 30, 2026

@claude review

Round 8 — addresses Codex P1 missed in earlier rounds (見落としすみません):

P1 (Codex): Round 5 introduced fail-closed semantics in ShardRouter.ResolveGroup but the read-key path in ShardedCoordinator wasn't audited for the new contract. engineGroupIDForKey returns 0 for any failure (recognised-but-unresolved or unrecognised), and groupReadKeysByShardID had if gid == 0 { continue } which silently dropped unrouted read keys. Result: a partitioned read key whose queue has drifted out of --sqsFifoPartitionMap gets dropped from OCC validation → FSM never sees it in ReadKeys → concurrent write to the same key commits without conflict → SSI broken.

Fix: groupReadKeysByShardID returns (map, error). Any unroutable key surfaces as ErrInvalidRequest; dispatchTxn propagates and the transaction aborts before prewrite. Calls c.router.ResolveGroup directly to preserve the (gid, ok) signal through the boundary.

Tests:

  • TestGroupReadKeysByShardID_FailsClosedOnUnroutable replaces the old TestGroupReadKeysByShardID_SkipsUnroutableKeys (which was pinning the buggy skip-silently behaviour).
  • TestShardedCoordinator_TxnFailsClosedForUnresolvedReadKey (new): coordinator-level regression — a txn with a recognised-but-unresolved partitioned read key MUST abort before any prewrite. Asserts neither group received an RPC.

go test -race ./kv/... ./adapter/... and golangci-lint clean. All previous-round fixes from rounds 1–7 stand from earlier pushes.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 30, 2026

Claude finished @bootjp's task in 3m 55s —— View job


Code Review: PR #715 Round 8 — feat(sqs): partition resolver for HT-FIFO routing

Tasks

  • Read changed files
  • Verify groupReadKeysByShardID signature change + error propagation
  • Verify dispatchTxn propagation path
  • Verify engineGroupIDForKey returns correct error signal
  • Verify new tests are genuine regressions
  • Check all previous round fixes still intact
  • Post final review

Round 8 fix is correctly implemented. The two new tests are genuine regressions. All previous-round fixes remain intact. This PR is ready to merge.


Fix verified — groupReadKeysByShardID returns (map, error)

kv/sharded_coordinator.go:916–931

The signature change from (map) to (map, error) is correct. The key improvement is calling c.router.ResolveGroup(key) directly — not through engineGroupIDForKey — so the (gid, ok) signal is preserved at the boundary:

gid, ok := c.router.ResolveGroup(key)
if !ok || gid == 0 {
    return nil, errors.Wrapf(ErrInvalidRequest, "no route for txn read key %q ...", key)
}

Previously engineGroupIDForKey collapsed the signal to a bare uint64 (0 on failure), losing the ok flag, then groupReadKeysByShardID silently continued. Now the full (gid, ok) is preserved and any failure surfaces immediately. The gid == 0 guard is belt-and-suspenders for the case where ok=true but gid=0 — harmless and correct.


Fix verified — dispatchTxn propagates before prewrite ✓

kv/sharded_coordinator.go:401–403

groupedReadKeys, err := c.groupReadKeysByShardID(readKeys)
if err != nil {
    return nil, err
}
prepared, err := c.prewriteTxn(ctx, ...)

The error check is exactly before prewriteTxn. A routing failure aborts the transaction before any Raft entry is proposed. The assertion in TestShardedCoordinator_TxnFailsClosedForUnresolvedReadKey that neither group received a request is correctly pinned.


TestGroupReadKeysByShardID_FailsClosedOnUnroutable is a genuine regression ✓

Pre-fix path:

  • gid := c.engineGroupIDForKey("zzz") → 0 (engine has no route for keys outside [a, m))
  • if gid == 0 { continue } → silently skips "zzz"
  • Returns {1: [b]}, no error → old test passed pinning wrong behavior

Post-fix path:

  • gid, ok := c.router.ResolveGroup("zzz")(0, false) (engine miss)
  • !ok || gid == 0 → returns error
  • New test asserts require.Error(t, err) and require.ErrorIs(t, err, ErrInvalidRequest)passes only with the fix

The rename + rewrite correctly replaces a test that was locking in the bug.


TestShardedCoordinator_TxnFailsClosedForUnresolvedReadKey is a genuine regression ✓

Setup: writeKey → g42 (resolver claims), readKey → recognised + unresolved (recognisedPrefix fires, but not in claim map).

Pre-fix trace:

  1. groupMutations(elems) → gids=[42] ✓
  2. allReadKeysInShard([readKey], 42)engineGroupIDForKey(readKey)ResolveGroup → fail-closed → 0 → 0 ≠ 42 → false → multi-shard path
  3. groupReadKeysByShardID([readKey]):
    • gid := c.engineGroupIDForKey(readKey) → 0
    • if gid == 0 { continue } → readKey silently dropped
    • Returns {}, no error
  4. prewriteTxn with empty ReadKeys → g42 commits with an incomplete OCC read-set
  5. Test FAILS (expects error, gets nil) → SSI violation undetected

Post-fix trace:
1–2. Same
3. groupReadKeysByShardID([readKey]):

  • gid, ok := c.router.ResolveGroup(readKey) → fail-closed → (0, false)
  • !ok || gid == 0 → returns error
  1. dispatchTxn returns error → transaction aborts before any prewrite
  2. Test PASSES ✓ + neither group received an RPC (asserted by the test)

Note: allReadKeysInShard relies on the gid≠0 invariant (no action needed)

kv/sharded_coordinator.go:434–444

allReadKeysInShard calls engineGroupIDForKey(rk) (which returns 0 on failure) and compares it to gid. An unresolvable key returns 0, and 0 ≠ valid_gid is always true (since gids are always non-zero: groupMutations errors before producing gid=0, and txnGroupForID validates the group exists). So allReadKeysInShard correctly returns false for any unresolvable read key, forcing the multi-shard path → groupReadKeysByShardID → error.

This is correct-by-invariant. The implicit dependency is that no valid Raft group has ID 0, which is enforced by the coordinator/engine layer. A comment noting this invariant would future-proof the method, but it's not a bug.


All previous rounds' fixes intact ✓

  • Typed-nilbuildSQSPartitionResolver returns kv.PartitionResolver; requireNilInterface uses require.True(t, r == nil, msg)
  • Raw-key resolverResolveGroup runs before routeKey normalization ✓
  • Coordinator helpersgroupForKey, routeAndGroupForKey, engineGroupIDForKey, groupMutations all call c.router.ResolveGroup(key)
  • [][]byte prefix precomputation — no per-call allocation in hot path ✓
  • Fail-closedRecognisesPartitionedKey prefix-only, ShardRouter.ResolveGroup fails closed for recognised-but-unresolved keys ✓
  • Two-mutation coordinator test — genuine regression for groupMutations bypass ✓
  • Startup-only docWithPartitionResolver comment correctly documents non-atomic assignment ✓

Verdict: All eight rounds of findings are resolved. The PR is clean and ready to merge.
| Branch

@bootjp bootjp merged commit bce448f into main Apr 30, 2026
9 checks passed
@bootjp bootjp deleted the feat/sqs-htfifo-routing-resolver branch April 30, 2026 10:56
bootjp added a commit that referenced this pull request Apr 30, 2026
## Summary

Phase 3.D PR 4-B-3a — adds the stateless `htfifo` capability poller that
PR 5's CreateQueue gate consumes. Stacks on the now-merged #715 (PR
4-B-2, partition resolver). This PR is purely additive: new helper file,
new test file, no existing code touched. Next is PR 4-B-3b
(leadership-refusal hook + flag flip).

## What's added

- `adapter/sqs_capability_poller.go`:
- `HTFIFOCapabilityReport{AllAdvertise, Peers}` — binary go/no-go signal
+ per-peer detail for operator triage.
- `HTFIFOCapabilityPeerStatus{Address, HasHTFIFO, Capabilities, Error}`
— one peer's polling result.
- `PollSQSHTFIFOCapability(ctx, client, peers)` — concurrent
goroutine-per-peer poll, indexed-channel result aggregation (race-free).
- Per-peer timeout `defaultSQSCapabilityPollTimeout = 3s` so a single
hung peer can't stall the cluster-wide poll.
- Body capped at 1 KiB via `io.LimitReader` so a misconfigured peer
can't drain memory.
- Bare `host:port` and full `http://…` / `https://…` URLs both accepted.
- Fail-closed on every failure mode: timeout, transport error, non-200,
malformed JSON, missing capability. Empty peer list → vacuously
`AllAdvertise=true` (caller validates list completeness).

## What's NOT added (deferred)

- `htfifoCapabilityAdvertised` stays `false`. PR 4-B-3b adds the §8
leadership-refusal hook + per-acquisition observer in
`kv/raftengine/etcd` and flips the flag.
- `CreateQueue` does NOT yet call this helper. PR 5 lifts the
`PartitionCount > 1` dormancy gate AND wires the capability check in the
same commit (per the §11 rollout plan's "gate-and-lift atomically"
rule).

## Test plan

9 top-level tests covering the contract surface:
- [x] `TestPollSQSHTFIFOCapability_AllAdvertise` — happy path, multiple
peers.
- [x] `TestPollSQSHTFIFOCapability_OneMissingFailsClosed` — old-binary
peer with empty capabilities drops `AllAdvertise`.
- [x] `TestPollSQSHTFIFOCapability_HTTPErrorFailsClosed` — HTTP 500,
connection refused, malformed JSON all surface as `Error`.
- [x] `TestPollSQSHTFIFOCapability_TimeoutFailsClosed` — hung peer
respects per-peer timeout, full poll bounded.
- [x] `TestPollSQSHTFIFOCapability_EmptyPeersIsVacuouslyTrue` — empty
peer list contract.
- [x] `TestPollSQSHTFIFOCapability_EmptyPeerAddressFailsClosed` — `""`
entry in peers slice surfaces explicit Error.
- [x] `TestPollSQSHTFIFOCapability_FullURLPeer` — `http://` and
`https://` URLs accepted alongside bare `host:port`.
- [x] `TestPollSQSHTFIFOCapability_ConcurrentPolling` — 5×200ms peers
finish in well under 1s.
- [x] `TestPollSQSHTFIFOCapability_RespectsBodyLimit` — 10 KiB response
truncated mid-string surfaces as JSON parse error, not garbage decode.
- [x] `TestBuildSQSHealthURL` — URL construction edge cases.
- [x] `go test -race ./adapter/...` pass.
- [x] `golangci-lint ./adapter/...` clean.

## Self-review (per CLAUDE.md)

1. **Data loss** — read-only HTTP poll; no FSM/Pebble/retention path. No
issue.
2. **Concurrency / distributed failures** — peer polls run in
independent goroutines; results land via an indexed channel so slice
writes are obviously race-free. Per-peer timeout enforced via
`context.WithTimeout` so a slow peer can't stall the rest. Body capped
via `io.LimitReader`. No issue.
3. **Performance** — N peers polled concurrently, not serially; the test
pins this. Per-peer cost is one HTTP round-trip + a JSON parse of a tiny
body. No hot-path impact (CreateQueue is a control-plane operation, not
request hot path). No issue.
4. **Data consistency** — fail-closed on every failure mode preserves
the §8.5 "any peer that doesn't respond is treated as not-yet-upgraded"
rule. The vacuously-true empty-peer-list case is documented and the
caller's responsibility. No issue.
5. **Test coverage** — every documented failure path (HTTP error,
transport error, JSON parse, timeout, missing capability, empty peer,
body-size cap) is pinned. Concurrent polling is pinned (would have
caught a regression to serial). URL construction edges pinned.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant