Skip to content

distribution: add ListRoutes and SplitRange admin APIs backed by catalog#318

Merged
bootjp merged 18 commits intomainfrom
feature/hotspot-re-allocation-3
Feb 21, 2026
Merged

distribution: add ListRoutes and SplitRange admin APIs backed by catalog#318
bootjp merged 18 commits intomainfrom
feature/hotspot-re-allocation-3

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Feb 20, 2026

No description provided.

Copilot AI review requested due to automatic review settings February 20, 2026 14:41
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @bootjp, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the distribution management capabilities by introducing new administrative gRPC endpoints. These endpoints allow for the inspection of all configured routes and the dynamic splitting of existing routes into smaller segments. The changes involve integrating a durable catalog store to maintain route state and ensure consistency across the system, thereby providing more granular control over data distribution.

Highlights

  • New Admin APIs: Introduced ListRoutes and SplitRange gRPC administrative APIs to manage the distribution catalog.
  • Catalog Integration: The DistributionServer now integrates with a distribution.CatalogStore to persist and retrieve route configurations.
  • Route Splitting Logic: Implemented logic for SplitRange to divide an existing route into two child routes, updating the catalog and the distribution engine.
  • Protobuf Definitions: Added new protobuf messages and an enum (RouteState, RouteDescriptor, ListRoutesRequest/Response, SplitRangeRequest/Response) to support the new APIs.
  • Error Handling and Validation: Incorporated robust error handling and validation for split operations, including checks for catalog version conflicts, unknown routes, and invalid split keys.
Changelog
  • adapter/distribution_server.go
    • Added catalog field to DistributionServer and updated its constructor to accept a distribution.CatalogStore.
    • Defined constants and error variables for distribution operations.
    • Implemented ListRoutes gRPC method to retrieve all routes from the catalog.
    • Implemented SplitRange gRPC method to split a route, persist changes to the catalog, and apply them to the engine.
    • Added helper functions for catalog snapshot management, engine snapshot application, and validation of split keys and catalog versions.
  • adapter/distribution_server_test.go
    • Updated NewDistributionServer calls in existing tests to include a nil catalog for backward compatibility.
    • Added comprehensive unit tests for ListRoutes covering successful retrieval and error conditions when the catalog is not configured.
    • Added extensive unit tests for SplitRange covering successful splits, unknown routes, invalid split keys, split keys at boundaries, and catalog version conflicts.
    • Introduced a seededDistributionServer helper function to simplify test setup.
  • cmd/server/demo.go
    • Initialized a distribution.CatalogStore using the existing store (st) and passed it to adapter.NewDistributionServer.
  • main.go
    • Initialized a distribution.CatalogStore by calling distributionCatalogStoreForGroup and passed it to adapter.NewDistributionServer.
    • Added distributionCatalogStoreForGroup helper function to locate the appropriate catalog store for a given raft group ID.
  • proto/distribution.pb.go
    • Updated protoc version to v5.29.3.
    • Generated new Go types for RouteState enum, RouteDescriptor message, ListRoutesRequest, ListRoutesResponse, SplitRangeRequest, and SplitRangeResponse.
  • proto/distribution.proto
    • Added ListRoutes and SplitRange RPC methods to the Distribution service definition.
    • Defined RouteState enum to represent various states of a route.
    • Defined RouteDescriptor message to encapsulate route details including ID, start/end keys, group ID, state, and parent ID.
    • Defined ListRoutesRequest, ListRoutesResponse, SplitRangeRequest, and SplitRangeResponse messages for the new APIs.
  • proto/distribution_grpc.pb.go
    • Updated protoc version to v5.29.3.
    • Added full method names for ListRoutes and SplitRange.
    • Extended the DistributionClient and DistributionServer interfaces with the new ListRoutes and SplitRange methods.
    • Implemented client and server stubs for the newly defined RPC methods.
Activity
  • New gRPC APIs, ListRoutes and SplitRange, have been introduced to the Distribution service.
  • The DistributionServer now requires a CatalogStore for its operations, enabling durable route management.
  • Protobuf definitions for route states, descriptors, and API requests/responses have been added and regenerated.
  • Extensive unit tests have been implemented to validate the functionality and error handling of the new distribution APIs.
  • The main application entry points have been updated to correctly provision the DistributionServer with the necessary CatalogStore.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces new administrative APIs, ListRoutes and SplitRange, for managing routes backed by a durable catalog. The implementation is well-structured and includes comprehensive tests for various scenarios. However, I've identified a critical race condition in the SplitRange method that could lead to an inconsistent state between the durable catalog and the in-memory engine, which violates our guidelines on atomic operations. My review provides a detailed explanation of this issue and a recommended solution to ensure atomicity. Apart from this critical issue, the changes look solid.

Comment thread adapter/distribution_server.go
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds new distribution admin APIs to inspect and mutate the durable route catalog via gRPC, and wires the server to a distribution.CatalogStore so the APIs operate against persisted state.

Changes:

  • Extend Distribution gRPC service with ListRoutes and SplitRange, plus proto definitions for RouteDescriptor and RouteState.
  • Implement ListRoutes (reads catalog snapshot) and SplitRange (splits a catalog route, saves with OCC, applies snapshot to the in-memory engine) in adapter.DistributionServer.
  • Update server wiring (main + demo) and add unit tests for the new APIs.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
proto/distribution.proto Adds RPCs and message/enum types to support listing and splitting routes.
proto/distribution.pb.go Regenerated protobuf Go types for the new messages/enum.
proto/distribution_grpc.pb.go Regenerated gRPC stubs to include ListRoutes and SplitRange.
adapter/distribution_server.go Implements the new catalog-backed admin APIs and supporting helpers/error mapping.
adapter/distribution_server_test.go Adds coverage for ListRoutes/SplitRange and related error cases.
main.go Wires a CatalogStore into the distribution server in the multi-raft runtime.
cmd/server/demo.go Wires a CatalogStore into the distribution server in the demo server.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread adapter/distribution_server_test.go Outdated
@bootjp bootjp requested a review from Copilot February 20, 2026 14:53
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Feb 20, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces new administrative APIs for listing routes and splitting ranges, backed by a catalog store. This is a significant step towards dynamic distribution management. The changes involve updating the DistributionServer to include a CatalogStore, adding new gRPC endpoints, and implementing the logic for ListRoutes and SplitRange. The associated protobuf definitions and generated Go code have also been updated. Test cases for the new functionalities are included, covering success scenarios and various error conditions. Overall, the changes are well-structured and address the stated objective of adding these administrative APIs.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread adapter/distribution_server.go
Comment thread main.go
Comment thread cmd/server/demo.go Outdated
@bootjp bootjp requested a review from Copilot February 20, 2026 15:45
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Feb 20, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant new functionality by adding ListRoutes and SplitRange admin APIs, backed by a durable and versioned route catalog. This is a great step towards more robust range management. The use of optimistic concurrency control for splits is well-implemented. My review has identified one critical issue concerning the atomicity of catalog updates when a coordinator is used, which could lead to data inconsistency. This aligns with the principle that in an MVCC system, all operations within a single transaction must share the same start timestamp to ensure atomicity and a consistent snapshot view. I've also noted a couple of medium-severity issues related to a context-handling anti-pattern and an opportunity to improve efficiency and consistency in how catalog updates are generated. Overall, the changes are well-structured, but addressing the atomicity issue is crucial before merging.

Comment thread adapter/distribution_server.go
Comment thread adapter/distribution_server.go Outdated
Comment thread distribution/bootstrap.go Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@bootjp bootjp requested a review from Copilot February 20, 2026 16:02
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Feb 20, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces new administrative APIs for ListRoutes and SplitRange backed by a catalog, enhancing the distribution server's capabilities. It also includes necessary protobuf updates, new error definitions, and a bootstrapping mechanism for the catalog. The changes are well-structured and include appropriate tests for the new functionalities. I've identified a few areas for improvement regarding error handling consistency and potential for nil pointer dereferences in specific scenarios.

Comment thread adapter/distribution_server.go
Comment thread adapter/distribution_server.go
Comment thread adapter/distribution_server.go
Comment thread adapter/distribution_server.go Outdated
Comment thread adapter/distribution_server.go Outdated
Comment thread adapter/distribution_server.go Outdated
Comment thread adapter/distribution_server.go Outdated
Comment thread adapter/distribution_server.go Outdated
Comment thread cmd/server/demo.go
Comment thread main.go
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Feb 20, 2026

/gemini review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread proto/Makefile
@bootjp bootjp requested a review from Copilot February 21, 2026 07:44
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Feb 21, 2026

/gemini review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the ListRoutes and SplitRange gRPC APIs to the distribution service, backed by a durable route catalog. The implementation correctly handles atomic catalog mutations using the Raft-backed coordinator and ensures consistency between the durable catalog and the in-memory distribution engine. The addition of next_route_id tracking in the catalog provides robust route ID allocation. The MVCC rule regarding shared start timestamps within transactions is followed. One potential race condition was identified in the SplitRange reload logic that could lead to transient errors, which aligns with best practices for preventing stale data in atomic operations.

Comment thread adapter/distribution_server.go Outdated
@bootjp bootjp requested a review from Copilot February 21, 2026 07:54
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Feb 21, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant new functionality by adding ListRoutes and SplitRange administrative APIs, backed by a durable route catalog. The changes are extensive, touching everything from the protobuf definitions to the core distribution logic, server adapters, and startup configuration. The implementation of the SplitRange operation, including its transactional nature, leader verification, and read-your-writes consistency through polling, is well-handled. The introduction of a durable catalog with bootstrap logic is also a major improvement for the system's robustness. I have a couple of suggestions to enhance configurability and error handling.

Comment thread adapter/distribution_server.go Outdated
Comment on lines +41 to +42
catalogReloadRetryAttempts = 20
catalogReloadRetryInterval = 10 * time.Millisecond
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The retry policy for waiting for a catalog snapshot to become visible is hardcoded. While the current values (20 attempts, 10ms interval) provide a 200ms total wait time, this may not be sufficient in environments with higher replication latency. Making this policy configurable would improve the server's robustness and adaptability to different deployment scenarios.

const (
	childRouteCount            = 2
	splitMutationOpCount       = childRouteCount + 3
)

var (
	defaultCatalogReloadRetryAttempts = 20
	defaultCatalogReloadRetryInterval = 10 * time.Millisecond
)

Comment thread distribution/catalog.go Outdated
Comment on lines +159 to +160
if err != nil {
return 0, errors.WithStack(ErrCatalogInvalidNextRouteID)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In case of a decoding error from DecodeCatalogVersion, the original error is discarded and replaced with ErrCatalogInvalidNextRouteID. This loses valuable context about the failure, such as whether it was due to an invalid record size or an unsupported codec version. To improve debuggability, it would be better to wrap the original error.

Suggested change
if err != nil {
return 0, errors.WithStack(ErrCatalogInvalidNextRouteID)
if err != nil {
return 0, errors.Wrap(err, "invalid next route id record")
}

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@bootjp bootjp requested a review from Copilot February 21, 2026 08:01
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Feb 21, 2026

/gemini review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces ListRoutes and SplitRange administrative APIs, backed by a new durable route catalog. The changes are extensive, including updates to the distribution server, catalog persistence, and bootstrapping logic. The implementation is robust and accompanied by a comprehensive test suite. I have a suggestion for improvement regarding concurrency in the SplitRange implementation.

Comment thread adapter/distribution_server.go
@bootjp bootjp requested a review from Copilot February 21, 2026 08:06
@bootjp bootjp enabled auto-merge February 21, 2026 08:07
@bootjp bootjp merged commit 1ad2994 into main Feb 21, 2026
7 checks passed
@bootjp bootjp deleted the feature/hotspot-re-allocation-3 branch February 21, 2026 08:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread main.go
Comment on lines +76 to +86
distCatalog := distributionCatalogStoreForGroup(runtimes, cfg.defaultGroup)
if distCatalog != nil {
if _, err := distribution.EnsureCatalogSnapshot(ctx, distCatalog, cfg.engine); err != nil {
return errors.Wrapf(err, "initialize distribution catalog")
}
}
distServer := adapter.NewDistributionServer(
cfg.engine,
distCatalog,
adapter.WithDistributionCoordinator(coordinate),
)
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

distributionCatalogStoreForGroup always selects the catalog store from cfg.defaultGroup, but the sharded coordinator routes mutations based on the key’s shard route. If the shard ranges map !dist|... keys to a different group than defaultGroup (or don’t cover that prefix at all), SplitRange will either fail leader checks or attempt to commit catalog mutations to the wrong raft group. Consider deriving the catalog group from cfg.engine.GetRoute(distribution.CatalogVersionKey()) (and failing fast if no route), then opening the CatalogStore for that group so storage and coordinator routing stay consistent.

Copilot uses AI. Check for mistakes.
bootjp added a commit that referenced this pull request Apr 30, 2026
Three correctness bugs and several quality issues, all valid.

CRITICAL: path traversal (Codex P1 #425).
S3 object keys are user-controlled. A key like "../../../etc/passwd"
would, under the prior filepath.Join-and-write code, escape the
bucket directory and overwrite host files. Added safeJoinUnderRoot
which Cleans the joined path and asserts it stays rooted under the
bucket dir; ".." traversal is rejected with ErrS3MalformedKey.
Absolute paths (leading "/") are normalised under the bucket dir by
filepath.Join, which is the safest outcome. Tests:
TestS3_PathTraversalAttemptRejected,
TestS3_AbsolutePathObjectKeyConfinedUnderBucket.

CRITICAL: stale upload-id chunks merged into body (Codex P1 #500,
Gemini HIGH #106/#476/#504).
A snapshot mid-delete-and-recreate or mid-retry can carry blob
chunks for multiple upload attempts under the same (bucket, gen,
object). The prior assembleObjectBody concatenated every chunk
regardless of upload_id, producing corrupted bytes. Now:
- s3ObjectState gains uploadID; HandleObjectManifest sets it.
- New filterChunksForManifest takes the chunkPaths map and the
  manifest's uploadID, returns only matching chunks sorted by
  (partNo, partVersion, chunkNo). Stale-uploadID chunks never
  enter the assembled body.
Test: TestS3_StaleUploadIDChunksFilteredFromAssembledBody.

CRITICAL: incomplete-uploads file truncated per record (Codex P2
#318, Gemini HIGH+MEDIUM #318).
HandleIncompleteUpload re-opened records.jsonl on every call;
openJSONL uses os.Create which truncates. Only the last record
survived per bucket. Now:
- s3BucketState carries an incompleteUploadsJL *jsonlFile lazily
  opened on the first record and cached.
- flushBucket closes it and surfaces the error (was silently
  ignored).
Test: TestS3_IncompleteUploadsAppendsAcrossCalls (3 records, 3
lines on disk).

QUALITY:
- Gemini MEDIUM #285 (MkdirAll per blob): s3ObjectState gains a
  scratchDirCreated bool; HandleBlob runs MkdirAll once.
- Gemini MEDIUM #318 (closeJSONL error ignored): keymap and
  incomplete-uploads writers now surface close errors via
  closeBucketKeymap / explicit closeJSONL return-check.
- Gemini MEDIUM #386 (includeOrphans flag ignored): orphan chunks
  for objects without manifests now write to
  <bucket>/_orphans/<encoded-object>/<chunk>.bin under
  WithIncludeOrphans(true). Test:
  TestS3_OrphanChunksWrittenWhenIncludeOrphans.

Also removed the now-unused attachManifestParts placeholder and
sortChunkKeys helper; their logic moved into
filterChunksForManifest which combines uploadID filtering with the
sort.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants