distribution: add ListRoutes and SplitRange admin APIs backed by catalog#318
distribution: add ListRoutes and SplitRange admin APIs backed by catalog#318
Conversation
Summary of ChangesHello @bootjp, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the distribution management capabilities by introducing new administrative gRPC endpoints. These endpoints allow for the inspection of all configured routes and the dynamic splitting of existing routes into smaller segments. The changes involve integrating a durable catalog store to maintain route state and ensure consistency across the system, thereby providing more granular control over data distribution. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces new administrative APIs, ListRoutes and SplitRange, for managing routes backed by a durable catalog. The implementation is well-structured and includes comprehensive tests for various scenarios. However, I've identified a critical race condition in the SplitRange method that could lead to an inconsistent state between the durable catalog and the in-memory engine, which violates our guidelines on atomic operations. My review provides a detailed explanation of this issue and a recommended solution to ensure atomicity. Apart from this critical issue, the changes look solid.
There was a problem hiding this comment.
Pull request overview
Adds new distribution admin APIs to inspect and mutate the durable route catalog via gRPC, and wires the server to a distribution.CatalogStore so the APIs operate against persisted state.
Changes:
- Extend
DistributiongRPC service withListRoutesandSplitRange, plus proto definitions forRouteDescriptorandRouteState. - Implement
ListRoutes(reads catalog snapshot) andSplitRange(splits a catalog route, saves with OCC, applies snapshot to the in-memory engine) inadapter.DistributionServer. - Update server wiring (main + demo) and add unit tests for the new APIs.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| proto/distribution.proto | Adds RPCs and message/enum types to support listing and splitting routes. |
| proto/distribution.pb.go | Regenerated protobuf Go types for the new messages/enum. |
| proto/distribution_grpc.pb.go | Regenerated gRPC stubs to include ListRoutes and SplitRange. |
| adapter/distribution_server.go | Implements the new catalog-backed admin APIs and supporting helpers/error mapping. |
| adapter/distribution_server_test.go | Adds coverage for ListRoutes/SplitRange and related error cases. |
| main.go | Wires a CatalogStore into the distribution server in the multi-raft runtime. |
| cmd/server/demo.go | Wires a CatalogStore into the distribution server in the demo server. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/gemini review |
There was a problem hiding this comment.
Code Review
The pull request introduces new administrative APIs for listing routes and splitting ranges, backed by a catalog store. This is a significant step towards dynamic distribution management. The changes involve updating the DistributionServer to include a CatalogStore, adding new gRPC endpoints, and implementing the logic for ListRoutes and SplitRange. The associated protobuf definitions and generated Go code have also been updated. Test cases for the new functionalities are included, covering success scenarios and various error conditions. Overall, the changes are well-structured and address the stated objective of adding these administrative APIs.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces significant new functionality by adding ListRoutes and SplitRange admin APIs, backed by a durable and versioned route catalog. This is a great step towards more robust range management. The use of optimistic concurrency control for splits is well-implemented. My review has identified one critical issue concerning the atomicity of catalog updates when a coordinator is used, which could lead to data inconsistency. This aligns with the principle that in an MVCC system, all operations within a single transaction must share the same start timestamp to ensure atomicity and a consistent snapshot view. I've also noted a couple of medium-severity issues related to a context-handling anti-pattern and an opportunity to improve efficiency and consistency in how catalog updates are generated. Overall, the changes are well-structured, but addressing the atomicity issue is crucial before merging.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/gemini review |
There was a problem hiding this comment.
Code Review
The pull request introduces new administrative APIs for ListRoutes and SplitRange backed by a catalog, enhancing the distribution server's capabilities. It also includes necessary protobuf updates, new error definitions, and a bootstrapping mechanism for the catalog. The changes are well-structured and include appropriate tests for the new functionalities. I've identified a few areas for improvement regarding error handling consistency and potential for nil pointer dereferences in specific scenarios.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/gemini review |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/gemini review |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Code Review
This pull request introduces the ListRoutes and SplitRange gRPC APIs to the distribution service, backed by a durable route catalog. The implementation correctly handles atomic catalog mutations using the Raft-backed coordinator and ensures consistency between the durable catalog and the in-memory distribution engine. The addition of next_route_id tracking in the catalog provides robust route ID allocation. The MVCC rule regarding shared start timestamps within transactions is followed. One potential race condition was identified in the SplitRange reload logic that could lead to transient errors, which aligns with best practices for preventing stale data in atomic operations.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces significant new functionality by adding ListRoutes and SplitRange administrative APIs, backed by a durable route catalog. The changes are extensive, touching everything from the protobuf definitions to the core distribution logic, server adapters, and startup configuration. The implementation of the SplitRange operation, including its transactional nature, leader verification, and read-your-writes consistency through polling, is well-handled. The introduction of a durable catalog with bootstrap logic is also a major improvement for the system's robustness. I have a couple of suggestions to enhance configurability and error handling.
| catalogReloadRetryAttempts = 20 | ||
| catalogReloadRetryInterval = 10 * time.Millisecond |
There was a problem hiding this comment.
The retry policy for waiting for a catalog snapshot to become visible is hardcoded. While the current values (20 attempts, 10ms interval) provide a 200ms total wait time, this may not be sufficient in environments with higher replication latency. Making this policy configurable would improve the server's robustness and adaptability to different deployment scenarios.
const (
childRouteCount = 2
splitMutationOpCount = childRouteCount + 3
)
var (
defaultCatalogReloadRetryAttempts = 20
defaultCatalogReloadRetryInterval = 10 * time.Millisecond
)| if err != nil { | ||
| return 0, errors.WithStack(ErrCatalogInvalidNextRouteID) |
There was a problem hiding this comment.
In case of a decoding error from DecodeCatalogVersion, the original error is discarded and replaced with ErrCatalogInvalidNextRouteID. This loses valuable context about the failure, such as whether it was due to an invalid record size or an unsupported codec version. To improve debuggability, it would be better to wrap the original error.
| if err != nil { | |
| return 0, errors.WithStack(ErrCatalogInvalidNextRouteID) | |
| if err != nil { | |
| return 0, errors.Wrap(err, "invalid next route id record") | |
| } |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/gemini review |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Code Review
This pull request introduces ListRoutes and SplitRange administrative APIs, backed by a new durable route catalog. The changes are extensive, including updates to the distribution server, catalog persistence, and bootstrapping logic. The implementation is robust and accompanied by a comprehensive test suite. I have a suggestion for improvement regarding concurrency in the SplitRange implementation.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| distCatalog := distributionCatalogStoreForGroup(runtimes, cfg.defaultGroup) | ||
| if distCatalog != nil { | ||
| if _, err := distribution.EnsureCatalogSnapshot(ctx, distCatalog, cfg.engine); err != nil { | ||
| return errors.Wrapf(err, "initialize distribution catalog") | ||
| } | ||
| } | ||
| distServer := adapter.NewDistributionServer( | ||
| cfg.engine, | ||
| distCatalog, | ||
| adapter.WithDistributionCoordinator(coordinate), | ||
| ) |
There was a problem hiding this comment.
distributionCatalogStoreForGroup always selects the catalog store from cfg.defaultGroup, but the sharded coordinator routes mutations based on the key’s shard route. If the shard ranges map !dist|... keys to a different group than defaultGroup (or don’t cover that prefix at all), SplitRange will either fail leader checks or attempt to commit catalog mutations to the wrong raft group. Consider deriving the catalog group from cfg.engine.GetRoute(distribution.CatalogVersionKey()) (and failing fast if no route), then opening the CatalogStore for that group so storage and coordinator routing stay consistent.
Three correctness bugs and several quality issues, all valid. CRITICAL: path traversal (Codex P1 #425). S3 object keys are user-controlled. A key like "../../../etc/passwd" would, under the prior filepath.Join-and-write code, escape the bucket directory and overwrite host files. Added safeJoinUnderRoot which Cleans the joined path and asserts it stays rooted under the bucket dir; ".." traversal is rejected with ErrS3MalformedKey. Absolute paths (leading "/") are normalised under the bucket dir by filepath.Join, which is the safest outcome. Tests: TestS3_PathTraversalAttemptRejected, TestS3_AbsolutePathObjectKeyConfinedUnderBucket. CRITICAL: stale upload-id chunks merged into body (Codex P1 #500, Gemini HIGH #106/#476/#504). A snapshot mid-delete-and-recreate or mid-retry can carry blob chunks for multiple upload attempts under the same (bucket, gen, object). The prior assembleObjectBody concatenated every chunk regardless of upload_id, producing corrupted bytes. Now: - s3ObjectState gains uploadID; HandleObjectManifest sets it. - New filterChunksForManifest takes the chunkPaths map and the manifest's uploadID, returns only matching chunks sorted by (partNo, partVersion, chunkNo). Stale-uploadID chunks never enter the assembled body. Test: TestS3_StaleUploadIDChunksFilteredFromAssembledBody. CRITICAL: incomplete-uploads file truncated per record (Codex P2 #318, Gemini HIGH+MEDIUM #318). HandleIncompleteUpload re-opened records.jsonl on every call; openJSONL uses os.Create which truncates. Only the last record survived per bucket. Now: - s3BucketState carries an incompleteUploadsJL *jsonlFile lazily opened on the first record and cached. - flushBucket closes it and surfaces the error (was silently ignored). Test: TestS3_IncompleteUploadsAppendsAcrossCalls (3 records, 3 lines on disk). QUALITY: - Gemini MEDIUM #285 (MkdirAll per blob): s3ObjectState gains a scratchDirCreated bool; HandleBlob runs MkdirAll once. - Gemini MEDIUM #318 (closeJSONL error ignored): keymap and incomplete-uploads writers now surface close errors via closeBucketKeymap / explicit closeJSONL return-check. - Gemini MEDIUM #386 (includeOrphans flag ignored): orphan chunks for objects without manifests now write to <bucket>/_orphans/<encoded-object>/<chunk>.bin under WithIncludeOrphans(true). Test: TestS3_OrphanChunksWrittenWhenIncludeOrphans. Also removed the now-unused attachManifestParts placeholder and sortChunkKeys helper; their logic moved into filterChunksForManifest which combines uploadID filtering with the sort.
No description provided.