Add api for server query route enable/disable#18190
Open
hongkunxu wants to merge 3 commits intoapache:masterfrom
Open
Add api for server query route enable/disable#18190hongkunxu wants to merge 3 commits intoapache:masterfrom
hongkunxu wants to merge 3 commits intoapache:masterfrom
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #18190 +/- ##
=========================================
Coverage 63.44% 63.44%
Complexity 1627 1627
=========================================
Files 3244 3244
Lines 197250 197263 +13
Branches 30514 30517 +3
=========================================
+ Hits 125136 125153 +17
- Misses 62082 62091 +9
+ Partials 10032 10019 -13
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
xiangfu0
requested changes
Apr 15, 2026
Contributor
xiangfu0
left a comment
There was a problem hiding this comment.
Finding 1
- Severity: CRITICAL
- Rule: Distributed state mutations must be field-scoped or version-safe
- Where:
pinot-controller/src/main/java/org/apache/pinot/controller/helix/core/PinotHelixResourceManager.javainsetQueriesRoute(); interacts with server lifecycle writes inpinot-server/src/main/java/org/apache/pinot/server/starter/helix/BaseServerStarter.java - Issue:
setQueriesRoute()fetches the wholeInstanceConfig, mutates one simple field, and writes the entire record back with_helixDataAccessor.setProperty(...). This API is explicitly meant to run during rolling upgrades, which is exactly when the server is also togglingshutdownInProgressthrough Helix config writes. A stale full-record write here can clobber that lifecycle bit (or other concurrent instance-config updates) depending on write ordering. - Risk: Broker routing uses both
shutdownInProgressandqueriesDisabledto decide whether a server is routable. If these two flags can overwrite each other during the upgrade path, we can route to a server that is shutting down, or keep a healthy server out of routing after it comes back. That is a real query-path safety issue for the exact workflow this PR is trying to support. - Suggested fix: Update only
queriesDisabledwith a field-scoped Helix write (HelixAdmin.setConfig(...)/updateProperty(...)with aDataUpdater) instead of replacing the wholeInstanceConfig, and add a regression test that interleaves this API with server startup/shutdown flag transitions.
Finding 2
- Severity: MAJOR
- Rule: Unsupported operational states must fail fast, not succeed as a no-op
- Where:
pinot-controller/src/main/java/org/apache/pinot/controller/api/resources/PinotInstanceRestletResource.javaintoggleQueriesRoute()andpinot-controller/src/main/java/org/apache/pinot/controller/helix/core/PinotHelixResourceManager.javainsetQueriesRoute() - Issue: The API is documented as a server-routing control, but it accepts any Helix instance id and returns success after writing
queriesDisabledinto the instance config. Broker routing only consultsqueriesDisabledfor server instances, so calling this on a broker/controller/minion is a silent no-op from the query-routing perspective. - Risk: This is meant for operational upgrade automation. A successful response on the wrong instance type can make automation think traffic was drained when nothing in broker routing actually changed, and it also pollutes shared instance-config state with a server-only flag.
- Suggested fix: Reject non-server instances up front (for example with
InstanceTypeUtils.isServer(instanceName)), and add a negative test that verifies broker/minion/controller ids are rejected.
Finding 3
- Severity: MAJOR
- Rule: Query-routing changes need coverage on the actual user-facing pipeline
- Where:
pinot-controller/src/test/java/org/apache/pinot/controller/api/PinotInstanceRestletResourceTest.javaintestToggleQueriesRoute() - Issue: The new test only verifies that the controller GET response echoes the
queriesDisabledbit and that bad input is rejected. It never verifies that a broker actually removes and re-adds the server from routing, which is the behavior promised in the PR description. - Risk: A regression in the broker instance-config listener, or a state-clobbering bug like the one above, would still pass this test. That leaves the zero-downtime-upgrade contract untested.
- Suggested fix: Add broker-routing-manager or integration coverage that flips the flag and asserts the server leaves and re-enters routing, ideally alongside the lifecycle interleaving case above.
Signed-off-by: Hongkun Xu <xuhongkun666@163.com>
ec6720e to
a633cf4
Compare
Signed-off-by: Hongkun Xu <xuhongkun666@163.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR Description
This change introduces an HTTP API to enable or disable query routing for a specific server.
The primary use case is to support safe and seamless rolling upgrades of a Pinot cluster. In some of our data centers, Pinot clusters do not have backup clusters, so it is critical to ensure zero impact during upgrades.
Although the existing shutdownInProgress flag provides some level of control, it does not handle certain edge cases well. For example, when a server (or pod) is abruptly terminated, Pinot cannot immediately remove it from the query routing, which may lead to transient query failures.
With this new API, we can explicitly disable query routing to a server before upgrading the pod, and re-enable it after the upgrade is complete. This approach avoids routing queries to unavailable servers and enables a truly zero-downtime upgrade process.
Usage