Skip to content

Add api for server query route enable/disable#18190

Open
hongkunxu wants to merge 3 commits intoapache:masterfrom
hongkunxu:fix/instance_config_sync
Open

Add api for server query route enable/disable#18190
hongkunxu wants to merge 3 commits intoapache:masterfrom
hongkunxu:fix/instance_config_sync

Conversation

@hongkunxu
Copy link
Copy Markdown
Contributor

@hongkunxu hongkunxu commented Apr 14, 2026

PR Description

This change introduces an HTTP API to enable or disable query routing for a specific server.

The primary use case is to support safe and seamless rolling upgrades of a Pinot cluster. In some of our data centers, Pinot clusters do not have backup clusters, so it is critical to ensure zero impact during upgrades.

Although the existing shutdownInProgress flag provides some level of control, it does not handle certain edge cases well. For example, when a server (or pod) is abruptly terminated, Pinot cannot immediately remove it from the query routing, which may lead to transient query failures.

With this new API, we can explicitly disable query routing to a server before upgrading the pod, and re-enable it after the upgrade is complete. This approach avoids routing queries to unavailable servers and enables a truly zero-downtime upgrade process.

Usage

# Disable query route for server
curl -X PUT "http://<controller>:9000/instances/Server_host_20000/queriesRoute?state=QUERIES_DISABLE"

# Enable query route for server
curl -X PUT "http://<controller>:9000/instances/Server_host_20000/queriesRoute?state=QUERIES_ENABLE"
image

@hongkunxu hongkunxu changed the title enhance: add api for server query route enable/disable Add api for server query route enable/disable Apr 14, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 14, 2026

Codecov Report

❌ Patch coverage is 92.30769% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 63.44%. Comparing base (e4a18a0) to head (d92a5ac).
⚠️ Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
...er/api/resources/PinotInstanceRestletResource.java 83.33% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##             master   #18190   +/-   ##
=========================================
  Coverage     63.44%   63.44%           
  Complexity     1627     1627           
=========================================
  Files          3244     3244           
  Lines        197250   197263   +13     
  Branches      30514    30517    +3     
=========================================
+ Hits         125136   125153   +17     
- Misses        62082    62091    +9     
+ Partials      10032    10019   -13     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.41% <92.30%> (+0.01%) ⬆️
java-21 63.41% <92.30%> (-0.01%) ⬇️
temurin 63.44% <92.30%> (+<0.01%) ⬆️
unittests 63.44% <92.30%> (+<0.01%) ⬆️
unittests1 55.39% <ø> (-0.01%) ⬇️
unittests2 34.98% <92.30%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@xiangfu0 xiangfu0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finding 1

  • Severity: CRITICAL
  • Rule: Distributed state mutations must be field-scoped or version-safe
  • Where: pinot-controller/src/main/java/org/apache/pinot/controller/helix/core/PinotHelixResourceManager.java in setQueriesRoute(); interacts with server lifecycle writes in pinot-server/src/main/java/org/apache/pinot/server/starter/helix/BaseServerStarter.java
  • Issue: setQueriesRoute() fetches the whole InstanceConfig, mutates one simple field, and writes the entire record back with _helixDataAccessor.setProperty(...). This API is explicitly meant to run during rolling upgrades, which is exactly when the server is also toggling shutdownInProgress through Helix config writes. A stale full-record write here can clobber that lifecycle bit (or other concurrent instance-config updates) depending on write ordering.
  • Risk: Broker routing uses both shutdownInProgress and queriesDisabled to decide whether a server is routable. If these two flags can overwrite each other during the upgrade path, we can route to a server that is shutting down, or keep a healthy server out of routing after it comes back. That is a real query-path safety issue for the exact workflow this PR is trying to support.
  • Suggested fix: Update only queriesDisabled with a field-scoped Helix write (HelixAdmin.setConfig(...) / updateProperty(...) with a DataUpdater) instead of replacing the whole InstanceConfig, and add a regression test that interleaves this API with server startup/shutdown flag transitions.

Finding 2

  • Severity: MAJOR
  • Rule: Unsupported operational states must fail fast, not succeed as a no-op
  • Where: pinot-controller/src/main/java/org/apache/pinot/controller/api/resources/PinotInstanceRestletResource.java in toggleQueriesRoute() and pinot-controller/src/main/java/org/apache/pinot/controller/helix/core/PinotHelixResourceManager.java in setQueriesRoute()
  • Issue: The API is documented as a server-routing control, but it accepts any Helix instance id and returns success after writing queriesDisabled into the instance config. Broker routing only consults queriesDisabled for server instances, so calling this on a broker/controller/minion is a silent no-op from the query-routing perspective.
  • Risk: This is meant for operational upgrade automation. A successful response on the wrong instance type can make automation think traffic was drained when nothing in broker routing actually changed, and it also pollutes shared instance-config state with a server-only flag.
  • Suggested fix: Reject non-server instances up front (for example with InstanceTypeUtils.isServer(instanceName)), and add a negative test that verifies broker/minion/controller ids are rejected.

Finding 3

  • Severity: MAJOR
  • Rule: Query-routing changes need coverage on the actual user-facing pipeline
  • Where: pinot-controller/src/test/java/org/apache/pinot/controller/api/PinotInstanceRestletResourceTest.java in testToggleQueriesRoute()
  • Issue: The new test only verifies that the controller GET response echoes the queriesDisabled bit and that bad input is rejected. It never verifies that a broker actually removes and re-adds the server from routing, which is the behavior promised in the PR description.
  • Risk: A regression in the broker instance-config listener, or a state-clobbering bug like the one above, would still pass this test. That leaves the zero-downtime-upgrade contract untested.
  • Suggested fix: Add broker-routing-manager or integration coverage that flips the flag and asserts the server leaves and re-enters routing, ideally alongside the lifecycle interleaving case above.

Signed-off-by: Hongkun Xu <xuhongkun666@163.com>
Signed-off-by: Hongkun Xu <xuhongkun666@163.com>
@hongkunxu hongkunxu force-pushed the fix/instance_config_sync branch from ec6720e to a633cf4 Compare April 17, 2026 10:54
Signed-off-by: Hongkun Xu <xuhongkun666@163.com>
@hongkunxu hongkunxu requested a review from xiangfu0 April 17, 2026 13:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants