Skip to content

Fix StaleInstancesCleanupTask to properly detect server instances in use #17969

Merged
xiangfu0 merged 2 commits intoapache:masterfrom
jineshparakh:fix-empty-server-instances-in-use
Mar 25, 2026
Merged

Fix StaleInstancesCleanupTask to properly detect server instances in use #17969
xiangfu0 merged 2 commits intoapache:masterfrom
jineshparakh:fix-empty-server-instances-in-use

Conversation

@jineshparakh
Copy link
Copy Markdown
Contributor

Problem

StaleInstancesCleanupTask.getServerInstancesInUse() called IdealState.getInstanceSet(tableName). For Pinot table resources, IdealState partitions are segment names, not the table resource id, so the lookup returned empty and every server was treated as unused.

There is no actual data loss. instanceDropSafetyCheck() correctly iterates getPartitionSet() then getInstanceSet(partition), so it will find that the instance IS referenced in an IdealState and refuse the drop.

Sample Table Ideal State from UpsertQuickStart

{
  "id": "upsertMeetupRsvp_REALTIME",
  "simpleFields": {
    "BATCH_MESSAGE_MODE": "false",
    "IDEAL_STATE_MODE": "CUSTOMIZED",
    "INSTANCE_GROUP_TAG": "upsertMeetupRsvp_REALTIME",
    "MAX_PARTITIONS_PER_INSTANCE": "1",
    "NUM_PARTITIONS": "0",
    "REBALANCE_MODE": "CUSTOMIZED",
    "REPLICAS": "1",
    "STATE_MODEL_DEF_REF": "SegmentOnlineOfflineStateModel",
    "STATE_MODEL_FACTORY_NAME": "DEFAULT"
  },
  "mapFields": {
    "upsertMeetupRsvp__0__0__20260325T0911Z": {
      "Server_100.112.214.70_7052": "CONSUMING"
    },
    "upsertMeetupRsvp__1__0__20260325T0911Z": {
      "Server_100.112.214.70_7053": "CONSUMING"
    }
  },
  "listFields": {}
}

Sample Broker Ideal State from UpsertQuickStart

{
  "id": "brokerResource",
  "simpleFields": {
    "BATCH_MESSAGE_MODE": "false",
    "IDEAL_STATE_MODE": "CUSTOMIZED",
    "NUM_PARTITIONS": "13",
    "REBALANCE_MODE": "CUSTOMIZED",
    "REPLICAS": "0",
    "STATE_MODEL_DEF_REF": "BrokerResourceOnlineOfflineStateModel",
    "STATE_MODEL_FACTORY_NAME": "DEFAULT"
  },
  "mapFields": {
    "airlineStats_REALTIME": {
      "Broker_100.112.214.70_8000": "ONLINE"
    },
    "dailySales_REALTIME": {
      "Broker_100.112.214.70_8000": "ONLINE"
    },
    "fineFoodReviews-federated": {
      "Broker_100.112.214.70_8000": "ONLINE"
    },
    "fineFoodReviews_REALTIME": {
      "Broker_100.112.214.70_8000": "ONLINE"
    },
    "fineFoodReviews_part_0_REALTIME": {
      "Broker_100.112.214.70_8000": "ONLINE"
    },
    "fineFoodReviews_part_1_REALTIME": {
      "Broker_100.112.214.70_8000": "ONLINE"
    },
    "githubEvents_REALTIME": {
      "Broker_100.112.214.70_8000": "ONLINE"
    },
    "meetupRsvpComplexType_REALTIME": {
      "Broker_100.112.214.70_8000": "ONLINE"
    },
    "meetupRsvpJson_REALTIME": {
      "Broker_100.112.214.70_8000": "ONLINE"
    },
    "meetupRsvp_REALTIME": {
      "Broker_100.112.214.70_8000": "ONLINE"
    },
    "upsertJsonMeetupRsvp_REALTIME": {
      "Broker_100.112.214.70_8000": "ONLINE"
    },
    "upsertMeetupRsvp_REALTIME": {
      "Broker_100.112.214.70_8000": "ONLINE"
    },
    "upsertPartialMeetupRsvp_REALTIME": {
      "Broker_100.112.214.70_8000": "ONLINE"
    }
  },
  "listFields": {}
}

IMP: brokerResource mapFields key is the table name while table IS mapFields key is the segment name

Change

  • Build the set of server instances from table IdealState segment assignments

Impact

  • Fewer unnecessary dropInstance / ZK reads from the safety check path.
  • Less misleading Dropping server instance logging for servers that still host segments.

Testing

  • Unit / integration tests for: server assigned in IS not dropped; unassigned server still eligible per existing rules.

Signed-off-by: Jinesh Parakh <jineshparakh@hotmail.com>
@xiangfu0 xiangfu0 added the bug Something is not working as expected label Mar 25, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes StaleInstancesCleanupTask so it correctly detects server instances that are still referenced by table IdealState segment assignments (instead of incorrectly querying by table name), preventing misleading “drop server” behavior for in-use servers.

Changes:

  • Update server in-use detection to derive server instances from table IdealState segment assignments.
  • Add new unit tests covering in-use vs not-in-use server behavior across multiple tables and null/empty IdealStates.
  • Add a stateless integration test ensuring a server still present in IdealState is not dropped.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
pinot-controller/src/main/java/org/apache/pinot/controller/helix/core/cleanup/StaleInstancesCleanupTask.java Fixes server instance in-use detection by scanning IdealState assignments.
pinot-controller/src/test/java/org/apache/pinot/controller/helix/core/cleanup/StaleInstancesCleanupTaskTest.java Adds focused unit test coverage for the corrected server in-use detection logic.
pinot-controller/src/test/java/org/apache/pinot/controller/helix/core/cleanup/StaleInstancesCleanupTaskStatelessTest.java Adds stateless regression test to ensure servers referenced in IdealState aren’t dropped.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 25, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 63.30%. Comparing base (fc424c4) to head (8157df2).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17969      +/-   ##
============================================
+ Coverage     63.25%   63.30%   +0.05%     
  Complexity     1542     1542              
============================================
  Files          3196     3196              
  Lines        193742   193744       +2     
  Branches      29806    29807       +1     
============================================
+ Hits         122547   122654     +107     
+ Misses        61585    61454     -131     
- Partials       9610     9636      +26     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.28% <100.00%> (+0.07%) ⬆️
java-21 63.18% <100.00%> (-0.04%) ⬇️
temurin 63.30% <100.00%> (+0.05%) ⬆️
unittests 63.30% <100.00%> (+0.05%) ⬆️
unittests1 55.56% <ø> (-0.02%) ⬇️
unittests2 34.23% <100.00%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: Jinesh Parakh <jineshparakh@hotmail.com>
@jineshparakh jineshparakh requested a review from xiangfu0 March 25, 2026 11:35
@jineshparakh
Copy link
Copy Markdown
Contributor Author

@xiangfu0 addressed Copilot review comment. Please re-review.

@xiangfu0 xiangfu0 merged commit a80c9ee into apache:master Mar 25, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something is not working as expected

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants