Fix StaleInstancesCleanupTask to properly detect server instances in use #17969
Conversation
Signed-off-by: Jinesh Parakh <jineshparakh@hotmail.com>
There was a problem hiding this comment.
Pull request overview
Fixes StaleInstancesCleanupTask so it correctly detects server instances that are still referenced by table IdealState segment assignments (instead of incorrectly querying by table name), preventing misleading “drop server” behavior for in-use servers.
Changes:
- Update server in-use detection to derive server instances from table IdealState segment assignments.
- Add new unit tests covering in-use vs not-in-use server behavior across multiple tables and null/empty IdealStates.
- Add a stateless integration test ensuring a server still present in IdealState is not dropped.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
pinot-controller/src/main/java/org/apache/pinot/controller/helix/core/cleanup/StaleInstancesCleanupTask.java |
Fixes server instance in-use detection by scanning IdealState assignments. |
pinot-controller/src/test/java/org/apache/pinot/controller/helix/core/cleanup/StaleInstancesCleanupTaskTest.java |
Adds focused unit test coverage for the corrected server in-use detection logic. |
pinot-controller/src/test/java/org/apache/pinot/controller/helix/core/cleanup/StaleInstancesCleanupTaskStatelessTest.java |
Adds stateless regression test to ensure servers referenced in IdealState aren’t dropped. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #17969 +/- ##
============================================
+ Coverage 63.25% 63.30% +0.05%
Complexity 1542 1542
============================================
Files 3196 3196
Lines 193742 193744 +2
Branches 29806 29807 +1
============================================
+ Hits 122547 122654 +107
+ Misses 61585 61454 -131
- Partials 9610 9636 +26
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Jinesh Parakh <jineshparakh@hotmail.com>
|
@xiangfu0 addressed Copilot review comment. Please re-review. |
Problem
StaleInstancesCleanupTask.getServerInstancesInUse()calledIdealState.getInstanceSet(tableName). For Pinot table resources, IdealState partitions are segment names, not the table resource id, so the lookup returned empty and every server was treated as unused.There is no actual data loss.
instanceDropSafetyCheck()correctly iteratesgetPartitionSet()thengetInstanceSet(partition), so it will find that the instance IS referenced in an IdealState and refuse the drop.Sample Table Ideal State from
UpsertQuickStartSample Broker Ideal State from
UpsertQuickStartIMP: brokerResource
mapFieldskey is the table name while table ISmapFieldskey is the segment nameChange
Impact
dropInstance/ ZK reads from the safety check path.Dropping server instancelogging for servers that still host segments.Testing