[apache/helix] -- Issue during onboarding resources without instances #2782

himanshukandwal · 2024-03-18T23:33:11Z

Issues

My PR addresses the following Helix issues and references them in the PR description:
Fixes NPE Issues when we add a WAGED resource without instances against the resource tag #2781

Description

Here are some details about my PR, including screenshots of any UI changes:
When adding a new WAGED resource with a tag and without any instances against that tag, we are observing NPE coming from the system. This in turn fails the complete WAGED cluster rebalance pipeline. This particularly happen when the numPartitions = 0.

To solve this issue we are adding a check in the ResourceComputationStage to have such resources excluded from the pipeline computation and only be considered when there are actual resource partitions (>0) to be assigned to the instances.

org.apache.helix.HelixRebalanceException: Failed to compute for delayed rebalance overwrites in cluster ZnRecord=CLUSTER_TestWagedClusterExpansionWithAddingResourcesBeforeInstances, {DELAY_REBALANCE_ENABLED=true, DELAY_REBALANCE_TIME=3000000, FAULT_ZONE_TYPE=zone, PERSIST_BEST_POSSIBLE_ASSIGNMENT=true, TOPOLOGY=/zone/instance, TOPOLOGY_AWARE_ENABLED=true}{REBALANCE_PREFERENCE={EVENNESS=0, LESS_MOVEMENT=10}}{}, Stat=Stat {_version=4, _creationTime=1710804015571, _modifiedTime=1710804047240, _ephemeralOwner=0} Failure Type: INVALID_CLUSTER_STATUS
	at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.handleDelayedRebalanceMinActiveReplica(WagedRebalancer.java:428) ~[classes/:?]
	at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.emergencyRebalance(WagedRebalancer.java:501) ~[classes/:?]
	at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.computeBestPossibleAssignment(WagedRebalancer.java:339) ~[classes/:?]
	at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.computeBestPossibleStates(WagedRebalancer.java:316) ~[classes/:?]
	at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.computeNewIdealStates(WagedRebalancer.java:248) [classes/:?]
	at org.apache.helix.controller.stages.BestPossibleStateCalcStage.computeResourceBestPossibleStateWithWagedRebalancer(BestPossibleStateCalcStage.java:406) [classes/:?]
	at org.apache.helix.controller.stages.BestPossibleStateCalcStage.compute(BestPossibleStateCalcStage.java:258) [classes/:?]
	at org.apache.helix.controller.stages.BestPossibleStateCalcStage.process(BestPossibleStateCalcStage.java:91) [classes/:?]
	at org.apache.helix.controller.pipeline.Pipeline.handle(Pipeline.java:75) [classes/:?]
	at org.apache.helix.controller.GenericHelixController.handleEvent(GenericHelixController.java:903) [classes/:?]
	at org.apache.helix.controller.GenericHelixController$ClusterEventProcessor.run(GenericHelixController.java:1554) [classes/:?]
Caused by: java.lang.NullPointerException
	at org.apache.helix.controller.rebalancer.util.DelayedRebalanceUtil.findToBeAssignedReplicasForMinActiveReplica(DelayedRebalanceUtil.java:335) ~[classes/:?]
	at org.apache.helix.controller.rebalancer.waged.model.ClusterModelProvider.generateClusterModel(ClusterModelProvider.java:257) ~[classes/:?]
	at org.apache.helix.controller.rebalancer.waged.model.ClusterModelProvider.generateClusterModelForDelayedRebalanceOverwrites(ClusterModelProvider.java:82) ~[classes/:?]
	at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.handleDelayedRebalanceMinActiveReplica(WagedRebalancer.java:415) ~[classes/:?]
	... 10 more
94184 [main] ERROR org.apache.helix.controller.rebalancer.waged.WagedRebalancer [] - Failed to compute for delayed rebalance overwrites in cluster CLUSTER_TestWagedClusterExpansionWithAddingResourcesBeforeInstances
94188 [main] ERROR org.apache.helix.controller.rebalancer.waged.WagedRebalancer [] - Failed to calculate the new assignments.
org.apache.helix.HelixRebalanceException: Failed to compute for delayed rebalance overwrites in cluster ZnRecord=CLUSTER_TestWagedClusterExpansionWithAddingResourcesBeforeInstances, {DELAY_REBALANCE_ENABLED=true, DELAY_REBALANCE_TIME=3000000, FAULT_ZONE_TYPE=zone, PERSIST_BEST_POSSIBLE_ASSIGNMENT=true, TOPOLOGY=/zone/instance, TOPOLOGY_AWARE_ENABLED=true}{REBALANCE_PREFERENCE={EVENNESS=0, LESS_MOVEMENT=10}}{}, Stat=Stat {_version=4, _creationTime=1710804015571, _modifiedTime=1710804047240, _ephemeralOwner=0} Failure Type: INVALID_CLUSTER_STATUS
	at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.handleDelayedRebalanceMinActiveReplica(WagedRebalancer.java:428) ~[classes/:?]
	at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.emergencyRebalance(WagedRebalancer.java:514) ~[classes/:?]
	at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.computeBestPossibleAssignment(WagedRebalancer.java:339) ~[classes/:?]
	at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.computeBestPossibleStates(WagedRebalancer.java:316) ~[classes/:?]
	at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.computeNewIdealStates(WagedRebalancer.java:248) [classes/:?]
	at org.apache.helix.controller.stages.BestPossibleStateCalcStage.computeResourceBestPossibleStateWithWagedRebalancer(BestPossibleStateCalcStage.java:406) [classes/:?]
	at org.apache.helix.controller.stages.BestPossibleStateCalcStage.compute(BestPossibleStateCalcStage.java:258) [classes/:?]
	at org.apache.helix.controller.stages.BestPossibleStateCalcStage.process(BestPossibleStateCalcStage.java:91) [classes/:?]
	at org.apache.helix.util.RebalanceUtil.runStage(RebalanceUtil.java:235) [classes/:?]
	at org.apache.helix.tools.ClusterVerifiers.BestPossibleExternalViewVerifier.calcBestPossState(BestPossibleExternalViewVerifier.java:444) [classes/:?]
	at org.apache.helix.tools.ClusterVerifiers.BestPossibleExternalViewVerifier.verifyState(BestPossibleExternalViewVerifier.java:293) [classes/:?]
	at org.apache.helix.tools.ClusterVerifiers.ZkHelixClusterVerifier.verifyByPolling(ZkHelixClusterVerifier.java:209) [classes/:?]
	at org.apache.helix.tools.ClusterVerifiers.ZkHelixClusterVerifier.verifyByPolling(ZkHelixClusterVerifier.java:229) [classes/:?]
	at org.apache.helix.integration.rebalancer.WagedRebalancer.TestWagedClusterExpansionWithAddingResourcesBeforeInstances.testExpandClusterWithResourceWithoutInstances(TestWagedClusterExpansionWithAddingResourcesBeforeInstances.java:151) [test-classes/:?]
	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]
	at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
Caused by: java.lang.NullPointerException
	at org.apache.helix.controller.rebalancer.util.DelayedRebalanceUtil.findToBeAssignedReplicasForMinActiveReplica(DelayedRebalanceUtil.java:335) ~[classes/:?]
	at org.apache.helix.controller.rebalancer.waged.model.ClusterModelProvider.generateClusterModel(ClusterModelProvider.java:257) ~[classes/:?]
	at org.apache.helix.controller.rebalancer.waged.model.ClusterModelProvider.generateClusterModelForDelayedRebalanceOverwrites(ClusterModelProvider.java:82) ~[classes/:?]
	at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.handleDelayedRebalanceMinActiveReplica(WagedRebalancer.java:415) ~[classes/:?]
	... 37 more

Tests

The following tests are written for this issue:

(List the names of added unit/integration tests)

TestWagedClusterExpansionWithAddingResourcesBeforeInstances.testExpandClusterWithResourceWithoutInstances

The following is the result of the "mvn test" command on the appropriate module:

➜  helix_os_hk git:(hkandwal/waged-adding-resources-without-capacity) mvn clean install -Dmaven.test.skip.exec=true && mvn test -o -Dtest=TestWagedClusterExpansionWithAddingResourcesBeforeInstances -pl=helix-core

[INFO] --- surefire:3.0.0-M3:test (default-test) @ helix-core ---
[INFO] 
[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running org.apache.helix.integration.rebalancer.WagedRebalancer.TestWagedClusterExpansionWithAddingResourcesBeforeInstances
Start zookeeper at localhost:2183 in thread main
AfterClass: TestWagedClusterExpansionWithAddingResourcesBeforeInstances called.
Shut down zookeeper at port 2183 in thread main
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 17.354 s - in org.apache.helix.integration.rebalancer.WagedRebalancer.TestWagedClusterExpansionWithAddingResourcesBeforeInstances
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
[INFO] 
[INFO] 
[INFO] --- jacoco:0.8.6:report (generate-code-coverage-report) @ helix-core ---
[INFO] Loading execution data file /Users/hkandwal/Documents/workspaces/projects/helix_os_hk/helix-core/target/jacoco.exec
[INFO] Analyzed bundle 'Apache Helix :: Core' with 950 classes
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  54.225 s
[INFO] Finished at: 2024-03-27T16:29:31-07:00
[INFO] ------------------------------------------------------------------------

Changes that Break Backward Compatibility (Optional)

My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:

(Consider including all behavior changes for public methods or API. Also include these changes in merge description so that other developers are aware of these changes. This allows them to make relevant code changes in feature branches accounting for the new method/API behavior.)

Documentation (Optional)

In case of new functionality, my PR adds documentation in the following wiki page:

(Link the GitHub wiki you added)

Commits

My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Code Quality

My diff has been formatted using helix-style.xml
(helix-style-intellij.xml if IntelliJ IDE is used)

junkaixue · 2024-03-21T20:26:37Z

Is this just for adding a test?

himanshukandwal · 2024-03-27T23:30:58Z

Is this just for adding a test?

Hey @junkaixue yes, I first reproduced it using the test, now added the fix. Would you pls review it when get a chance.

zpinto

LGTM! thanks for investigating this issue.

zpinto

Please remove WIP from title

himanshukandwal · 2024-03-28T06:12:43Z

Please remove WIP from title

Done

himanshukandwal · 2024-03-28T06:13:44Z

This PR has been reviewed and approved by @zpinto.

Final Commit Message: When adding a new WAGED resource with a tag and without any instances against that tag, we are observing NPE coming from the system. To solve this issue we are adding a check in the ResourceComputationStage to have such resources excluded from the pipeline computation and only be considered when there are actual resource partitions (>0) to be assigned to the instances.

junkaixue · 2024-03-31T22:02:02Z

lgtm. Let's wait for the tests.

himanshukandwal · 2024-04-01T18:04:37Z

Hey @junkaixue, the PR CI successful.

Also ran the full CI suite in my repo and thats successful as well:
https://github.com/himanshukandwal/helix/actions/runs/8471701884/job/23212239715

New Release Snapshot with several fixes: [apache/helix] -- Issue during onboarding resources without instances apache#2782 [apache/helix] -- Provide JDK 1.8 (backward) compatibility of helix-core apache#2775 Do not start the server if user uses the default SECRET_TOKEN env value apache#2783 Delete expected version apache#2759 [apache/helix] -- Fix PreferenceList Ordering Changes during Maintenance Mode apache#2778

[apache/helix] -- Issue during onboarding resources without instances.

05038d3

[apache/helix] -- Issue during onboarding resources without instances.

fb1468f

zpinto reviewed Mar 27, 2024

View reviewed changes

zpinto approved these changes Mar 27, 2024

View reviewed changes

himanshukandwal changed the title ~~[WIP][apache/helix] -- Issue during onboarding resources without instances~~ [apache/helix] -- Issue during onboarding resources without instances Mar 28, 2024

github-actions bot added the CheckedAndApproved label Mar 28, 2024

junkaixue merged commit c480eac into apache:master Apr 1, 2024
5 checks passed

zpinto mentioned this pull request Apr 30, 2024

New Release Snapshot linkedin/helix#2

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[apache/helix] -- Issue during onboarding resources without instances #2782

[apache/helix] -- Issue during onboarding resources without instances #2782

himanshukandwal commented Mar 18, 2024 •

edited

Loading

junkaixue commented Mar 21, 2024

himanshukandwal commented Mar 27, 2024

zpinto left a comment

zpinto left a comment

himanshukandwal commented Mar 28, 2024

himanshukandwal commented Mar 28, 2024

junkaixue commented Mar 31, 2024

himanshukandwal commented Apr 1, 2024

[apache/helix] -- Issue during onboarding resources without instances #2782

[apache/helix] -- Issue during onboarding resources without instances #2782

Conversation

himanshukandwal commented Mar 18, 2024 • edited Loading

Issues

Description

Tests

Changes that Break Backward Compatibility (Optional)

Documentation (Optional)

Commits

Code Quality

junkaixue commented Mar 21, 2024

himanshukandwal commented Mar 27, 2024

zpinto left a comment

Choose a reason for hiding this comment

zpinto left a comment

Choose a reason for hiding this comment

himanshukandwal commented Mar 28, 2024

himanshukandwal commented Mar 28, 2024

junkaixue commented Mar 31, 2024

himanshukandwal commented Apr 1, 2024

himanshukandwal commented Mar 18, 2024 •

edited

Loading