Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[apache/helix] -- Issue during onboarding resources without instances #2782

Conversation

himanshukandwal
Copy link
Contributor

@himanshukandwal himanshukandwal commented Mar 18, 2024

Issues

Description

  • Here are some details about my PR, including screenshots of any UI changes:
    When adding a new WAGED resource with a tag and without any instances against that tag, we are observing NPE coming from the system. This in turn fails the complete WAGED cluster rebalance pipeline. This particularly happen when the numPartitions = 0.

To solve this issue we are adding a check in the ResourceComputationStage to have such resources excluded from the pipeline computation and only be considered when there are actual resource partitions (>0) to be assigned to the instances.

org.apache.helix.HelixRebalanceException: Failed to compute for delayed rebalance overwrites in cluster ZnRecord=CLUSTER_TestWagedClusterExpansionWithAddingResourcesBeforeInstances, {DELAY_REBALANCE_ENABLED=true, DELAY_REBALANCE_TIME=3000000, FAULT_ZONE_TYPE=zone, PERSIST_BEST_POSSIBLE_ASSIGNMENT=true, TOPOLOGY=/zone/instance, TOPOLOGY_AWARE_ENABLED=true}{REBALANCE_PREFERENCE={EVENNESS=0, LESS_MOVEMENT=10}}{}, Stat=Stat {_version=4, _creationTime=1710804015571, _modifiedTime=1710804047240, _ephemeralOwner=0} Failure Type: INVALID_CLUSTER_STATUS
	at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.handleDelayedRebalanceMinActiveReplica(WagedRebalancer.java:428) ~[classes/:?]
	at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.emergencyRebalance(WagedRebalancer.java:501) ~[classes/:?]
	at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.computeBestPossibleAssignment(WagedRebalancer.java:339) ~[classes/:?]
	at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.computeBestPossibleStates(WagedRebalancer.java:316) ~[classes/:?]
	at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.computeNewIdealStates(WagedRebalancer.java:248) [classes/:?]
	at org.apache.helix.controller.stages.BestPossibleStateCalcStage.computeResourceBestPossibleStateWithWagedRebalancer(BestPossibleStateCalcStage.java:406) [classes/:?]
	at org.apache.helix.controller.stages.BestPossibleStateCalcStage.compute(BestPossibleStateCalcStage.java:258) [classes/:?]
	at org.apache.helix.controller.stages.BestPossibleStateCalcStage.process(BestPossibleStateCalcStage.java:91) [classes/:?]
	at org.apache.helix.controller.pipeline.Pipeline.handle(Pipeline.java:75) [classes/:?]
	at org.apache.helix.controller.GenericHelixController.handleEvent(GenericHelixController.java:903) [classes/:?]
	at org.apache.helix.controller.GenericHelixController$ClusterEventProcessor.run(GenericHelixController.java:1554) [classes/:?]
Caused by: java.lang.NullPointerException
	at org.apache.helix.controller.rebalancer.util.DelayedRebalanceUtil.findToBeAssignedReplicasForMinActiveReplica(DelayedRebalanceUtil.java:335) ~[classes/:?]
	at org.apache.helix.controller.rebalancer.waged.model.ClusterModelProvider.generateClusterModel(ClusterModelProvider.java:257) ~[classes/:?]
	at org.apache.helix.controller.rebalancer.waged.model.ClusterModelProvider.generateClusterModelForDelayedRebalanceOverwrites(ClusterModelProvider.java:82) ~[classes/:?]
	at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.handleDelayedRebalanceMinActiveReplica(WagedRebalancer.java:415) ~[classes/:?]
	... 10 more
94184 [main] ERROR org.apache.helix.controller.rebalancer.waged.WagedRebalancer [] - Failed to compute for delayed rebalance overwrites in cluster CLUSTER_TestWagedClusterExpansionWithAddingResourcesBeforeInstances
94188 [main] ERROR org.apache.helix.controller.rebalancer.waged.WagedRebalancer [] - Failed to calculate the new assignments.
org.apache.helix.HelixRebalanceException: Failed to compute for delayed rebalance overwrites in cluster ZnRecord=CLUSTER_TestWagedClusterExpansionWithAddingResourcesBeforeInstances, {DELAY_REBALANCE_ENABLED=true, DELAY_REBALANCE_TIME=3000000, FAULT_ZONE_TYPE=zone, PERSIST_BEST_POSSIBLE_ASSIGNMENT=true, TOPOLOGY=/zone/instance, TOPOLOGY_AWARE_ENABLED=true}{REBALANCE_PREFERENCE={EVENNESS=0, LESS_MOVEMENT=10}}{}, Stat=Stat {_version=4, _creationTime=1710804015571, _modifiedTime=1710804047240, _ephemeralOwner=0} Failure Type: INVALID_CLUSTER_STATUS
	at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.handleDelayedRebalanceMinActiveReplica(WagedRebalancer.java:428) ~[classes/:?]
	at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.emergencyRebalance(WagedRebalancer.java:514) ~[classes/:?]
	at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.computeBestPossibleAssignment(WagedRebalancer.java:339) ~[classes/:?]
	at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.computeBestPossibleStates(WagedRebalancer.java:316) ~[classes/:?]
	at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.computeNewIdealStates(WagedRebalancer.java:248) [classes/:?]
	at org.apache.helix.controller.stages.BestPossibleStateCalcStage.computeResourceBestPossibleStateWithWagedRebalancer(BestPossibleStateCalcStage.java:406) [classes/:?]
	at org.apache.helix.controller.stages.BestPossibleStateCalcStage.compute(BestPossibleStateCalcStage.java:258) [classes/:?]
	at org.apache.helix.controller.stages.BestPossibleStateCalcStage.process(BestPossibleStateCalcStage.java:91) [classes/:?]
	at org.apache.helix.util.RebalanceUtil.runStage(RebalanceUtil.java:235) [classes/:?]
	at org.apache.helix.tools.ClusterVerifiers.BestPossibleExternalViewVerifier.calcBestPossState(BestPossibleExternalViewVerifier.java:444) [classes/:?]
	at org.apache.helix.tools.ClusterVerifiers.BestPossibleExternalViewVerifier.verifyState(BestPossibleExternalViewVerifier.java:293) [classes/:?]
	at org.apache.helix.tools.ClusterVerifiers.ZkHelixClusterVerifier.verifyByPolling(ZkHelixClusterVerifier.java:209) [classes/:?]
	at org.apache.helix.tools.ClusterVerifiers.ZkHelixClusterVerifier.verifyByPolling(ZkHelixClusterVerifier.java:229) [classes/:?]
	at org.apache.helix.integration.rebalancer.WagedRebalancer.TestWagedClusterExpansionWithAddingResourcesBeforeInstances.testExpandClusterWithResourceWithoutInstances(TestWagedClusterExpansionWithAddingResourcesBeforeInstances.java:151) [test-classes/:?]
	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]
	at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
Caused by: java.lang.NullPointerException
	at org.apache.helix.controller.rebalancer.util.DelayedRebalanceUtil.findToBeAssignedReplicasForMinActiveReplica(DelayedRebalanceUtil.java:335) ~[classes/:?]
	at org.apache.helix.controller.rebalancer.waged.model.ClusterModelProvider.generateClusterModel(ClusterModelProvider.java:257) ~[classes/:?]
	at org.apache.helix.controller.rebalancer.waged.model.ClusterModelProvider.generateClusterModelForDelayedRebalanceOverwrites(ClusterModelProvider.java:82) ~[classes/:?]
	at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.handleDelayedRebalanceMinActiveReplica(WagedRebalancer.java:415) ~[classes/:?]
	... 37 more

Tests

  • The following tests are written for this issue:

(List the names of added unit/integration tests)

  • TestWagedClusterExpansionWithAddingResourcesBeforeInstances.testExpandClusterWithResourceWithoutInstances
  • The following is the result of the "mvn test" command on the appropriate module:
➜  helix_os_hk git:(hkandwal/waged-adding-resources-without-capacity) mvn clean install -Dmaven.test.skip.exec=true && mvn test -o -Dtest=TestWagedClusterExpansionWithAddingResourcesBeforeInstances -pl=helix-core

[INFO] --- surefire:3.0.0-M3:test (default-test) @ helix-core ---
[INFO] 
[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running org.apache.helix.integration.rebalancer.WagedRebalancer.TestWagedClusterExpansionWithAddingResourcesBeforeInstances
Start zookeeper at localhost:2183 in thread main
AfterClass: TestWagedClusterExpansionWithAddingResourcesBeforeInstances called.
Shut down zookeeper at port 2183 in thread main
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 17.354 s - in org.apache.helix.integration.rebalancer.WagedRebalancer.TestWagedClusterExpansionWithAddingResourcesBeforeInstances
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
[INFO] 
[INFO] 
[INFO] --- jacoco:0.8.6:report (generate-code-coverage-report) @ helix-core ---
[INFO] Loading execution data file /Users/hkandwal/Documents/workspaces/projects/helix_os_hk/helix-core/target/jacoco.exec
[INFO] Analyzed bundle 'Apache Helix :: Core' with 950 classes
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  54.225 s
[INFO] Finished at: 2024-03-27T16:29:31-07:00
[INFO] ------------------------------------------------------------------------

Changes that Break Backward Compatibility (Optional)

  • My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:

(Consider including all behavior changes for public methods or API. Also include these changes in merge description so that other developers are aware of these changes. This allows them to make relevant code changes in feature branches accounting for the new method/API behavior.)

Documentation (Optional)

  • In case of new functionality, my PR adds documentation in the following wiki page:

(Link the GitHub wiki you added)

Commits

  • My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Code Quality

  • My diff has been formatted using helix-style.xml
    (helix-style-intellij.xml if IntelliJ IDE is used)

@junkaixue
Copy link
Contributor

Is this just for adding a test?

@himanshukandwal
Copy link
Contributor Author

Is this just for adding a test?

Hey @junkaixue yes, I first reproduced it using the test, now added the fix. Would you pls review it when get a chance.

Copy link
Contributor

@zpinto zpinto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! thanks for investigating this issue.

Copy link
Contributor

@zpinto zpinto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove WIP from title

@himanshukandwal himanshukandwal changed the title [WIP][apache/helix] -- Issue during onboarding resources without instances [apache/helix] -- Issue during onboarding resources without instances Mar 28, 2024
@himanshukandwal
Copy link
Contributor Author

Please remove WIP from title

Done

@himanshukandwal
Copy link
Contributor Author

This PR has been reviewed and approved by @zpinto.

Final Commit Message: When adding a new WAGED resource with a tag and without any instances against that tag, we are observing NPE coming from the system. To solve this issue we are adding a check in the ResourceComputationStage to have such resources excluded from the pipeline computation and only be considered when there are actual resource partitions (>0) to be assigned to the instances.

@junkaixue
Copy link
Contributor

lgtm. Let's wait for the tests.

@himanshukandwal
Copy link
Contributor Author

Hey @junkaixue, the PR CI successful.

Also ran the full CI suite in my repo and thats successful as well:
https://github.com/himanshukandwal/helix/actions/runs/8471701884/job/23212239715

@junkaixue junkaixue merged commit c480eac into apache:master Apr 1, 2024
5 checks passed
zpinto added a commit to linkedin/helix that referenced this pull request Apr 30, 2024
New Release Snapshot with several fixes:

[apache/helix] -- Issue during onboarding resources without instances apache#2782
[apache/helix] -- Provide JDK 1.8 (backward) compatibility of helix-core apache#2775
Do not start the server if user uses the default SECRET_TOKEN env value apache#2783
Delete expected version apache#2759
[apache/helix] -- Fix PreferenceList Ordering Changes during Maintenance Mode apache#2778
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NPE Issues when we add a WAGED resource without instances against the resource tag
3 participants