Fix /partitionAssignmentAPI and WAGED rebalancer finalMapping results matching #2739

zpinto · 2024-01-17T03:44:36Z

Issues

PartitionAssignmentAPI returns incorrect finalMapping when changes are applied and real pipeline computes

Description

It was reported that the helix-rest partitionAssignmentAPI was returning an incorrect finalMapping. When the changes simulated were applied, the finalMapping generated by the controller was different. In theory, the finalMapping can be simulated correctly if all the inputs to the algorithm exactly match those passed in the controller.

After investigation, it was found that the offlineTimeMap was not being cleared by the cache and the BEST_POSSIBLE_STATE was not being persisted to ZK each time a new version was being computed by partial rebalance. Fixing these two issues ensures that the pipeline produces the correct finalMapping and the partitionAssignmentAPI is not using a stale BEST_POSSIBLE_STATE to compute the simulated finalMapping.

I have added integration tests to test that the pipeline produces the same finalMapping as the result generated by the partitionAssignmentAPI.

Tests

testComputePartitionAssignmentAddInstance
testComputePartitionAssignmentReplaceInstance

Changes that Break Backward Compatibility (Optional)

NA

Documentation (Optional)

NA

Commits

My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Code Quality

My diff has been formatted using helix-style.xml
(helix-style-intellij.xml if IntelliJ IDE is used)

…rsion is generated by a pipeline run. Previously, it would only persist when there was an instance would become inactive and cause an emergency rebalance. The most recently generated best possible state is necessary for the partitionAssignmentAPI to generate the correct finalMapping. Without this it will generate the finalMapping based off of a stale BEST_POSSIBLE_STATE. Also, fix offlineTimeMap and WagedCapacityProvider by modifying the cache logic for disabled instances.

desaikomal · 2024-01-17T04:27:56Z

First and foremost, excellent debugging. QQ: After investigation, it was found that the offlineTimeMap was not being cleared by the cache and the BEST_POSSIBLE_STATE was not being persisted to ZK each time a new version was being computed by partial rebalance. Fixing these two issues ensures that the pipeline produces the correct finalMapping and the partitionAssignmentAPI is not using a stale BEST_POSSIBLE_STATE to compute the simulated finalMapping.

I understand the second part, very well. You removed '_assignableDisabledInstanceSet' - how does it relate to the analysis?
for first part, the problem was happening only when we have certain hosts crossing over the delayed window? so essentially global pipeline gets updated, but partial pipeline is not persisted, so we get results are different?

desaikomal · 2024-01-17T04:24:21Z

...core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java

@@ -552,7 +551,6 @@ protected synchronized Set<HelixConstants.ChangeType> doRefresh(HelixDataAccesso

    updateIdealRuleMap(getClusterConfig());
    updateDisabledInstances(getInstanceConfigMap().values(),


this is the cache which required update?

Yes, in node swap cache changes we introduced _assignableDisabledInstanceSet. We were not clearing it on cache refresh. This would lead WagedCapacityProvider to not charge pending messages against the instances capacity for any instance that was offline at some point.

After revising, this cache is not necessary as we can use _disabledInstanceSet to determine enabled instances and enabled assignable instances since it will be a superset of _assignableDisabledInstanceSet.

desaikomal · 2024-01-17T04:29:16Z

...core/src/main/java/org/apache/helix/controller/rebalancer/waged/AssignmentMetadataStore.java

@@ -76,6 +77,10 @@ public Map<String, ResourceAssignment> getBaseline() {
    return _globalBaseline;
  }

+  public boolean hasPersistedLatestBestPossibleAssignment() {


can you add detail as to why we need this variable?

Added a docstring and more comments in code

himanshukandwal

Thanks Zach for the amazing investigation and finding the complex bugs.

himanshukandwal · 2024-01-17T20:24:07Z

...core/src/main/java/org/apache/helix/controller/rebalancer/waged/AssignmentMetadataStore.java

  protected volatile int _bestPossibleVersion = 0;
+  protected volatile int _lastPersistedBestPossibleVersion = 0;


Can we initialize these two variables with the ZK state i.e. path:
/ASSIGNMENT_METADATA/BEST_POSSIBLE/LAST_SUCCESSFUL_WRITE

In order to do this, we would have to add a public method to BucketDataAccessor interface. This will cause a backwards incompatible change.

Let's hold off on this for now, as the actual version does not matter in the context of the AssignmentMetadataStore object. The use of version here is local to the controller thread processing the pipeline for the cluster. We only need to maintain version here to make sure we flush the cache to ZK eventually.

zpinto · 2024-01-23T18:14:35Z

This PR is ready to be merged.

Commit Message:
offlineTimeMap was not being cleared by the cache and the BEST_POSSIBLE_STATE was not being persisted to ZK each time a new version was being computed by partial rebalance. Fixing these two issues ensures that the pipeline produces the correct finalMapping and the partitionAssignmentAPI is not using a stale BEST_POSSIBLE_STATE to compute the simulated finalMapping.

zpinto force-pushed the fix/partition_assignment_api branch from 8abc054 to c725ee1 Compare January 17, 2024 04:01

Change test name and add trailing newline.

46d537f

desaikomal reviewed Jan 17, 2024

View reviewed changes

himanshukandwal approved these changes Jan 17, 2024

View reviewed changes

Add comments.

501b0e7

github-actions bot added the CheckedAndApproved label Jan 17, 2024

Fix teardown for tests.

74b9b31

xyuanlu approved these changes Jan 23, 2024

View reviewed changes

xyuanlu merged commit 3cd22a3 into apache:master Jan 23, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix /partitionAssignmentAPI and WAGED rebalancer finalMapping results matching #2739

Fix /partitionAssignmentAPI and WAGED rebalancer finalMapping results matching #2739

zpinto commented Jan 17, 2024

desaikomal commented Jan 17, 2024

desaikomal Jan 17, 2024

zpinto Jan 17, 2024

desaikomal Jan 17, 2024

zpinto Jan 17, 2024

himanshukandwal left a comment

himanshukandwal Jan 17, 2024

zpinto Jan 17, 2024

zpinto commented Jan 23, 2024

		@@ -552,7 +551,6 @@ protected synchronized Set<HelixConstants.ChangeType> doRefresh(HelixDataAccesso

		updateIdealRuleMap(getClusterConfig());
		updateDisabledInstances(getInstanceConfigMap().values(),

		protected volatile int _bestPossibleVersion = 0;
		protected volatile int _lastPersistedBestPossibleVersion = 0;

Fix /partitionAssignmentAPI and WAGED rebalancer finalMapping results matching #2739

Fix /partitionAssignmentAPI and WAGED rebalancer finalMapping results matching #2739

Conversation

zpinto commented Jan 17, 2024

Issues

Description

Tests

Changes that Break Backward Compatibility (Optional)

Documentation (Optional)

Commits

Code Quality

desaikomal commented Jan 17, 2024

desaikomal Jan 17, 2024

Choose a reason for hiding this comment

zpinto Jan 17, 2024

Choose a reason for hiding this comment

desaikomal Jan 17, 2024

Choose a reason for hiding this comment

zpinto Jan 17, 2024

Choose a reason for hiding this comment

himanshukandwal left a comment

Choose a reason for hiding this comment

himanshukandwal Jan 17, 2024

Choose a reason for hiding this comment

zpinto Jan 17, 2024

Choose a reason for hiding this comment

zpinto commented Jan 23, 2024