Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix /partitionAssignmentAPI and WAGED rebalancer finalMapping results matching #2739

Merged
merged 4 commits into from
Jan 23, 2024

Conversation

zpinto
Copy link
Contributor

@zpinto zpinto commented Jan 17, 2024

Issues

  • PartitionAssignmentAPI returns incorrect finalMapping when changes are applied and real pipeline computes

Description

It was reported that the helix-rest partitionAssignmentAPI was returning an incorrect finalMapping. When the changes simulated were applied, the finalMapping generated by the controller was different. In theory, the finalMapping can be simulated correctly if all the inputs to the algorithm exactly match those passed in the controller.

After investigation, it was found that the offlineTimeMap was not being cleared by the cache and the BEST_POSSIBLE_STATE was not being persisted to ZK each time a new version was being computed by partial rebalance. Fixing these two issues ensures that the pipeline produces the correct finalMapping and the partitionAssignmentAPI is not using a stale BEST_POSSIBLE_STATE to compute the simulated finalMapping.

I have added integration tests to test that the pipeline produces the same finalMapping as the result generated by the partitionAssignmentAPI.

Tests

  • testComputePartitionAssignmentAddInstance
  • testComputePartitionAssignmentReplaceInstance

Changes that Break Backward Compatibility (Optional)

NA

Documentation (Optional)

NA

Commits

  • My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Code Quality

  • My diff has been formatted using helix-style.xml
    (helix-style-intellij.xml if IntelliJ IDE is used)

…rsion is generated by a pipeline run. Previously, it would only persist when there was an instance would become inactive and cause an emergency rebalance. The most recently generated best possible state is necessary for the partitionAssignmentAPI to generate the correct finalMapping. Without this it will generate the finalMapping based off of a stale BEST_POSSIBLE_STATE. Also, fix offlineTimeMap and WagedCapacityProvider by modifying the cache logic for disabled instances.
@zpinto zpinto force-pushed the fix/partition_assignment_api branch from 8abc054 to c725ee1 Compare January 17, 2024 04:01
@desaikomal
Copy link
Contributor

First and foremost, excellent debugging. QQ: After investigation, it was found that the offlineTimeMap was not being cleared by the cache and the BEST_POSSIBLE_STATE was not being persisted to ZK each time a new version was being computed by partial rebalance. Fixing these two issues ensures that the pipeline produces the correct finalMapping and the partitionAssignmentAPI is not using a stale BEST_POSSIBLE_STATE to compute the simulated finalMapping.

I understand the second part, very well. You removed '_assignableDisabledInstanceSet' - how does it relate to the analysis?
for first part, the problem was happening only when we have certain hosts crossing over the delayed window? so essentially global pipeline gets updated, but partial pipeline is not persisted, so we get results are different?

@@ -552,7 +551,6 @@ protected synchronized Set<HelixConstants.ChangeType> doRefresh(HelixDataAccesso

updateIdealRuleMap(getClusterConfig());
updateDisabledInstances(getInstanceConfigMap().values(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the cache which required update?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in node swap cache changes we introduced _assignableDisabledInstanceSet. We were not clearing it on cache refresh. This would lead WagedCapacityProvider to not charge pending messages against the instances capacity for any instance that was offline at some point.

After revising, this cache is not necessary as we can use _disabledInstanceSet to determine enabled instances and enabled assignable instances since it will be a superset of _assignableDisabledInstanceSet.

@@ -76,6 +77,10 @@ public Map<String, ResourceAssignment> getBaseline() {
return _globalBaseline;
}

public boolean hasPersistedLatestBestPossibleAssignment() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add detail as to why we need this variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a docstring and more comments in code

Copy link
Contributor

@himanshukandwal himanshukandwal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Zach for the amazing investigation and finding the complex bugs.

Comment on lines 54 to +55
protected volatile int _bestPossibleVersion = 0;
protected volatile int _lastPersistedBestPossibleVersion = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we initialize these two variables with the ZK state i.e. path:
/ASSIGNMENT_METADATA/BEST_POSSIBLE/LAST_SUCCESSFUL_WRITE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to do this, we would have to add a public method to BucketDataAccessor interface. This will cause a backwards incompatible change.

Let's hold off on this for now, as the actual version does not matter in the context of the AssignmentMetadataStore object. The use of version here is local to the controller thread processing the pipeline for the cluster. We only need to maintain version here to make sure we flush the cache to ZK eventually.

@zpinto
Copy link
Contributor Author

zpinto commented Jan 23, 2024

This PR is ready to be merged.

Commit Message:
offlineTimeMap was not being cleared by the cache and the BEST_POSSIBLE_STATE was not being persisted to ZK each time a new version was being computed by partial rebalance. Fixing these two issues ensures that the pipeline produces the correct finalMapping and the partitionAssignmentAPI is not using a stale BEST_POSSIBLE_STATE to compute the simulated finalMapping.

@xyuanlu xyuanlu merged commit 3cd22a3 into apache:master Jan 23, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants