-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix /partitionAssignmentAPI and WAGED rebalancer finalMapping results matching #2739
Conversation
…rsion is generated by a pipeline run. Previously, it would only persist when there was an instance would become inactive and cause an emergency rebalance. The most recently generated best possible state is necessary for the partitionAssignmentAPI to generate the correct finalMapping. Without this it will generate the finalMapping based off of a stale BEST_POSSIBLE_STATE. Also, fix offlineTimeMap and WagedCapacityProvider by modifying the cache logic for disabled instances.
8abc054
to
c725ee1
Compare
First and foremost, excellent debugging. QQ: After investigation, it was found that the offlineTimeMap was not being cleared by the cache and the BEST_POSSIBLE_STATE was not being persisted to ZK each time a new version was being computed by partial rebalance. Fixing these two issues ensures that the pipeline produces the correct finalMapping and the partitionAssignmentAPI is not using a stale BEST_POSSIBLE_STATE to compute the simulated finalMapping. I understand the second part, very well. You removed '_assignableDisabledInstanceSet' - how does it relate to the analysis? |
@@ -552,7 +551,6 @@ protected synchronized Set<HelixConstants.ChangeType> doRefresh(HelixDataAccesso | |||
|
|||
updateIdealRuleMap(getClusterConfig()); | |||
updateDisabledInstances(getInstanceConfigMap().values(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the cache which required update?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in node swap cache changes we introduced _assignableDisabledInstanceSet
. We were not clearing it on cache refresh. This would lead WagedCapacityProvider to not charge pending messages against the instances capacity for any instance that was offline at some point.
After revising, this cache is not necessary as we can use _disabledInstanceSet
to determine enabled instances and enabled assignable instances since it will be a superset of _assignableDisabledInstanceSet
.
@@ -76,6 +77,10 @@ public Map<String, ResourceAssignment> getBaseline() { | |||
return _globalBaseline; | |||
} | |||
|
|||
public boolean hasPersistedLatestBestPossibleAssignment() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add detail as to why we need this variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a docstring and more comments in code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Zach for the amazing investigation and finding the complex bugs.
protected volatile int _bestPossibleVersion = 0; | ||
protected volatile int _lastPersistedBestPossibleVersion = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we initialize these two variables with the ZK state i.e. path:
/ASSIGNMENT_METADATA/BEST_POSSIBLE/LAST_SUCCESSFUL_WRITE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to do this, we would have to add a public method to BucketDataAccessor interface. This will cause a backwards incompatible change.
Let's hold off on this for now, as the actual version does not matter in the context of the AssignmentMetadataStore object. The use of version here is local to the controller thread processing the pipeline for the cluster. We only need to maintain version here to make sure we flush the cache to ZK eventually.
This PR is ready to be merged. Commit Message: |
Issues
Description
It was reported that the helix-rest partitionAssignmentAPI was returning an incorrect finalMapping. When the changes simulated were applied, the finalMapping generated by the controller was different. In theory, the finalMapping can be simulated correctly if all the inputs to the algorithm exactly match those passed in the controller.
After investigation, it was found that the offlineTimeMap was not being cleared by the cache and the BEST_POSSIBLE_STATE was not being persisted to ZK each time a new version was being computed by partial rebalance. Fixing these two issues ensures that the pipeline produces the correct finalMapping and the partitionAssignmentAPI is not using a stale BEST_POSSIBLE_STATE to compute the simulated finalMapping.
I have added integration tests to test that the pipeline produces the same finalMapping as the result generated by the partitionAssignmentAPI.
Tests
Changes that Break Backward Compatibility (Optional)
NA
Documentation (Optional)
NA
Commits
Code Quality
(helix-style-intellij.xml if IntelliJ IDE is used)