Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.#1037
Conversation
…tuation. Although the rebalancer will fix the additional master eventually, the default operations are arbitrary and it may cause an older master to survive. This may cause serious application logic issues since many applications require the master to have the latest data. With this state resolver, the rebalancer will change the default behavior to reset all the master replicas so as to ensure the remaining one is the youngest one. Then the double-masters situation is gracefully resolved.
...x-core/src/test/java/org/apache/helix/integration/rebalancer/TestAbnormalStatesResolver.java
Outdated
Show resolved
Hide resolved
...c/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
Outdated
Show resolved
Hide resolved
...c/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
Outdated
Show resolved
Hide resolved
...c/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
Outdated
Show resolved
Hide resolved
| } | ||
|
|
||
| @Override | ||
| public Map<String, String> computeRecoveryAssignment(final CurrentStateOutput currentStateOutput, |
There was a problem hiding this comment.
computeCorrectedAssignment is clearer?
There was a problem hiding this comment.
Not really, this is not the corrected mapping, this is the next step of fixing.
| if (currentStateOutput.getCurrentStateMap(resourceName, partition).values().stream() | ||
| .filter(state -> state.equals(stateModelDef.getTopState())).count() > 1) { | ||
| return false; |
There was a problem hiding this comment.
I wonder if this would be too much of an overhead for each pipeline run?
Do you think it would be better to try to come up with a way to cache currentState mappings and compare diffs (going from O(n) -> O(1) check by storing results across pipelines).
For heavy users, this O(n) computation might become a significant bottleneck if done every pipeline.
There was a problem hiding this comment.
Could you please clarify why comparing diff will bring the complexity from O(n) to O(1)?
There was a problem hiding this comment.
@jiajunwang In CurrentStateOutput, could we add a top state counter map so we could cache the top state counter, like below? Then we could avoid that stream filter computation? Tradeoff is we need a bit more memory for the cache. But most of them are just references.
public void setCurrentState(String resourceName, Partition partition, String instanceName,
String state) {
(...... current code ......)
// Counter number of top state replicas for a single top state model.
if (state.equals(stateModelDef.getTopState())) {
Map<String, Integer> counterMap =
_topStateCounter.computeIfAbsent(resourceName, k -> new HashMap<>())
.computeIfAbsent(partition, k -> new HashMap<>());
counterMap.put(state, counterMap.getOrDefault(state, 0) + 1);
}
}
Not sure if we need to optimize this. Maybe you could test it. It seems for this part, the time complexity is down from O(n) to O(1), but I am not sure what the actual time saving is, considering the whole pipeline. If the whole pipeline complexity is O(N^2), with this optimization, it is O(N), that may help. If the whole pipeline is O(2 * N), with this optimization, still O(N).
There was a problem hiding this comment.
I see. In that case, we should add this to the cache instead of CurrentStateOutput. The cache is "protected" by the selective update, so it will help to reduce some calculations.
That is a valid idea. But that requires more changes. For this specific usage, changing the fundamental cache class seems to be not worthy.
Moreover, if the resolver is not enabled, then we don't do the calculation at all.
Let me add a TODO there, if we have more usage of this count, then we shall do it.
...c/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
Show resolved
Hide resolved
...c/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
Outdated
Show resolved
Hide resolved
...c/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
Outdated
Show resolved
Hide resolved
...c/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
Outdated
Show resolved
Hide resolved
...c/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
Show resolved
Hide resolved
...c/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
Show resolved
Hide resolved
...c/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
Outdated
Show resolved
Hide resolved
| if (currentStateOutput.getCurrentStateMap(resourceName, partition).values().stream() | ||
| .filter(state -> state.equals(stateModelDef.getTopState())).count() > 1) { | ||
| return false; |
There was a problem hiding this comment.
@jiajunwang In CurrentStateOutput, could we add a top state counter map so we could cache the top state counter, like below? Then we could avoid that stream filter computation? Tradeoff is we need a bit more memory for the cache. But most of them are just references.
public void setCurrentState(String resourceName, Partition partition, String instanceName,
String state) {
(...... current code ......)
// Counter number of top state replicas for a single top state model.
if (state.equals(stateModelDef.getTopState())) {
Map<String, Integer> counterMap =
_topStateCounter.computeIfAbsent(resourceName, k -> new HashMap<>())
.computeIfAbsent(partition, k -> new HashMap<>());
counterMap.put(state, counterMap.getOrDefault(state, 0) + 1);
}
}
Not sure if we need to optimize this. Maybe you could test it. It seems for this part, the time complexity is down from O(n) to O(1), but I am not sure what the actual time saving is, considering the whole pipeline. If the whole pipeline complexity is O(N^2), with this optimization, it is O(N), that may help. If the whole pipeline is O(2 * N), with this optimization, still O(N).
|
This PR is ready to be merged, approved by @dasahcc |
…tuation. (#1037) Although the rebalancer will fix the additional master eventually, the default operations are arbitrary and it may cause an older master to survive. This may cause serious application logic issues since many applications require the master to have the latest data. With this state resolver, the rebalancer will change the default behavior to reset all the master replicas so as to ensure the remaining one is the youngest one. Then the double-masters situation is gracefully resolved.
…tuation. (#1037) Although the rebalancer will fix the additional master eventually, the default operations are arbitrary and it may cause an older master to survive. This may cause serious application logic issues since many applications require the master to have the latest data. With this state resolver, the rebalancer will change the default behavior to reset all the master replicas so as to ensure the remaining one is the youngest one. Then the double-masters situation is gracefully resolved.
…tuation. (#1037) Although the rebalancer will fix the additional master eventually, the default operations are arbitrary and it may cause an older master to survive. This may cause serious application logic issues since many applications require the master to have the latest data. With this state resolver, the rebalancer will change the default behavior to reset all the master replicas so as to ensure the remaining one is the youngest one. Then the double-masters situation is gracefully resolved.
…tuation. (apache#1037) Although the rebalancer will fix the additional master eventually, the default operations are arbitrary and it may cause an older master to survive. This may cause serious application logic issues since many applications require the master to have the latest data. With this state resolver, the rebalancer will change the default behavior to reset all the master replicas so as to ensure the remaining one is the youngest one. Then the double-masters situation is gracefully resolved.
Issues
#1028
Description
Although the rebalancer will fix the additional master eventually, the default operations are arbitrary and it may cause an older master to survive. This may cause serious application logic issues since many applications require the master to have the latest data.
With this state resolver, the rebalancer will change the default behavior to reset all the master replicas so as to ensure the remaining one is the youngest one. Then the double-masters situation is gracefully resolved.
Tests
TestAbnormalStatesResolver.testExcessiveTopStateResolver()
N/A, the newly added logic is only used in the new test case. The other tests won't touch the new logic.
Will run the whole test before merging the branch to master.
Commits
Documentation (Optional)
(Link the GitHub wiki you added)
Code Quality
(helix-style-intellij.xml if IntelliJ IDE is used)