Skip to content

Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.#1037

Merged
jiajunwang merged 4 commits intoapache:abnormalResolverfrom
jiajunwang:resolver
Jun 4, 2020
Merged

Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.#1037
jiajunwang merged 4 commits intoapache:abnormalResolverfrom
jiajunwang:resolver

Conversation

@jiajunwang
Copy link
Contributor

Issues

  • My PR addresses the following Helix issues and references them in the PR description:

#1028

Description

  • Here are some details about my PR, including screenshots of any UI changes:

Although the rebalancer will fix the additional master eventually, the default operations are arbitrary and it may cause an older master to survive. This may cause serious application logic issues since many applications require the master to have the latest data.
With this state resolver, the rebalancer will change the default behavior to reset all the master replicas so as to ensure the remaining one is the youngest one. Then the double-masters situation is gracefully resolved.

Tests

  • The following tests are written for this issue:

TestAbnormalStatesResolver.testExcessiveTopStateResolver()

  • The following is the result of the "mvn test" command on the appropriate module:

N/A, the newly added logic is only used in the new test case. The other tests won't touch the new logic.
Will run the whole test before merging the branch to master.

Commits

  • My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation (Optional)

  • In case of new functionality, my PR adds documentation in the following wiki page:

(Link the GitHub wiki you added)

Code Quality

  • My diff has been formatted using helix-style.xml
    (helix-style-intellij.xml if IntelliJ IDE is used)

…tuation.

Although the rebalancer will fix the additional master eventually, the default operations are arbitrary and it may cause an older master to survive. This may cause serious application logic issues since many applications require the master to have the latest data.
With this state resolver, the rebalancer will change the default behavior to reset all the master replicas so as to ensure the remaining one is the youngest one. Then the double-masters situation is gracefully resolved.
}

@Override
public Map<String, String> computeRecoveryAssignment(final CurrentStateOutput currentStateOutput,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

computeCorrectedAssignment is clearer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, this is not the corrected mapping, this is the next step of fixing.

Comment on lines +56 to +58
if (currentStateOutput.getCurrentStateMap(resourceName, partition).values().stream()
.filter(state -> state.equals(stateModelDef.getTopState())).count() > 1) {
return false;
Copy link
Contributor

@narendly narendly May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this would be too much of an overhead for each pipeline run?

Do you think it would be better to try to come up with a way to cache currentState mappings and compare diffs (going from O(n) -> O(1) check by storing results across pipelines).

For heavy users, this O(n) computation might become a significant bottleneck if done every pipeline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please clarify why comparing diff will bring the complexity from O(n) to O(1)?

Copy link
Contributor

@huizhilu huizhilu Jun 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiajunwang In CurrentStateOutput, could we add a top state counter map so we could cache the top state counter, like below? Then we could avoid that stream filter computation? Tradeoff is we need a bit more memory for the cache. But most of them are just references.

  public void setCurrentState(String resourceName, Partition partition, String instanceName,
      String state) {
    (...... current code ......)
    // Counter number of top state replicas for a single top state model. 
    if (state.equals(stateModelDef.getTopState())) {
      Map<String, Integer> counterMap =
          _topStateCounter.computeIfAbsent(resourceName, k -> new HashMap<>())
              .computeIfAbsent(partition, k -> new HashMap<>());
      counterMap.put(state, counterMap.getOrDefault(state, 0) + 1);
    }
  }

Not sure if we need to optimize this. Maybe you could test it. It seems for this part, the time complexity is down from O(n) to O(1), but I am not sure what the actual time saving is, considering the whole pipeline. If the whole pipeline complexity is O(N^2), with this optimization, it is O(N), that may help. If the whole pipeline is O(2 * N), with this optimization, still O(N).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. In that case, we should add this to the cache instead of CurrentStateOutput. The cache is "protected" by the selective update, so it will help to reduce some calculations.

That is a valid idea. But that requires more changes. For this specific usage, changing the fundamental cache class seems to be not worthy.
Moreover, if the resolver is not enabled, then we don't do the calculation at all.

Let me add a TODO there, if we have more usage of this count, then we shall do it.

Copy link
Contributor

@junkaixue junkaixue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Comment on lines +56 to +58
if (currentStateOutput.getCurrentStateMap(resourceName, partition).values().stream()
.filter(state -> state.equals(stateModelDef.getTopState())).count() > 1) {
return false;
Copy link
Contributor

@huizhilu huizhilu Jun 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiajunwang In CurrentStateOutput, could we add a top state counter map so we could cache the top state counter, like below? Then we could avoid that stream filter computation? Tradeoff is we need a bit more memory for the cache. But most of them are just references.

  public void setCurrentState(String resourceName, Partition partition, String instanceName,
      String state) {
    (...... current code ......)
    // Counter number of top state replicas for a single top state model. 
    if (state.equals(stateModelDef.getTopState())) {
      Map<String, Integer> counterMap =
          _topStateCounter.computeIfAbsent(resourceName, k -> new HashMap<>())
              .computeIfAbsent(partition, k -> new HashMap<>());
      counterMap.put(state, counterMap.getOrDefault(state, 0) + 1);
    }
  }

Not sure if we need to optimize this. Maybe you could test it. It seems for this part, the time complexity is down from O(n) to O(1), but I am not sure what the actual time saving is, considering the whole pipeline. If the whole pipeline complexity is O(N^2), with this optimization, it is O(N), that may help. If the whole pipeline is O(2 * N), with this optimization, still O(N).

@jiajunwang
Copy link
Contributor Author

This PR is ready to be merged, approved by @dasahcc

@jiajunwang jiajunwang merged commit f8bc5fd into apache:abnormalResolver Jun 4, 2020
@jiajunwang jiajunwang deleted the resolver branch June 4, 2020 18:52
jiajunwang added a commit that referenced this pull request Jun 9, 2020
…tuation. (#1037)

Although the rebalancer will fix the additional master eventually, the default operations are arbitrary and it may cause an older master to survive. This may cause serious application logic issues since many applications require the master to have the latest data.
With this state resolver, the rebalancer will change the default behavior to reset all the master replicas so as to ensure the remaining one is the youngest one. Then the double-masters situation is gracefully resolved.
jiajunwang added a commit that referenced this pull request Jun 15, 2020
…tuation. (#1037)

Although the rebalancer will fix the additional master eventually, the default operations are arbitrary and it may cause an older master to survive. This may cause serious application logic issues since many applications require the master to have the latest data.
With this state resolver, the rebalancer will change the default behavior to reset all the master replicas so as to ensure the remaining one is the youngest one. Then the double-masters situation is gracefully resolved.
jiajunwang added a commit that referenced this pull request Jun 15, 2020
…tuation. (#1037)

Although the rebalancer will fix the additional master eventually, the default operations are arbitrary and it may cause an older master to survive. This may cause serious application logic issues since many applications require the master to have the latest data.
With this state resolver, the rebalancer will change the default behavior to reset all the master replicas so as to ensure the remaining one is the youngest one. Then the double-masters situation is gracefully resolved.
huizhilu pushed a commit to huizhilu/helix that referenced this pull request Aug 16, 2020
…tuation. (apache#1037)

Although the rebalancer will fix the additional master eventually, the default operations are arbitrary and it may cause an older master to survive. This may cause serious application logic issues since many applications require the master to have the latest data.
With this state resolver, the rebalancer will change the default behavior to reset all the master replicas so as to ensure the remaining one is the youngest one. Then the double-masters situation is gracefully resolved.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants