Skip to content

[Merge/Reviewed] Merge Abnormal resolver changes#1094

Merged
jiajunwang merged 3 commits intomasterfrom
abnormalResolver
Jun 15, 2020
Merged

[Merge/Reviewed] Merge Abnormal resolver changes#1094
jiajunwang merged 3 commits intomasterfrom
abnormalResolver

Conversation

@jiajunwang
Copy link
Contributor

Issues

  • My PR addresses the following Helix issues and references them in the PR description:

#1027

Description

  • Here are some details about my PR, including screenshots of any UI changes:

This PR is for merging only. All the comments have been reviewed and cleanly rebased to the master branch.
Review on this PR is optional.

Tests

  • The following tests are written for this issue:

NA

  • The following is the result of the "mvn test" command on the appropriate module:

[INFO] Results:
[INFO]
[ERROR] Failures:
[ERROR] TestWorkflowTimeout.testWorkflowTimeoutWhenWorkflowCompleted:116 expected: but was:
[INFO]
[ERROR] Tests run: 1144, Failures: 1, Errors: 0, Skipped: 0
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 01:18 h
[INFO] Finished at: 2020-06-15T13:50:15-07:00
[INFO] ------------------------------------------------------------------------

Re-run test

[INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 18.361 s - in org.apache.helix.integration.task.TestWorkflowTimeout
[INFO]
[INFO] Results:
[INFO]
[INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 24.719 s
[INFO] Finished at: 2020-06-15T14:48:34-07:00
[INFO] ------------------------------------------------------------------------

Commits

  • My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation (Optional)

  • In case of new functionality, my PR adds documentation in the following wiki page:

(Link the GitHub wiki you added)

Code Quality

  • My diff has been formatted using helix-style.xml
    (helix-style-intellij.xml if IntelliJ IDE is used)

The Abnormal States Resolver defines a generic interface to find and recover if the partition has any abnormal current states. For example,
- double masters
- application data out of sync
The interface shall be implemented according to the requirement.

The resolver is applied in the rebalance process according to the corresponding cluster config item. For example,
"ABNORMAL_STATES_RESOLVER_MAP" : {
 "MASTERSLAVE" : "org.apache.helix.api.rebalancer.constraint.MasterSlaveAbnormalStateReslovler"
}
The default behavior without any configuration is not doing any recovery work.
…tuation. (#1037)

Although the rebalancer will fix the additional master eventually, the default operations are arbitrary and it may cause an older master to survive. This may cause serious application logic issues since many applications require the master to have the latest data.
With this state resolver, the rebalancer will change the default behavior to reset all the master replicas so as to ensure the remaining one is the youngest one. Then the double-masters situation is gracefully resolved.
Example ObjectName of the new monitor MBean: Rebalancer:ClusterName=<clusterName>, EntityName=AbnormalStates.<StateModelDefName>
Attributes,
1. AbnormalStatePartitionCounter: record the total count of the partitions that have been found in abnormal status. Note that if one partition has been found to be abnormal twice, then we will record it twice in this counter as well.
2. RecoveryAttemptCounter: record the total count of successful recovery computation that has been done by the resolver.
@jiajunwang jiajunwang changed the title [MERGE/Reviewed] Merge Abnormal resolver changes [Merge/Reviewed] Merge Abnormal resolver changes Jun 15, 2020
@jiajunwang jiajunwang merged commit f85cbd1 into master Jun 15, 2020
@jiajunwang jiajunwang deleted the abnormalResolver branch June 15, 2020 23:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant