-
Notifications
You must be signed in to change notification settings - Fork 594
HDDS-6743. Specify leader node for OM failover #3409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@adoroszlai @ChenSammi Could you help to review this PR? |
|
thanks @symious for the work! @hanishakoneru left a comment to explain why this should not be done. #2765 (comment) i suggest we should achieve agreement on this issue first , and then go ahead. |
|
@JacksonYao287 Sure, thanks for the review. In #2765 (comment), the concern I think is the misconfig of client side might trigger some dead loops, so an address was prefered to add instead of only OMNodeId. In the latest commit of this PR, the
An example of this exception message would be |
|
@symious is this PR still active? If not we can close it. |
|
Just saw this PR, recently I've also been researching some issue related to the out-of-sync mapping between client and server. just mark myself here in order to follow up the latest change of this PR! thanks all! |
|
@kerneltime Still active I think, could you help to review the PR? |
|
thanks @symious will get this reviewed |
...p-ozone/common/src/main/java/org/apache/hadoop/ozone/om/exceptions/OMNotLeaderException.java
Outdated
Show resolved
Hide resolved
...ntegration-test/src/test/java/org/apache/hadoop/ozone/om/TestOzoneManagerHAWithFailover.java
Show resolved
Hide resolved
|
Changes look good to me! |
| this.leaderAddress = suggestedLeaderAddress; | ||
| } | ||
|
|
||
| public OMNotLeaderException(String message) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not support this. The caller has to specify either the peer ID or the leader ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
|
The overall change looks good and would help debug as well. Some minor nits that need addressing. |
|
@kerneltime The review comments have been addressed. I guess this can be merged. |
|
@swamirishi thanks for the update. I will take a look. |
...op-ozone/common/src/main/java/org/apache/hadoop/ozone/om/ha/OMFailoverProxyProviderBase.java
Outdated
Show resolved
Hide resolved
...op-ozone/common/src/main/java/org/apache/hadoop/ozone/om/ha/GrpcOMFailoverProxyProvider.java
Outdated
Show resolved
Hide resolved
|
@neils-dev Updated the PR, please have a look. |
...ne/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/ratis/OzoneManagerRatisServer.java
Outdated
Show resolved
Hide resolved
...ne/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/ratis/OzoneManagerRatisServer.java
Outdated
Show resolved
Hide resolved
...ne/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/ratis/OzoneManagerRatisServer.java
Outdated
Show resolved
Hide resolved
|
Thanks @symious . I tested failover on the docker ha dev cluster and noticed that when leader node goes down, the followers respond in one of two ways, either i.) initially providing a stale om leader or ii,) null address om suggestion. i.) (stale leader suggestion) ii.) The null address in the suggestion seems to be effective causing the failover provider to choose the next om to try from its om node map and resolves the failover. Q. should the node give a stale om leader id, is it possible that we encounter a loop condition where we continue to try the suggestion, fail, then ask the same om for the suggestion that we already tried? |
|
@neils-dev Thank you for the review. Updated the patch, could you have a check? I think the problem you mentioned happens when the leader is not generated, once the leader is confirmed, the client should be forwarded to the correct OM. |
neils-dev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @symious for the updates. LGTM.
@kerneltime do you mind taking a look? If no further comments we should look to merge.
|
Thanks @JacksonYao287 , @kerneltime , @DaveTeng0 , @swamirishi for your review and comments. Thanks @symious for this. Merge to master. |
What changes were proposed in this pull request?
Currently if clients first connect to a follower OM, the response show the OM is not leader but didn't specify the real Leader node.
This ticket is to let the reply to contains the Leader OM so that clients can connect to Leader node more conveniently.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-6743
How was this patch tested?
unit test