New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-3465. OM Failover retry happens too quickly when new leader suggested and retrying on same OM. #859
Conversation
…ested and retrying on same OM
…ested and retrying on same OM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @umamaheswararao for working on this. I would like to suggest minor improvement in the unit test.
...op-ozone/common/src/test/java/org/apache/hadoop/ozone/om/ha/TestOMFailoverProxyProvider.java
Outdated
Show resolved
Hide resolved
...op-ozone/common/src/test/java/org/apache/hadoop/ozone/om/ha/TestOMFailoverProxyProvider.java
Outdated
Show resolved
Hide resolved
…ested and retrying on same OM.: Added few more tests and fixed comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @umamaheswararao for improving the unit test.
public void testWaitTimeWithSameNodeFailover() { | ||
// Failover attempt 1 to same OM, waitTime should increase. | ||
failoverToSameNode(4); | ||
// 2 same node failovers, waitTime should be 0. | ||
failoverToNextNode(2, 0); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this is basically the same test case as testWaitTimeResetWhenNextNodeFailoverAfterSameNode
. If you agree, I think we can drop this one.
@Before | ||
public void init() throws Exception { | ||
OzoneConfiguration config = new OzoneConfiguration(); | ||
long numNodes = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of test cases depend on this value by attempting 2 failovers to next node. I think it should be a member variable and tests should use failoverToNextNode(numNodes - 1, 0)
instead of failoverToNextNode(2, 0)
.
Note that failoverToSameNode(2)
should be kept as is, since in that case the failover number is arbitrary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hay Attila, Thanks for reviews. It make sense to make it member variable. I remember I extracted it for the purpose and lost that. BTW, that variable should be int, not long. Just updated that too.
Take a look.
And the pr test failures seems unrelated, I had a quick look.
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/ha/OMFailoverProxyProvider.java
Show resolved
Hide resolved
Thanks Uma for working on this. |
What changes were proposed in this pull request?
When leader was suggested, we are updating lastAttemptedOMid correctly, so that waitTime can be calculated by incrementing if failover to same OM.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-3465
How was this patch tested?
Added 2 tests which reproduce the case and after fix, they both passed.