HDDS-3465. OM Failover retry happens too quickly when new leader suggested and retrying on same OM. #859

umamaheswararao · 2020-04-23T01:42:01Z

What changes were proposed in this pull request?

When leader was suggested, we are updating lastAttemptedOMid correctly, so that waitTime can be calculated by incrementing if failover to same OM.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-3465

How was this patch tested?

Added 2 tests which reproduce the case and after fix, they both passed.

…ested and retrying on same OM

adoroszlai

Thanks @umamaheswararao for working on this. I would like to suggest minor improvement in the unit test.

...op-ozone/common/src/test/java/org/apache/hadoop/ozone/om/ha/TestOMFailoverProxyProvider.java

…ested and retrying on same OM.: Added few more tests and fixed comments.

adoroszlai

Thanks @umamaheswararao for improving the unit test.

adoroszlai · 2020-04-24T06:28:50Z

...op-ozone/common/src/test/java/org/apache/hadoop/ozone/om/ha/TestOMFailoverProxyProvider.java

+  public void testWaitTimeWithSameNodeFailover() {
+    // Failover attempt 1 to same OM, waitTime should increase.
+    failoverToSameNode(4);
+    // 2 same node failovers, waitTime should be 0.
+    failoverToNextNode(2, 0);
+  }


It seems this is basically the same test case as testWaitTimeResetWhenNextNodeFailoverAfterSameNode. If you agree, I think we can drop this one.

adoroszlai · 2020-04-24T06:34:43Z

...op-ozone/common/src/test/java/org/apache/hadoop/ozone/om/ha/TestOMFailoverProxyProvider.java

+  @Before
+  public void init() throws Exception {
+    OzoneConfiguration config = new OzoneConfiguration();
+    long numNodes = 3;


A couple of test cases depend on this value by attempting 2 failovers to next node. I think it should be a member variable and tests should use failoverToNextNode(numNodes - 1, 0) instead of failoverToNextNode(2, 0).

Note that failoverToSameNode(2) should be kept as is, since in that case the failover number is arbitrary.

Hay Attila, Thanks for reviews. It make sense to make it member variable. I remember I extracted it for the purpose and lost that. BTW, that variable should be int, not long. Just updated that too.
Take a look.
And the pr test failures seems unrelated, I had a quick look.

hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/ha/OMFailoverProxyProvider.java

…-ozone into HDDS-3465

hanishakoneru · 2020-04-25T00:02:58Z

Thanks Uma for working on this.
LGTM. +1.

umamaheswararao added 2 commits April 22, 2020 18:34

HDDS-3465. OM Failover retry happens too quickly when new leader sugg…

2e8aefe

…ested and retrying on same OM

HDDS-3465. OM Failover retry happens too quickly when new leader sugg…

0f37a8c

…ested and retrying on same OM

adoroszlai reviewed Apr 23, 2020

View reviewed changes

...op-ozone/common/src/test/java/org/apache/hadoop/ozone/om/ha/TestOMFailoverProxyProvider.java Outdated Show resolved Hide resolved

...op-ozone/common/src/test/java/org/apache/hadoop/ozone/om/ha/TestOMFailoverProxyProvider.java Outdated Show resolved Hide resolved

HDDS-3465. OM Failover retry happens too quickly when new leader sugg…

510f511

…ested and retrying on same OM.: Added few more tests and fixed comments.

adoroszlai reviewed Apr 24, 2020

View reviewed changes

umamaheswararao and others added 3 commits April 24, 2020 01:21

Few test nits

cb61f55

trigger new CI check

098094e

trigger new CI check

49c370a

hanishakoneru reviewed Apr 24, 2020

View reviewed changes

hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/ha/OMFailoverProxyProvider.java Show resolved Hide resolved

umamaheswararao added 2 commits April 24, 2020 14:34

Fixed a comment and added a test case to capture the case.

e615502

Merge branch 'HDDS-3465' of https://github.com/umamaheswararao/hadoop…

3567044

…-ozone into HDDS-3465

umamaheswararao merged commit 4b1fa10 into apache:master Apr 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-3465. OM Failover retry happens too quickly when new leader suggested and retrying on same OM. #859

HDDS-3465. OM Failover retry happens too quickly when new leader suggested and retrying on same OM. #859

umamaheswararao commented Apr 23, 2020

adoroszlai left a comment

adoroszlai left a comment

adoroszlai Apr 24, 2020

adoroszlai Apr 24, 2020

umamaheswararao Apr 24, 2020

hanishakoneru commented Apr 25, 2020

HDDS-3465. OM Failover retry happens too quickly when new leader suggested and retrying on same OM. #859

HDDS-3465. OM Failover retry happens too quickly when new leader suggested and retrying on same OM. #859

Conversation

umamaheswararao commented Apr 23, 2020

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

adoroszlai left a comment

Choose a reason for hiding this comment

adoroszlai left a comment

Choose a reason for hiding this comment

adoroszlai Apr 24, 2020

Choose a reason for hiding this comment

adoroszlai Apr 24, 2020

Choose a reason for hiding this comment

umamaheswararao Apr 24, 2020

Choose a reason for hiding this comment

hanishakoneru commented Apr 25, 2020