Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-3465. OM Failover retry happens too quickly when new leader suggested and retrying on same OM. #859

Merged
merged 8 commits into from Apr 25, 2020

Conversation

umamaheswararao
Copy link
Contributor

What changes were proposed in this pull request?

When leader was suggested, we are updating lastAttemptedOMid correctly, so that waitTime can be calculated by incrementing if failover to same OM.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-3465

How was this patch tested?

Added 2 tests which reproduce the case and after fix, they both passed.

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @umamaheswararao for working on this. I would like to suggest minor improvement in the unit test.

…ested and retrying on same OM.: Added few more tests and fixed comments.
Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @umamaheswararao for improving the unit test.

Comment on lines 88 to 93
public void testWaitTimeWithSameNodeFailover() {
// Failover attempt 1 to same OM, waitTime should increase.
failoverToSameNode(4);
// 2 same node failovers, waitTime should be 0.
failoverToNextNode(2, 0);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this is basically the same test case as testWaitTimeResetWhenNextNodeFailoverAfterSameNode. If you agree, I think we can drop this one.

@Before
public void init() throws Exception {
OzoneConfiguration config = new OzoneConfiguration();
long numNodes = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of test cases depend on this value by attempting 2 failovers to next node. I think it should be a member variable and tests should use failoverToNextNode(numNodes - 1, 0) instead of failoverToNextNode(2, 0).

Note that failoverToSameNode(2) should be kept as is, since in that case the failover number is arbitrary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hay Attila, Thanks for reviews. It make sense to make it member variable. I remember I extracted it for the purpose and lost that. BTW, that variable should be int, not long. Just updated that too.
Take a look.
And the pr test failures seems unrelated, I had a quick look.

@hanishakoneru
Copy link
Contributor

Thanks Uma for working on this.
LGTM. +1.

@umamaheswararao umamaheswararao merged commit 4b1fa10 into apache:master Apr 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants