-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-10582. Intermittent timeout during waitForReplicaCount in TestReconAndAdminContainerCLI #6585
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @raju-balpande for testing this.
GenericTestUtils.waitFor(() -> TestHelper.countReplicas(containerIdR3, cluster) == 4, | ||
200, 30000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Observed the condition was true for a very smaller span
The same problem may affect other tests (currently there is only one other usage). I think the new, reduced check interval can be set in waitForReplicaCount
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is why I cross checked the uses of waitForReplicaCount which I found in TestContainerReplication, And I tried to see its flakiness..
- Worked 100% success for 10x10 https://github.com/raju-balpande/apache_ozone/actions/runs/8710014704
- Worked 99.75% success for 20x20 https://github.com/raju-balpande/apache_ozone/actions/runs/8720977974
And seems no change require for this class.
Please suggest if I shall make this change in TestHelper.waitForReplicaCount instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no change require for this class
If it's not getting worse due to the change, please make it in waitForReplicaCount
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree to your point. Even though the TestContainerReplication is not breaking now but the following two lines are indicating the condition is true for short span which if missed will fail the test method and hence thought to extend the change for this class as well. Hence making change in TestHelper.waitForReplicaCount only.
Will be observing the result for both the flaky classes under https://github.com/raju-balpande/apache_ozone/actions/runs/8831034232
…n TestContainerReplication
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @raju-balpande for updating the patch, LGTM.
…conAndAdminContainerCLI (apache#6585) (cherry picked from commit dd86223)
What changes were proposed in this pull request?
Intermittent timeout during waitForReplicaCount in TestReconAndAdminContainerCLI
I observed two wait condition having similar checks and were intermittently failing. Observed the condition was true for a very smaller span and hence increased the frequency to check. Earlier frequency was 1000ms, tested it with 500ms and 200ms. 500ms still have few failures. 200ms is working with no failure.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-10582
How was this patch tested?
Test it in CI by creating flaky tests check : https://github.com/raju-balpande/apache_ozone/actions/runs/8817120192