Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-10582. Intermittent timeout during waitForReplicaCount in TestReconAndAdminContainerCLI #6585

Merged
merged 10 commits into from
Apr 25, 2024

Conversation

raju-balpande
Copy link
Contributor

What changes were proposed in this pull request?

Intermittent timeout during waitForReplicaCount in TestReconAndAdminContainerCLI

I observed two wait condition having similar checks and were intermittently failing. Observed the condition was true for a very smaller span and hence increased the frequency to check. Earlier frequency was 1000ms, tested it with 500ms and 200ms. 500ms still have few failures. 200ms is working with no failure.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10582

How was this patch tested?

Test it in CI by creating flaky tests check : https://github.com/raju-balpande/apache_ozone/actions/runs/8817120192

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @raju-balpande for testing this.

Comment on lines 264 to 265
GenericTestUtils.waitFor(() -> TestHelper.countReplicas(containerIdR3, cluster) == 4,
200, 30000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observed the condition was true for a very smaller span

The same problem may affect other tests (currently there is only one other usage). I think the new, reduced check interval can be set in waitForReplicaCount.

Copy link
Contributor Author

@raju-balpande raju-balpande Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is why I cross checked the uses of waitForReplicaCount which I found in TestContainerReplication, And I tried to see its flakiness..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no change require for this class

If it's not getting worse due to the change, please make it in waitForReplicaCount.

Copy link
Contributor Author

@raju-balpande raju-balpande Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree to your point. Even though the TestContainerReplication is not breaking now but the following two lines are indicating the condition is true for short span which if missed will fail the test method and hence thought to extend the change for this class as well. Hence making change in TestHelper.waitForReplicaCount only.
image

Will be observing the result for both the flaky classes under https://github.com/raju-balpande/apache_ozone/actions/runs/8831034232

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @raju-balpande for updating the patch, LGTM.

@adoroszlai adoroszlai merged commit dd86223 into apache:master Apr 25, 2024
28 checks passed
jojochuang pushed a commit to jojochuang/ozone that referenced this pull request May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants