-
Notifications
You must be signed in to change notification settings - Fork 594
HDDS-7326. Intermittent timeout in TestECContainerRecovery.testContainerRecoveryOverReplicationProcessing #3932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you @captainzmc for working on this.
Sorry, but for intermittent tests a single run is not enough to ensure intermittency is gone. Can you please trigger repeated runs? @kaijchen can provide example of such run if you need help. (Also, please search Jira for existing issue before filing one.) |
|
CC @swamirishi for review |
|
Example of repeating test in CI: https://github.com/kaijchen/ozone/blob/repeat/.github/workflows/post-commit.yml#L333 |
|
Sure, Let's me trigger 10x10 iterations to test this. |
|
This might not resolve the intermittency completely. As far as I understand I see a lot of threads in waiting state. The particular test is timing out after 100000 millis. This particular change would reduce the frequency of the test thread which would give other threads more time to perform op. Probably we should look into increasing the priority of the background threads we expect to run instead to make the test more robust. In this case which would be overReplicationHandler threads. |
|
BTW, after looking at the test case I found that issue is not the timeout but the Replication Manager Interval which is driven by this config hdds.scm.replication.thread.interval which defaults to 300s. I have fixed this issue as part of this PR. @captainzmc @adoroszlai @kaijchen Can you confirm if this is root cause for the failure. |
|
@swamirishi |
|
Thanks to @swamirishi @adoroszlai for the reply. Yes, I go through 10x10 iterations and find that increasing time can only reduce the probability of problems and cannot avoid error reporting. If #3939 could solves this problem, I'll turn off this PR. |
|
Hi @captainzmc, I have just merged #3941. |
|
Thanks @swamirishi @kaijchen @adoroszlai , Let's close this one. |
What changes were proposed in this pull request?
Sometimes TestECContainerRecovery.testContainerRecoveryOverReplicationProcessing will timeout, resulting in CI failure. We need to increase the inspection time to avoid this problem.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-7326
How was this patch tested?
fix ut.