Skip to content

Conversation

@captainzmc
Copy link
Member

@captainzmc captainzmc commented Nov 4, 2022

What changes were proposed in this pull request?

Sometimes TestECContainerRecovery.testContainerRecoveryOverReplicationProcessing will timeout, resulting in CI failure. We need to increase the inspection time to avoid this problem.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-7326

How was this patch tested?

fix ut.

@captainzmc captainzmc requested a review from kaijchen November 4, 2022 08:47
@adoroszlai
Copy link
Contributor

Thank you @captainzmc for working on this.

How was this patch tested?

fix ut.

Sorry, but for intermittent tests a single run is not enough to ensure intermittency is gone. Can you please trigger repeated runs? @kaijchen can provide example of such run if you need help.

(Also, please search Jira for existing issue before filing one.)

@adoroszlai adoroszlai changed the title HDDS-7459. Fix unstable TestECContainerRecovery HDDS-7326. Intermittent timeout in TestECContainerRecovery.testContainerRecoveryOverReplicationProcessing Nov 4, 2022
@adoroszlai
Copy link
Contributor

CC @swamirishi for review

@kaijchen
Copy link
Member

kaijchen commented Nov 4, 2022

@captainzmc
Copy link
Member Author

captainzmc commented Nov 4, 2022

Sure, Let's me trigger 10x10 iterations to test this.

@swamirishi
Copy link
Contributor

swamirishi commented Nov 5, 2022

This might not resolve the intermittency completely. As far as I understand I see a lot of threads in waiting state. The particular test is timing out after 100000 millis. This particular change would reduce the frequency of the test thread which would give other threads more time to perform op. Probably we should look into increasing the priority of the background threads we expect to run instead to make the test more robust. In this case which would be overReplicationHandler threads.
@adoroszlai @kaijchen What do you think?

@swamirishi
Copy link
Contributor

swamirishi commented Nov 7, 2022

BTW, after looking at the test case I found that issue is not the timeout but the Replication Manager Interval which is driven by this config hdds.scm.replication.thread.interval which defaults to 300s. I have fixed this issue as part of this PR. @captainzmc @adoroszlai @kaijchen Can you confirm if this is root cause for the failure.
#3939

@adoroszlai
Copy link
Contributor

@swamirishi hdds.scm.replication.thread.interval setting sounds plausible as root cause. I suggest extracting that change to a separate commit, running that single test case or class repeatedly in your fork (see example). If it fixes the issue, then please create a separate PR only for the fix.

@captainzmc
Copy link
Member Author

Thanks to @swamirishi @adoroszlai for the reply. Yes, I go through 10x10 iterations and find that increasing time can only reduce the probability of problems and cannot avoid error reporting. If #3939 could solves this problem, I'll turn off this PR.

@kaijchen
Copy link
Member

kaijchen commented Nov 8, 2022

Hi @captainzmc, I have just merged #3941.

@captainzmc
Copy link
Member Author

Thanks @swamirishi @kaijchen @adoroszlai , Let's close this one.

@captainzmc captainzmc closed this Nov 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants