HDDS-7326. Intermittent timeout in TestECContainerRecovery.testContainerRecoveryOverReplicationProcessing #3932

captainzmc · 2022-11-04T08:43:42Z

What changes were proposed in this pull request?

Sometimes TestECContainerRecovery.testContainerRecoveryOverReplicationProcessing will timeout, resulting in CI failure. We need to increase the inspection time to avoid this problem.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-7326

How was this patch tested?

fix ut.

adoroszlai · 2022-11-04T08:54:08Z

Thank you @captainzmc for working on this.

How was this patch tested?

fix ut.

Sorry, but for intermittent tests a single run is not enough to ensure intermittency is gone. Can you please trigger repeated runs? @kaijchen can provide example of such run if you need help.

(Also, please search Jira for existing issue before filing one.)

adoroszlai · 2022-11-04T08:56:17Z

CC @swamirishi for review

kaijchen · 2022-11-04T08:57:50Z

Example of repeating test in CI: https://github.com/kaijchen/ozone/blob/repeat/.github/workflows/post-commit.yml#L333

captainzmc · 2022-11-04T09:03:04Z

Sure, Let's me trigger 10x10 iterations to test this.

swamirishi · 2022-11-05T00:59:57Z

This might not resolve the intermittency completely. As far as I understand I see a lot of threads in waiting state. The particular test is timing out after 100000 millis. This particular change would reduce the frequency of the test thread which would give other threads more time to perform op. Probably we should look into increasing the priority of the background threads we expect to run instead to make the test more robust. In this case which would be overReplicationHandler threads.
@adoroszlai @kaijchen What do you think?

swamirishi · 2022-11-07T02:46:36Z

BTW, after looking at the test case I found that issue is not the timeout but the Replication Manager Interval which is driven by this config hdds.scm.replication.thread.interval which defaults to 300s. I have fixed this issue as part of this PR. @captainzmc @adoroszlai @kaijchen Can you confirm if this is root cause for the failure.
#3939

adoroszlai · 2022-11-07T08:30:51Z

@swamirishi hdds.scm.replication.thread.interval setting sounds plausible as root cause. I suggest extracting that change to a separate commit, running that single test case or class repeatedly in your fork (see example). If it fixes the issue, then please create a separate PR only for the fix.

captainzmc · 2022-11-07T14:55:07Z

Thanks to @swamirishi @adoroszlai for the reply. Yes, I go through 10x10 iterations and find that increasing time can only reduce the probability of problems and cannot avoid error reporting. If #3939 could solves this problem, I'll turn off this PR.

kaijchen · 2022-11-08T06:22:37Z

Hi @captainzmc, I have just merged #3941.

captainzmc · 2022-11-09T03:02:17Z

Thanks @swamirishi @kaijchen @adoroszlai , Let's close this one.

Fix unstable TestECContainerRecovery

00d33d4

captainzmc requested a review from kaijchen November 4, 2022 08:47

adoroszlai changed the title ~~HDDS-7459. Fix unstable TestECContainerRecovery~~ HDDS-7326. Intermittent timeout in TestECContainerRecovery.testContainerRecoveryOverReplicationProcessing Nov 4, 2022

captainzmc closed this Nov 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-7326. Intermittent timeout in TestECContainerRecovery.testContainerRecoveryOverReplicationProcessing #3932

HDDS-7326. Intermittent timeout in TestECContainerRecovery.testContainerRecoveryOverReplicationProcessing #3932

Uh oh!

captainzmc commented Nov 4, 2022 •

edited by adoroszlai

Loading

Uh oh!

adoroszlai commented Nov 4, 2022

How was this patch tested?

Uh oh!

adoroszlai commented Nov 4, 2022

Uh oh!

kaijchen commented Nov 4, 2022

Uh oh!

captainzmc commented Nov 4, 2022 •

edited

Loading

Uh oh!

swamirishi commented Nov 5, 2022 •

edited

Loading

Uh oh!

swamirishi commented Nov 7, 2022 •

edited

Loading

Uh oh!

adoroszlai commented Nov 7, 2022

Uh oh!

captainzmc commented Nov 7, 2022

Uh oh!

kaijchen commented Nov 8, 2022

Uh oh!

captainzmc commented Nov 9, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HDDS-7326. Intermittent timeout in TestECContainerRecovery.testContainerRecoveryOverReplicationProcessing #3932

HDDS-7326. Intermittent timeout in TestECContainerRecovery.testContainerRecoveryOverReplicationProcessing #3932

Uh oh!

Conversation

captainzmc commented Nov 4, 2022 • edited by adoroszlai Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

adoroszlai commented Nov 4, 2022

How was this patch tested?

Uh oh!

adoroszlai commented Nov 4, 2022

Uh oh!

kaijchen commented Nov 4, 2022

Uh oh!

captainzmc commented Nov 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

swamirishi commented Nov 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

swamirishi commented Nov 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adoroszlai commented Nov 7, 2022

Uh oh!

captainzmc commented Nov 7, 2022

Uh oh!

kaijchen commented Nov 8, 2022

Uh oh!

captainzmc commented Nov 9, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

captainzmc commented Nov 4, 2022 •

edited by adoroszlai

Loading

captainzmc commented Nov 4, 2022 •

edited

Loading

swamirishi commented Nov 5, 2022 •

edited

Loading

swamirishi commented Nov 7, 2022 •

edited

Loading