HDDS-6113. Intermittent timeout in TestBlockOutputStreamWithFailures#testWatchForCommitDatanodeFailure#6733
HDDS-6113. Intermittent timeout in TestBlockOutputStreamWithFailures#testWatchForCommitDatanodeFailure#6733raju-balpande wants to merge 6 commits intoapache:masterfrom
Conversation
…pache_ozone into raju-b-hdds-6113
adoroszlai
left a comment
There was a problem hiding this comment.
Thanks @raju-balpande for working on this.
in rare scenario it takes beyond 300s, I re-tested it with 600s and 900s but it still fails. When I run it with 1200s it succeed hence committed the change with timeout as 1200s.
If the test usually passes in less than 2 minutes, but sometimes takes more than 15 minutes, we should see more variation in the execution time of successful splits when using 20 minutes timeout. But all splits passed around 1 hour, which indicates the "longer than 15 minutes" case was not hit in any of the runs.
So I doubt 20 minutes timeout is the solution. We'll need to check what is causing the delay.
An unrelated "fork timeout" problem (HDDS-10750) makes it more difficult to check results, so I suggest waiting until we can upgrade to Ratis 3.1.0, which should fix that.
ivandika3
left a comment
There was a problem hiding this comment.
@raju-balpande Thanks for the patch.
Could you check from the failed repeated tests whether the exception is due to the assertion exception mentioned in HDDS-6113 or are these mostly due to fork timeout?
As @adoroszlai said, HDDS-10750 might be the reason of the timeout.
|
@ivandika3 I have checked some of the logs, and there are both kinds of timeouts: test timeout: https://github.com/raju-balpande/apache_ozone/actions/runs/9226024466/job/25385130012#step:7:31517 fork timeout: https://github.com/raju-balpande/apache_ozone/actions/runs/9234289550/job/25461443504#step:7:32145 |
|
@adoroszlai Thanks for checking the timeouts.
Agreed. Increasing the timeout significantly would mask the underlying problem. |
|
@ivandika3 however, even the test timeout seems to be due to the same underlying problem as HDDS-10750: So I think we can mark HDDS-6113 as duplicate. |
|
@adoroszlai Sure, we can mark it as duplicate first. |
|
@adoroszlai Noted, it's been a while. We can reopen the ticket if the assertion error happened again. |
|
Thanks, for the inputs on this request. |
What changes were proposed in this pull request?
Intermittent timeout in TestBlockOutputStreamWithFailuresFlushDelay#testDatanodeFailureWithSingleNodeRatis
TestBlockOutputStreamWithFailuresFlushDelay is changed to TestBlockOutputStreamWithFailures and observed the flakiness in method testWatchForCommitDatanodeFailure
I looked into the logic in test method, mostly it completes in 80s but in rare scenario it takes beyond 300s, I re-tested it with 600s and 900s but it still fails. When I run it with 1200s it succeed hence committed the change with timeout as 1200s.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-6113
How was this patch tested?
I tested it for 1000 runs with 50 splits and 20 iterations at https://github.com/raju-balpande/apache_ozone/actions/runs/9201973732