Skip to content

Comments

HDDS-6113. Intermittent timeout in TestBlockOutputStreamWithFailures#testWatchForCommitDatanodeFailure#6733

Closed
raju-balpande wants to merge 6 commits intoapache:masterfrom
raju-balpande:raju-b-hdds-6113
Closed

HDDS-6113. Intermittent timeout in TestBlockOutputStreamWithFailures#testWatchForCommitDatanodeFailure#6733
raju-balpande wants to merge 6 commits intoapache:masterfrom
raju-balpande:raju-b-hdds-6113

Conversation

@raju-balpande
Copy link
Contributor

What changes were proposed in this pull request?

Intermittent timeout in TestBlockOutputStreamWithFailuresFlushDelay#testDatanodeFailureWithSingleNodeRatis

TestBlockOutputStreamWithFailuresFlushDelay is changed to TestBlockOutputStreamWithFailures and observed the flakiness in method testWatchForCommitDatanodeFailure

I looked into the logic in test method, mostly it completes in 80s but in rare scenario it takes beyond 300s, I re-tested it with 600s and 900s but it still fails. When I run it with 1200s it succeed hence committed the change with timeout as 1200s.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-6113

How was this patch tested?

I tested it for 1000 runs with 50 splits and 20 iterations at https://github.com/raju-balpande/apache_ozone/actions/runs/9201973732

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @raju-balpande for working on this.

in rare scenario it takes beyond 300s, I re-tested it with 600s and 900s but it still fails. When I run it with 1200s it succeed hence committed the change with timeout as 1200s.

If the test usually passes in less than 2 minutes, but sometimes takes more than 15 minutes, we should see more variation in the execution time of successful splits when using 20 minutes timeout. But all splits passed around 1 hour, which indicates the "longer than 15 minutes" case was not hit in any of the runs.

So I doubt 20 minutes timeout is the solution. We'll need to check what is causing the delay.

An unrelated "fork timeout" problem (HDDS-10750) makes it more difficult to check results, so I suggest waiting until we can upgrade to Ratis 3.1.0, which should fix that.

@adoroszlai adoroszlai changed the title HDDS-6113. Intermittent timeout in TestBlockOutputStreamWithFailuresFlushDelay#testDatanodeFailureWithSingleNodeRatis HDDS-6113. Intermittent timeout in TestBlockOutputStreamWithFailures#testWatchForCommitDatanodeFailure May 28, 2024
@adoroszlai adoroszlai marked this pull request as draft May 28, 2024 06:09
Copy link
Contributor

@ivandika3 ivandika3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raju-balpande Thanks for the patch.

Could you check from the failed repeated tests whether the exception is due to the assertion exception mentioned in HDDS-6113 or are these mostly due to fork timeout?

As @adoroszlai said, HDDS-10750 might be the reason of the timeout.

@adoroszlai
Copy link
Contributor

adoroszlai commented May 28, 2024

@ivandika3
Copy link
Contributor

ivandika3 commented May 28, 2024

@adoroszlai Thanks for checking the timeouts.

We'll need to check what is causing the delay.

Agreed. Increasing the timeout significantly would mask the underlying problem.

@adoroszlai
Copy link
Contributor

@ivandika3 however, even the test timeout seems to be due to the same underlying problem as HDDS-10750:

        at org.apache.ratis.server.impl.RaftServerImpl.lambda$close$3(RaftServerImpl.java:543)
        at org.apache.ratis.server.impl.RaftServerImpl$$Lambda$1923/1198603729.run(Unknown Source)
        at org.apache.ratis.util.LifeCycle.lambda$checkStateAndClose$7(LifeCycle.java:306)
        at org.apache.ratis.util.LifeCycle$$Lambda$1298/779888560.get(Unknown Source)
        at org.apache.ratis.util.LifeCycle.checkStateAndClose(LifeCycle.java:326)
        at org.apache.ratis.util.LifeCycle.checkStateAndClose(LifeCycle.java:304)
        at org.apache.ratis.server.impl.RaftServerImpl.close(RaftServerImpl.java:525)

So I think we can mark HDDS-6113 as duplicate.

@ivandika3
Copy link
Contributor

@adoroszlai Sure, we can mark it as duplicate first.

@ivandika3
Copy link
Contributor

@adoroszlai Noted, it's been a while. We can reopen the ticket if the assertion error happened again.

@raju-balpande
Copy link
Contributor Author

Thanks, for the inputs on this request.

@adoroszlai adoroszlai closed this Oct 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants