HDDS-10004. Fix TestOMRatisSnapshots#testInstallIncrementalSnapshotWithFailure#5917
HDDS-10004. Fix TestOMRatisSnapshots#testInstallIncrementalSnapshotWithFailure#5917adoroszlai merged 1 commit intoapache:masterfrom
Conversation
…SnapshotWithFailure
|
Running the entire class 10x10, 2 times. All runs passing. |
|
The CI failure is unrelated. It's from https://github.com/apache/ozone/actions/runs/7407840040/job/20155408569?pr=5917#step:6:113 It passed on my fork. https://github.com/xBis7/ozone/actions/runs/7401541919 I'll retrigger it. |
|
@xBis7 thanks a lot for working on this.
I've noticed such failure recently in CI, so I can confirm it's not caused by the change. Filed https://issues.apache.org/jira/browse/HDDS-10059, please take a look if you have time. |
@adoroszlai Thanks, I'll probably pick it up in the next day. |
adoroszlai
left a comment
There was a problem hiding this comment.
Thanks @xBis7 for the fix, LGTM.
hemantk-12
left a comment
There was a problem hiding this comment.
Thanks for the patch @xBis7.
Change looks good to me.
|
Thanks @xBis7 for the patch, @hemantk-12 for the review. |
What changes were proposed in this pull request?
The test goes through these steps
The metrics part was commented out by #5673.
The metrics depend on downloading the 3rd snapshot. During the failures, the corruption in the candidate dir isn’t picked up and the Ratis GrpcLogAppender actually returns SUCCESS and the snapshot installation isn’t repeated.
Making the test consistently fail
The issue is with the code deleting the sst files. The steps go as following
The initial list that we get is always like this
During the failures, this is what the sst file list could potentially look like after deletes
In both cases 5-6 sst files in consecutive order, were left untouched. Due to the randomness, some times we end up leaving too many consecutive files untouched and as a result, the candidate dir isn't considered corrupted. During the corruption we expect to get a result
SNAPSHOT_UNAVAILABLE.If we remove the shuffle and just delete the 3rd, the 4th and the last element of the list, we end up with 5 consecutive ssts untouched. Check this commit.. With that change, the test fails every time, 100/100 runs.
https://github.com/xBis7/ozone/actions/runs/7400858782/job/20136301493
Fix
If we remove the randomness and delete every other file while going over the list, the test always passes.
I've been running it on repeat using the
flaky-test-checkci, 10x10.https://github.com/xBis7/ozone/actions/runs/7401653879
https://github.com/xBis7/ozone/actions/runs/7401950911 (1 failure)
https://github.com/xBis7/ozone/actions/runs/7402888697
https://github.com/xBis7/ozone/actions/runs/7403418317
https://github.com/xBis7/ozone/actions/runs/7403423825 (1 failure)
These 2 failures, both had to do with a timeout in writing the keys, which is appearing so rarely that it was probably a memory issue and it could have something to do with the fact that I was running just the method and not the entire class.
The metrics issue is no longer there.
Check this file for the entire thread dump:
org.apache.hadoop.ozone.om.TestOMRatisSnapshots.txt
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-10004
How was this patch tested?
This patch is fixing a flaky test and it was tested using the new
flaky-test-checkci.