HDDS-10004. Fix TestOMRatisSnapshots#testInstallIncrementalSnapshotWithFailure by xBis7 · Pull Request #5917 · apache/ozone

xBis7 · 2024-01-04T09:37:53Z

What changes were proposed in this pull request?

The test goes through these steps

Start with an inactive follower
Write keys
Start the inactive follower, which downloads a ratis snapshot
Write more keys
The follower now downloads an incremental snapshot
Delete some sst files from the follower's candidate dir
The follower tries to install an incremental snapshot, which fails
The follower installs a new snapshot
Check the metrics to verify the above steps

The metrics part was commented out by #5673.

The metrics depend on downloading the 3rd snapshot. During the failures, the corruption in the candidate dir isn’t picked up and the Ratis GrpcLogAppender actually returns SUCCESS and the snapshot installation isn’t repeated.

Making the test consistently fail

The issue is with the code deleting the sst files. The steps go as following

Get a list of sst files
Shuffle the list
Get the first 3 sst files
Delete them

The initial list that we get is always like this

[000054.sst, 000057.sst, 000062.sst, 000063.sst, 000061.sst, 000058.sst, 000053.sst, 000060.sst, 000055.sst, 000056.sst]

During the failures, this is what the sst file list could potentially look like after deletes

[000054.sst, 000057.sst, 000061.sst, 000058.sst, 000053.sst, 000060.sst, 000055.sst]

[000054.sst, 000061.sst, 000058.sst, 000053.sst, 000060.sst, 000055.sst, 000056.sst

In both cases 5-6 sst files in consecutive order, were left untouched. Due to the randomness, some times we end up leaving too many consecutive files untouched and as a result, the candidate dir isn't considered corrupted. During the corruption we expect to get a result SNAPSHOT_UNAVAILABLE.

If we remove the shuffle and just delete the 3rd, the 4th and the last element of the list, we end up with 5 consecutive ssts untouched. Check this commit.. With that change, the test fails every time, 100/100 runs.

https://github.com/xBis7/ozone/actions/runs/7400858782/job/20136301493

Fix

If we remove the randomness and delete every other file while going over the list, the test always passes.

I've been running it on repeat using the flaky-test-check ci, 10x10.

https://github.com/xBis7/ozone/actions/runs/7401653879

https://github.com/xBis7/ozone/actions/runs/7401950911 (1 failure)

https://github.com/xBis7/ozone/actions/runs/7402888697

https://github.com/xBis7/ozone/actions/runs/7403418317

https://github.com/xBis7/ozone/actions/runs/7403423825 (1 failure)

These 2 failures, both had to do with a timeout in writing the keys, which is appearing so rarely that it was probably a memory issue and it could have something to do with the fact that I was running just the method and not the entire class.

The metrics issue is no longer there.

at org.apache.hadoop.ozone.om.protocolPB.Hadoop3OmTransport.submitRequest(Hadoop3OmTransport.java:80)
at org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.submitRequest(OzoneManagerProtocolClientSideTranslatorPB.java:330)
at org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.updateKey(OzoneManagerProtocolClientSideTranslatorPB.java:831)
at org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.commitKey(OzoneManagerProtocolClientSideTranslatorPB.java:788)
at org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.commitKey(BlockOutputStreamEntryPool.java:350)
at org.apache.hadoop.ozone.client.io.KeyOutputStream.close(KeyOutputStream.java:580)
at org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:105)
at org.apache.hadoop.ozone.om.TestOzoneManagerHA.createKey(TestOzoneManagerHA.java:225)
at org.apache.hadoop.ozone.om.TestOMRatisSnapshots.writeKeysToIncreaseLogIndex(TestOMRatisSnapshots.java:1066)
at org.apache.hadoop.ozone.om.TestOMRatisSnapshots.testInstallIncrementalSnapshotWithFailure(TestOMRatisSnapshots.java:630)

Check this file for the entire thread dump:
org.apache.hadoop.ozone.om.TestOMRatisSnapshots.txt

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10004

How was this patch tested?

This patch is fixing a flaky test and it was tested using the new flaky-test-check ci.

…SnapshotWithFailure

xBis7 · 2024-01-04T10:49:41Z

Running the entire class 10x10, 2 times. All runs passing.

https://github.com/xBis7/ozone/actions/runs/7407357922

https://github.com/xBis7/ozone/actions/runs/7407863800

xBis7 · 2024-01-04T10:56:45Z

The CI failure is unrelated. It's from testInstallSnapshot.

Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 302.1 s <<< FAILURE! -- in org.apache.hadoop.ozone.om.TestOMRatisSnapshots
org.apache.hadoop.ozone.om.TestOMRatisSnapshots.testInstallSnapshot(int, Path)[1] -- Time elapsed: 83.38 s <<< ERROR!

https://github.com/apache/ozone/actions/runs/7407840040/job/20155408569?pr=5917#step:6:113

It passed on my fork.

https://github.com/xBis7/ozone/actions/runs/7401541919

I'll retrigger it.

adoroszlai · 2024-01-04T11:15:21Z

@xBis7 thanks a lot for working on this.

The CI failure is unrelated. It's from testInstallSnapshot.

I've noticed such failure recently in CI, so I can confirm it's not caused by the change.

Filed https://issues.apache.org/jira/browse/HDDS-10059, please take a look if you have time.

xBis7 · 2024-01-04T12:06:30Z

Filed https://issues.apache.org/jira/browse/HDDS-10059, please take a look if you have time.

@adoroszlai Thanks, I'll probably pick it up in the next day.

adoroszlai

Thanks @xBis7 for the fix, LGTM.

hemantk-12

Thanks for the patch @xBis7.

Change looks good to me.

adoroszlai · 2024-01-05T06:08:13Z

Thanks @xBis7 for the patch, @hemantk-12 for the review.

Fix the commented part in TestOMRatisSnapshots#testInstallIncremental…

d72b97f

…SnapshotWithFailure

xBis7 added the test label Jan 4, 2024

xBis7 requested a review from adoroszlai January 4, 2024 09:37

adoroszlai changed the title ~~HDDS-10004. Fix the commented part in TestOMRatisSnapshots#testInstallIncrementalSnapshotWithFailure~~ HDDS-10004. Fix TestOMRatisSnapshots#testInstallIncrementalSnapshotWithFailure Jan 4, 2024

adoroszlai reviewed Jan 4, 2024

View reviewed changes

adoroszlai requested review from hemantk-12 and smengcl January 4, 2024 14:05

hemantk-12 approved these changes Jan 4, 2024

View reviewed changes

adoroszlai merged commit 741b113 into apache:master Jan 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-10004. Fix TestOMRatisSnapshots#testInstallIncrementalSnapshotWithFailure#5917

HDDS-10004. Fix TestOMRatisSnapshots#testInstallIncrementalSnapshotWithFailure#5917
adoroszlai merged 1 commit intoapache:masterfrom
xBis7:HDDS-10004

xBis7 commented Jan 4, 2024 •

edited

Loading

Uh oh!

xBis7 commented Jan 4, 2024

Uh oh!

xBis7 commented Jan 4, 2024

Uh oh!

adoroszlai commented Jan 4, 2024

Uh oh!

xBis7 commented Jan 4, 2024

Uh oh!

adoroszlai left a comment

Uh oh!

hemantk-12 left a comment

Uh oh!

adoroszlai commented Jan 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xBis7 commented Jan 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Making the test consistently fail

Fix

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

xBis7 commented Jan 4, 2024

Uh oh!

xBis7 commented Jan 4, 2024

Uh oh!

adoroszlai commented Jan 4, 2024

Uh oh!

xBis7 commented Jan 4, 2024

Uh oh!

adoroszlai left a comment

Choose a reason for hiding this comment

Uh oh!

hemantk-12 left a comment

Choose a reason for hiding this comment

Uh oh!

adoroszlai commented Jan 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xBis7 commented Jan 4, 2024 •

edited

Loading