Skip to content

HDDS-10004. Fix TestOMRatisSnapshots#testInstallIncrementalSnapshotWithFailure#5917

Merged
adoroszlai merged 1 commit intoapache:masterfrom
xBis7:HDDS-10004
Jan 5, 2024
Merged

HDDS-10004. Fix TestOMRatisSnapshots#testInstallIncrementalSnapshotWithFailure#5917
adoroszlai merged 1 commit intoapache:masterfrom
xBis7:HDDS-10004

Conversation

@xBis7
Copy link
Contributor

@xBis7 xBis7 commented Jan 4, 2024

What changes were proposed in this pull request?

The test goes through these steps

  • Start with an inactive follower
  • Write keys
  • Start the inactive follower, which downloads a ratis snapshot
  • Write more keys
  • The follower now downloads an incremental snapshot
  • Delete some sst files from the follower's candidate dir
  • The follower tries to install an incremental snapshot, which fails
  • The follower installs a new snapshot
  • Check the metrics to verify the above steps

The metrics part was commented out by #5673.

The metrics depend on downloading the 3rd snapshot. During the failures, the corruption in the candidate dir isn’t picked up and the Ratis GrpcLogAppender actually returns SUCCESS and the snapshot installation isn’t repeated.

Making the test consistently fail

The issue is with the code deleting the sst files. The steps go as following

  • Get a list of sst files
  • Shuffle the list
  • Get the first 3 sst files
  • Delete them

The initial list that we get is always like this

[000054.sst, 000057.sst, 000062.sst, 000063.sst, 000061.sst, 000058.sst, 000053.sst, 000060.sst, 000055.sst, 000056.sst]

During the failures, this is what the sst file list could potentially look like after deletes

[000054.sst, 000057.sst, 000061.sst, 000058.sst, 000053.sst, 000060.sst, 000055.sst]
[000054.sst, 000061.sst, 000058.sst, 000053.sst, 000060.sst, 000055.sst, 000056.sst

In both cases 5-6 sst files in consecutive order, were left untouched. Due to the randomness, some times we end up leaving too many consecutive files untouched and as a result, the candidate dir isn't considered corrupted. During the corruption we expect to get a result SNAPSHOT_UNAVAILABLE.

If we remove the shuffle and just delete the 3rd, the 4th and the last element of the list, we end up with 5 consecutive ssts untouched. Check this commit.. With that change, the test fails every time, 100/100 runs.

https://github.com/xBis7/ozone/actions/runs/7400858782/job/20136301493

Fix

If we remove the randomness and delete every other file while going over the list, the test always passes.

I've been running it on repeat using the flaky-test-check ci, 10x10.

https://github.com/xBis7/ozone/actions/runs/7401653879

https://github.com/xBis7/ozone/actions/runs/7401950911 (1 failure)

https://github.com/xBis7/ozone/actions/runs/7402888697

https://github.com/xBis7/ozone/actions/runs/7403418317

https://github.com/xBis7/ozone/actions/runs/7403423825 (1 failure)

These 2 failures, both had to do with a timeout in writing the keys, which is appearing so rarely that it was probably a memory issue and it could have something to do with the fact that I was running just the method and not the entire class.

The metrics issue is no longer there.

at org.apache.hadoop.ozone.om.protocolPB.Hadoop3OmTransport.submitRequest(Hadoop3OmTransport.java:80)
at org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.submitRequest(OzoneManagerProtocolClientSideTranslatorPB.java:330)
at org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.updateKey(OzoneManagerProtocolClientSideTranslatorPB.java:831)
at org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.commitKey(OzoneManagerProtocolClientSideTranslatorPB.java:788)
at org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.commitKey(BlockOutputStreamEntryPool.java:350)
at org.apache.hadoop.ozone.client.io.KeyOutputStream.close(KeyOutputStream.java:580)
at org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:105)
at org.apache.hadoop.ozone.om.TestOzoneManagerHA.createKey(TestOzoneManagerHA.java:225)
at org.apache.hadoop.ozone.om.TestOMRatisSnapshots.writeKeysToIncreaseLogIndex(TestOMRatisSnapshots.java:1066)
at org.apache.hadoop.ozone.om.TestOMRatisSnapshots.testInstallIncrementalSnapshotWithFailure(TestOMRatisSnapshots.java:630)

Check this file for the entire thread dump:
org.apache.hadoop.ozone.om.TestOMRatisSnapshots.txt

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10004

How was this patch tested?

This patch is fixing a flaky test and it was tested using the new flaky-test-check ci.

@xBis7 xBis7 added the test label Jan 4, 2024
@xBis7 xBis7 requested a review from adoroszlai January 4, 2024 09:37
@xBis7
Copy link
Contributor Author

xBis7 commented Jan 4, 2024

Running the entire class 10x10, 2 times. All runs passing.

https://github.com/xBis7/ozone/actions/runs/7407357922

https://github.com/xBis7/ozone/actions/runs/7407863800

@xBis7
Copy link
Contributor Author

xBis7 commented Jan 4, 2024

The CI failure is unrelated. It's from testInstallSnapshot.

Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 302.1 s <<< FAILURE! -- in org.apache.hadoop.ozone.om.TestOMRatisSnapshots
org.apache.hadoop.ozone.om.TestOMRatisSnapshots.testInstallSnapshot(int, Path)[1] -- Time elapsed: 83.38 s <<< ERROR!

https://github.com/apache/ozone/actions/runs/7407840040/job/20155408569?pr=5917#step:6:113

It passed on my fork.

https://github.com/xBis7/ozone/actions/runs/7401541919

I'll retrigger it.

@adoroszlai adoroszlai changed the title HDDS-10004. Fix the commented part in TestOMRatisSnapshots#testInstallIncrementalSnapshotWithFailure HDDS-10004. Fix TestOMRatisSnapshots#testInstallIncrementalSnapshotWithFailure Jan 4, 2024
@adoroszlai
Copy link
Contributor

@xBis7 thanks a lot for working on this.

The CI failure is unrelated. It's from testInstallSnapshot.

I've noticed such failure recently in CI, so I can confirm it's not caused by the change.

Filed https://issues.apache.org/jira/browse/HDDS-10059, please take a look if you have time.

@xBis7
Copy link
Contributor Author

xBis7 commented Jan 4, 2024

Filed https://issues.apache.org/jira/browse/HDDS-10059, please take a look if you have time.

@adoroszlai Thanks, I'll probably pick it up in the next day.

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xBis7 for the fix, LGTM.

Copy link
Contributor

@hemantk-12 hemantk-12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch @xBis7.

Change looks good to me.

@adoroszlai adoroszlai merged commit 741b113 into apache:master Jan 5, 2024
@adoroszlai
Copy link
Contributor

Thanks @xBis7 for the patch, @hemantk-12 for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants