Skip to content

HDDS-6470. Fix TestOzoneManagerHAWithData#testOMRestart()#3213

Merged
adoroszlai merged 2 commits intoapache:masterfrom
kaijchen:HDDS-6470
Mar 19, 2022
Merged

HDDS-6470. Fix TestOzoneManagerHAWithData#testOMRestart()#3213
adoroszlai merged 2 commits intoapache:masterfrom
kaijchen:HDDS-6470

Conversation

@kaijchen
Copy link
Member

@kaijchen kaijchen commented Mar 19, 2022

What changes were proposed in this pull request?

This assertion is wrong in TestOzoneManagerHAWithData#testOMRestart().
Because the lagging follower OM may catch up asynchronously.

    // Restart the stopped OM.
    followerOM1.restart();
 
    // Get the latest snapshotIndex from the leader OM.
    long leaderOMSnaphsotIndex = leaderOM.getRatisSnapshotIndex();
 
    // The recently started OM should be lagging behind the leader OM.
    long followerOMLastAppliedIndex =
        followerOM1.getOmRatisServer().getLastAppliedTermIndex().getIndex();
    Assert.assertTrue(
        followerOMLastAppliedIndex < leaderOMSnaphsotIndex);

Example of CI failure on master: https://github.com/apache/ozone/runs/5593803014
Result of running this test 100x: https://github.com/kaijchen/ozone/actions/runs/2007487998

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-6470

How was this patch tested?

https://github.com/kaijchen/ozone/actions/runs/2008065589

@kaijchen kaijchen marked this pull request as ready for review March 19, 2022 06:03
@kaijchen
Copy link
Member Author

Seems this is duplicated with #3214. @adoroszlai

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kaijchen for the patch.

Before reporting a problem, please search Jira to see if it is already known and/or being worked on. In this case HDDS-6470 duplicates HDDS-6469, reported yesterday. I have ran 100x iterations of the fix overnight.

I think it is better to verify the fix before opening a PR. One may need to tweak the fix based on results. Waiting with the PR saves both reviewer's time and GitHub resources.

Running 10x10 iterations instead of 1x100 to get results faster is a good idea.

Please note that some other test cases in TestOzoneManagerHAWithData are also flaky (reported in separate Jira issues). Including these in the 100x repetitions can make it harder to verify results: instead of 100 clean runs we may get other failures. Test can be limited to the specific case by -Dtest=TestOzoneManagerHAWithData#testOMRestart.

I also see failure in testOMRestart with the patch:
https://github.com/kaijchen/ozone/runs/5609898194#step:4:3547

All in all, I like your patch better than mine. I'll close my PR and merge this one once we get 100x clean runs.


objectStore.createVolume(volumeName, createVolumeArgs);
OzoneVolume retVolumeinfo = objectStore.getVolume(volumeName);
OzoneVolume ozoneVolume = objectStore.getVolume(volumeName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new name is indeed better, but this refactoring is not essential for the fix. There are several other improvements that we could make in this test, but should not mix it with bugfix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree.

Assert.assertTrue(
followerOMLastAppliedIndex < leaderOMSnaphsotIndex);
// The stopped OM should be lagging behind the leader OM.
Assert.assertTrue(followerOM1LastAppliedIndex < leaderOMSnaphsotIndex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it is better to make this assertion before restarting the follower, instead of removing it like I did in #3214.

@kaijchen
Copy link
Member Author

kaijchen commented Mar 19, 2022

Thanks @kaijchen for the patch.

Thanks @adoroszlai for the review.

Before reporting a problem, please search Jira to see if it is already known and/or being worked on. In this case HDDS-6470 duplicates HDDS-6469, reported yesterday. I have ran 100x iterations of the fix overnight.

I think it is better to verify the fix before opening a PR. One may need to tweak the fix based on results. Waiting with the PR saves both reviewer's time and GitHub resources.

Sure, sorry about that.

Running 10x10 iterations instead of 1x100 to get results faster is a good idea.

Please note that some other test cases in TestOzoneManagerHAWithData are also flaky (reported in separate Jira issues). Including these in the 100x repetitions can make it harder to verify results: instead of 100 clean runs we may get other failures. Test can be limited to the specific case by -Dtest=TestOzoneManagerHAWithData#testOMRestart.

Thanks for the advise.

I also see failure in testOMRestart with the patch: https://github.com/kaijchen/ozone/runs/5609898194#step:4:3547

I don't think this failure is related to the change made here. I will rerun the tests.

All in all, I like your patch better than mine. I'll close my PR and merge this one once we get 100x clean runs.

The only difference in code logic between these 2 patches is the assertion.
I'm good if you wish to merge yours with the assertion included.

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adoroszlai adoroszlai merged commit be2021a into apache:master Mar 19, 2022
@kaijchen
Copy link
Member Author

Thanks @adoroszlai for the review.

@kaijchen kaijchen deleted the HDDS-6470 branch March 19, 2022 12:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants