HDDS-6470. Fix TestOzoneManagerHAWithData#testOMRestart() by kaijchen · Pull Request #3213 · apache/ozone

kaijchen · 2022-03-19T05:51:38Z

What changes were proposed in this pull request?

This assertion is wrong in TestOzoneManagerHAWithData#testOMRestart().
Because the lagging follower OM may catch up asynchronously.

    // Restart the stopped OM.
    followerOM1.restart();
 
    // Get the latest snapshotIndex from the leader OM.
    long leaderOMSnaphsotIndex = leaderOM.getRatisSnapshotIndex();
 
    // The recently started OM should be lagging behind the leader OM.
    long followerOMLastAppliedIndex =
        followerOM1.getOmRatisServer().getLastAppliedTermIndex().getIndex();
    Assert.assertTrue(
        followerOMLastAppliedIndex < leaderOMSnaphsotIndex);

Example of CI failure on master: https://github.com/apache/ozone/runs/5593803014
Result of running this test 100x: https://github.com/kaijchen/ozone/actions/runs/2007487998

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-6470

How was this patch tested?

https://github.com/kaijchen/ozone/actions/runs/2008065589

kaijchen · 2022-03-19T06:09:10Z

Seems this is duplicated with #3214. @adoroszlai

adoroszlai

Thanks @kaijchen for the patch.

Before reporting a problem, please search Jira to see if it is already known and/or being worked on. In this case HDDS-6470 duplicates HDDS-6469, reported yesterday. I have ran 100x iterations of the fix overnight.

I think it is better to verify the fix before opening a PR. One may need to tweak the fix based on results. Waiting with the PR saves both reviewer's time and GitHub resources.

Running 10x10 iterations instead of 1x100 to get results faster is a good idea.

Please note that some other test cases in TestOzoneManagerHAWithData are also flaky (reported in separate Jira issues). Including these in the 100x repetitions can make it harder to verify results: instead of 100 clean runs we may get other failures. Test can be limited to the specific case by -Dtest=TestOzoneManagerHAWithData#testOMRestart.

I also see failure in testOMRestart with the patch:
https://github.com/kaijchen/ozone/runs/5609898194#step:4:3547

All in all, I like your patch better than mine. I'll close my PR and merge this one once we get 100x clean runs.

adoroszlai · 2022-03-19T06:43:44Z

...ne/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOzoneManagerHAWithData.java


    objectStore.createVolume(volumeName, createVolumeArgs);
-    OzoneVolume retVolumeinfo = objectStore.getVolume(volumeName);
+    OzoneVolume ozoneVolume = objectStore.getVolume(volumeName);


The new name is indeed better, but this refactoring is not essential for the fix. There are several other improvements that we could make in this test, but should not mix it with bugfix.

Yes, I agree.

adoroszlai · 2022-03-19T06:44:37Z

...ne/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOzoneManagerHAWithData.java

-    Assert.assertTrue(
-        followerOMLastAppliedIndex < leaderOMSnaphsotIndex);
+    // The stopped OM should be lagging behind the leader OM.
+    Assert.assertTrue(followerOM1LastAppliedIndex < leaderOMSnaphsotIndex);


I agree that it is better to make this assertion before restarting the follower, instead of removing it like I did in #3214.

kaijchen · 2022-03-19T07:46:15Z

Thanks @kaijchen for the patch.

Thanks @adoroszlai for the review.

Before reporting a problem, please search Jira to see if it is already known and/or being worked on. In this case HDDS-6470 duplicates HDDS-6469, reported yesterday. I have ran 100x iterations of the fix overnight.

I think it is better to verify the fix before opening a PR. One may need to tweak the fix based on results. Waiting with the PR saves both reviewer's time and GitHub resources.

Sure, sorry about that.

Running 10x10 iterations instead of 1x100 to get results faster is a good idea.

Please note that some other test cases in TestOzoneManagerHAWithData are also flaky (reported in separate Jira issues). Including these in the 100x repetitions can make it harder to verify results: instead of 100 clean runs we may get other failures. Test can be limited to the specific case by -Dtest=TestOzoneManagerHAWithData#testOMRestart.

Thanks for the advise.

I also see failure in testOMRestart with the patch: https://github.com/kaijchen/ozone/runs/5609898194#step:4:3547

I don't think this failure is related to the change made here. I will rerun the tests.

All in all, I like your patch better than mine. I'll close my PR and merge this one once we get 100x clean runs.

The only difference in code logic between these 2 patches is the assertion.
I'm good if you wish to merge yours with the assertion included.

adoroszlai

Thanks @kaijchen for the fix. 5x20 run verified: https://github.com/kaijchen/ozone/actions/runs/2008065589

kaijchen · 2022-03-19T09:18:33Z

Thanks @adoroszlai for the review.

Fix TestOzoneManagerHAWithData#testOMRestart()

837359f

kaijchen marked this pull request as ready for review March 19, 2022 06:03

Revert the gap and the timeout

2c8e195

adoroszlai reviewed Mar 19, 2022

View reviewed changes

kaijchen mentioned this pull request Mar 19, 2022

HDDS-6469. Intermittent failure in TestOzoneManagerHAWithData#testOMRestart #3214

Closed

adoroszlai approved these changes Mar 19, 2022

View reviewed changes

adoroszlai merged commit be2021a into apache:master Mar 19, 2022

kaijchen deleted the HDDS-6470 branch March 19, 2022 12:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-6470. Fix TestOzoneManagerHAWithData#testOMRestart()#3213

HDDS-6470. Fix TestOzoneManagerHAWithData#testOMRestart()#3213
adoroszlai merged 2 commits intoapache:masterfrom
kaijchen:HDDS-6470

kaijchen commented Mar 19, 2022 •

edited

Loading

Uh oh!

kaijchen commented Mar 19, 2022

Uh oh!

adoroszlai left a comment •

edited

Loading

Uh oh!

adoroszlai Mar 19, 2022

Uh oh!

kaijchen Mar 19, 2022

Uh oh!

adoroszlai Mar 19, 2022

Uh oh!

kaijchen commented Mar 19, 2022 •

edited

Loading

Uh oh!

adoroszlai left a comment

Uh oh!

kaijchen commented Mar 19, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kaijchen commented Mar 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

kaijchen commented Mar 19, 2022

Uh oh!

adoroszlai left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adoroszlai Mar 19, 2022

Choose a reason for hiding this comment

Uh oh!

kaijchen Mar 19, 2022

Choose a reason for hiding this comment

Uh oh!

adoroszlai Mar 19, 2022

Choose a reason for hiding this comment

Uh oh!

kaijchen commented Mar 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adoroszlai left a comment

Choose a reason for hiding this comment

Uh oh!

kaijchen commented Mar 19, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kaijchen commented Mar 19, 2022 •

edited

Loading

adoroszlai left a comment •

edited

Loading

kaijchen commented Mar 19, 2022 •

edited

Loading