HDDS-6470. Fix TestOzoneManagerHAWithData#testOMRestart()#3213
HDDS-6470. Fix TestOzoneManagerHAWithData#testOMRestart()#3213adoroszlai merged 2 commits intoapache:masterfrom
Conversation
|
Seems this is duplicated with #3214. @adoroszlai |
There was a problem hiding this comment.
Thanks @kaijchen for the patch.
Before reporting a problem, please search Jira to see if it is already known and/or being worked on. In this case HDDS-6470 duplicates HDDS-6469, reported yesterday. I have ran 100x iterations of the fix overnight.
I think it is better to verify the fix before opening a PR. One may need to tweak the fix based on results. Waiting with the PR saves both reviewer's time and GitHub resources.
Running 10x10 iterations instead of 1x100 to get results faster is a good idea.
Please note that some other test cases in TestOzoneManagerHAWithData are also flaky (reported in separate Jira issues). Including these in the 100x repetitions can make it harder to verify results: instead of 100 clean runs we may get other failures. Test can be limited to the specific case by -Dtest=TestOzoneManagerHAWithData#testOMRestart.
I also see failure in testOMRestart with the patch:
https://github.com/kaijchen/ozone/runs/5609898194#step:4:3547
All in all, I like your patch better than mine. I'll close my PR and merge this one once we get 100x clean runs.
|
|
||
| objectStore.createVolume(volumeName, createVolumeArgs); | ||
| OzoneVolume retVolumeinfo = objectStore.getVolume(volumeName); | ||
| OzoneVolume ozoneVolume = objectStore.getVolume(volumeName); |
There was a problem hiding this comment.
The new name is indeed better, but this refactoring is not essential for the fix. There are several other improvements that we could make in this test, but should not mix it with bugfix.
| Assert.assertTrue( | ||
| followerOMLastAppliedIndex < leaderOMSnaphsotIndex); | ||
| // The stopped OM should be lagging behind the leader OM. | ||
| Assert.assertTrue(followerOM1LastAppliedIndex < leaderOMSnaphsotIndex); |
There was a problem hiding this comment.
I agree that it is better to make this assertion before restarting the follower, instead of removing it like I did in #3214.
Thanks @adoroszlai for the review.
Sure, sorry about that.
Thanks for the advise.
I don't think this failure is related to the change made here. I will rerun the tests.
The only difference in code logic between these 2 patches is the assertion. |
adoroszlai
left a comment
There was a problem hiding this comment.
Thanks @kaijchen for the fix. 5x20 run verified: https://github.com/kaijchen/ozone/actions/runs/2008065589
|
Thanks @adoroszlai for the review. |
What changes were proposed in this pull request?
This assertion is wrong in TestOzoneManagerHAWithData#testOMRestart().
Because the lagging follower OM may catch up asynchronously.
Example of CI failure on master: https://github.com/apache/ozone/runs/5593803014
Result of running this test 100x: https://github.com/kaijchen/ozone/actions/runs/2007487998
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-6470
How was this patch tested?
https://github.com/kaijchen/ozone/actions/runs/2008065589