HDDS-5319 Intermittent failure in TestSCMUpdateServiceGrpcServer #2558
Merged
adoroszlai merged 5 commits intoapache:masterfrom Dec 28, 2021
Merged
HDDS-5319 Intermittent failure in TestSCMUpdateServiceGrpcServer #2558adoroszlai merged 5 commits intoapache:masterfrom
adoroszlai merged 5 commits intoapache:masterfrom
Conversation
adoroszlai
reviewed
Sep 16, 2021
Contributor
adoroszlai
left a comment
There was a problem hiding this comment.
Thanks @jwminton for working on this. The changes make sense. However, I have run them 150 times and still had 4 failures, so further tweak may be needed.
This one 3 times:
[ERROR] testClientUpdateWithRestart Time elapsed: 10.373 s <<< FAILURE!
java.lang.AssertionError: expected:<5> but was:<6>
...
at org.apache.hadoop.hdds.scm.update.server.TestSCMUpdateServiceGrpcServer.testClientUpdateWithRestart(TestSCMUpdateServiceGrpcServer.java:268)
Another one only once:
[ERROR] testClientUpdateWithDelayedRevoke Time elapsed: 5.263 s <<< FAILURE!
java.lang.AssertionError: expected:<0> but was:<1>
...
at org.apache.hadoop.hdds.scm.update.server.TestSCMUpdateServiceGrpcServer.testClientUpdateWithDelayedRevoke(TestSCMUpdateServiceGrpcServer.java:190)
https://github.com/adoroszlai/hadoop-ozone/commits/HDDS-5319-repeat
Contributor
|
@jwminton Thanks again for the patch. I would like to propose merging your improvements, but keeping the test disabled for now. If you agree, can you please push another commit to restore |
Contributor
Author
|
Yes, the shortest path will be for you to push that change if its not too
much trouble.
Thanks.
…On Mon, Dec 27, 2021 at 3:57 AM Doroszlai, Attila ***@***.***> wrote:
@jwminton <https://github.com/jwminton> Thanks again for the patch. I
would like to propose merging your improvements, but keeping the test
disabled for now. If you agree, can you please push another commit to
restore @ignore, or let me know if I should make that change?
—
Reply to this email directly, view it on GitHub
<#2558 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AQJPZ4DZWTVRBQDPN5DZ3ODUTBH2FANCNFSM5CRLZPMQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
There are three failures reported in this. Two happen in testClientUpdateWithDelayedRevoke and one in testClientUpdateWithRestart.
For the two failures that occur within testClientUpdateWithDelayedRevoke() the problem is that after revoking cert 5, after the waitFor, the next tests are trying to test pre-removal status of things while the cert is waiting to be revoked. The problem is there's nothing guaranteeing that these pre-revoke tests will happen before the timeout/revoke happens. Thus the two different failures. There are two tested values the revoke can affect and the specific failure depends on which operation the test is on when the revoke happens. Unless there's a way to guarantee this pre-revoke state but wihout the threat of being revoked, it may be best to let these two tests accommodate normal execution and the case where removal happened. Thus, the two asserts that are checking for range relations rather than for exact values.
For the testClientUpdateWithRestart failure, mid-test the updateCount value isn't getting tracked correctly and its possible for the test execution to get ahead of what states the client/server should be in. The updateCount is tested for being 4 twice, missing an increment, guaranteeing that the last two waitFor's never wait because by definition, their predicates are already satisfied at the time of execution.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-5319
How was this patch tested?
Manually, githubCI's, in my IDE and repeatedly in loops in bash