Skip to content

HDDS-5319 Intermittent failure in TestSCMUpdateServiceGrpcServer #2558

Merged
adoroszlai merged 5 commits intoapache:masterfrom
jwminton:HDDS-5319
Dec 28, 2021
Merged

HDDS-5319 Intermittent failure in TestSCMUpdateServiceGrpcServer #2558
adoroszlai merged 5 commits intoapache:masterfrom
jwminton:HDDS-5319

Conversation

@jwminton
Copy link
Contributor

What changes were proposed in this pull request?

There are three failures reported in this. Two happen in testClientUpdateWithDelayedRevoke and one in testClientUpdateWithRestart.

For the two failures that occur within testClientUpdateWithDelayedRevoke() the problem is that after revoking cert 5, after the waitFor, the next tests are trying to test pre-removal status of things while the cert is waiting to be revoked. The problem is there's nothing guaranteeing that these pre-revoke tests will happen before the timeout/revoke happens. Thus the two different failures. There are two tested values the revoke can affect and the specific failure depends on which operation the test is on when the revoke happens. Unless there's a way to guarantee this pre-revoke state but wihout the threat of being revoked, it may be best to let these two tests accommodate normal execution and the case where removal happened. Thus, the two asserts that are checking for range relations rather than for exact values.

For the testClientUpdateWithRestart failure, mid-test the updateCount value isn't getting tracked correctly and its possible for the test execution to get ahead of what states the client/server should be in. The updateCount is tested for being 4 twice, missing an increment, guaranteeing that the last two waitFor's never wait because by definition, their predicates are already satisfied at the time of execution.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-5319

How was this patch tested?

Manually, githubCI's, in my IDE and repeatedly in loops in bash

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jwminton for working on this. The changes make sense. However, I have run them 150 times and still had 4 failures, so further tweak may be needed.

This one 3 times:

[ERROR] testClientUpdateWithRestart  Time elapsed: 10.373 s  <<< FAILURE!
java.lang.AssertionError: expected:<5> but was:<6>
...
  at org.apache.hadoop.hdds.scm.update.server.TestSCMUpdateServiceGrpcServer.testClientUpdateWithRestart(TestSCMUpdateServiceGrpcServer.java:268)

Another one only once:

[ERROR] testClientUpdateWithDelayedRevoke  Time elapsed: 5.263 s  <<< FAILURE!
java.lang.AssertionError: expected:<0> but was:<1>
...
  at org.apache.hadoop.hdds.scm.update.server.TestSCMUpdateServiceGrpcServer.testClientUpdateWithDelayedRevoke(TestSCMUpdateServiceGrpcServer.java:190)

https://github.com/adoroszlai/hadoop-ozone/commits/HDDS-5319-repeat

@adoroszlai
Copy link
Contributor

@jwminton Thanks again for the patch. I would like to propose merging your improvements, but keeping the test disabled for now. If you agree, can you please push another commit to restore @Ignore, or let me know if I should make that change?

@jwminton
Copy link
Contributor Author

jwminton commented Dec 28, 2021 via email

@adoroszlai adoroszlai merged commit a5cc886 into apache:master Dec 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments