[FLINK-9900][tests] Harden ZooKeeperHighAvailabilityITCase #6395
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What is the purpose of the change
This PR makes a few modifications to the
ZooKeeperHighAvailabilityITCase
to reduce the chances for intermittent test failures and timeouts.Changes:
1)
The test was moving files out of the HA storage directory with a simple loop using
File#renameTo
. The test enforced that the moving is successful, however since old checkpoints may be deleted asynchronously this may not always be the case.We now use a
FileVisitor
and only logIOExceptions
that occur while moving.If no checkpoint file could be moved the test will still fail.
2)
After the checkpoint files were moved out of the HA storage directory the job is thrown into a restart loop. To verify the restart behavior the test was polling the job state and checked for the
RESTARTING
andFAILING
states.Due to the small size the job is in these states only for a short time, effectively adding a race condition. Thus this loop mayrun for longer than anticipated; the largest outlier i got locally was 50 seconds which isn't that for off from the 2 minute timeout. I suspect this to be the failure cause raised in the JIRA, but I can't guarantee it.
Instead we now access the
fullRestarts
metric using a custom reporter to check how many restarts have occurred. The actual state transitions should be irrelevant to the test.