Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-9900][tests] Harden ZooKeeperHighAvailabilityITCase #6395

Merged
merged 2 commits into from
Jul 30, 2018

Conversation

zentol
Copy link
Contributor

@zentol zentol commented Jul 23, 2018

What is the purpose of the change

This PR makes a few modifications to the ZooKeeperHighAvailabilityITCase to reduce the chances for intermittent test failures and timeouts.

Changes:

1)

The test was moving files out of the HA storage directory with a simple loop using File#renameTo. The test enforced that the moving is successful, however since old checkpoints may be deleted asynchronously this may not always be the case.
We now use a FileVisitor and only log IOExceptions that occur while moving.
If no checkpoint file could be moved the test will still fail.

2)

After the checkpoint files were moved out of the HA storage directory the job is thrown into a restart loop. To verify the restart behavior the test was polling the job state and checked for the RESTARTING and FAILING states.
Due to the small size the job is in these states only for a short time, effectively adding a race condition. Thus this loop mayrun for longer than anticipated; the largest outlier i got locally was 50 seconds which isn't that for off from the 2 minute timeout. I suspect this to be the failure cause raised in the JIRA, but I can't guarantee it.
Instead we now access the fullRestarts metric using a custom reporter to check how many restarts have occurred. The actual state transitions should be irrelevant to the test.

@@ -107,6 +119,8 @@ public static void setup() throws Exception {
config.setString(HighAvailabilityOptions.HA_ZOOKEEPER_QUORUM, zkServer.getConnectString());
config.setString(HighAvailabilityOptions.HA_MODE, "zookeeper");

config.setString(ConfigConstants.METRICS_REPORTER_PREFIX + "restarts." + ConfigConstants.METRICS_REPORTER_CLASS_SUFFIX, RestartReporter.class.getName());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Break down line in two.

Copy link
Contributor

@StefanRRichter StefanRRichter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants