[FLINK-9900][tests] Harden ZooKeeperHighAvailabilityITCase #6395

zentol · 2018-07-23T14:38:48Z

What is the purpose of the change

This PR makes a few modifications to the ZooKeeperHighAvailabilityITCase to reduce the chances for intermittent test failures and timeouts.

Changes:

1)

The test was moving files out of the HA storage directory with a simple loop using File#renameTo. The test enforced that the moving is successful, however since old checkpoints may be deleted asynchronously this may not always be the case.
We now use a FileVisitor and only log IOExceptions that occur while moving.
If no checkpoint file could be moved the test will still fail.

2)

After the checkpoint files were moved out of the HA storage directory the job is thrown into a restart loop. To verify the restart behavior the test was polling the job state and checked for the RESTARTING and FAILING states.
Due to the small size the job is in these states only for a short time, effectively adding a race condition. Thus this loop mayrun for longer than anticipated; the largest outlier i got locally was 50 seconds which isn't that for off from the 2 minute timeout. I suspect this to be the failure cause raised in the JIRA, but I can't guarantee it.
Instead we now access the fullRestarts metric using a custom reporter to check how many restarts have occurred. The actual state transitions should be irrelevant to the test.

StefanRRichter · 2018-07-27T09:43:48Z

...tests/src/test/java/org/apache/flink/test/checkpointing/ZooKeeperHighAvailabilityITCase.java

@@ -107,6 +119,8 @@ public static void setup() throws Exception {
 		config.setString(HighAvailabilityOptions.HA_ZOOKEEPER_QUORUM, zkServer.getConnectString());
 		config.setString(HighAvailabilityOptions.HA_MODE, "zookeeper");

+		config.setString(ConfigConstants.METRICS_REPORTER_PREFIX + "restarts." + ConfigConstants.METRICS_REPORTER_CLASS_SUFFIX, RestartReporter.class.getName());


Break down line in two.

StefanRRichter

LGTM 👍

[FLINK-9900][tests] Harden ZooKeeperHighAvailabilityITCase

b8827dc

StefanRRichter reviewed Jul 27, 2018

View reviewed changes

StefanRRichter approved these changes Jul 27, 2018

View reviewed changes

split line

8a6e5d9

zentol merged commit a278d59 into apache:master Jul 30, 2018

zentol deleted the 9900 branch July 30, 2018 08:49

rmetzger added component=Runtime/Coordination component=Tests labels Mar 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-9900][tests] Harden ZooKeeperHighAvailabilityITCase #6395

[FLINK-9900][tests] Harden ZooKeeperHighAvailabilityITCase #6395

zentol commented Jul 23, 2018 •

edited

Loading

StefanRRichter Jul 27, 2018

StefanRRichter left a comment

[FLINK-9900][tests] Harden ZooKeeperHighAvailabilityITCase #6395

[FLINK-9900][tests] Harden ZooKeeperHighAvailabilityITCase #6395

Conversation

zentol commented Jul 23, 2018 • edited Loading

What is the purpose of the change

1)

2)

StefanRRichter Jul 27, 2018

Choose a reason for hiding this comment

StefanRRichter left a comment

Choose a reason for hiding this comment

zentol commented Jul 23, 2018 •

edited

Loading