[FLINK-12006][tests] Wait for Curator background operation finished #8046

tisonkun · 2019-03-25T10:37:51Z

What is the purpose of the change

Since the failing log doesn't show an exception thrown, the case should be we passed

client.delete().deletingChildrenIfNeeded().forPath("/");
zNodeDeleted = true;

but the znode "/" wasn't be deleted. For any reason we use client.checkExists().forPath("/") to ensure its deletion. Also add #guaranteed on #deleted to best effort delete the znode even if we failed by an retryable exception.

Verifying this change

This change is already covered by existing tests, such as ZooKeeperHaServiceTest

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
The serializers:(no)
The runtime per-record code paths (performance sensitive):(no)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes)
The S3 file system connector: (no)

Documentation

Does this pull request introduce a new feature? (no)
If yes, how is the feature documented? (not applicable)

cc @tillrohrmann @GJL @aljoscha

flinkbot · 2019-03-25T10:38:08Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

azagrebin

Thanks for looking into it @tisonkun.
It seems the original reason for the failure is not clear yet.
Did you try to loop this test and reproduce the error?
Maybe add some log statements closer to deletingChildrenIfNeeded to make sure that it was really called and successfully passed.

...e/src/main/java/org/apache/flink/runtime/highavailability/zookeeper/ZooKeeperHaServices.java

tisonkun · 2019-03-27T10:42:11Z

@azagrebin Thanks for your review. The reported unstable is hard to reproduce locally. I add some log statements and be sure that the execution is as described if test passed. And if not, since we see no exception there, it should be the same execution.

azagrebin · 2019-03-27T12:54:43Z

@tisonkun thanks for addressing the comment.

I think it would be still useful to try to reproduce it on Travis.
Do you have a travis activated for your Github account?
Could you try it in your Travis? I would loop the whole test suite ZooKeeperHaServicesTest.

You could have a look how unstable Kafka unit test is being looped in this commit on my branch in my Travis: azagrebin@6961315

You could add the log statements around deletingChildrenIfNeeded to make sure that it passes successfully, also in case of test failure if it is reproducable.

tillrohrmann · 2019-03-27T15:29:54Z

I think I might have an idea where the problem comes from: The underlying problem is that we have an ongoing background operation originating from a NodeCache which makes sure that all parent nodes are created. I think the following can happen: The NodeCache inserts the background task into the CuratorFrameworkImpl's Executor but it is not executed. Next the NodeCache is stopped (as part of stopping the owning LeaderRetrievalService). Then ZooKeeperHaServices#closeAndCleanupAllData is called. This call will remove all created zNodes. Right after removing all zNodes as part of the #deleteOwnedZNode call, the background task is started. The background task will then recreate the parent nodes which will lead to the test failure.

I think Tison's fix won't fully solve the problem, because we actually deleted all zNodes at some point in time.

tillrohrmann

See my other comment and Andrey's comment to check whether we can reproduce the problem on Travis.

tisonkun · 2019-03-27T16:18:03Z

@tillrohrmann Thanks for your analysis. It looks reasonable. For this test we can extract the TestingListener and ensure that the leader is elected. Thus operation 2 happens before operation 1.

 import org.apache.flink.runtime.util.ZooKeeperUtils;
 import org.apache.flink.runtime.zookeeper.ZooKeeperResource;
@@ -191,11 +192,15 @@ public class ZooKeeperHaServicesTest extends TestLogger {
                        final LeaderElectionService resourceManagerLeaderElectionService = zooKeeperHaServices.getResourceManagerLeaderElectionService();
                        final RunningJobsRegistry runningJobsRegistry = zooKeeperHaServices.getRunningJobsRegistry();
 
-                       resourceManagerLeaderRetriever.start(new TestingListener());
+                       final TestingListener listener = new TestingListener();
+
+                       resourceManagerLeaderRetriever.start(listener);
                        resourceManagerLeaderElectionService.start(new TestingContender("foobar", resourceManagerLeaderElectionService));
                        final JobID jobId = new JobID();
                        runningJobsRegistry.setJobRunning(jobId);
 
+                       listener.waitForNewLeader(2000L);
+
                        resourceManagerLeaderRetriever.stop();
                        resourceManagerLeaderElectionService.stop();
                        runningJobsRegistry.clearJob(jobId);

However, in production code it is still buggy on this execute order. But if we bump ZK version to support CreateMode#CONTAINER, then the remain znode(path) should be all containers and thus we can say that they will finally get removed.

For reproduce the problem, we might force same order with hook of background. But I have no idea how to ensure the background creatingParentContainersIfNeeded perform after delete children.

tisonkun · 2019-03-27T16:49:05Z

Further discussion happens on JIRA. As discussion there, I firstly repeat the test to try to reliably reproduce the problem on Travis.

tisonkun · 2019-03-28T00:53:13Z

Reproduce the issue. Add a fix to see if it goes absent.

tisonkun · 2019-03-28T03:20:29Z

@tillrohrmann done as we decided on JIRA. You might trigger extra travis builds to ensure it is the case. Theoretically we can reproduce the issue with high possibility and with the fix we should never run into that issue.

tillrohrmann · 2019-03-29T09:47:19Z

Great @tisonkun. Just to make sure, you've been able to reproduce the problem without the fix and it is gone with the fix, right?

tillrohrmann

Changes look good. Waiting for your confirmation @tisonkun that you could reproduce the problem w/o the fix.

...c/test/java/org/apache/flink/runtime/highavailability/zookeeper/ZooKeeperHaServicesTest.java

tisonkun · 2019-03-29T10:04:39Z

@tillrohrmann

reproduce the problem without the fix https://api.travis-ci.org/v3/job/512259209/log.txt https://travis-ci.org/TisonKun/flink/builds/512259207
gone with the fix https://api.travis-ci.org/v3/job/512297425/log.txt https://travis-ci.org/TisonKun/flink/builds/512297423

tillrohrmann · 2019-03-29T10:17:42Z

Perfect, thanks @tisonkun!

In order to wait for the NodeCache's background operation which generates the parent zNodes for the ZooKeeperLeaderRetrievalService, we wait for a new leader in the ZooKeeperHaServicesTest. This closes #8046.

In order to wait for the NodeCache's background operation which generates the parent zNodes for the ZooKeeperLeaderRetrievalService, we wait for a new leader in the ZooKeeperHaServicesTest. This closes apache#8046.

rmetzger added review=description? component=Runtime/Coordination labels Mar 25, 2019

tisonkun force-pushed the FLINK-12006 branch from b11e964 to 9fcc46e Compare March 25, 2019 13:04

aljoscha requested a review from tillrohrmann March 27, 2019 08:04

azagrebin reviewed Mar 27, 2019

View reviewed changes

...e/src/main/java/org/apache/flink/runtime/highavailability/zookeeper/ZooKeeperHaServices.java Outdated Show resolved Hide resolved

tillrohrmann self-assigned this Mar 27, 2019

tillrohrmann requested changes Mar 27, 2019

View reviewed changes

tisonkun force-pushed the FLINK-12006 branch from b0d8398 to 034b5fb Compare March 27, 2019 16:47

tisonkun changed the title ~~[FLINK-12006][coordination] Ensure owned znode deleted on ZooKeeperHaServices#deleteOwnedZNode~~ [FLINK-12006][tests] Wait for Curator background operation finished Mar 27, 2019

[FLINK-12006][tests] Wait for Curator background operation finished

98e99c2

tisonkun force-pushed the FLINK-12006 branch from 034b5fb to 98e99c2 Compare March 27, 2019 22:31

try a fix

56f4d6a

tillrohrmann approved these changes Mar 29, 2019

View reviewed changes

...c/test/java/org/apache/flink/runtime/highavailability/zookeeper/ZooKeeperHaServicesTest.java Outdated Show resolved Hide resolved

...c/test/java/org/apache/flink/runtime/highavailability/zookeeper/ZooKeeperHaServicesTest.java Outdated Show resolved Hide resolved

Remove repetition

5760f5c

asfgit closed this in 72a6f3f Mar 29, 2019

tisonkun deleted the FLINK-12006 branch March 29, 2019 10:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-12006][tests] Wait for Curator background operation finished #8046

[FLINK-12006][tests] Wait for Curator background operation finished #8046

tisonkun commented Mar 25, 2019

flinkbot commented Mar 25, 2019

azagrebin left a comment

tisonkun commented Mar 27, 2019

azagrebin commented Mar 27, 2019 •

edited

tillrohrmann commented Mar 27, 2019

tillrohrmann left a comment

tisonkun commented Mar 27, 2019 •

edited

tisonkun commented Mar 27, 2019

tisonkun commented Mar 28, 2019

tisonkun commented Mar 28, 2019 •

edited

tillrohrmann commented Mar 29, 2019

tillrohrmann left a comment

tisonkun commented Mar 29, 2019

tillrohrmann commented Mar 29, 2019

[FLINK-12006][tests] Wait for Curator background operation finished #8046

[FLINK-12006][tests] Wait for Curator background operation finished #8046

Conversation

tisonkun commented Mar 25, 2019

What is the purpose of the change

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

flinkbot commented Mar 25, 2019

Review Progress

azagrebin left a comment

Choose a reason for hiding this comment

tisonkun commented Mar 27, 2019

azagrebin commented Mar 27, 2019 • edited

tillrohrmann commented Mar 27, 2019

tillrohrmann left a comment

Choose a reason for hiding this comment

tisonkun commented Mar 27, 2019 • edited

tisonkun commented Mar 27, 2019

tisonkun commented Mar 28, 2019

tisonkun commented Mar 28, 2019 • edited

tillrohrmann commented Mar 29, 2019

tillrohrmann left a comment

Choose a reason for hiding this comment

tisonkun commented Mar 29, 2019

tillrohrmann commented Mar 29, 2019

azagrebin commented Mar 27, 2019 •

edited

tisonkun commented Mar 27, 2019 •

edited

tisonkun commented Mar 28, 2019 •

edited