Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-12006][tests] Wait for Curator background operation finished #8046

Closed
wants to merge 3 commits into from

Conversation

tisonkun
Copy link
Member

What is the purpose of the change

Since the failing log doesn't show an exception thrown, the case should be we passed

client.delete().deletingChildrenIfNeeded().forPath("/");
zNodeDeleted = true;

but the znode "/" wasn't be deleted. For any reason we use client.checkExists().forPath("/") to ensure its deletion. Also add #guaranteed on #deleted to best effort delete the znode even if we failed by an retryable exception.

Verifying this change

This change is already covered by existing tests, such as ZooKeeperHaServiceTest

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers:(no)
  • The runtime per-record code paths (performance sensitive):(no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (no)
  • If yes, how is the feature documented? (not applicable)

cc @tillrohrmann @GJL @aljoscha

@flinkbot
Copy link
Collaborator

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.


The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

Copy link
Contributor

@azagrebin azagrebin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into it @tisonkun.
It seems the original reason for the failure is not clear yet.
Did you try to loop this test and reproduce the error?
Maybe add some log statements closer to deletingChildrenIfNeeded to make sure that it was really called and successfully passed.

@tisonkun
Copy link
Member Author

@azagrebin Thanks for your review. The reported unstable is hard to reproduce locally. I add some log statements and be sure that the execution is as described if test passed. And if not, since we see no exception there, it should be the same execution.

@azagrebin
Copy link
Contributor

azagrebin commented Mar 27, 2019

@tisonkun thanks for addressing the comment.

I think it would be still useful to try to reproduce it on Travis.
Do you have a travis activated for your Github account?
Could you try it in your Travis? I would loop the whole test suite ZooKeeperHaServicesTest.

You could have a look how unstable Kafka unit test is being looped in this commit on my branch in my Travis: azagrebin@6961315

You could add the log statements around deletingChildrenIfNeeded to make sure that it passes successfully, also in case of test failure if it is reproducable.

@tillrohrmann
Copy link
Contributor

I think I might have an idea where the problem comes from: The underlying problem is that we have an ongoing background operation originating from a NodeCache which makes sure that all parent nodes are created. I think the following can happen: The NodeCache inserts the background task into the CuratorFrameworkImpl's Executor but it is not executed. Next the NodeCache is stopped (as part of stopping the owning LeaderRetrievalService). Then ZooKeeperHaServices#closeAndCleanupAllData is called. This call will remove all created zNodes. Right after removing all zNodes as part of the #deleteOwnedZNode call, the background task is started. The background task will then recreate the parent nodes which will lead to the test failure.

I think Tison's fix won't fully solve the problem, because we actually deleted all zNodes at some point in time.

@tillrohrmann tillrohrmann self-assigned this Mar 27, 2019
Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my other comment and Andrey's comment to check whether we can reproduce the problem on Travis.

@tisonkun
Copy link
Member Author

tisonkun commented Mar 27, 2019

@tillrohrmann Thanks for your analysis. It looks reasonable. For this test we can extract the TestingListener and ensure that the leader is elected. Thus operation 2 happens before operation 1.

 import org.apache.flink.runtime.util.ZooKeeperUtils;
 import org.apache.flink.runtime.zookeeper.ZooKeeperResource;
@@ -191,11 +192,15 @@ public class ZooKeeperHaServicesTest extends TestLogger {
                        final LeaderElectionService resourceManagerLeaderElectionService = zooKeeperHaServices.getResourceManagerLeaderElectionService();
                        final RunningJobsRegistry runningJobsRegistry = zooKeeperHaServices.getRunningJobsRegistry();
 
-                       resourceManagerLeaderRetriever.start(new TestingListener());
+                       final TestingListener listener = new TestingListener();
+
+                       resourceManagerLeaderRetriever.start(listener);
                        resourceManagerLeaderElectionService.start(new TestingContender("foobar", resourceManagerLeaderElectionService));
                        final JobID jobId = new JobID();
                        runningJobsRegistry.setJobRunning(jobId);
 
+                       listener.waitForNewLeader(2000L);
+
                        resourceManagerLeaderRetriever.stop();
                        resourceManagerLeaderElectionService.stop();
                        runningJobsRegistry.clearJob(jobId);

However, in production code it is still buggy on this execute order. But if we bump ZK version to support CreateMode#CONTAINER, then the remain znode(path) should be all containers and thus we can say that they will finally get removed.

For reproduce the problem, we might force same order with hook of background. But I have no idea how to ensure the background creatingParentContainersIfNeeded perform after delete children.

@tisonkun tisonkun changed the title [FLINK-12006][coordination] Ensure owned znode deleted on ZooKeeperHaServices#deleteOwnedZNode [FLINK-12006][tests] Wait for Curator background operation finished Mar 27, 2019
@tisonkun
Copy link
Member Author

Further discussion happens on JIRA. As discussion there, I firstly repeat the test to try to reliably reproduce the problem on Travis.

@tisonkun
Copy link
Member Author

Reproduce the issue. Add a fix to see if it goes absent.

@tisonkun
Copy link
Member Author

tisonkun commented Mar 28, 2019

@tillrohrmann done as we decided on JIRA. You might trigger extra travis builds to ensure it is the case. Theoretically we can reproduce the issue with high possibility and with the fix we should never run into that issue.

@tillrohrmann
Copy link
Contributor

Great @tisonkun. Just to make sure, you've been able to reproduce the problem without the fix and it is gone with the fix, right?

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good. Waiting for your confirmation @tisonkun that you could reproduce the problem w/o the fix.

@tillrohrmann
Copy link
Contributor

Perfect, thanks @tisonkun!

asfgit pushed a commit that referenced this pull request Mar 29, 2019
In order to wait for the NodeCache's background operation which generates the
parent zNodes for the ZooKeeperLeaderRetrievalService, we wait for a new leader
in the ZooKeeperHaServicesTest.

This closes #8046.
@asfgit asfgit closed this in 72a6f3f Mar 29, 2019
@tisonkun tisonkun deleted the FLINK-12006 branch March 29, 2019 10:35
HuangZhenQiu pushed a commit to HuangZhenQiu/flink that referenced this pull request Apr 22, 2019
In order to wait for the NodeCache's background operation which generates the
parent zNodes for the ZooKeeperLeaderRetrievalService, we wait for a new leader
in the ZooKeeperHaServicesTest.

This closes apache#8046.
sunhaibotb pushed a commit to sunhaibotb/flink that referenced this pull request May 8, 2019
In order to wait for the NodeCache's background operation which generates the
parent zNodes for the ZooKeeperLeaderRetrievalService, we wait for a new leader
in the ZooKeeperHaServicesTest.

This closes apache#8046.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants