-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-12006][tests] Wait for Curator background operation finished #8046
Conversation
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into it @tisonkun.
It seems the original reason for the failure is not clear yet.
Did you try to loop this test and reproduce the error?
Maybe add some log statements closer to deletingChildrenIfNeeded
to make sure that it was really called and successfully passed.
...e/src/main/java/org/apache/flink/runtime/highavailability/zookeeper/ZooKeeperHaServices.java
Outdated
Show resolved
Hide resolved
@azagrebin Thanks for your review. The reported unstable is hard to reproduce locally. I add some log statements and be sure that the execution is as described if test passed. And if not, since we see no exception there, it should be the same execution. |
@tisonkun thanks for addressing the comment. I think it would be still useful to try to reproduce it on Travis. You could have a look how unstable Kafka unit test is being looped in this commit on my branch in my Travis: azagrebin@6961315 You could add the log statements around |
I think I might have an idea where the problem comes from: The underlying problem is that we have an ongoing background operation originating from a I think Tison's fix won't fully solve the problem, because we actually deleted all zNodes at some point in time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my other comment and Andrey's comment to check whether we can reproduce the problem on Travis.
@tillrohrmann Thanks for your analysis. It looks reasonable. For this test we can extract the TestingListener and ensure that the leader is elected. Thus operation 2 happens before operation 1.
However, in production code it is still buggy on this execute order. But if we bump ZK version to support CreateMode#CONTAINER, then the remain znode(path) should be all containers and thus we can say that they will finally get removed. For reproduce the problem, we might force same order with hook of background. But I have no idea how to ensure the background |
Further discussion happens on JIRA. As discussion there, I firstly repeat the test to try to reliably reproduce the problem on Travis. |
Reproduce the issue. Add a fix to see if it goes absent. |
@tillrohrmann done as we decided on JIRA. You might trigger extra travis builds to ensure it is the case. Theoretically we can reproduce the issue with high possibility and with the fix we should never run into that issue. |
Great @tisonkun. Just to make sure, you've been able to reproduce the problem without the fix and it is gone with the fix, right? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good. Waiting for your confirmation @tisonkun that you could reproduce the problem w/o the fix.
...c/test/java/org/apache/flink/runtime/highavailability/zookeeper/ZooKeeperHaServicesTest.java
Outdated
Show resolved
Hide resolved
...c/test/java/org/apache/flink/runtime/highavailability/zookeeper/ZooKeeperHaServicesTest.java
Outdated
Show resolved
Hide resolved
reproduce the problem without the fix https://api.travis-ci.org/v3/job/512259209/log.txt https://travis-ci.org/TisonKun/flink/builds/512259207 |
Perfect, thanks @tisonkun! |
In order to wait for the NodeCache's background operation which generates the parent zNodes for the ZooKeeperLeaderRetrievalService, we wait for a new leader in the ZooKeeperHaServicesTest. This closes #8046.
In order to wait for the NodeCache's background operation which generates the parent zNodes for the ZooKeeperLeaderRetrievalService, we wait for a new leader in the ZooKeeperHaServicesTest. This closes apache#8046.
In order to wait for the NodeCache's background operation which generates the parent zNodes for the ZooKeeperLeaderRetrievalService, we wait for a new leader in the ZooKeeperHaServicesTest. This closes apache#8046.
What is the purpose of the change
Since the failing log doesn't show an exception thrown, the case should be we passed
but the znode "/" wasn't be deleted. For any reason we use
client.checkExists().forPath("/")
to ensure its deletion. Also add#guaranteed
on#deleted
to best effort delete the znode even if we failed by an retryable exception.Verifying this change
This change is already covered by existing tests, such as ZooKeeperHaServiceTest
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: (no)Documentation
cc @tillrohrmann @GJL @aljoscha