-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-33053][runtime] Refactoring of workaround to fix thread leakage on the ZooKeeper server side. #23773
base: master
Are you sure you want to change the base?
Conversation
@@ -69,8 +68,40 @@ public class ZooKeeperLeaderRetrievalDriver implements LeaderRetrievalDriver { | |||
|
|||
private final FatalErrorHandler fatalErrorHandler; | |||
|
|||
/** | |||
* Each {@code ZooKeeperLeaderRetrievalDriver} has its own watcher initialized. There is a bug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is meant as a draft and should rather be used as a base for a discussion whether FLINK-33053 is actually an issue that should be addressed in Flink itself. As far as I understand, the thread leakage is only happening in test code when the ZooKeeper test server implementation is set up.
@KarmaGYZ Is it correct that the thread leakage happens in ZooKeeper's TestServer implementation? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm not familiar with it. At least I can't find a public api to retrieve the watch info. Maybe you can verify with this guide to see whether we can get the watches of TestServer and wether there are leakage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just asking because you mentioned in the description of FLINK-33053 that you observed the issue in stress tests. Did you use the MiniCluster and ZooKeeper in a single JVM or did you have your ZooKeeper deployed separately for the stress tests?
I'd like to understand whether it's on the ZooKeeper side (that's how it sounds to me right now) or on the Flink/Curator side. If I misunderstood the discussion in FLINK-33053 and the related PR #23415 and it's not a ZooKeeper server issue, fixing it on the Flink side is reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I did the stress tests with a real ZK cluster. So, I'm afraid we might need an e2e test for it.
The leackage is happened in the ZK side. Your fix loods reasonable to me, at least for the current HA mechanism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, ok ... thanks for clarification. But that brings me to the point where I would be in favor of not fixing it on Flink's side at all. That needs to be addressed in ZooKeeper (as @tisonkun pointed out with his reference to ZOOKEEPER-4625). I'm curious to hear what your take on this is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the core issue is a specific component should take the responsibility to manage the lifecycle of watches. This component can be ZooKeeper, Curator, or Flink. ZK can provide a "closeWatch" interface to reconcile watch references on the server-side. Similarly, Curator and Flink can proactively handle this task and clean up by using "closeAll," just like what you did in this PR.
I personally believe that let ZK to handle this task is more reasonable. However, considering the release cycles of ZK and the need to maintain compatibility with multiple versions of ZK in Flink, we may need to proactively manage the lifecycle of watches within Flink in the foreseeable future.
// Ignore the no watcher exception as it's just a safetynet to fix watcher leak issue. | ||
// For more details, please refer to FLINK-33053. | ||
if (client.getZookeeperClient().isConnected() | ||
&& watcherReferenceCounter.getAndIncrement() == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getAndDecrement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yikes, I introduced that one when I did another code change. Good catch 👍 That proofs once more that a proper test should be added to cover this code path. 🤔
...ain/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalDriverFactory.java
Outdated
Show resolved
Hide resolved
@@ -69,8 +68,40 @@ public class ZooKeeperLeaderRetrievalDriver implements LeaderRetrievalDriver { | |||
|
|||
private final FatalErrorHandler fatalErrorHandler; | |||
|
|||
/** | |||
* Each {@code ZooKeeperLeaderRetrievalDriver} has its own watcher initialized. There is a bug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm not familiar with it. At least I can't find a public api to retrieve the watch info. Maybe you can verify with this guide to see whether we can get the watches of TestServer and wether there are leakage.
Thanks for the fix, @XComp . Great job! |
…on ZooKeeperExtension
…e on the ZooKeeper server side.
e82c867
to
e159012
Compare
What is the purpose of the change
This is a follow-up of PR #23415 which tries to come up with a more general solution for working around the thread leakage on the ZooKeeper server side. The motivation was that the approach which was introduced in #23415 changed some visibility constraints that are not really desired (making the
ZooKeeperLeaderRetrievalDriver
aware of theResourceManager
).This change is meant as a draft and should rather be used as a base for a discussion whether FLINK-33053 is actually an issue that should be addressed in Flink itself. As far as I understand, the thread leakage is only happening in test code when the ZooKeeper test server implementation is set up.
Brief change log
ZooKeeperLeaderRetrievalFactory
that counts the number of instances that are created by this factory instance. This reference counter is passed into all driver instances and is used to decide whether theremoveAll
call should be performed duringclose()
.Verifying this change
I tried to come up with a test to verify the behavior. But it's hard to test because the thread leakage is actually happening in the ZooKeeper server itself.
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: noDocumentation