[FLINK-33053][runtime] Refactoring of workaround to fix thread leakage on the ZooKeeper server side. #23773

XComp · 2023-11-22T15:35:13Z

What is the purpose of the change

This is a follow-up of PR #23415 which tries to come up with a more general solution for working around the thread leakage on the ZooKeeper server side. The motivation was that the approach which was introduced in #23415 changed some visibility constraints that are not really desired (making the ZooKeeperLeaderRetrievalDriver aware of the ResourceManager).

This change is meant as a draft and should rather be used as a base for a discussion whether FLINK-33053 is actually an issue that should be addressed in Flink itself. As far as I understand, the thread leakage is only happening in test code when the ZooKeeper test server implementation is set up.

Brief change log

Introduced a reference counter in the ZooKeeperLeaderRetrievalFactory that counts the number of instances that are created by this factory instance. This reference counter is passed into all driver instances and is used to decide whether the removeAll call should be performed during close().

Verifying this change

I tried to come up with a test to verify the behavior. But it's hard to test because the thread leakage is actually happening in the ZooKeeper server itself.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @Public(Evolving): no
The serializers: no
The runtime per-record code paths (performance sensitive): no
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: yes
The S3 file system connector: no

Documentation

Does this pull request introduce a new feature? no
If yes, how is the feature documented? not applicable

XComp · 2023-11-22T15:36:36Z

...e/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalDriver.java

@@ -69,8 +68,40 @@ public class ZooKeeperLeaderRetrievalDriver implements LeaderRetrievalDriver {

    private final FatalErrorHandler fatalErrorHandler;

+    /**
+     * Each {@code ZooKeeperLeaderRetrievalDriver} has its own watcher initialized. There is a bug


This change is meant as a draft and should rather be used as a base for a discussion whether FLINK-33053 is actually an issue that should be addressed in Flink itself. As far as I understand, the thread leakage is only happening in test code when the ZooKeeper test server implementation is set up.

@KarmaGYZ Is it correct that the thread leakage happens in ZooKeeper's TestServer implementation? 🤔

Sorry, I'm not familiar with it. At least I can't find a public api to retrieve the watch info. Maybe you can verify with this guide to see whether we can get the watches of TestServer and wether there are leakage.

I'm just asking because you mentioned in the description of FLINK-33053 that you observed the issue in stress tests. Did you use the MiniCluster and ZooKeeper in a single JVM or did you have your ZooKeeper deployed separately for the stress tests?

I'd like to understand whether it's on the ZooKeeper side (that's how it sounds to me right now) or on the Flink/Curator side. If I misunderstood the discussion in FLINK-33053 and the related PR #23415 and it's not a ZooKeeper server issue, fixing it on the Flink side is reasonable.

No, I did the stress tests with a real ZK cluster. So, I'm afraid we might need an e2e test for it.
The leackage is happened in the ZK side. Your fix loods reasonable to me, at least for the current HA mechanism.

hm, ok ... thanks for clarification. But that brings me to the point where I would be in favor of not fixing it on Flink's side at all. That needs to be addressed in ZooKeeper (as @tisonkun pointed out with his reference to ZOOKEEPER-4625). I'm curious to hear what your take on this is?

I think the core issue is a specific component should take the responsibility to manage the lifecycle of watches. This component can be ZooKeeper, Curator, or Flink. ZK can provide a "closeWatch" interface to reconcile watch references on the server-side. Similarly, Curator and Flink can proactively handle this task and clean up by using "closeAll," just like what you did in this PR.
I personally believe that let ZK to handle this task is more reasonable. However, considering the release cycles of ZK and the need to maintain compatibility with multiple versions of ZK in Flink, we may need to proactively manage the lifecycle of watches within Flink in the foreseeable future.

flinkbot · 2023-11-22T15:39:41Z

CI report:

e159012 Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

KarmaGYZ · 2023-11-23T00:35:39Z

...e/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalDriver.java

-            // Ignore the no watcher exception as it's just a safetynet to fix watcher leak issue.
-            // For more details, please refer to FLINK-33053.
+        if (client.getZookeeperClient().isConnected()
+                && watcherReferenceCounter.getAndIncrement() == 1) {


getAndDecrement?

yikes, I introduced that one when I did another code change. Good catch 👍 That proofs once more that a proper test should be added to cover this code path. 🤔

...ain/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalDriverFactory.java

KarmaGYZ · 2023-11-23T00:45:12Z

...e/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalDriver.java

@@ -69,8 +68,40 @@ public class ZooKeeperLeaderRetrievalDriver implements LeaderRetrievalDriver {

    private final FatalErrorHandler fatalErrorHandler;

+    /**
+     * Each {@code ZooKeeperLeaderRetrievalDriver} has its own watcher initialized. There is a bug


Sorry, I'm not familiar with it. At least I can't find a public api to retrieve the watch info. Maybe you can verify with this guide to see whether we can get the watches of TestServer and wether there are leakage.

KarmaGYZ · 2023-11-23T00:47:05Z

Thanks for the fix, @XComp . Great job!

…on ZooKeeperExtension

…e on the ZooKeeper server side.

XComp commented Nov 22, 2023

View reviewed changes

KarmaGYZ reviewed Nov 23, 2023

View reviewed changes

XComp added 3 commits January 25, 2024 18:03

[FLINK-33053][test] Adds helper methods for executing admin commands …

fdc7eec

…on ZooKeeperExtension

[FLINK-33053][test] Adds test to cover the watcher leak

ff8840d

[FLINK-33053][runtime] Refactoring of workaround to fix thread leakag…

e159012

…e on the ZooKeeper server side.

XComp force-pushed the FLINK-33053-followup branch from e82c867 to e159012 Compare January 25, 2024 17:05

flinkbot added the component=Runtime/Coordination label Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-33053][runtime] Refactoring of workaround to fix thread leakage on the ZooKeeper server side. #23773

[FLINK-33053][runtime] Refactoring of workaround to fix thread leakage on the ZooKeeper server side. #23773

XComp commented Nov 22, 2023

XComp Nov 22, 2023

KarmaGYZ Nov 23, 2023

XComp Nov 23, 2023

KarmaGYZ Nov 23, 2023

XComp Nov 23, 2023 •

edited

Loading

KarmaGYZ Nov 23, 2023

flinkbot commented Nov 22, 2023 •

edited

Loading

KarmaGYZ Nov 23, 2023

XComp Nov 23, 2023

KarmaGYZ Nov 23, 2023

KarmaGYZ commented Nov 23, 2023

[FLINK-33053][runtime] Refactoring of workaround to fix thread leakage on the ZooKeeper server side. #23773

Are you sure you want to change the base?

[FLINK-33053][runtime] Refactoring of workaround to fix thread leakage on the ZooKeeper server side. #23773

Conversation

XComp commented Nov 22, 2023

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

XComp Nov 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flinkbot commented Nov 22, 2023 • edited Loading

CI report:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KarmaGYZ commented Nov 23, 2023

XComp Nov 23, 2023 •

edited

Loading

flinkbot commented Nov 22, 2023 •

edited

Loading