[BEAM-9474] Improve robustness of BundleFactory and ProcessEnvironment #11084

mxm · 2020-03-09T20:19:02Z

The cleanup code in DefaultJobBundleFactory and its RemoteEnvironments may leak
resources. This is especially a concern when the execution engines reuses the
same JVM or underlying machines for multiple runs of a pipeline.

Exceptions encountered during cleanup should not lead to aborting the cleanup
procedure. Not all code handles this correctly. We should also ensure that the
cleanup succeeds even if the runner does not properly close the bundle,
e.g. when a exception occurs during closing the bundle.

Post-Commit Tests Status (on master branch)

Lang	SDK	Apex	Dataflow	Gearpump	Samza
Go		---	---	---	---
Java
Python		---		---	---
XLang	---	---	---	---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website
Non-portable
Portable	---		---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

mxm · 2020-03-09T20:38:21Z

retest this please

...cution/src/main/java/org/apache/beam/runners/fnexecution/environment/ProcessEnvironment.java

...ution/src/main/java/org/apache/beam/runners/fnexecution/control/DefaultJobBundleFactory.java

tweise · 2020-03-09T22:11:08Z

...-execution/src/main/java/org/apache/beam/runners/fnexecution/environment/ProcessManager.java

-      } else {
-        LOG.info("Process for worker {} still running. Killing.", id);
-        process.destroyForcibly();
+      long maxTimeToWait = 500;


Why the timeout change?

The old timeout was too long.

How did that manifest? 500ms timeout for process termination seems too aggressive. I would prefer a longer timeout to allow for connections to be closed gracefully.

The shutdown code is synchronous, so I'd prefer a shorter timeout here. It would be a good improvement to make it async.

I've tested this on a cluster with many runs and I've not seen a single instance lingering. Also I did not notice any difference in the logs. The environment will be torn down last, after all connections have been closed. So failures would not be visible anymore.

If you want I can restore the old timeout but I would then also change the code to make the stopping async or at least stop all the processes at once and then wait (instead of tearing down the process one-by-one and wait for each process to quit).

If you have verified that the graceful shutdown works (in the happy path), then we are good. Maybe add a comment to the code, since all of this isn't very obvious.

It's not always shutting down gracefully but that's what the change is about: removing processes and ensuring a quick recovery time. It's a trade-off. Ideally we would want to allow more time but if we wait 2 seconds with an SDK parallelism of 16, that's already more than half a minute waiting time. We really want to do the process removal in parallel. I'll look into this.

I'm not sure the ProcessManager is a good place to document the shutdown behavior. If you have any suggestions though, I'll add them here.

Won't delay the PR for it!

...-execution/src/main/java/org/apache/beam/runners/fnexecution/environment/ProcessManager.java

...ution/src/main/java/org/apache/beam/runners/fnexecution/control/DefaultJobBundleFactory.java

mxm · 2020-03-09T23:16:22Z

retest this please

tweise · 2020-03-10T17:11:10Z

...ution/src/main/java/org/apache/beam/runners/fnexecution/control/DefaultJobBundleFactory.java

@@ -352,20 +407,18 @@ public RemoteBundle getBundle(
        // The blocking queue of caches for serving multiple bundles concurrently.
        currentCache = availableCaches.take();
        client = currentCache.getUnchecked(executableStage.getEnvironment());
-        client.ref();


client.ref needs to remain here, the lines below rely on that and it is also more readable.

Please explain which lines rely on it.

Readability is highly highly subjective, I find it more readable to not duplicate the same code for two branches, as it is currently the case (which is potentially error-prone).

I checked again and couldn't see how this change alters the behavior. In any case, I don't mind to move it back to unblock this PR.

Below, preparedClients.keySet().removeIf(c -> c.bundleRefCount.get() <= 0); removes everything that isn't referenced. These two statements are logically one unit and hence I prefer to not scatter them:

client = currentCache.getUnchecked(executableStage.getEnvironment()); client.ref(); client.ref();

It's not true that a later ref() introduces a bug for preparedClients.keySet().removeIf(c -> c.bundleRefCount.get() <= 0); because the refcount will be >0, otherwise we wouldn't be able to retrieve the client from the cache.

In any case, I will revert this change.

It's not true that a later ref() introduces a bug for preparedClients.keySet().removeIf(c -> c.bundleRefCount.get() <= 0); because the refcount will be >0, otherwise we wouldn't be able to retrieve the client from the cache.

Cache and environment are shared between executable stages. So the refcount can become 0 with concurrent eviction and release. That actually raises the question if these 2 statements should be atomic.

Yes, that makes sense. I've already reverted the change.

I suppose there is a race condition where we retrieve an environment X and before we can call ref() on it, we evict the environment X, close all its references, and shut it down. This will result in a job restart.

tweise · 2020-03-10T17:14:32Z

...ution/src/main/java/org/apache/beam/runners/fnexecution/control/DefaultJobBundleFactory.java

+      // Ensure client is referenced for this bundle, unref in close()
+      client.ref();
+      // Cleanup list of clients which were active during eviction but now do not hold references
+      evictedActiveClients.removeIf(c -> c.bundleRefCount.get() <= 0);


I'm not sure I like this in the path for every bundle. We could probably move it below line 416 (preparedClients.keySet().removeIf(c -> c.bundleRefCount.get() <= 0);)

This is a cheap operation. I don't think we can move it because we need to run this cleanup for all branches.

I've improved this to only remove when environment expiration is used.

...ution/src/main/java/org/apache/beam/runners/fnexecution/control/DefaultJobBundleFactory.java

mxm · 2020-03-11T13:21:05Z

Thanks for your comments @tweise. Let me know if you have more questions.

The cleanup code in DefaultJobBundleFactory and its RemoteEnvironments may leak resources. This is especially a concern when the execution engines reuses the same JVM or underlying machines for multiple runs of a pipeline. Exceptions encountered during cleanup should not lead to aborting the cleanup procedure. Not all code handles this correctly. We should also ensure that the cleanup succeeds even if the runner does not properly close the bundle, e.g. when a exception occurs during closing the bundle.

mxm · 2020-03-11T20:45:29Z

...ution/src/main/java/org/apache/beam/runners/fnexecution/control/DefaultJobBundleFactory.java

-      int count = bundleRefCount.decrementAndGet();
-      if (count == 0) {
+      int refCount = bundleRefCount.decrementAndGet();
+      Preconditions.checkState(refCount >= 0, "Reference count must not be negative.");


FYI, I've added this check instead to check for correct bounds.

probot-autolabeler bot added fn-execution runners labels Mar 9, 2020

mxm requested a review from tweise March 9, 2020 21:15

tweise requested changes Mar 9, 2020

View reviewed changes

tweise reviewed Mar 9, 2020

View reviewed changes

...-execution/src/main/java/org/apache/beam/runners/fnexecution/environment/ProcessManager.java Outdated Show resolved Hide resolved

tweise reviewed Mar 9, 2020

View reviewed changes

...ution/src/main/java/org/apache/beam/runners/fnexecution/control/DefaultJobBundleFactory.java Show resolved Hide resolved

tweise reviewed Mar 10, 2020

View reviewed changes

...ution/src/main/java/org/apache/beam/runners/fnexecution/control/DefaultJobBundleFactory.java Outdated Show resolved Hide resolved

mxm force-pushed the BEAM-9474 branch 2 times, most recently from 3e305c7 to 5c04950 Compare March 11, 2020 20:27

mxm force-pushed the BEAM-9474 branch from 5c04950 to cf219ba Compare March 11, 2020 20:44

mxm commented Mar 11, 2020

View reviewed changes

tweise approved these changes Mar 11, 2020

View reviewed changes

tweise merged commit d62521f into apache:master Mar 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-9474] Improve robustness of BundleFactory and ProcessEnvironment #11084

[BEAM-9474] Improve robustness of BundleFactory and ProcessEnvironment #11084

mxm commented Mar 9, 2020

mxm commented Mar 9, 2020

tweise Mar 9, 2020

mxm Mar 9, 2020

tweise Mar 10, 2020

mxm Mar 11, 2020

mxm Mar 11, 2020 •

edited

tweise Mar 11, 2020

mxm Mar 11, 2020

tweise Mar 11, 2020

mxm commented Mar 9, 2020

tweise Mar 10, 2020

mxm Mar 10, 2020

mxm Mar 11, 2020

tweise Mar 11, 2020

mxm Mar 11, 2020

tweise Mar 11, 2020

mxm Mar 11, 2020

tweise Mar 10, 2020

mxm Mar 10, 2020

mxm Mar 11, 2020

mxm commented Mar 11, 2020

mxm Mar 11, 2020

[BEAM-9474] Improve robustness of BundleFactory and ProcessEnvironment #11084

[BEAM-9474] Improve robustness of BundleFactory and ProcessEnvironment #11084

Conversation

mxm commented Mar 9, 2020

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

mxm commented Mar 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mxm Mar 11, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mxm commented Mar 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mxm commented Mar 11, 2020

Choose a reason for hiding this comment

mxm Mar 11, 2020 •

edited