[SPARK-10911] Executors should System.exit on clean shutdown. #9946

zhuoliu · 2015-11-24T21:25:28Z

Call system.exit explicitly to make sure non-daemon user threads terminate. Without this, user applications might live forever if the cluster manager does not appropriately kill them. E.g., YARN had this bug: HADOOP-12441.

JoshRosen · 2015-11-24T21:54:38Z

Jenkins, this is ok to test.

SparkQA · 2015-11-24T23:59:24Z

Test build #46627 has finished for PR 9946 at commit 2869d21.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2015-11-25T09:07:54Z

See the JIRA. I still don't believe this is safe. At least, causes more problems than it seems to solve.

tgravescs · 2015-12-01T21:51:10Z

@srowen do you know of any actual use cases this will break? We've finished running the user code and are exiting anyway so things should just be shutting down. At this point it shouldn't be committing anything and under normal circumstances if it takes to long its going to get shot anyway.

If we know it will affect stuff I'm fine with leaving it out because under normal circumstances the cluster manager handles removing these. But this particular condition happened when there was an issue with the cluster manager as well. Leaving around tasks is bad and anything we can do to protect from users in my opinion is good.

srowen · 2015-12-01T22:10:18Z

But isn't the scenario here that the user app isn't done, because it spawned non-daemon threads that are doing something? I agree it's not good practice, but if apps avoided doing this entirely we'd have no problem to begin with. The question is what to do if such a user thread does exist and runs long.

Killing it immediately could cause problems, and it's not terribly theoretical: imagine persisting data to disk or writing to a socket and failing to write all the bytes. Not-killing it of course opens the possibility that the thread never stops at all. Which is worse? The long-running user thread is the app's "fault" and is pretty easy to debug by looking at a stack dump. On the other hand killing threads straight away could cause problems for a sort of reasonably behaving app. (Also imagine this non-daemon thread could be in library code.)

Maybe some kind of timeout mostly mitigates the issue, but then I think it's just trying to save a fairly clearly misbehaving app from itself at a non-trivial cost.

You're always going to have the possibility of a stuck process (deadlock, infinite loop in a driver, etc) and need to be able to kill that if needed as an admin.

tgravescs · 2015-12-02T14:14:49Z

Well the users shouldn't have been using non-daemon threads in the first place, they spawned them while a task was running and then never cleaned them up. We asked them to change that. I can't think of any scenarios where the user should be relying on anything running past when the users task code exits as it would just be a natural race condition. the task finishes and the driver could kill the container, it could finish the app and yarn could kill the container, etc. whatever they are doing may or may not finish anyway.

vanzin · 2015-12-03T19:23:57Z

@zhuoliu given the discussion in SPARK-10911, could you write a proper PR description and reference HADOOP-12441, which is why this is needed on the YARN side?

tgravescs · 2015-12-03T19:28:52Z

@vanzin So the bug is one thing it can actually happen in other ways too. For instance if someone messes up the node while upgrading the NM during rolling upgrade. Or it could happen if there is a bug in other resource managers (standalone, mesos, etc). I'm fine with updating the description but don't think its limited to HADOOP-12441. It should system exit to make sure users process really exits.

vanzin · 2015-12-03T19:42:06Z

Sure, it's more for compleness; explaining why you need to explicitly exit, since I'd expect the cluster manager to kill those containers when the app finally finishes, yet there are cases when that doesn't seem to happen.

andrewor14 · 2015-12-15T00:25:18Z

I agree with @srowen. Just ending the executor can pose regressions in behavior that are difficult to debug. If the concern is that executor processes are undying, then setting a timeout of some reasonable time before calling System.exit(0) should do it. E.g. we probably don't expect anyone to want to leave their executors around for more than 2 hours after the executors have already stopped contact with the driver.

zhuoliu · 2016-01-20T16:47:46Z

@vanzin @srowen @andrewor14 , I think your concerns make sense. Also, adding timeout might have another issue that resources cannot be released immediately even if tasks have completed. So for now, we may just close this pull request and the issue, and reopen it if the same problems repeat for other users.

srowen · 2016-01-20T16:59:57Z

@zhuoliu OK though you will have to close it

zhuoliu · 2016-01-20T17:02:01Z

Thanks, closed.

tgravescs · 2016-01-20T17:40:29Z

Just to point out here I may re-open this. I would still rather fix this, a known bug that I can reproduce then worry about a theoretical it might break something. Unless of course someone can give me a hard use case that this breaks.

At the point we call System.exit here all user code is done and we are terminating. If there is something you know of that the user should be allowed to do after this then we should create something like a shutdown hook so the user can properly clean it up.

A timeout would probably work here but I'm not fond of it here, it is going to delay it for everyone and you don't give the resources back as quickly. You could also end up in the same situation where if you choose a timeout, say 3 seconds and if this theoretical user code didn't finish within that 3 seconds you would kill it anyway.

Actually thinking about this more, maybe this is a perfect time as we can put it in 2.0 so change in behavior is more reasonable. thoughts ?

tgravescs · 2016-01-20T17:41:11Z

Also note System.exit still goes through the normal JVM shutdown hooks, etc.

vanzin · 2016-01-20T17:44:09Z

I'm not against this change; in fact, this is already the current behavior in healthy YARN installs (since YARN will kill your container). I just wanted the change description to be more descriptive of the underlying reason why the exit is being explicitly added.

I also don't buy the existence of any valid user case for a user thread to block container exit, exactly for the reason above.

tgravescs · 2016-01-20T20:16:21Z

@andrewor14 @srowen thoughts or concerns?

srowen · 2016-01-20T23:21:41Z

I disagree with: "At the point we call System.exit here all user code is done and we are terminating" It should ideally be so, but, what happens when it isn't? A user shutdown hook could be the thing still executing (I think?) so I don't know if that's a solution.

It may be a common sin among fairly righteous apps, as I can imagine all kinds of third party libraries taking some short time to shutdown their pools and flush their whatever to disk. Uncommonly, something pathologically runs forever -- bad app. There's no way to distinguish them. Killing the JVM risks some bad end for the app's cleanup; not killing it risks a stuck JVM. A timeout doesn't solve it; the implicit timeout of 0 here is the most extreme choice in one direction. My gut is that it's better to avoid possibly harming several fairly innocent apps with this behavior change, understanding it means more manual work to chase down and kill the occasional errant bad app.

I'm OK being out-voted too, but given the discussion so far that remains my take.

vanzin · 2016-01-20T23:24:22Z

It should ideally be so, but, what happens when it isn't?

What happens is that the user code will die anyway because YARN will kill the container. The explicit exit here is to work around a bug in certain versions of YARN (and some broken cluster configurations, e.g., unhealthy NMs).

srowen · 2016-01-20T23:36:24Z

Actually, what's the difference between letting main complete normally (in which case java already exits with status 0 right?) and exiting explicitly at the end of main with status 0? You still get shutdown hooks but System.exit kills the non-daemon threads?

I see you're saying it isn't really a behavior change. In the scenario here, the JVM would still be running because some non-daemon thread is alive, so how does YARN decide to forcibly kill it -- it's because the app has already told YARN it's done but hasn't completed?

There still seems to be a material difference between always immediately killing threads at the end, and maybe getting killed some short time later, since the former will always occur in normal operation. Could we emulate the delay as well as better-than-nothing?

vanzin · 2016-01-20T23:46:51Z

You still get shutdown hooks but System.exit kills the non-daemon threads?

You get shutdown hooks when the JVM exits (nor matter how, except kill -9 of course). The explicit System.exit is to kill non-daemon threads.

it's because the app has already told YARN it's done but hasn't completed?

Precisely.

Could we emulate the delay as well as better-than-nothing?

Why? What would that really help with? Any user application relying on "container stays alive after YARN application is deemed finished" is doing something wrong, and cannot assume anything about how long containers will live after the app is finished.

I still haven't seen a single, valid use case for not killing these non-daemon threads.

srowen · 2016-01-21T00:44:58Z

I'm not suggesting that behavior is correct, but equally you could say an app that never terminates is doing something wrong or a YARN that doesn't stop it is doing something wrong. The lesser-of-two-evils 'use case' is as I say above: some app's cleanup fails not infrequently in an inconsistent state since it was immediately halted and that has its cost as does letting an app get stuck. Simply, which is worse?

vanzin · 2016-01-21T00:48:32Z

a YARN that doesn't stop it is doing something wrong

But that's the case. It's a bug in YARN. See https://issues.apache.org/jira/browse/HADOOP-12441.

some app's cleanup fails not infrequently in an inconsistent state since it was immediately halted

If the app fails in that way, it is broken. The cleanup should be done before the driver's {{main()}} returns; otherwise, the app has relinquished control back to Spark / the cluster manager, and anything can happen.

tgravescs · 2016-01-21T17:26:26Z

so I haven't seen any use cases this would break.

I would argue is they are relying on this behavior its a bug in the user code. They should be using shut down hook or other and that still work with system.exit. If they are doing this now they are getting undeterministic results because YARN and other cluster managers could shoot them at any time.

Personally I think inconsistent behavior is worse as it is much harder to debug.

srowen · 2016-01-21T17:31:22Z

I don't think this is quite addressing my point, but at best I'm asserting there's a choice between which "bad" behavior you want to deal with, not that either behavior is "OK". Exiting immediately might harm apps that are unintentionally bad (e.g. third party libs flushing); then again tackling stuck apps is protecting the cluster. I'd just reassert the above, but this is a matter of judgment and being outvoted seems OK to me.

vanzin · 2016-01-21T18:15:41Z

Sorry Sean, I don't see how we're not addressing your comment. Without the change, the behavior you're concerned about already exists, because YARN kills containers when that bug is not present. It's, for all practical effects, exactly the same as calling System.exit explicitly.

srowen · 2016-01-21T18:26:22Z

Sure, but it becomes much more likely to bite if you always kill the threads immediately, if they're still running, rather than have them killed a little bit later by YARN. Are you saying YARN immediately kills the container here? then in practice agree. Otherwise this seems like a coherent concern; you may not think it's worth worrying about but it's not hard to grok

vanzin · 2016-01-21T18:33:39Z

Are you saying YARN immediately kills the container here?

I'm saying that it's the user's fault if his application depends on that non-predictable behavior. We're talking milliseconds here, that might be affected by everything from YARN's internal code to network delays to kernel scheduling. There's absolutely no argument for someone depending on that for the correct behavior of their application.

tgravescs · 2016-01-21T18:49:08Z

@zhuoliu please re-open this.

zhuoliu · 2016-01-21T19:42:37Z

Now reopen.

srowen · 2016-01-21T19:59:54Z

I'm on board with this if it's true that YARN virtually immediately kills a JVM like this if it's not done by the time the NM thinks it's done. Then indeed regardless of what it does to an app that's tardy in cleaning up, it's not a materially different behavior. If it helps handle another bad app behavior, there's no downside to doing that.

tgravescs · 2016-01-21T20:45:24Z

+1. I'll give this a bit for others to look at and make sure we are done discussing.

vanzin · 2016-01-21T20:48:48Z

I still would like to see a better commit message (the link to JIRA is implied by the title so is redundant in a commit message).

zhuoliu · 2016-01-21T21:02:42Z

Hi @vanzin , do we want to amend the commit message to something like this?
"Call system.exit explicitly to make sure non-daemon user threads terminate.
Without this, user applications might live forever if the cluster manager does not appropriately kill them. E.g., YARN had this bug: HADOOP-12441."

vanzin · 2016-01-21T21:05:52Z

@zhuoliu that sounds great. Just edit your very first comment (that's the commit message).

zhuoliu · 2016-01-21T22:19:54Z

Thanks @vanzin , commit message updated.

vanzin · 2016-01-21T22:21:42Z

Huh, no. You have to edit the very first comment on this page, not fix the commit message on your github branch.

zhuoliu · 2016-01-21T22:34:04Z

Sorry for that. Just updated the first comment and changed the commit message back to original.

vanzin · 2016-01-21T22:36:14Z

Great, thanks! LGTM.

SparkQA · 2016-01-22T00:35:31Z

Test build #49891 has finished for PR 9946 at commit a3c5064.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-01-26T16:05:14Z

Can you make a comment to give a summary of the discussion? I am not sure that the commit message has the enough info.

zhuoliu · 2016-01-26T17:09:19Z

Sure.
A brief summary is that: (picked a few conclusive points from above):

"At the point we call System.exit here all user code is done and we are terminating. If there is something you know of that the user should be allowed to do after this then we should create something like a shutdown hook so the user can properly clean it up.
A timeout would probably work here but I'm not fond of it here, it is going to delay it for everyone and you don't give the resources back as quickly. You could also end up in the same situation where if you choose a timeout, say 3 seconds and if this theoretical user code didn't finish within that 3 seconds you would kill it anyway."

"I also don't buy the existence of any valid user case for a user thread to block container exit,"

"I'm saying that it's the user's fault if his application depends on that non-predictable behavior. We're talking milliseconds here, that might be affected by everything from YARN's internal code to network delays to kernel scheduling. There's absolutely no argument for someone depending on that for the correct behavior of their application."'

"I'm on board with this if it's true that YARN virtually immediately kills a JVM like this if it's not done by the time the NM thinks it's done. Then indeed regardless of what it does to an app that's tardy in cleaning up, it's not a materially different behavior. If it helps handle another bad app behavior, there's no downside to doing that."

…ache#9946

zhuoliu closed this Jan 20, 2016

zhuoliu reopened this Jan 21, 2016

zhuoliu force-pushed the 10911 branch from 2869d21 to c8b9760 Compare January 21, 2016 22:18

[SPARK-10911] Executors should System.exit on clean shutdown.

a3c5064

zhuoliu force-pushed the 10911 branch from c8b9760 to a3c5064 Compare January 21, 2016 22:33

asfgit closed this in ae0309a Jan 26, 2016

zzcclp added a commit to zzcclp/spark that referenced this pull request Jul 18, 2016

[EXT][SPARK-10911] Executors should System.exit on clean shutdown. ap…

5bb1508

…ache#9946

[SPARK-10911] Executors should System.exit on clean shutdown. #9946

[SPARK-10911] Executors should System.exit on clean shutdown. #9946

Conversation

zhuoliu commented Nov 24, 2015

JoshRosen commented Nov 24, 2015

SparkQA commented Nov 24, 2015

srowen commented Nov 25, 2015

tgravescs commented Dec 1, 2015

srowen commented Dec 1, 2015

tgravescs commented Dec 2, 2015

vanzin commented Dec 3, 2015

tgravescs commented Dec 3, 2015

vanzin commented Dec 3, 2015

andrewor14 commented Dec 15, 2015

zhuoliu commented Jan 20, 2016

srowen commented Jan 20, 2016

zhuoliu commented Jan 20, 2016

tgravescs commented Jan 20, 2016

tgravescs commented Jan 20, 2016

vanzin commented Jan 20, 2016

tgravescs commented Jan 20, 2016

srowen commented Jan 20, 2016

vanzin commented Jan 20, 2016

srowen commented Jan 20, 2016

vanzin commented Jan 20, 2016

srowen commented Jan 21, 2016

vanzin commented Jan 21, 2016

tgravescs commented Jan 21, 2016

srowen commented Jan 21, 2016

vanzin commented Jan 21, 2016

srowen commented Jan 21, 2016

vanzin commented Jan 21, 2016

tgravescs commented Jan 21, 2016

zhuoliu commented Jan 21, 2016

srowen commented Jan 21, 2016

tgravescs commented Jan 21, 2016

vanzin commented Jan 21, 2016

zhuoliu commented Jan 21, 2016

vanzin commented Jan 21, 2016

zhuoliu commented Jan 21, 2016

vanzin commented Jan 21, 2016

zhuoliu commented Jan 21, 2016

vanzin commented Jan 21, 2016

SparkQA commented Jan 22, 2016

yhuai commented Jan 26, 2016

zhuoliu commented Jan 26, 2016