[SPARK-17340][YARN] cleanup .sparkStaging when app is killed by yarn #14916

Devian-ua · 2016-09-01T10:43:22Z

What changes were proposed in this pull request?

Cleanup .sparkStaging directory whenever spark application has exitCode 15/16
EXIT_EXCEPTION_USER_CLASS = 15
EXIT_EARLY = 16

It happens when you kill spark-submit in terminal and do
$ yarn application -kill <app id>

How was this patch tested?

Existing tests. (./dev/run-tests)
Also manually verified that application does cleanup when it is killed in such way.

srowen · 2016-09-01T11:01:04Z

Jenkins test this please

srowen · 2016-09-01T11:22:24Z

@jerryshao what's your view on a change like this?

SparkQA · 2016-09-01T11:28:20Z

Test build #64775 has finished for PR 14916 at commit 8a1fe82.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2016-09-01T15:56:08Z

yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala

-          if (finalStatus == FinalApplicationStatus.SUCCEEDED || isLastAttempt) {
+          if (finalStatus == FinalApplicationStatus.SUCCEEDED ||
+            exitCode == ApplicationMaster.EXIT_EARLY ||
+            exitCode == ApplicationMaster.EXIT_EXCEPTION_USER_CLASS || isLastAttempt) {


You can't do this. There are various reasons these can happen and if any of them are retryable by yarn you are now preventing that from happening by unregistering. The kill may cause these but other things could to. The EXIT_EXCEPTION_USER_CLASS is any throwable from the user code, the EXIT_EARLY is unknown and thus would want to retry.

I'm fine with adding something in if we know it was kill, but I think thats hard here because yarn doesn't tell us. Ideally we have a spark command to kill nicely and then we can do the cleanup ourselves.

The client should try to clean this up if it sees its killed, assuming its still running.

@tgravescs
What about SignalUtils.scala
log.error("RECEIVED SIGNAL " + sig)
when we kill the app using yarn kill we get this:
ERROR ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM
can we use it to trigger cleanup?

Unfortunately I don't think you can use that either. yarn has preemption and overcommit that can kill the AM and this case it uses SIGTERM or SIGKILL. In these cases we want the AM to rerun again.

Devian-ua · 2016-09-01T17:45:44Z

Can we just clean previous finished apps from sparkStaging folder?

tgravescs · 2016-09-01T19:07:04Z

We shouldn't really clean previous apps because there is a debug option to keep staging dir around. ie you might want some of those around if debugging. you would have to include some time based param also, but I really don't like this as its putting burden on a different application and startup cost could be affected, etc.

This really needs to be solved in yarn where they have like a cleanup task that would run afterwards. There is a jira for this in YARN but no one has worked on it yet.

Short of that I would say we add a spark interface to kill the application and then we can nicely go down and cleanup, vs yarn kill just shooting the app.

I know its a bit annoying but unfortunately hard to solve which is why its still this way. otherwise you can obviously setup some sort of cron job to clean these up separately.

jerryshao · 2016-09-02T01:24:16Z

Agree with @tgravescs .

Actually this issue only exists when local yarn#client process is gone and application is killed by yarn command. In this case the staging dir will not be cleaned up by AM.

If it is due to EXIT_EXCEPTION_USER_CLASS, then I think reattempt will finally clean up the dir. Also relying on signal is vulnerable, since you cannot distinguish why the signal is received.

So maybe writing some scripts to clean up these dirs periodically would be much easier compared to fix it in Spark.

[SPARK-17340][YARN] cleanup .sparkStaging when app is killed by yarn

8a1fe82

tgravescs reviewed Sep 1, 2016
View reviewed changes

Devian-ua closed this Sep 12, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-17340][YARN] cleanup .sparkStaging when app is killed by yarn #14916

[SPARK-17340][YARN] cleanup .sparkStaging when app is killed by yarn #14916

Devian-ua commented Sep 1, 2016

srowen commented Sep 1, 2016

srowen commented Sep 1, 2016

SparkQA commented Sep 1, 2016

tgravescs Sep 1, 2016

Devian-ua Sep 1, 2016 •

edited

tgravescs Sep 1, 2016

Devian-ua commented Sep 1, 2016

tgravescs commented Sep 1, 2016 •

edited

jerryshao commented Sep 2, 2016

[SPARK-17340][YARN] cleanup .sparkStaging when app is killed by yarn #14916

[SPARK-17340][YARN] cleanup .sparkStaging when app is killed by yarn #14916

Conversation

Devian-ua commented Sep 1, 2016

What changes were proposed in this pull request?

How was this patch tested?

srowen commented Sep 1, 2016

srowen commented Sep 1, 2016

SparkQA commented Sep 1, 2016

tgravescs Sep 1, 2016

Choose a reason for hiding this comment

Devian-ua Sep 1, 2016 • edited

Choose a reason for hiding this comment

tgravescs Sep 1, 2016

Choose a reason for hiding this comment

Devian-ua commented Sep 1, 2016

tgravescs commented Sep 1, 2016 • edited

jerryshao commented Sep 2, 2016

Devian-ua Sep 1, 2016 •

edited

tgravescs commented Sep 1, 2016 •

edited