Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-17340][YARN] cleanup .sparkStaging when app is killed by yarn #14916

Closed
wants to merge 1 commit into from

Conversation

Devian-ua
Copy link
Contributor

What changes were proposed in this pull request?

Cleanup .sparkStaging directory whenever spark application has exitCode 15/16
EXIT_EXCEPTION_USER_CLASS = 15
EXIT_EARLY = 16

It happens when you kill spark-submit in terminal and do
$ yarn application -kill <app id>

How was this patch tested?

Existing tests. (./dev/run-tests)
Also manually verified that application does cleanup when it is killed in such way.

@srowen
Copy link
Member

srowen commented Sep 1, 2016

Jenkins test this please

@srowen
Copy link
Member

srowen commented Sep 1, 2016

@jerryshao what's your view on a change like this?

@SparkQA
Copy link

SparkQA commented Sep 1, 2016

Test build #64775 has finished for PR 14916 at commit 8a1fe82.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if (finalStatus == FinalApplicationStatus.SUCCEEDED || isLastAttempt) {
if (finalStatus == FinalApplicationStatus.SUCCEEDED ||
exitCode == ApplicationMaster.EXIT_EARLY ||
exitCode == ApplicationMaster.EXIT_EXCEPTION_USER_CLASS || isLastAttempt) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't do this. There are various reasons these can happen and if any of them are retryable by yarn you are now preventing that from happening by unregistering. The kill may cause these but other things could to. The EXIT_EXCEPTION_USER_CLASS is any throwable from the user code, the EXIT_EARLY is unknown and thus would want to retry.

I'm fine with adding something in if we know it was kill, but I think thats hard here because yarn doesn't tell us. Ideally we have a spark command to kill nicely and then we can do the cleanup ourselves.

The client should try to clean this up if it sees its killed, assuming its still running.

Copy link
Contributor Author

@Devian-ua Devian-ua Sep 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tgravescs
What about SignalUtils.scala
log.error("RECEIVED SIGNAL " + sig)
when we kill the app using yarn kill we get this:
ERROR ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM
can we use it to trigger cleanup?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I don't think you can use that either. yarn has preemption and overcommit that can kill the AM and this case it uses SIGTERM or SIGKILL. In these cases we want the AM to rerun again.

@Devian-ua
Copy link
Contributor Author

Can we just clean previous finished apps from sparkStaging folder?

@tgravescs
Copy link
Contributor

tgravescs commented Sep 1, 2016

We shouldn't really clean previous apps because there is a debug option to keep staging dir around. ie you might want some of those around if debugging. you would have to include some time based param also, but I really don't like this as its putting burden on a different application and startup cost could be affected, etc.

This really needs to be solved in yarn where they have like a cleanup task that would run afterwards. There is a jira for this in YARN but no one has worked on it yet.

Short of that I would say we add a spark interface to kill the application and then we can nicely go down and cleanup, vs yarn kill just shooting the app.

I know its a bit annoying but unfortunately hard to solve which is why its still this way. otherwise you can obviously setup some sort of cron job to clean these up separately.

@jerryshao
Copy link
Contributor

Agree with @tgravescs .

Actually this issue only exists when local yarn#client process is gone and application is killed by yarn command. In this case the staging dir will not be cleaned up by AM.

If it is due to EXIT_EXCEPTION_USER_CLASS, then I think reattempt will finally clean up the dir. Also relying on signal is vulnerable, since you cannot distinguish why the signal is received.

So maybe writing some scripts to clean up these dirs periodically would be much easier compared to fix it in Spark.

@Devian-ua Devian-ua closed this Sep 12, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants