-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-17340][YARN] cleanup .sparkStaging when app is killed by yarn #14916
Conversation
Jenkins test this please |
@jerryshao what's your view on a change like this? |
Test build #64775 has finished for PR 14916 at commit
|
if (finalStatus == FinalApplicationStatus.SUCCEEDED || isLastAttempt) { | ||
if (finalStatus == FinalApplicationStatus.SUCCEEDED || | ||
exitCode == ApplicationMaster.EXIT_EARLY || | ||
exitCode == ApplicationMaster.EXIT_EXCEPTION_USER_CLASS || isLastAttempt) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't do this. There are various reasons these can happen and if any of them are retryable by yarn you are now preventing that from happening by unregistering. The kill may cause these but other things could to. The EXIT_EXCEPTION_USER_CLASS is any throwable from the user code, the EXIT_EARLY is unknown and thus would want to retry.
I'm fine with adding something in if we know it was kill, but I think thats hard here because yarn doesn't tell us. Ideally we have a spark command to kill nicely and then we can do the cleanup ourselves.
The client should try to clean this up if it sees its killed, assuming its still running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tgravescs
What about SignalUtils.scala
log.error("RECEIVED SIGNAL " + sig)
when we kill the app using yarn kill we get this:
ERROR ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM
can we use it to trigger cleanup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately I don't think you can use that either. yarn has preemption and overcommit that can kill the AM and this case it uses SIGTERM or SIGKILL. In these cases we want the AM to rerun again.
Can we just clean previous finished apps from sparkStaging folder? |
We shouldn't really clean previous apps because there is a debug option to keep staging dir around. ie you might want some of those around if debugging. you would have to include some time based param also, but I really don't like this as its putting burden on a different application and startup cost could be affected, etc. This really needs to be solved in yarn where they have like a cleanup task that would run afterwards. There is a jira for this in YARN but no one has worked on it yet. Short of that I would say we add a spark interface to kill the application and then we can nicely go down and cleanup, vs yarn kill just shooting the app. I know its a bit annoying but unfortunately hard to solve which is why its still this way. otherwise you can obviously setup some sort of cron job to clean these up separately. |
Agree with @tgravescs . Actually this issue only exists when local If it is due to So maybe writing some scripts to clean up these dirs periodically would be much easier compared to fix it in Spark. |
What changes were proposed in this pull request?
Cleanup .sparkStaging directory whenever spark application has exitCode 15/16
EXIT_EXCEPTION_USER_CLASS = 15
EXIT_EARLY = 16
It happens when you kill spark-submit in terminal and do
$ yarn application -kill <app id>
How was this patch tested?
Existing tests. (./dev/run-tests)
Also manually verified that application does cleanup when it is killed in such way.