New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-15896][SQL] Clean up shuffle files just after jobs finished #14039
Conversation
Test build #61702 has finished for PR 14039 at commit
|
I don't think we do this in general. The shuffle files are supposed to remain to potentially be reused if the stage needs to be re-executed. |
@srowen thanks for the comment. Yea, I noticed that and I'm fixing this to remove only shuffle files generated by |
Actually, they can be reused -- not in Spark as distributed, but it is an open question whether reusing shuffle files within Spark SQL is something that we should be doing and want to support. It can be an effective alternative means of caching. https://issues.apache.org/jira/browse/SPARK-13756 Until that issue is definitively decided, we should not pre-empt the possibility with this PR. |
Test build #61715 has finished for PR 14039 at commit
|
Test build #61717 has finished for PR 14039 at commit
|
@srowen My understanding is that shuffle data in stages are possibly shared in a job. However, once the job is finished, the current implementation cannot reuse the shuffle data anymore. So, we can safely remove them. Is this incorrect? Spark can reuse them between different jobs? |
@markhamstra Thanks for the comment. I think the reuse of fragments highly depends on user's queries, catalyst optimizer, cluster resources... Reusing |
Test build #61738 has finished for PR 14039 at commit
|
I haven't got anything more concrete to offer at this time than the descriptions in the relevant JIRA's, but I do have this running in production with 1.6, and it does work. Essentially, you build a cache in your application whose keys are a canonicalization of query fragments and whose values are RDDs associated with that fragment of the logical plan, and which produce the shuffle files. For as long as you hold the references to those RDDs in your cache, Spark won't remove the shuffle files. For as long as you have sufficient memory available to the OS, those shuffle files will be accessed via the OS buffer cache, which is actually pretty quick and doesn't require any of Java heap management and garbage collection. That was the original motivation behind using shuffle files in this way and before off-heap caching and unified memory management were available. It's less necessary now (at least once I figure out how to do the mapping between logical plan fragments and tables cached off-heap), but it is still a valid alternative caching mechanism. |
What changes were proposed in this pull request?
Since
ShuffleRDD
in a SQL query could not be reuse later, this pr is to remove the shuffle files after finish a query to free the disk space as soon as possible.How was this patch tested?
Manually checked all files were deleted just after jobs finished.