[SPARK-15896][SQL] Clean up shuffle files just after jobs finished #14039

maropu · 2016-07-04T03:01:23Z

What changes were proposed in this pull request?

Since ShuffleRDD in a SQL query could not be reuse later, this pr is to remove the shuffle files after finish a query to free the disk space as soon as possible.

How was this patch tested?

Manually checked all files were deleted just after jobs finished.

SparkQA · 2016-07-04T04:31:45Z

Test build #61702 has finished for PR 14039 at commit 4e56d5b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-07-04T06:24:34Z

I don't think we do this in general. The shuffle files are supposed to remain to potentially be reused if the stage needs to be re-executed.

maropu · 2016-07-04T06:31:19Z

@srowen thanks for the comment. Yea, I noticed that and I'm fixing this to remove only shuffle files generated by ShuffleExchange. Also, I'm looking for other ways to remove the files.

markhamstra · 2016-07-04T07:24:44Z

Actually, they can be reused -- not in Spark as distributed, but it is an open question whether reusing shuffle files within Spark SQL is something that we should be doing and want to support. It can be an effective alternative means of caching. https://issues.apache.org/jira/browse/SPARK-13756

Until that issue is definitively decided, we should not pre-empt the possibility with this PR.

SparkQA · 2016-07-04T10:14:25Z

Test build #61715 has finished for PR 14039 at commit 891a100.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-04T13:09:37Z

Test build #61717 has finished for PR 14039 at commit daa859a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2016-07-05T01:40:20Z

@srowen My understanding is that shuffle data in stages are possibly shared in a job. However, once the job is finished, the current implementation cannot reuse the shuffle data anymore. So, we can safely remove them. Is this incorrect? Spark can reuse them between different jobs?

maropu · 2016-07-05T02:16:20Z

@markhamstra Thanks for the comment. I think the reuse of fragments highly depends on user's queries, catalyst optimizer, cluster resources... Reusing ShuffledRowRDD shuffle data in a single job is a good idea though, it seems difficult to stay the data in multiple jobs because spark cannot know when the data should be garbaged-collected and it possibly eats much disk space. I think caching mechanism is a better idea to reuse fragments in multiple jobs. Or, do u have any smart/concrete idea to reuse the shuffle data?

SparkQA · 2016-07-05T03:24:19Z

Test build #61738 has finished for PR 14039 at commit 55c8e03.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

markhamstra · 2016-07-05T18:03:36Z

I haven't got anything more concrete to offer at this time than the descriptions in the relevant JIRA's, but I do have this running in production with 1.6, and it does work. Essentially, you build a cache in your application whose keys are a canonicalization of query fragments and whose values are RDDs associated with that fragment of the logical plan, and which produce the shuffle files. For as long as you hold the references to those RDDs in your cache, Spark won't remove the shuffle files. For as long as you have sufficient memory available to the OS, those shuffle files will be accessed via the OS buffer cache, which is actually pretty quick and doesn't require any of Java heap management and garbage collection. That was the original motivation behind using shuffle files in this way and before off-heap caching and unified memory management were available. It's less necessary now (at least once I figure out how to do the mapping between logical plan fragments and tables cached off-heap), but it is still a valid alternative caching mechanism.

Clean up shuffle files just after jobs finished

4e56d5b

Fix bugs

891a100

Move some codes to right places

55c8e03

maropu force-pushed the SPARK-15896 branch from daa859a to 55c8e03 Compare July 5, 2016 01:31

maropu closed this Nov 18, 2016

maropu deleted the SPARK-15896 branch July 5, 2017 11:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-15896][SQL] Clean up shuffle files just after jobs finished #14039

[SPARK-15896][SQL] Clean up shuffle files just after jobs finished #14039

maropu commented Jul 4, 2016

SparkQA commented Jul 4, 2016

srowen commented Jul 4, 2016

maropu commented Jul 4, 2016 •

edited

markhamstra commented Jul 4, 2016

SparkQA commented Jul 4, 2016

SparkQA commented Jul 4, 2016

maropu commented Jul 5, 2016

maropu commented Jul 5, 2016

SparkQA commented Jul 5, 2016

markhamstra commented Jul 5, 2016

[SPARK-15896][SQL] Clean up shuffle files just after jobs finished #14039

[SPARK-15896][SQL] Clean up shuffle files just after jobs finished #14039

Conversation

maropu commented Jul 4, 2016

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jul 4, 2016

srowen commented Jul 4, 2016

maropu commented Jul 4, 2016 • edited

markhamstra commented Jul 4, 2016

SparkQA commented Jul 4, 2016

SparkQA commented Jul 4, 2016

maropu commented Jul 5, 2016

maropu commented Jul 5, 2016

SparkQA commented Jul 5, 2016

markhamstra commented Jul 5, 2016

maropu commented Jul 4, 2016 •

edited