Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-15896][SQL] Clean up shuffle files just after jobs finished #14039

Closed
wants to merge 3 commits into from

Conversation

maropu
Copy link
Member

@maropu maropu commented Jul 4, 2016

What changes were proposed in this pull request?

Since ShuffleRDD in a SQL query could not be reuse later, this pr is to remove the shuffle files after finish a query to free the disk space as soon as possible.

How was this patch tested?

Manually checked all files were deleted just after jobs finished.

@SparkQA
Copy link

SparkQA commented Jul 4, 2016

Test build #61702 has finished for PR 14039 at commit 4e56d5b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jul 4, 2016

I don't think we do this in general. The shuffle files are supposed to remain to potentially be reused if the stage needs to be re-executed.

@maropu
Copy link
Member Author

maropu commented Jul 4, 2016

@srowen thanks for the comment. Yea, I noticed that and I'm fixing this to remove only shuffle files generated by ShuffleExchange. Also, I'm looking for other ways to remove the files.

@markhamstra
Copy link
Contributor

Actually, they can be reused -- not in Spark as distributed, but it is an open question whether reusing shuffle files within Spark SQL is something that we should be doing and want to support. It can be an effective alternative means of caching. https://issues.apache.org/jira/browse/SPARK-13756

Until that issue is definitively decided, we should not pre-empt the possibility with this PR.

@SparkQA
Copy link

SparkQA commented Jul 4, 2016

Test build #61715 has finished for PR 14039 at commit 891a100.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 4, 2016

Test build #61717 has finished for PR 14039 at commit daa859a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Jul 5, 2016

@srowen My understanding is that shuffle data in stages are possibly shared in a job. However, once the job is finished, the current implementation cannot reuse the shuffle data anymore. So, we can safely remove them. Is this incorrect? Spark can reuse them between different jobs?

@maropu
Copy link
Member Author

maropu commented Jul 5, 2016

@markhamstra Thanks for the comment. I think the reuse of fragments highly depends on user's queries, catalyst optimizer, cluster resources... Reusing ShuffledRowRDD shuffle data in a single job is a good idea though, it seems difficult to stay the data in multiple jobs because spark cannot know when the data should be garbaged-collected and it possibly eats much disk space. I think caching mechanism is a better idea to reuse fragments in multiple jobs. Or, do u have any smart/concrete idea to reuse the shuffle data?

@SparkQA
Copy link

SparkQA commented Jul 5, 2016

Test build #61738 has finished for PR 14039 at commit 55c8e03.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@markhamstra
Copy link
Contributor

I haven't got anything more concrete to offer at this time than the descriptions in the relevant JIRA's, but I do have this running in production with 1.6, and it does work. Essentially, you build a cache in your application whose keys are a canonicalization of query fragments and whose values are RDDs associated with that fragment of the logical plan, and which produce the shuffle files. For as long as you hold the references to those RDDs in your cache, Spark won't remove the shuffle files. For as long as you have sufficient memory available to the OS, those shuffle files will be accessed via the OS buffer cache, which is actually pretty quick and doesn't require any of Java heap management and garbage collection. That was the original motivation behind using shuffle files in this way and before off-heap caching and unified memory management were available. It's less necessary now (at least once I figure out how to do the mapping between logical plan fragments and tables cached off-heap), but it is still a valid alternative caching mechanism.

@maropu maropu closed this Nov 18, 2016
@maropu maropu deleted the SPARK-15896 branch July 5, 2017 11:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants