[SPARK-2033] Automatically cleanup checkpoint #855

witgo · 2014-05-22T14:38:12Z

No description provided.

mridulm · 2014-05-22T17:16:53Z

Why would you want to cleanup checkpoint data automatically since it is an explicit user action ?
It can be used to persist computations between spark invocations.

I can see the need for adding the need to register to remove a checkpoint when it can be safely gc'ed - but note that this need not happen (since gc need not finish when called before app exit) which is same as this PR.

tdas · 2014-05-23T03:45:35Z

Yes, i agree with @mridulm This should not be done automatically for all checkpoints, as it is necessary to keep them around across spark invocations. For example, with Spark Streaming saves intermediate state datasets as checkpoints and relies on them to recover from driver failures.

witgo · 2014-05-23T05:01:30Z

@mridulm @tdas
The code has been updated.
Now, automatically clean up checkpoint data is optional

SparkQA · 2014-07-16T16:37:50Z

QA tests have started for PR 855. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16731/consoleFull

SparkQA · 2014-07-16T18:17:23Z

QA results for PR 855:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16731/consoleFull

SparkQA · 2014-08-01T06:34:08Z

QA tests have started for PR 855. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17654/consoleFull

SparkQA · 2014-08-01T06:37:16Z

QA results for PR 855:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17654/consoleFull

SparkQA · 2014-08-01T06:49:02Z

QA tests have started for PR 855. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17656/consoleFull

SparkQA · 2014-08-01T07:42:55Z

QA results for PR 855:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17656/consoleFull

tdas · 2014-08-01T10:28:27Z

This is definitely better. Can you make add the documentation for this property in the configuration page?

tdas · 2014-08-01T10:29:32Z

core/src/test/scala/org/apache/spark/ContextCleanerSuite.scala

+  test("automatically cleanup checkpoint data") {
+    val conf=new SparkConf().setMaster("local[2]").setAppName("cleanupCheckpointData").
+      set("spark.cleaner.checkpointData.enabled","true")
+    sc =new SparkContext(conf)


space missing

SparkQA · 2014-08-01T14:34:01Z

QA tests have started for PR 855. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17673/consoleFull

mridulm · 2014-08-01T15:13:30Z

This definitely is much better, thanks for the PR !

SparkQA · 2014-08-01T15:25:12Z

QA results for PR 855:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17673/consoleFull

pwendell · 2014-09-08T04:58:49Z

@tdas this has gone stale - can you take a look?

SparkQA · 2014-10-31T07:39:06Z

Test build #22601 has finished for PR 855 at commit 9cdbdaa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2015-02-19T22:34:35Z

@witgo @tdas This seems like a good feature to have. However, the PR has mostly gone stale at this point. Would you mind updating this to master? After you do that I will take a closer look and hopefully merge it into 1.4.

andrewor14 · 2015-02-19T22:34:57Z

core/src/main/scala/org/apache/spark/ContextCleaner.scala

@@ -32,6 +32,7 @@ private sealed trait CleanupTask
 private case class CleanRDD(rddId: Int) extends CleanupTask
 private case class CleanShuffle(shuffleId: Int) extends CleanupTask
 private case class CleanBroadcast(broadcastId: Long) extends CleanupTask
+private case class CleanRDDCheckpointData(rddId: Int) extends CleanupTask


I think you can just call this CleanCheckpoint here and other places

tdas · 2015-02-23T05:17:24Z

@andrewor14 Yes, this is a good feature to have.

witgo · 2015-02-26T03:02:21Z

@andrewor14 @tdas The code has been updated.

SparkQA · 2015-02-26T04:21:05Z

Test build #27975 has finished for PR 855 at commit 4c555d3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-02-26T04:27:19Z

Test build #27976 has finished for PR 855 at commit 46016d3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-03-10T15:17:07Z

Test build #28430 has finished for PR 855 at commit 6a630f0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2015-04-10T00:55:59Z

core/src/main/scala/org/apache/spark/ContextCleaner.scala

@@ -139,6 +140,11 @@ private[spark] class ContextCleaner(sc: SparkContext) extends Logging {
    registerForCleanup(broadcast, CleanBroadcast(broadcast.id))
  }

+  /** Register a RDDCheckpointData for cleanup when it is garbage collected. */
+  def registerRDDCheckpointDataForCleanup[T](rdd: RDD[_], parentId: Int) {


please add Unit return type here and other places

SparkQA · 2015-04-10T03:59:36Z

Test build #29998 has finished for PR 855 at commit 1649850.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

andrewor14 · 2015-04-14T19:56:19Z

Merging this into master thanks @witgo

…between them (#855) ### What changes were proposed in this pull request? This pr makes `CombineUnions` combine unions if there is a project between them. For example: ```scala spark.range(1).selectExpr("CAST(id AS decimal(18, 1)) AS id").write.saveAsTable("t1") spark.range(2).selectExpr("CAST(id AS decimal(18, 2)) AS id").write.saveAsTable("t2") spark.range(3).selectExpr("CAST(id AS decimal(18, 3)) AS id").write.saveAsTable("t3") spark.range(4).selectExpr("CAST(id AS decimal(18, 4)) AS id").write.saveAsTable("t4") spark.range(5).selectExpr("CAST(id AS decimal(18, 5)) AS id").write.saveAsTable("t5") spark.sql("SELECT id FROM t1 UNION SELECT id FROM t2 UNION SELECT id FROM t3 UNION SELECT id FROM t4 UNION SELECT id FROM t5").explain(true) ``` Before this pr: ``` == Optimized Logical Plan == Aggregate [id#36], [id#36] +- Union false, false :- Aggregate [id#34], [cast(id#34 as decimal(22,5)) AS id#36] : +- Union false, false : :- Aggregate [id#32], [cast(id#32 as decimal(21,4)) AS id#34] : : +- Union false, false : : :- Aggregate [id#30], [cast(id#30 as decimal(20,3)) AS id#32] : : : +- Union false, false : : : :- Project [cast(id#25 as decimal(19,2)) AS id#30] : : : : +- Relation default.t1[id#25] parquet : : : +- Project [cast(id#26 as decimal(19,2)) AS id#31] : : : +- Relation default.t2[id#26] parquet : : +- Project [cast(id#27 as decimal(20,3)) AS id#33] : : +- Relation default.t3[id#27] parquet : +- Project [cast(id#28 as decimal(21,4)) AS id#35] : +- Relation default.t4[id#28] parquet +- Project [cast(id#29 as decimal(22,5)) AS id#37] +- Relation default.t5[id#29] parquet ``` After this pr: ``` == Optimized Logical Plan == Aggregate [id#36], [id#36] +- Union false, false :- Project [cast(id#25 as decimal(22,5)) AS id#36] : +- Relation default.t1[id#25] parquet :- Project [cast(id#26 as decimal(22,5)) AS id#46] : +- Relation default.t2[id#26] parquet :- Project [cast(id#27 as decimal(22,5)) AS id#45] : +- Relation default.t3[id#27] parquet :- Project [cast(id#28 as decimal(22,5)) AS id#44] : +- Relation default.t4[id#28] parquet +- Project [cast(id#29 as decimal(22,5)) AS id#37] +- Relation default.t5[id#29] parquet ``` ### Why are the changes needed? Improve query performance by reduce shuffles. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #35214 from wangyum/SPARK-37915. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ac2b0df) * [SPARK-37915][SQL] Combine unions if there is a project between them

witgo changed the title ~~Automatically cleanup checkpoint date~~ [WIP]Automatically cleanup checkpoint date May 23, 2014

witgo changed the title ~~[WIP]Automatically cleanup checkpoint date~~ Automatically cleanup checkpoint date May 23, 2014

witgo changed the title ~~Automatically cleanup checkpoint date~~ Automatically cleanup checkpoint data May 29, 2014

witgo changed the title ~~Automatically cleanup checkpoint data~~ SPARK-2033:Automatically cleanup checkpoint data Jun 5, 2014

witgo changed the title ~~SPARK-2033:Automatically cleanup checkpoint data~~ SPARK-2033: Automatically cleanup checkpoint data Jun 5, 2014

witgo changed the title ~~SPARK-2033: Automatically cleanup checkpoint data~~ SPARK-2033: Automatically cleanup checkpoint Jun 5, 2014

witgo changed the title ~~SPARK-2033: Automatically cleanup checkpoint~~ [SPARK-2033] Automatically cleanup checkpoint Jun 5, 2014

tdas reviewed Aug 1, 2014
View reviewed changes

witgo force-pushed the cleanup_checkpoint_date branch from f367358 to 9cdbdaa Compare October 31, 2014 06:24

andrewor14 reviewed Feb 19, 2015
View reviewed changes

witgo force-pushed the cleanup_checkpoint_date branch from 9cdbdaa to 4c555d3 Compare February 26, 2015 03:00

witgo force-pushed the cleanup_checkpoint_date branch from 46016d3 to 6a630f0 Compare March 10, 2015 13:56

andrewor14 reviewed Apr 10, 2015
View reviewed changes

witgo added 2 commits April 10, 2015 09:50

Automatically cleanup checkpoint

c0087e0

review commit

1649850

witgo force-pushed the cleanup_checkpoint_date branch from 6a630f0 to 1649850 Compare April 10, 2015 02:29

asfgit closed this in 25998e4 Apr 14, 2015

witgo deleted the cleanup_checkpoint_date branch April 15, 2015 01:36

nchammas mentioned this pull request Mar 10, 2021

[SPARK-33000] Cleanup checkpoint data on shutdown #31742

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-2033] Automatically cleanup checkpoint #855

[SPARK-2033] Automatically cleanup checkpoint #855

witgo commented May 22, 2014

mridulm commented May 22, 2014

tdas commented May 23, 2014

witgo commented May 23, 2014

SparkQA commented Jul 16, 2014

SparkQA commented Jul 16, 2014

SparkQA commented Aug 1, 2014

SparkQA commented Aug 1, 2014

SparkQA commented Aug 1, 2014

SparkQA commented Aug 1, 2014

tdas commented Aug 1, 2014

tdas Aug 1, 2014

SparkQA commented Aug 1, 2014

mridulm commented Aug 1, 2014

SparkQA commented Aug 1, 2014

pwendell commented Sep 8, 2014

SparkQA commented Oct 31, 2014

andrewor14 commented Feb 19, 2015

andrewor14 Feb 19, 2015

tdas commented Feb 23, 2015

witgo commented Feb 26, 2015

SparkQA commented Feb 26, 2015

SparkQA commented Feb 26, 2015

SparkQA commented Mar 10, 2015

andrewor14 Apr 10, 2015

SparkQA commented Apr 10, 2015

andrewor14 commented Apr 14, 2015

[SPARK-2033] Automatically cleanup checkpoint #855

[SPARK-2033] Automatically cleanup checkpoint #855

Conversation

witgo commented May 22, 2014

mridulm commented May 22, 2014

tdas commented May 23, 2014

witgo commented May 23, 2014

SparkQA commented Jul 16, 2014

SparkQA commented Jul 16, 2014

SparkQA commented Aug 1, 2014

SparkQA commented Aug 1, 2014

SparkQA commented Aug 1, 2014

SparkQA commented Aug 1, 2014

tdas commented Aug 1, 2014

tdas Aug 1, 2014

Choose a reason for hiding this comment

SparkQA commented Aug 1, 2014

mridulm commented Aug 1, 2014

SparkQA commented Aug 1, 2014

pwendell commented Sep 8, 2014

SparkQA commented Oct 31, 2014

andrewor14 commented Feb 19, 2015

andrewor14 Feb 19, 2015

Choose a reason for hiding this comment

tdas commented Feb 23, 2015

witgo commented Feb 26, 2015

SparkQA commented Feb 26, 2015

SparkQA commented Feb 26, 2015

SparkQA commented Mar 10, 2015

andrewor14 Apr 10, 2015

Choose a reason for hiding this comment

SparkQA commented Apr 10, 2015

andrewor14 commented Apr 14, 2015