[SPARK-2490] Change recursive visiting on RDD dependencies to iterative approach #1418

viirya · 2014-07-15T12:07:40Z

When performing some transformations on RDDs after many iterations, the dependencies of RDDs could be very long. It can easily cause StackOverflowError when recursively visiting these dependencies in Spark core. For example:

var rdd = sc.makeRDD(Array(1))
for (i <- 1 to 1000) { 
  rdd = rdd.coalesce(1).cache()
  rdd.collect()
}

This PR changes recursive visiting on rdd's dependencies to iterative approach to avoid StackOverflowError.

In addition to the recursive visiting, since the Java serializer has a known bug that causes StackOverflowError too when serializing/deserializing a large graph of objects. So applying this PR only solves part of the problem. Using KryoSerializer to replace Java serializer might be helpful. However, since KryoSerializer is not supported for spark.closure.serializer now, I can not test if KryoSerializer can solve Java serializer's problem completely.

… to avoid stackoverflowerror.

viirya · 2014-07-16T15:42:37Z

Another example of this problem is the PageRank example bundled in Spark. At this time, since the problem of Java serializer still exists, to avoid causing StackOverflowError after too many iterations, it is needed to call checkpoint() on the RDD.

rxin · 2014-07-20T08:26:44Z

Thanks for submitting this. I think we can still stack overflow in serialization, but I agree it's better to do this non-recursivley.

rxin · 2014-07-20T08:27:29Z

Actually it's late. I will review this tomorrow.

markhamstra · 2014-07-22T17:37:53Z

ok to test

markhamstra · 2014-07-22T20:16:28Z

Jenkins, ok to test.

viirya · 2014-07-26T06:59:52Z

Thanks for commenting. How about the review?

mateiz · 2014-07-26T22:17:01Z

Jenkins, test this please

mateiz · 2014-07-26T22:17:48Z

examples/src/main/scala/org/apache/spark/examples/SparkPageRank.scala

+      if (i % 50 == 0) {
+        ranks.cache()
+        ranks.checkpoint()
+        ranks.collect()


Don't do a collect, if you want to force it to be computed, just do a foreach. Otherwise collect will try to bring all the data back to the driver.

SparkQA · 2014-07-26T22:18:48Z

QA tests have started for PR 1418. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17228/consoleFull

SparkQA · 2014-07-26T22:19:29Z

QA results for PR 1418:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17228/consoleFull

mateiz · 2014-07-26T22:19:45Z

Hey @viirya, instead of modifying the PageRank example, what do you think of leaving it as-is until we have automatic checkpointing of long lineage chains? I think that will be better because the example is meant to be easy to understand and not that many people will run PageRank with hundreds of iterations (this particular algorithm usually converges much faster).

mateiz · 2014-07-26T22:20:28Z

BTW the Jenkins failure is due to a code style issue: an if block without braces

Jenkins, this is ok to test

mateiz · 2014-07-26T22:21:00Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

          }
        }
      }
    }
-    visit(rdd)
+    waitingForVisit.push(rdd)
+    while (!waitingForVisit.isEmpty)


This should have braces around the loop body

mateiz · 2014-07-26T22:22:14Z

BTW you can run sbt scalastyle to check these style things locally

viirya · 2014-07-27T07:08:43Z

@mateiz Thanks for suggestion. I leave the PageRank example as-is. These braces are added to comply with code style.

mateiz · 2014-07-27T23:09:51Z

Jenkins, test this please

rxin · 2014-07-29T02:42:51Z

Jenkins, retest this please.

SparkQA · 2014-07-29T02:48:43Z

QA tests have started for PR 1418. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17331/consoleFull

SparkQA · 2014-07-29T03:36:39Z

QA results for PR 1418:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17331/consoleFull

rxin · 2014-07-29T22:52:41Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

+    val parents = new Stack[ShuffleDependency[_, _, _]]
+    val visited = new HashSet[RDD[_]]
+    val waitingForVisit = new Stack[RDD[_]]
+    def visit(r: RDD[_]) {


why define a function here? seems like this is only used once? why not just inline it in the while?

I let the codes as the function getParentShuffleDependencies because it contains multiple indents and so put it under the case statement would not be readable. I can make it as inline if this is an issue.

mateiz · 2014-07-30T20:30:28Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

    parents.toList
  }

+  private def getParentShuffleDependencies(rdd: RDD[_]): Stack[ShuffleDependency[_, _, _]] = {


This should actually be called getAncestorShuffleDependencies because it's not just direct parents, it can be grandparents and such as well.

Also, please add a comment at the top of this saying that what it finds. In particular it only finds missing ones.

Finally, would it be possible to merge this with the code that calls it in getShuffleMapStage, e.g. have a method called registerShuffleDependencies? Might be easier to follow.

Add new commit for several changes including function name, comments. Please review if it is fine.

mateiz · 2014-08-01T07:15:03Z

Jenkins, test this please

SparkQA · 2014-08-01T07:19:21Z

QA tests have started for PR 1418. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17661/consoleFull

SparkQA · 2014-08-01T08:02:54Z

QA results for PR 1418:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17661/consoleFull

mateiz · 2014-08-01T18:03:22Z

Jenkins, test this please

SparkQA · 2014-08-01T18:09:11Z

QA tests have started for PR 1418. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17687/consoleFull

SparkQA · 2014-08-01T19:02:25Z

QA results for PR 1418:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17687/consoleFull

mateiz · 2014-08-01T19:12:57Z

Thanks Liang-Chi; I've merged this in.

…ve approach When performing some transformations on RDDs after many iterations, the dependencies of RDDs could be very long. It can easily cause StackOverflowError when recursively visiting these dependencies in Spark core. For example: var rdd = sc.makeRDD(Array(1)) for (i <- 1 to 1000) { rdd = rdd.coalesce(1).cache() rdd.collect() } This PR changes recursive visiting on rdd's dependencies to iterative approach to avoid StackOverflowError. In addition to the recursive visiting, since the Java serializer has a known [bug](http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4152790) that causes StackOverflowError too when serializing/deserializing a large graph of objects. So applying this PR only solves part of the problem. Using KryoSerializer to replace Java serializer might be helpful. However, since KryoSerializer is not supported for `spark.closure.serializer` now, I can not test if KryoSerializer can solve Java serializer's problem completely. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#1418 from viirya/remove_recursive_visit and squashes the following commits: 6b2c615 [Liang-Chi Hsieh] change function name; comply with code style. 5f072a7 [Liang-Chi Hsieh] add comments to explain Stack usage. 8742dbb [Liang-Chi Hsieh] comply with code style. 900538b [Liang-Chi Hsieh] change recursive visiting on rdd's dependencies to iterative approach to avoid stackoverflowerror.

change recursive visiting on rdd's dependencies to iterative approach…

900538b

… to avoid stackoverflowerror.

mateiz reviewed Jul 26, 2014
View reviewed changes

comply with code style.

8742dbb

rxin reviewed Jul 29, 2014
View reviewed changes

add comments to explain Stack usage.

5f072a7

mateiz reviewed Jul 30, 2014
View reviewed changes

change function name; comply with code style.

6b2c615

asfgit closed this in baf9ce1 Aug 1, 2014

viirya deleted the remove_recursive_visit branch December 27, 2023 18:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-2490] Change recursive visiting on RDD dependencies to iterative approach #1418

[SPARK-2490] Change recursive visiting on RDD dependencies to iterative approach #1418

viirya commented Jul 15, 2014

viirya commented Jul 16, 2014

rxin commented Jul 20, 2014

rxin commented Jul 20, 2014

markhamstra commented Jul 22, 2014

markhamstra commented Jul 22, 2014

viirya commented Jul 26, 2014

mateiz commented Jul 26, 2014

mateiz Jul 26, 2014

SparkQA commented Jul 26, 2014

SparkQA commented Jul 26, 2014

mateiz commented Jul 26, 2014

mateiz commented Jul 26, 2014

mateiz Jul 26, 2014

mateiz commented Jul 26, 2014

viirya commented Jul 27, 2014

mateiz commented Jul 27, 2014

rxin commented Jul 29, 2014

SparkQA commented Jul 29, 2014

SparkQA commented Jul 29, 2014

rxin Jul 29, 2014

viirya Jul 30, 2014

mateiz Jul 30, 2014

viirya Jul 31, 2014

mateiz commented Aug 1, 2014

SparkQA commented Aug 1, 2014

SparkQA commented Aug 1, 2014

mateiz commented Aug 1, 2014

SparkQA commented Aug 1, 2014

SparkQA commented Aug 1, 2014

mateiz commented Aug 1, 2014

[SPARK-2490] Change recursive visiting on RDD dependencies to iterative approach #1418

[SPARK-2490] Change recursive visiting on RDD dependencies to iterative approach #1418

Conversation

viirya commented Jul 15, 2014

viirya commented Jul 16, 2014

rxin commented Jul 20, 2014

rxin commented Jul 20, 2014

markhamstra commented Jul 22, 2014

markhamstra commented Jul 22, 2014

viirya commented Jul 26, 2014

mateiz commented Jul 26, 2014

mateiz Jul 26, 2014

Choose a reason for hiding this comment

SparkQA commented Jul 26, 2014

SparkQA commented Jul 26, 2014

mateiz commented Jul 26, 2014

mateiz commented Jul 26, 2014

mateiz Jul 26, 2014

Choose a reason for hiding this comment

mateiz commented Jul 26, 2014

viirya commented Jul 27, 2014

mateiz commented Jul 27, 2014

rxin commented Jul 29, 2014

SparkQA commented Jul 29, 2014

SparkQA commented Jul 29, 2014

rxin Jul 29, 2014

Choose a reason for hiding this comment

viirya Jul 30, 2014

Choose a reason for hiding this comment

mateiz Jul 30, 2014

Choose a reason for hiding this comment

viirya Jul 31, 2014

Choose a reason for hiding this comment

mateiz commented Aug 1, 2014

SparkQA commented Aug 1, 2014

SparkQA commented Aug 1, 2014

mateiz commented Aug 1, 2014

SparkQA commented Aug 1, 2014

SparkQA commented Aug 1, 2014

mateiz commented Aug 1, 2014