[SPARK-4654] Clean up DAGScheduler getMissingParentStages / stageDependsOn methods #3515

JoshRosen · 2014-11-29T19:48:05Z

DAGScheduler has getMissingParentStages() and stageDependsOn() methods which are suspiciously similar to getParentStages().

Both of these methods perform traversals of the RDD / Stage graph to inspect parent stages. We can remove both of these methods, though: the set of parent stages is known when a Stage instance is constructed and is stored in Stage.parents, so we can just check for missing stages by looking for unavailable stages in Stage.parents. Similarly, we can determine whether one stage depends on another by searching Stage.parents rather than performing a graph traversal from scratch.

SparkQA · 2014-11-29T19:52:51Z

Test build #23950 has started for PR 3515 at commit 1ab3d6d.

This patch merges cleanly.

JoshRosen · 2014-11-29T19:53:45Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

@@ -401,7 +370,7 @@ class DAGScheduler(
        val s = stages.head
        s.jobIds += jobId
        jobIdToStageIds.getOrElseUpdate(jobId, new HashSet[Int]()) += s.id
-        val parents: List[Stage] = getParentStages(s.rdd, jobId)


It might look like this changes the behavior of this method, since getParentStages will create any parent stages that are missing. However, I think that this call never ended up taking the "create a missing stage" branch because stage's parent stages should have already been created before it was created, since getParentStages(stage.rdd, jobId) should have been called from the newStage method: https://github.com/JoshRosen/spark/blob/dagscheduler-missingparents-cleanup/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L247

SparkQA · 2014-11-29T21:52:52Z

Test build #23950 timed out for PR 3515 at commit 1ab3d6d after a configured wait of 120m.

AmplabJenkins · 2014-11-29T21:52:56Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23950/
Test FAILed.

markhamstra · 2014-11-29T21:53:22Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

@@ -401,7 +370,7 @@ class DAGScheduler(
        val s = stages.head
        s.jobIds += jobId
        jobIdToStageIds.getOrElseUpdate(jobId, new HashSet[Int]()) += s.id
-        val parents: List[Stage] = getParentStages(s.rdd, jobId)
+        val parents: List[Stage] = stage.parents
        val parentsWithoutThisJobId = parents.filter { ! _.jobIds.contains(jobId) }


I wouldn't bother binding a local anymore, so just:

val parentsWithoutThisJobId = stage.parents.filter { ! _.jobIds.contains(jobId) }

SparkQA · 2014-11-30T02:12:51Z

Test build #23954 has started for PR 3515 at commit 8aee34d.

This patch merges cleanly.

SparkQA · 2014-11-30T04:12:51Z

Test build #23954 timed out for PR 3515 at commit 8aee34d after a configured wait of 120m.

AmplabJenkins · 2014-11-30T04:12:55Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23954/
Test FAILed.

JoshRosen · 2014-11-30T05:09:30Z

It looks like the tests have failed twice with the same error, so this looks like it might be a legitimate bug. That would be surprising, since it would seem to indicate that there's some latent complexity in the old code that I overlooked. I'll investigate this tomorrow if I have time.

CodEnFisH · 2014-12-02T04:12:49Z

I reproduced the failed testing locally and took a look at the log.

The failed test case ("awaitTermination with error in task") is to check if task failure is successfully captured by the system.

But it seems that DAGScheduler doesn't fail the job although its task fails. In my log, I saw "Ignoring failure of Stage 0 because all jobs depending on it are done" which is printed at the end of abortStage() of DAGScheduler. So the job is not aborted and the failure is not captured as expected.

The reason is that after the pull request is applied, the DAGScheduler cannot correctly create the dependency between the failed rdd and the job. I'm digging the cause of that.

JoshRosen · 2014-12-02T05:04:08Z

I haven't had a chance to dig into this much more, but perhaps this could be due to streaming checkpointing; if RDDs' dependencies change after checkpointing, then that might mean that we need to re-walk the stage / dependency graph rather than relying on the cached results of the earlier traversal.

pwendell · 2015-01-19T09:48:12Z

Looks like this has gone stale so I'd like to close this issue pending an update form @JoshRosen

Clean up DAGScheduler getMissingParentStages / stageDependsOn methods.

1ab3d6d

JoshRosen reviewed Nov 29, 2014
View reviewed changes

markhamstra reviewed Nov 29, 2014
View reviewed changes

Remove unnecessary parents variable.

8aee34d

JoshRosen mentioned this pull request Dec 1, 2014

[Core]Remove duplicated code in DAGScheduler #3421

Closed

asfgit closed this in 1ac1c1d Jan 19, 2015

cloud-fan mentioned this pull request Jan 21, 2015

[SPARK-5374][CORE] abstract RDD's DAG graph iteration in DAGScheduler #4134

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-4654] Clean up DAGScheduler getMissingParentStages / stageDependsOn methods #3515

[SPARK-4654] Clean up DAGScheduler getMissingParentStages / stageDependsOn methods #3515

JoshRosen commented Nov 29, 2014

SparkQA commented Nov 29, 2014

JoshRosen Nov 29, 2014

SparkQA commented Nov 29, 2014

AmplabJenkins commented Nov 29, 2014

markhamstra Nov 29, 2014

SparkQA commented Nov 30, 2014

SparkQA commented Nov 30, 2014

AmplabJenkins commented Nov 30, 2014

JoshRosen commented Nov 30, 2014

CodEnFisH commented Dec 2, 2014

JoshRosen commented Dec 2, 2014

pwendell commented Jan 19, 2015

[SPARK-4654] Clean up DAGScheduler getMissingParentStages / stageDependsOn methods #3515

[SPARK-4654] Clean up DAGScheduler getMissingParentStages / stageDependsOn methods #3515

Conversation

JoshRosen commented Nov 29, 2014

SparkQA commented Nov 29, 2014

JoshRosen Nov 29, 2014

Choose a reason for hiding this comment

SparkQA commented Nov 29, 2014

AmplabJenkins commented Nov 29, 2014

markhamstra Nov 29, 2014

Choose a reason for hiding this comment

SparkQA commented Nov 30, 2014

SparkQA commented Nov 30, 2014

AmplabJenkins commented Nov 30, 2014

JoshRosen commented Nov 30, 2014

CodEnFisH commented Dec 2, 2014

JoshRosen commented Dec 2, 2014

pwendell commented Jan 19, 2015