Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4654] Clean up DAGScheduler getMissingParentStages / stageDependsOn methods #3515

Closed

Conversation

JoshRosen
Copy link
Contributor

DAGScheduler has getMissingParentStages() and stageDependsOn() methods which are suspiciously similar to getParentStages().

Both of these methods perform traversals of the RDD / Stage graph to inspect parent stages. We can remove both of these methods, though: the set of parent stages is known when a Stage instance is constructed and is stored in Stage.parents, so we can just check for missing stages by looking for unavailable stages in Stage.parents. Similarly, we can determine whether one stage depends on another by searching Stage.parents rather than performing a graph traversal from scratch.

@SparkQA
Copy link

SparkQA commented Nov 29, 2014

Test build #23950 has started for PR 3515 at commit 1ab3d6d.

  • This patch merges cleanly.

@@ -401,7 +370,7 @@ class DAGScheduler(
val s = stages.head
s.jobIds += jobId
jobIdToStageIds.getOrElseUpdate(jobId, new HashSet[Int]()) += s.id
val parents: List[Stage] = getParentStages(s.rdd, jobId)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might look like this changes the behavior of this method, since getParentStages will create any parent stages that are missing. However, I think that this call never ended up taking the "create a missing stage" branch because stage's parent stages should have already been created before it was created, since getParentStages(stage.rdd, jobId) should have been called from the newStage method: https://github.com/JoshRosen/spark/blob/dagscheduler-missingparents-cleanup/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L247

@SparkQA
Copy link

SparkQA commented Nov 29, 2014

Test build #23950 timed out for PR 3515 at commit 1ab3d6d after a configured wait of 120m.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23950/
Test FAILed.

@@ -401,7 +370,7 @@ class DAGScheduler(
val s = stages.head
s.jobIds += jobId
jobIdToStageIds.getOrElseUpdate(jobId, new HashSet[Int]()) += s.id
val parents: List[Stage] = getParentStages(s.rdd, jobId)
val parents: List[Stage] = stage.parents
val parentsWithoutThisJobId = parents.filter { ! _.jobIds.contains(jobId) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't bother binding a local anymore, so just:

val parentsWithoutThisJobId = stage.parents.filter { ! _.jobIds.contains(jobId) }

@SparkQA
Copy link

SparkQA commented Nov 30, 2014

Test build #23954 has started for PR 3515 at commit 8aee34d.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 30, 2014

Test build #23954 timed out for PR 3515 at commit 8aee34d after a configured wait of 120m.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23954/
Test FAILed.

@JoshRosen
Copy link
Contributor Author

It looks like the tests have failed twice with the same error, so this looks like it might be a legitimate bug. That would be surprising, since it would seem to indicate that there's some latent complexity in the old code that I overlooked. I'll investigate this tomorrow if I have time.

@CodEnFisH
Copy link

I reproduced the failed testing locally and took a look at the log.

The failed test case ("awaitTermination with error in task") is to check if task failure is successfully captured by the system.

But it seems that DAGScheduler doesn't fail the job although its task fails. In my log, I saw "Ignoring failure of Stage 0 because all jobs depending on it are done" which is printed at the end of abortStage() of DAGScheduler. So the job is not aborted and the failure is not captured as expected.

The reason is that after the pull request is applied, the DAGScheduler cannot correctly create the dependency between the failed rdd and the job. I'm digging the cause of that.

@JoshRosen
Copy link
Contributor Author

I haven't had a chance to dig into this much more, but perhaps this could be due to streaming checkpointing; if RDDs' dependencies change after checkpointing, then that might mean that we need to re-walk the stage / dependency graph rather than relying on the cached results of the earlier traversal.

@pwendell
Copy link
Contributor

Looks like this has gone stale so I'd like to close this issue pending an update form @JoshRosen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants