Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-10003] Improve readability of DAGScheduler #8217

Closed

Conversation

andrewor14
Copy link
Contributor

Note: this is not intended to be in Spark 1.5!

This patch rewrites some code in the DAGScheduler to make it more readable. In particular

  • there were blocks of code that are unnecessary and removed for simplicity
  • there were abstractions that are unnecessary and made the code hard to navigate
  • other minor changes

logInfo("Submitting " + shuffleStage + " (" +
shuffleStage.rdd + "), which is now runnable")
submitMissingTasks(shuffleStage, jobId)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a potentially controversial change. Let me explain:

Context. This block of code happens when a map stage finishes and all shuffle files are present. What happens here is that the newly runnable stages are submitted and moved from waitingStages to runningStages.

Why is it removed? Note that we already call submitWaitingStages() at the end of this method, which does the exact same thing. It seems to be me that this block of code doesn't do anything extra.

Test coverage. Note that this only affects the case where the map shuffle stage has finished successfully, which is already covered in any existing test in DAGSchedulerSuite that runs a successful shuffle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks good to me -- I went on a git blame adventure and this code is super old and it's not clear that it was ever necessary, and it definitely doesn't seem necessary now that we have the submitWaitingStages() call at the bottom.

@SparkQA
Copy link

SparkQA commented Aug 15, 2015

Test build #40932 timed out for PR 8217 at commit 64a9ed2 after a configured wait of 175m.

@SparkQA
Copy link

SparkQA commented Aug 15, 2015

Test build #1621 has finished for PR 8217 at commit 64a9ed2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor Author

@kayousterhout can you take a look?

…eadability

Conflicts:
	core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
changeEpoch = true)

clearCacheLocs()

// Some tasks had failed; let's resubmit this shuffleStage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why move this outside the if-statement? It looks like it only holds when the if-statement is true, so less confusing I think to have it inside.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, makes sense

@kayousterhout
Copy link
Contributor

LGTM except for the one question about the moved comment

@SparkQA
Copy link

SparkQA commented Sep 3, 2015

Test build #41983 has finished for PR 8217 at commit 574fb1e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in cf42138 Sep 4, 2015
@SparkQA
Copy link

SparkQA commented Sep 4, 2015

Test build #41986 has finished for PR 8217 at commit 57abca3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14 andrewor14 deleted the dag-scheduler-readability branch September 4, 2015 04:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants