-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-9144] Remove DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled #7484
Conversation
Ping @tdas, who helped come up with the idea of removing this after we noticed that some messiness in error-handling in Also, ping @kayousterhout or @markhamstra to see if they can think of any reasons why we should save this feature. |
} finally { | ||
taskContext.markTaskCompleted() | ||
TaskContext.unset() | ||
// Note: this memory freeing logic is duplicated in Executor.run(); when changing this, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to be removing this duplication.
If nobody is going to miss it, I'd be happy to be rid of localExecution. But this PR should really be "Deprecate DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled" since the actual removal can't occur until later. |
Fair enough RE: deferring the actual removal to 1.6. I can re-work this to only deprecate the setting and the public methods that expose the |
@@ -98,8 +98,7 @@ class KafkaRDD[ | |||
val res = context.runJob( | |||
this, | |||
(tc: TaskContext, it: Iterator[R]) => it.take(parts(tc.partitionId)).toArray, | |||
parts.keys.toArray, | |||
allowLocal = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is okay.
Test build #37675 has finished for PR 7484 at commit
|
I think the idea is to run stuff like take much faster. Why do we want to remove this? |
@rxin, it turns out that this optimization for local actions is guarded behind a feature flag which is off by default. Although this path gets tested in DAGSchedulerSuite, I think it's somewhat unlikely that it ends up getting used in most production deployments. |
Also, there is a bit of messiness in how |
FYI, this used to be on by default (and not flagged) until August last year, when this commit turned it off: http://mail-archives.apache.org/mod_mbox/spark-commits/201408.mbox/%3C9f2f6315e068441787bf791864573776@git.apache.org%3E. It doesn't seem horrible to keep it off forever since it created some problems before. The main place it helped was if you call first(), take() etc on a dataset when working interactively, but maybe sending one task isn't that bad. |
BTW it would also be nice to test the difference this makes before deciding, though the optimization only helps in somewhat limited cases (e.g. it won't help much if you do a shuffle). |
[SPARK-9144] Remove DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled Conflicts: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
I think the main time when this can help a lot is when you are connecting to a busy cluster, and in that case, it can take a while to get something scheduled. If the cluster is idle, it takes just a few ms to launch a task, and as a result users won't be able to tell the difference at all. |
I submitted #7585 to bring this up to date. We can merge that one. |
…alExecution.enabled Spark has an option called spark.localExecution.enabled; according to the docs: > Enables Spark to run certain jobs, such as first() or take() on the driver, without sending tasks to the cluster. This can make certain jobs execute very quickly, but may require shipping a whole partition of data to the driver. This feature ends up adding quite a bit of complexity to DAGScheduler, especially in the runLocallyWithinThread method, but as far as I know nobody uses this feature (I searched the mailing list and haven't seen any recent mentions of the configuration nor stacktraces including the runLocally method). As a step towards scheduler complexity reduction, I propose that we remove this feature and all code related to it for Spark 1.5. This pull request simply brings #7484 up to date. Author: Josh Rosen <joshrosen@databricks.com> Author: Reynold Xin <rxin@databricks.com> Closes #7585 from rxin/remove-local-exec and squashes the following commits: 84bd10e [Reynold Xin] Python fix. 1d9739a [Reynold Xin] Merge pull request #7484 from JoshRosen/remove-localexecution eec39fa [Josh Rosen] Remove allowLocal(); deprecate user-facing uses of it. b0835dc [Josh Rosen] Remove local execution code in DAGScheduler 8975d96 [Josh Rosen] Remove local execution tests. ffa8c9b [Josh Rosen] Remove documentation for configuration
Closing since this was done in #7585. |
Spark has an option called
spark.localExecution.enabled
; according to the docs:This feature ends up adding quite a bit of complexity to DAGScheduler, especially in the
runLocallyWithinThread
method, but as far as I know nobody uses this feature (I searched the mailing list and haven't seen any recent mentions of the configuration nor stacktraces including the runLocally method). As a step towards scheduler complexity reduction, I propose that we remove this feature and all code related to it for Spark 1.5.