[SPARK-1946] Submit tasks after (configured ratio) executors have been registered #900

li-zhihui · 2014-05-28T06:18:43Z

Because submitting tasks and registering executors are asynchronous, in most situation, early stages' tasks run without preferred locality.

A simple solution is sleeping few seconds in application, so that executors have enough time to register.

The PR add 2 configuration properties to make TaskScheduler submit tasks after a few of executors have been registered.

Submit tasks only after (registered executors / total executors) arrived the ratio, default value is 0

spark.scheduler.minRegisteredExecutorsRatio = 0.8

Whatever minRegisteredExecutorsRatio is arrived, submit tasks after the maxRegisteredWaitingTime(millisecond), default value is 30000

spark.scheduler.maxRegisteredExecutorsWaitingTime = 5000

AmplabJenkins · 2014-05-28T06:22:58Z

Can one of the admins verify this patch?

pwendell · 2014-05-28T06:28:03Z

Jenkins, test this please. @li-zhihui could you create a JIRA for this and add it to the title of the PR (e.g. SPARK-XXX)?

/cc @mridul @tgravescs @kayousterhout

sryza · 2014-05-28T06:31:03Z

This should probably be unified with #634

li-zhihui · 2014-05-28T07:32:20Z

I create a JIRA and attach a file to describe it.

@mridulm

tgravescs · 2014-06-05T20:12:47Z

This is very similar to SPARK-1453 and this may solve it generically instead of just yarn.

kayousterhout · 2014-06-05T20:21:42Z

This is also related to SPARK-1937 (#892) -- is fixing that sufficient or is it necessary to have this sleep as well? It seems like this sleep is only necessary if the sleep time is less than the expected increase in task completion time as a result of running non-locally.

tgravescs · 2014-06-11T17:59:46Z

@kayousterhout @mridulm Looking briefly at pr #892 it seems that is handling locality when executors are added later and I assume some of the locality wait configs come into affect here, but it doesn't just wait for a certain number of executors to be there to start (or period of time), does it? Are the changes good enough where that waiting for a good percentage of the executors no longer matters?

I seems like depending on your resource manager and how busy of a cluster you run on it could take a while (minutes) to get a large number of executors. I think this should just be a configuration and user code should not have to "sleep" or workaround this.

mridulm · 2014-06-11T18:21:55Z

This one slipped off my radar, my apologies.
@tgravescs In #892, if there is even a single executor which is process local with any partition, then we start waiting for all levels based on configured timeouts.
Here we are trying to ensure there are sufficient executors available before we start accepting jobs. The intent is slightly differrent

mridulm · 2014-06-11T18:26:43Z

Hit submit by mistake, to continue ...
The side effect of not having sufficient executors are different from #892. For example,
a) the default parallelism in yarn is based on number of executors,
b) the number of intermediate files per node for shuffle (this can bring the node down btw)
c) and amount of memory consumed on a node for rdd MEMORY persisted data (making the job fail if disk is not specified : like some of the mllib algos ?)
and so on ...

tgravescs · 2014-06-11T18:50:36Z

Thanks @mridulm for the clarification. So it seems like this change would be useful and is similar to what we discussed in SPARK-1453. Are there any general concerns over adding configs to wait? If

I'm seeing more people running into this and would like to get something implemented so each user doesn't have to put sleep in their code.

kayousterhout · 2014-06-11T19:13:07Z

@li-zhihui it looks like the JIRA you created (https://issues.apache.org/jira/browse/SPARK-1946) describes the issue fixed by #892 and described by a redundant issue (https://issues.apache.org/jira/browse/SPARK-1937). As @mridulm explained (thanks!!), the primary set of issues addressed by this pull request center around the fact that Spark-on-YARN has various performance problems when not enough executors have registered yet. Could you update SPARK-1946 accordingly?

@tgravescs I'm a little nervous about adding more scheduler config options, because I think the average user would have a very difficult time figuring out that their performance problems could be fixed by tuning this particular set of options. The scheduler already has quite a few config options and I think we should be very cautious in adding more (cc @pwendell). On the other hand, as you pointed out, it seems like a user typically wants to wait for some number of executors to become available, and those semantics aren't available to the application -- so we're stuck with adding something to the scheduler code. Is it possible to do this only for the YARN scheduler / do you think it's necessary in standalone too? Doing it only for YARN (and naming the config variable accordingly) could help signal to a naive user when tuning this might help. From @mridulm's description, it sounds like many of the issues here are yarn-specific.

kayousterhout · 2014-06-11T19:19:45Z

Also, @li-zhihui, can you add a unit test for this?

Second, it looks like this only works in standalone mode and not for YARN (since, as I understand the YARN code, YARN uses YarnClientSchedulerBackend and not SparkDeploySchedulerBackend)? Was that the intention?

kayousterhout · 2014-06-11T19:35:42Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

+    }
+    while(!backend.isReady){
+      synchronized{
+        this.wait(100)


Is there a reason for not using wait/notify here?

Just for programming simply. :)
If someone would like to implement more backend implementations or change backend.isReady somewhere, they needn't to call NOTIFY().

But, waiting 100 milliseconds maybe too long, is 10 milliseconds OK?

tgravescs · 2014-06-11T20:22:43Z

yarn uses CoarseGrainedSchedulerBackend for standalone mode and the YarnClientSchedulerBackend (which is based on CoarseGrainedSchedulerBackend) for client mode. I don't see this pr handling that since its not incrementing the totalExecutors. I haven't looked at it in detail yet but I would think you could do it in RegisterExecutor and that would handle both since SparkDeploySchedulerBackend is based on CoarseGrainedSchedulerBackend also.

tgravescs · 2014-06-11T20:26:49Z

If people think its not useful for other deployments modes then I can look into making it YARN specific. We already have a small sleep to help with this in YarnClusterScheduler.

li-zhihui · 2014-06-12T06:15:49Z

@tgravescs @kayousterhout The PR only work in standalone mode now. But it provide a abstract method isReady() in SchedulerBackend.scala for all backend implementations.

For yarn mode, it seems better to use registered number as threshold.

li-zhihui · 2014-06-12T07:02:30Z

@tgravescs @mridulm @kayousterhout I add a commit which submit stage after configured number executor are registered. I think it will work well in yarn mode.

submit stage only after successfully registered executors arrived the number, default value 0

spark.executor.minRegisteredNum = 20

sryza · 2014-06-12T08:04:01Z

Sorry to jump in late on this, but I think spark.executor.minRegisteredNum sounds like an executor property, when this is a property of the driver.

li-zhihui · 2014-06-12T08:48:51Z

Thanks @sryza How about spark.scheduler.minRegisteredExecutors?

CrazyJvm · 2014-06-12T08:57:02Z

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

+      ready = true
+      return true
+    }
+    return false


no need use "return" in last statement i think.

Thanks @CrazyJvm. But if I remove "return", the method always return true. I don't know why. I use Scala 2.10.3.

override def isReady(): Boolean = {
if ((System.currentTimeMillis() - createTime) >= maxRegisteredWaitingTime) {
ready = true
}
ready
}

thanks @adrian-wang , but I think it's necessary to return true quickly, because ready is true most time.

Thanks @CrazyJvm I made a mistake, the last "return" is not necessary.

li-zhihui · 2014-06-16T07:01:53Z

@tgravescs @sryza I add a commit support the feature(with percentage style) for yarn mode, test successfully in yarn 2.2.0.

About default value of minRegisteredRatio, Yarn mode is 0.9, Standalone mode is 0.

tgravescs · 2014-06-16T14:46:30Z

yarn/common/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala

@@ -77,6 +77,12 @@ private[spark] class YarnClientSchedulerBackend(

    logDebug("ClientArguments called with: " + argsArrayBuf)
    val args = new ClientArguments(argsArrayBuf.toArray, conf)
+    totalExecutors.set(args.numExecutors)


note that yarn has 2 modes - yarn-client and yarn-cluster mode. This changes it for yarn-client mode, but not yarn-cluster mode. yarn-cluster mode currently just uses CoarseGrainedSchedulerBackend directly.

Thanks @tgravescs

li-zhihui · 2014-06-17T10:41:35Z

@tgravescs I add a commit support yarn-cluster.

A little issue, the YarnClusterSchedulerBackend can't get --num-executors as totalExecutors currently(spark-default.xml is ok).
I will follow it. (Fixed)

li-zhihui · 2014-06-27T10:51:56Z

@tgravescs @kayousterhout
I add a new commit according to comments.

li-zhihui · 2014-06-27T11:01:41Z

@tgravescs @kayousterhout
I move waitBackendReady back to submitTasks method, because it (waitBackendReady in start method) dose not work on yarn-cluster mode (NullPointException because SparkContext initialize timeout) (yarn-client is ok). (abandoned)

li-zhihui · 2014-06-30T02:41:36Z

@tgravescs @kayousterhout
It will lead to a logic deadlock in yarn-cluster mode, if waitBackendReady is in TaskSchedulerImpl.start.

How about move it (waitBackendReady) to postStartHook() ?

tgravescs · 2014-07-10T13:08:35Z

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

  val conf = scheduler.sc.conf
  private val timeout = AkkaUtils.askTimeout(conf)
  private val akkaFrameSize = AkkaUtils.maxFrameSizeBytes(conf)
+  // Submit tasks only after (registered executors / total expected executors) 
+  // is equal to at least this value, that is double between 0 and 1.
+  var minRegisteredRatio = conf.getDouble("spark.scheduler.minRegisteredExecutorsRatio", 0)


Please add the new configs to the user docs - see docs/configuration.md

@tgravescs Done.

tgravescs · 2014-07-10T13:32:21Z

postStartHook seems like a good place to put it. It will only be called once there and as you said Yarn cluster mode actually waits for the sparkcontext to be initialized before allocating executors.

It looks like we aren't handling mesos. We should atleast file a jira for this.
@li-zhihui did you look at mesos at all?

For the yarn side where you added the TODO's about the sleep. I think we can leave them here as there is another jira to remove them.

li-zhihui · 2014-07-11T10:06:49Z

Thanks @tgravescs
I will file a new jira for handling mesos and follow it after the PR merged.

tgravescs · 2014-07-11T14:27:01Z

docs/configuration.md

+  <td><code>spark.scheduler.maxRegisteredExecutorsWaitingTime</code></td>
+  <td>30000</td>
+  <td>
+    Whatever (registered executors / total expected executors) is reached 


I think we should clarify both of these a bit because its really you start when either one is hit so I think adding reference to maxRegisteredExecutorsWaitingTime from the description of minRegisteredExecutorsRatio would be good.

How about something like below? Note I'm not a doc writer so I'm fine with changing.

for spark.scheduler.minRegisteredExecutorsRatio:
The minimum ratio of registered executors (registered executors / total expected executors) to wait for before scheduling begins. Specified as a double between 0 and 1. Regardless of whether the minimum ratio of executors has been reached, the maximum amount of time it will wait before scheduling begins is controlled by config spark.scheduler.maxRegisteredExecutorsWaitingTime .

Then for spark.scheduler.maxRegisteredExecutorsWaitingTime:
Maximum amount of time to wait for executors to register before scheduling begins (in milliseconds).

li-zhihui · 2014-07-14T06:47:08Z

@tgravescs add a commit according to comments.

tgravescs · 2014-07-14T20:33:16Z

Looks good. Thanks @li-zhihui

…n registered Because submitting tasks and registering executors are asynchronous, in most situation, early stages' tasks run without preferred locality. A simple solution is sleeping few seconds in application, so that executors have enough time to register. The PR add 2 configuration properties to make TaskScheduler submit tasks after a few of executors have been registered. \# Submit tasks only after (registered executors / total executors) arrived the ratio, default value is 0 spark.scheduler.minRegisteredExecutorsRatio = 0.8 \# Whatever minRegisteredExecutorsRatio is arrived, submit tasks after the maxRegisteredWaitingTime(millisecond), default value is 30000 spark.scheduler.maxRegisteredExecutorsWaitingTime = 5000 Author: li-zhihui <zhihui.li@intel.com> Closes apache#900 from li-zhihui/master and squashes the following commits: b9f8326 [li-zhihui] Add logs & edit docs 1ac08b1 [li-zhihui] Add new configs to user docs 22ead12 [li-zhihui] Move waitBackendReady to postStartHook c6f0522 [li-zhihui] Bug fix: numExecutors wasn't set & use constant DEFAULT_NUMBER_EXECUTORS 4d6d847 [li-zhihui] Move waitBackendReady to TaskSchedulerImpl.start & some code refactor 0ecee9a [li-zhihui] Move waitBackendReady from DAGScheduler.submitStage to TaskSchedulerImpl.submitTasks 4261454 [li-zhihui] Add docs for new configs & code style ce0868a [li-zhihui] Code style, rename configuration property name of minRegisteredRatio & maxRegisteredWaitingTime 6cfb9ec [li-zhihui] Code style, revert default minRegisteredRatio of yarn to 0, driver get --num-executors in yarn/alpha 812c33c [li-zhihui] Fix driver lost --num-executors option in yarn-cluster mode e7b6272 [li-zhihui] support yarn-cluster 37f7dc2 [li-zhihui] support yarn mode(percentage style) 3f8c941 [li-zhihui] submit stage after (configured ratio of) executors have been registered

…lone mode In SPARK-1946(PR #900), configuration <code>spark.scheduler.minRegisteredExecutorsRatio</code> was introduced. However, in standalone mode, there is a race condition where isReady() can return true because totalExpectedExecutors has not been correctly set. Because expected executors is uncertain in standalone mode, the PR try to use CPU cores(<code>--total-executor-cores</code>) as expected resources to judge whether SchedulerBackend is ready. Author: li-zhihui <zhihui.li@intel.com> Author: Li Zhihui <zhihui.li@intel.com> Closes #1525 from li-zhihui/fixre4s and squashes the following commits: e9a630b [Li Zhihui] Rename variable totalExecutors and clean codes abf4860 [Li Zhihui] Push down variable totalExpectedResources to children classes ca54bd9 [li-zhihui] Format log with String interpolation 88c7dc6 [li-zhihui] Few codes and docs refactor 41cf47e [li-zhihui] Fix race condition at SchedulerBackend.isReady in standalone mode (cherry picked from commit 28dbae8) Signed-off-by: Patrick Wendell <pwendell@gmail.com>

…lone mode In SPARK-1946(PR #900), configuration <code>spark.scheduler.minRegisteredExecutorsRatio</code> was introduced. However, in standalone mode, there is a race condition where isReady() can return true because totalExpectedExecutors has not been correctly set. Because expected executors is uncertain in standalone mode, the PR try to use CPU cores(<code>--total-executor-cores</code>) as expected resources to judge whether SchedulerBackend is ready. Author: li-zhihui <zhihui.li@intel.com> Author: Li Zhihui <zhihui.li@intel.com> Closes #1525 from li-zhihui/fixre4s and squashes the following commits: e9a630b [Li Zhihui] Rename variable totalExecutors and clean codes abf4860 [Li Zhihui] Push down variable totalExpectedResources to children classes ca54bd9 [li-zhihui] Format log with String interpolation 88c7dc6 [li-zhihui] Few codes and docs refactor 41cf47e [li-zhihui] Fix race condition at SchedulerBackend.isReady in standalone mode

…n registered Because submitting tasks and registering executors are asynchronous, in most situation, early stages' tasks run without preferred locality. A simple solution is sleeping few seconds in application, so that executors have enough time to register. The PR add 2 configuration properties to make TaskScheduler submit tasks after a few of executors have been registered. \# Submit tasks only after (registered executors / total executors) arrived the ratio, default value is 0 spark.scheduler.minRegisteredExecutorsRatio = 0.8 \# Whatever minRegisteredExecutorsRatio is arrived, submit tasks after the maxRegisteredWaitingTime(millisecond), default value is 30000 spark.scheduler.maxRegisteredExecutorsWaitingTime = 5000 Author: li-zhihui <zhihui.li@intel.com> Closes apache#900 from li-zhihui/master and squashes the following commits: b9f8326 [li-zhihui] Add logs & edit docs 1ac08b1 [li-zhihui] Add new configs to user docs 22ead12 [li-zhihui] Move waitBackendReady to postStartHook c6f0522 [li-zhihui] Bug fix: numExecutors wasn't set & use constant DEFAULT_NUMBER_EXECUTORS 4d6d847 [li-zhihui] Move waitBackendReady to TaskSchedulerImpl.start & some code refactor 0ecee9a [li-zhihui] Move waitBackendReady from DAGScheduler.submitStage to TaskSchedulerImpl.submitTasks 4261454 [li-zhihui] Add docs for new configs & code style ce0868a [li-zhihui] Code style, rename configuration property name of minRegisteredRatio & maxRegisteredWaitingTime 6cfb9ec [li-zhihui] Code style, revert default minRegisteredRatio of yarn to 0, driver get --num-executors in yarn/alpha 812c33c [li-zhihui] Fix driver lost --num-executors option in yarn-cluster mode e7b6272 [li-zhihui] support yarn-cluster 37f7dc2 [li-zhihui] support yarn mode(percentage style) 3f8c941 [li-zhihui] submit stage after (configured ratio of) executors have been registered

…lone mode In SPARK-1946(PR apache#900), configuration <code>spark.scheduler.minRegisteredExecutorsRatio</code> was introduced. However, in standalone mode, there is a race condition where isReady() can return true because totalExpectedExecutors has not been correctly set. Because expected executors is uncertain in standalone mode, the PR try to use CPU cores(<code>--total-executor-cores</code>) as expected resources to judge whether SchedulerBackend is ready. Author: li-zhihui <zhihui.li@intel.com> Author: Li Zhihui <zhihui.li@intel.com> Closes apache#1525 from li-zhihui/fixre4s and squashes the following commits: e9a630b [Li Zhihui] Rename variable totalExecutors and clean codes abf4860 [Li Zhihui] Push down variable totalExpectedResources to children classes ca54bd9 [li-zhihui] Format log with String interpolation 88c7dc6 [li-zhihui] Few codes and docs refactor 41cf47e [li-zhihui] Fix race condition at SchedulerBackend.isReady in standalone mode

li-zhihui changed the title ~~submit stage after (configured ratio of) executors have been registered~~ [SPARK-1946]Submit stage after (configured ratio of) executors have been registered May 28, 2014

li-zhihui changed the title ~~[SPARK-1946]Submit stage after (configured ratio of) executors have been registered~~ [SPARK-1946] Submit stage after (configured ratio of) executors have been registered May 28, 2014

kayousterhout reviewed Jun 11, 2014
View reviewed changes

li-zhihui changed the title ~~[SPARK-1946] Submit stage after (configured ratio of) executors have been registered~~ [SPARK-1946] Submit stage after (configured number) executors have been registered Jun 12, 2014

CrazyJvm reviewed Jun 12, 2014
View reviewed changes

li-zhihui changed the title ~~[SPARK-1946] Submit stage after (configured number) executors have been registered~~ [SPARK-1946] Submit stage after (configured ratio) executors have been registered Jun 16, 2014

tgravescs reviewed Jun 16, 2014
View reviewed changes

li-zhihui added 2 commits June 19, 2014 10:32

submit stage after (configured ratio of) executors have been registered

3f8c941

support yarn mode(percentage style)

37f7dc2

Bug fix: numExecutors wasn't set & use constant DEFAULT_NUMBER_EXECUTORS

c6f0522

Move waitBackendReady to postStartHook

22ead12

tgravescs reviewed Jul 10, 2014
View reviewed changes

Add new configs to user docs

1ac08b1

tgravescs reviewed Jul 11, 2014
View reviewed changes

Add logs & edit docs

b9f8326

asfgit closed this in 3dd8af7 Jul 14, 2014

li-zhihui mentioned this pull request Jul 17, 2014

[SPARK-2555] Support configuration spark.scheduler.minRegisteredResourcesRatio in Mesos mode. #1462

Closed

li-zhihui mentioned this pull request Jul 22, 2014

[SPARK-2635] Fix race condition at SchedulerBackend.isReady in standalone mode #1525

Closed

[SPARK-1946] Submit tasks after (configured ratio) executors have been registered #900

[SPARK-1946] Submit tasks after (configured ratio) executors have been registered #900

Conversation

li-zhihui commented May 28, 2014

Submit tasks only after (registered executors / total executors) arrived the ratio, default value is 0

Whatever minRegisteredExecutorsRatio is arrived, submit tasks after the maxRegisteredWaitingTime(millisecond), default value is 30000

AmplabJenkins commented May 28, 2014

pwendell commented May 28, 2014

sryza commented May 28, 2014

li-zhihui commented May 28, 2014

tgravescs commented Jun 5, 2014

kayousterhout commented Jun 5, 2014

tgravescs commented Jun 11, 2014

mridulm commented Jun 11, 2014

mridulm commented Jun 11, 2014

tgravescs commented Jun 11, 2014

kayousterhout commented Jun 11, 2014

kayousterhout commented Jun 11, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgravescs commented Jun 11, 2014

tgravescs commented Jun 11, 2014

li-zhihui commented Jun 12, 2014

li-zhihui commented Jun 12, 2014

submit stage only after successfully registered executors arrived the number, default value 0

sryza commented Jun 12, 2014

li-zhihui commented Jun 12, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

li-zhihui commented Jun 16, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

li-zhihui commented Jun 17, 2014

li-zhihui commented Jun 27, 2014

li-zhihui commented Jun 27, 2014

li-zhihui commented Jun 30, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgravescs commented Jul 10, 2014

li-zhihui commented Jul 11, 2014

Choose a reason for hiding this comment

li-zhihui commented Jul 14, 2014

tgravescs commented Jul 14, 2014