SPARK-1707. Remove unnecessary 3 second sleep in YarnClusterScheduler #634

sryza · 2014-05-04T19:15:26Z

No description provided.

AmplabJenkins · 2014-05-04T19:17:57Z

Merged build triggered.

AmplabJenkins · 2014-05-04T19:18:05Z

Merged build started.

AmplabJenkins · 2014-05-04T19:56:49Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-05-04T19:56:50Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14647/

mridulm · 2014-05-04T20:30:30Z

Better would be to make this configurable (and increase it - not remove).
Currently, we actually sleep for much longer in our jobs to ensure a minimum number of containers are available.
The sleep was not arbitrarily thrown in there :-)

pwendell · 2014-05-04T21:42:02Z

@sryza @mridulm what is this sleep actually for? We should really have in-code documentation for things like this.

If it's necessary to sleep like this, could we restructure it as a 1 second busy-wait on a particular condition? That way we wouldn't need to hard code any particular value.

sryza · 2014-05-04T22:08:50Z

A little more detail on the JIRA, but, to summarize, the sleep is to wait for executors to connect. What's confusing to me is that Spark standalone doesn't have a similar sleep.

@mridulm is the worry that only some executors would register and that tasks would then unnecessarily get scheduled on nodes where the data isn't local? Have you seen this negatively impact performance? If that's the case, I think the best thing would be to wait until some number of executors have registered.

pwendell · 2014-05-04T22:57:46Z

Ah I see - sorry I didn't read the JIRA.

So since we don't know a-priori how many executors to expect, I don't think we can wait on any particular condition. Even in YARN mode, AFAIK the executor number is just a request to the RM and not a guarentee.

I'm still not 100% sure why this needs to exist. @mridulm is the main concern here just launching non-local tasks? If so, why not just set spark.locality.wait threshold (that's the whole reason it exists).

In standalone mode this may not really be necessary because the executors typically launch in under 3 seconds which is the default locality wait.

pwendell · 2014-05-04T22:58:47Z

@sryza @mridulm if we do decide we want to have a new type of waiting here, maybe we could add spark.scheduler.initialWait or something to give time to allow executors to show up before we try to schedule.

sryza · 2014-05-05T04:57:50Z

Keep in mind this sleep is after we've already waited to receive a set of containers from the YARN ResourceManager and sent RPCs to the NodeManagers to start them. So we do know how many executors to expect.

pwendell · 2014-05-05T05:03:49Z

@sryza Ah I see, in that case, we could wait on the specific condition, but what if e.g. there is a straggler and one container takes a long time to launch (or never launches)?

sryza · 2014-05-05T05:05:54Z

Yeah, I think we'd want some (probably configurable) max wait time.

pwendell · 2014-05-05T05:07:24Z

Seems reasonable - still curious exactly what the case @mridulm is dealing with is. I guess it's just the case where the containers take a long time to launch (once allocated), much longer than the delay scheduling threhsold?

mridulm · 2014-05-05T09:41:47Z

The scheduling and data locality come later.
This preceeds all that.

To give example - suppose we need 200 containers to run a job; as soon as we start, we might get say 2 or 5 nodes allocated : it takes some time for yarn to ramp up to a good number.

Now, if we start off the actual job before sufficient containers have been allocated - and we get job failures since a) data cant be cached in memory b) too much load on too low number of containers.

Task locality, scheduling comes later - but they are also impacted.
Assuming job does not fail, now all data is available on these small number of containers while rest of the cluster is busy pulling data from them which causes suboptimal performance (I have not seen job failures due to this - yet).

mridulm · 2014-05-05T09:43:52Z

Btw I dont recall exactly why the actual value of 3 seconds is there; I think earlier spark used to crash in case there are no containers available when job is going to be launched - thankfully that is not the case anymore !

tgravescs · 2014-05-05T13:42:19Z

Note I had filed SPARK-1453 to make the wait configurable, see the discussions there. I'm fine with removing this once the other one is configurable. I would hesitate to do it before that.
Like was already mentioned there are 2 different timeouts, this is the second and hardcoded to 3 seconds. I'm not really sure what the difference is between this one and the ALLOCATOR_LOOP_WAIT_COUNT. I was originally thinking of removing this one and just making the other configurable but I haven't investigated in depth yet.

I don't know the details of spark.locality.wait to know if it works in all cases. I assume the spark.locality.wait would work fine as long as you have an input from like hadoop or somewhere that tells you the locations, but for RDD's without the locations it wouldn't work to wait?

sryza · 2014-05-05T16:52:48Z

I like the idea of making the first wait configurable and taking out the second. For the second, we could possibly still wait for the executors we've launched to register.

Regarding spark.locality.wait, yes, it needs to locations, but it doesn't need the preferredLocalityData to be explicitly passed in by the user in the way it's needed for requesting where to put executors.

tgravescs · 2014-05-06T00:28:11Z

@sryza Feel free to take over that jira to make it configurable.

tgravescs · 2014-06-05T20:11:09Z

@sryza can we close this and just do SPARK-1453 and/or SPARK-1946

tgravescs · 2014-06-05T21:04:00Z

I guess we could keep this around incase we do SPARK-1946

tgravescs · 2014-07-14T20:48:40Z

@sryza I merged in SPARK-1946, would you want to update this to handle both cluster mode and client mode. I think we can actually also remove the waitForInitialAllocations() call.

SparkQA · 2014-07-14T22:42:52Z

QA tests have started for PR 634. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16642/consoleFull

sryza · 2014-07-14T23:06:26Z

Posted a patch that takes out the sleeps and waitForInitialAllocations

SparkQA · 2014-07-14T23:07:56Z

QA tests have started for PR 634. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16644/consoleFull

SparkQA · 2014-07-15T00:20:51Z

QA results for PR 634:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16642/consoleFull

SparkQA · 2014-07-15T00:45:33Z

QA results for PR 634:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16644/consoleFull

tgravescs · 2014-07-15T15:06:52Z

yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala

@@ -416,19 +407,8 @@ object ApplicationMaster extends Logging {
  // This is to ensure that we have reasonable number of containers before we start


we can remove this whole comment block now.

tgravescs · 2014-07-15T15:26:08Z

I just thought of something bad here. Now the default behavior is to wait for 0% and we have removed the bit of sleep there so if someone is just using the default they aren't very likely to get many before starting.

originally I was thinking keep the default 0 so behavior doesn't change, but personally I would rather see it higher so out of the box behavior is better. thoughts on changing it for yarn? To say 80%?

sryza · 2014-07-16T06:06:37Z

That's a good point. Changing it for YARN seems like the right thing, and 80% sounds reasonable to me.

Another thing is the wait time. Previously it was 6 seconds, but now spark.scheduler.maxRegisteredExecutorsWaitingTime defaults to 5 times that. 30 seconds seems a little excessive to me in general - at least for jobs without caching, after a couple seconds the wait outweighs scheduling some non-local tasks. What do you think about decreasing this to 6 seconds in general? Or at least for YARN.

tgravescs · 2014-07-18T13:20:08Z

He had increased it to 30 seconds based on some experiments he did. I guess it depends on how well the scheduler recovers from starting with fewer executors. I know they have been doing a bit of work around this area but I'm not sure the real affect. I have definitely seen issues with jobs when they start to soon so I think I would rather leave it at 30 for now unless we have more concrete data saying its not longer an issue.

sryza · 2014-07-20T18:28:42Z

Updated patch sets the min registered executors ratio to .8 for YARN

SparkQA · 2014-07-20T18:33:10Z

QA tests have started for PR 634. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16877/consoleFull

SparkQA · 2014-07-20T20:10:51Z

QA results for PR 634:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16877/consoleFull

lianhuiwang · 2014-07-21T06:33:07Z

yarn/common/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientClusterScheduler.scala

-
-  override def postStartHook() {
-
-    super.postStartHook()


when it is yarn-client mode, super.postStartHook() donot be deleted. because it can waitBackendReady when configured ratio executors have not been registered.

YarnClientClusterScheduler extends TaskSchedulerImpl, so it will just fall through to TaskSchedulerImpl.postStartHook() which calls waitBackendReady

tgravescs · 2014-07-21T18:15:11Z

Looks good. +1, Thanks @sryza.

Author: Sandy Ryza <sandy@cloudera.com> Closes apache#634 from sryza/sandy-spark-1707 and squashes the following commits: 2f6e358 [Sandy Ryza] Default min registered executors ratio to .8 for YARN 354c630 [Sandy Ryza] Remove outdated comments c744ef3 [Sandy Ryza] Take out waitForInitialAllocations 2a4329b [Sandy Ryza] SPARK-1707. Remove unnecessary 3 second sleep in YarnClusterScheduler

As reported in https://spark-project.atlassian.net/browse/SPARK-1055 "The used Spark version in the .../base/Dockerfile is stale on 0.8.1 and should be updated to 0.9.x to match the release." Author: CodingCat <zhunansjtu@gmail.com> Author: Nan Zhu <CodingCat@users.noreply.github.com> Closes apache#634 from CodingCat/SPARK-1055 and squashes the following commits: cb7330e [Nan Zhu] Update Dockerfile adf8259 [CodingCat] fix the SCALA_VERSION and SPARK_VERSION in docker file (cherry picked from commit 1aa4f8a) Signed-off-by: Aaron Davidson <aaron@databricks.com>

sryza mentioned this pull request May 28, 2014

[SPARK-1946] Submit tasks after (configured ratio) executors have been registered #900

Closed

tgravescs reviewed Jul 15, 2014
View reviewed changes

sryza added 4 commits July 20, 2014 11:03

SPARK-1707. Remove unnecessary 3 second sleep in YarnClusterScheduler

2a4329b

Take out waitForInitialAllocations

c744ef3

Remove outdated comments

354c630

Default min registered executors ratio to .8 for YARN

2f6e358

lianhuiwang reviewed Jul 21, 2014
View reviewed changes

tgravescs mentioned this pull request Jul 21, 2014

[SPARK-2555] Support configuration spark.scheduler.minRegisteredResourcesRatio in Mesos mode. #1462

Closed

asfgit closed this in f89cf65 Jul 21, 2014

		@@ -416,19 +407,8 @@ object ApplicationMaster extends Logging {
		// This is to ensure that we have reasonable number of containers before we start

SPARK-1707. Remove unnecessary 3 second sleep in YarnClusterScheduler #634

SPARK-1707. Remove unnecessary 3 second sleep in YarnClusterScheduler #634

Conversation

sryza commented May 4, 2014

AmplabJenkins commented May 4, 2014

AmplabJenkins commented May 4, 2014

AmplabJenkins commented May 4, 2014

AmplabJenkins commented May 4, 2014

mridulm commented May 4, 2014

pwendell commented May 4, 2014

sryza commented May 4, 2014

pwendell commented May 4, 2014

pwendell commented May 4, 2014

sryza commented May 5, 2014

pwendell commented May 5, 2014

sryza commented May 5, 2014

pwendell commented May 5, 2014

mridulm commented May 5, 2014

mridulm commented May 5, 2014

tgravescs commented May 5, 2014

sryza commented May 5, 2014

tgravescs commented May 6, 2014

tgravescs commented Jun 5, 2014

tgravescs commented Jun 5, 2014

tgravescs commented Jul 14, 2014

SparkQA commented Jul 14, 2014

sryza commented Jul 14, 2014

SparkQA commented Jul 14, 2014

SparkQA commented Jul 15, 2014

SparkQA commented Jul 15, 2014

tgravescs Jul 15, 2014

Choose a reason for hiding this comment

tgravescs commented Jul 15, 2014

sryza commented Jul 16, 2014

tgravescs commented Jul 18, 2014

sryza commented Jul 20, 2014

SparkQA commented Jul 20, 2014

SparkQA commented Jul 20, 2014

lianhuiwang Jul 21, 2014

Choose a reason for hiding this comment

tgravescs Jul 21, 2014

Choose a reason for hiding this comment

tgravescs commented Jul 21, 2014