[SPARK-15725][YARN] Ensure ApplicationMaster sleeps for the min interval. by rdblue · Pull Request #13482 · apache/spark

rdblue · 2016-06-02T23:21:47Z

What changes were proposed in this pull request?

Update ApplicationMaster to sleep for at least the minimum allocation interval before calling allocateResources. This prevents overloading the YarnAllocator that is happening because the thread is triggered when an executor is killed and its connections die. In YARN, this prevents the app from overloading the allocator and becoming unstable.

How was this patch tested?

Tested that this allows the an app to recover instead of hanging. It is still possible for the YarnAllocator to be overwhelmed by requests, but this prevents the issue for the most common cause.

SparkQA · 2016-06-02T23:49:41Z

Test build #59895 has finished for PR 13482 at commit fde631b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2016-06-03T00:11:41Z

@yhuai, @rxin, we should consider this work-around for 2.0 if it isn't too late. We see a lot of apps fail because the driver and AM lock up.

andrewor14 · 2016-06-04T16:49:16Z

cc @vanzin @tgravescs

vanzin · 2016-06-06T17:52:35Z

@rdblue could you follow the usual convention in the pr title ([SPARK-15725][yarn] Blah)? thx

vanzin · 2016-06-06T18:04:51Z

Seems like an ok workaround to me; we really should spend some time looking at removing some of those locks and avoiding askWithRetry (which shouldn't ever be needed with a reliable RPC library like the one Spark uses).

andrewor14 · 2016-06-06T18:15:21Z

yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala

these should all be debug

These are the only signal we have that the allocation loop is getting signalled too much. I think it's worth an info message so we can identify other cases that are causing this behavior. The normal case where the thread already slept for more than the min interval is debug. This doesn't add an unreasonable number of log messages.

most users don't care about how long the internal yarn allocator loop slept for. If these logs are valuable to you then I would just set the ApplicationMaster log level to DEBUG in your environment. Otherwise we'll end up cluttering the logs too much. (in the old code we also do logDebug here)

andrewor14 · 2016-06-06T18:18:56Z

@rdblue the reason for the hang is the GetExecutorLossReason right? IIUC we send one to the AM every time an executor dies. What if we just keep a set of executor IDs we're waiting to kill on the driver (this might already exist) and query that before asking the AM? If a particular executor is in that set then we can just return ExecutorKilled without sending a message.

andrewor14 · 2016-06-06T18:20:48Z

I think this is important to fix for 2.0 but I personally found the changes in this patch rather confusing. If there's a simpler workaround we could do (such as the solution I suggested, if that works) then I would prefer that.

tgravescs · 2016-06-06T23:10:33Z

So why don't we just take out the notifyAll call when we get a GetExecutorLossReason?
We can add a parameter to resetAllocatorInterval() and resets the interval it but doesn't call the notify. The most its going to sleep is the heartbeat interval (a few seconds), which doesn't seem bad to just get the lost reason. If they keep coming in the interval gets reset so it only sleep 200ms (default) on subsequent sleeps.

I guess you might end up with the same situation on the RequestExecutors, but generally there I expect we request a bunch at a time and the allocation manager runs on a timer so those shouldn't be happens that frequently.
And we probably don't really need the notify on the requestExecutors call either but it its not hurting anything it could speed up requests a few seconds.

vanzin · 2016-06-08T23:07:55Z

So why don't we just take out the notifyAll call when we get a GetExecutorLossReason?

If that helps it's ok too. It would probably increase a little bit the time for the driver to know why an executor failed, which can make some tasks take longer to be re-scheduled; but task failures because of executor loss aren't normal to start with, so it should be ok.

rdblue · 2016-06-11T23:17:41Z

@andrewor14, I think we should consider two problems here: the fact that the thread will sleep for less than the min interval if something triggers it and whatever is currently triggering it. We should certainly fix the loss reason request that is currently triggering this behavior, but I still think that this patch is a good solution to the first problem in case there are other situations that cause it as well.

There's not a good reason to sleep for less than the min interval if it can cause the application to become unstable. We could look at a more complicated strategy -- like an exponentially increasing min interval up to the current min -- but the important thing right now is to ensure nothing can cause this instability.

To be clear, I don't consider this a complete fix for both of those problems. We should definitely avoid the askWithRetry, only signal the allocator thread when necessary, etc. But as a safety precaution, I think this patch is a good start.

tgravescs · 2016-06-13T16:12:14Z

We should probably decouple the task scheduling and the executor lost reason eventually, but that is a separate issue.

The only time I would see removing the notifyAll a problem is if they increase the heartbeat timeout to a very large number, but it would have to be close to the rpc timeout, which they just shouldn't do. Otherwise a couple of extra seconds to reschedule the tasks in this failure case that is not the norm shouldn't be a problem and as soon as one happens, it goes down to the 200ms that this patch is suggesting anyway.

@rdblue does removing the notifyAll call solve your problem as well? That seems like a much cleaner approach then notifying but then sleeping some time again.

rdblue · 2016-06-13T20:47:49Z

@tgravescs, removing notifyAll doesn't solve the problem entirely, it just removes one path that's causing the allocate call to be run too many times. (Also, I haven't tested delaying loss reasons in our Spark jobs at scale, other than for the 200ms introduced here.) Ensuring that allocate is not called too often addresses the problem no matter what the immediate cause is. That's why I think it's a good idea to fix the two separately: first ensure that allocate will not run too often and starve other operations on the YarnAllocator and, second, track down the cases that cause this.

Even if we were to fix the YarnAllocator so we don't have resource contention, ensuring a min interval between calls to allocate is a good idea so that Spark doesn't make too many useless calls to the resource manager. And, I don't want to track down this same bug in 3 months because of a different path trigger it.

tgravescs · 2016-06-13T21:37:12Z

Ok, I'm fine with this as a work around for now since you don't really know and this will ensure it, but please clean up the code so that its clear which sleep is which and add a nice comment stating why we are doing this.

Then I think we should file another jira to investigate a more proper fix for this. We shouldn't have to wait for reason to schedule,

andrewor14 · 2016-06-16T21:38:13Z

@rdblue can you address the comments? I would like to get this into the 2.0 rc1 if possible.

tgravescs · 2016-06-23T13:52:59Z

ping @rdblue do you have time to update this?

rdblue · 2016-06-23T16:18:26Z

@tgravescs, I've updated it. Sorry about the delay, for some reason the notifications for this issue didn't make it to my inbox so I wasn't seeing updates.

SparkQA · 2016-06-23T16:43:32Z

Test build #61117 has finished for PR 13482 at commit e022408.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2016-06-23T18:53:38Z

+1

…val. ## What changes were proposed in this pull request? Update `ApplicationMaster` to sleep for at least the minimum allocation interval before calling `allocateResources`. This prevents overloading the `YarnAllocator` that is happening because the thread is triggered when an executor is killed and its connections die. In YARN, this prevents the app from overloading the allocator and becoming unstable. ## How was this patch tested? Tested that this allows the an app to recover instead of hanging. It is still possible for the YarnAllocator to be overwhelmed by requests, but this prevents the issue for the most common cause. Author: Ryan Blue <blue@apache.org> Closes #13482 from rdblue/SPARK-15725-am-sleep-work-around. (cherry picked from commit a410814) Signed-off-by: Tom Graves <tgraves@yahoo-inc.com>

rdblue · 2016-06-23T18:57:53Z

@tgravescs, thanks for reviewing! Sorry about the delay!

…interval. apache#13482

andrewor14 reviewed Jun 6, 2016
View reviewed changes

rdblue changed the title ~~SPARK-15725: Ensure ApplicationMaster sleeps for the min interval.~~ [SPARK-15725][YARN] Ensure ApplicationMaster sleeps for the min interval. Jun 11, 2016

SPARK-15725: Ensure ApplicationMaster sleeps for the min interval.

e022408

rdblue force-pushed the SPARK-15725-am-sleep-work-around branch from fde631b to e022408 Compare June 23, 2016 16:16

asfgit closed this in a410814 Jun 23, 2016

zzcclp added a commit to zzcclp/spark that referenced this pull request Jun 24, 2016

[EXT][SPARK-15725][YARN] Ensure ApplicationMaster sleeps for the min …

3d8e79e

…interval. apache#13482

Conversation

rdblue commented Jun 2, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jun 2, 2016

Uh oh!

rdblue commented Jun 3, 2016

Uh oh!

andrewor14 commented Jun 4, 2016

Uh oh!

vanzin commented Jun 6, 2016

Uh oh!

vanzin commented Jun 6, 2016

Uh oh!

andrewor14 Jun 6, 2016

Choose a reason for hiding this comment

Uh oh!

rdblue Jun 11, 2016

Choose a reason for hiding this comment

Uh oh!

andrewor14 Jun 16, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Jun 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrewor14 commented Jun 6, 2016

Uh oh!

tgravescs commented Jun 6, 2016

Uh oh!

vanzin commented Jun 8, 2016

Uh oh!

rdblue commented Jun 11, 2016

Uh oh!

tgravescs commented Jun 13, 2016

Uh oh!

rdblue commented Jun 13, 2016

Uh oh!

tgravescs commented Jun 13, 2016

Uh oh!

andrewor14 commented Jun 16, 2016

Uh oh!

tgravescs commented Jun 23, 2016

Uh oh!

rdblue commented Jun 23, 2016

Uh oh!

SparkQA commented Jun 23, 2016

Uh oh!

tgravescs commented Jun 23, 2016

Uh oh!

rdblue commented Jun 23, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

andrewor14 Jun 16, 2016 •

edited

Loading

andrewor14 commented Jun 6, 2016 •

edited

Loading