Skip to content

[SPARK-15725][YARN] Ensure ApplicationMaster sleeps for the min interval.#13482

Closed
rdblue wants to merge 1 commit intoapache:masterfrom
rdblue:SPARK-15725-am-sleep-work-around
Closed

[SPARK-15725][YARN] Ensure ApplicationMaster sleeps for the min interval.#13482
rdblue wants to merge 1 commit intoapache:masterfrom
rdblue:SPARK-15725-am-sleep-work-around

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Jun 2, 2016

What changes were proposed in this pull request?

Update ApplicationMaster to sleep for at least the minimum allocation interval before calling allocateResources. This prevents overloading the YarnAllocator that is happening because the thread is triggered when an executor is killed and its connections die. In YARN, this prevents the app from overloading the allocator and becoming unstable.

How was this patch tested?

Tested that this allows the an app to recover instead of hanging. It is still possible for the YarnAllocator to be overwhelmed by requests, but this prevents the issue for the most common cause.

@SparkQA
Copy link

SparkQA commented Jun 2, 2016

Test build #59895 has finished for PR 13482 at commit fde631b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor Author

rdblue commented Jun 3, 2016

@yhuai, @rxin, we should consider this work-around for 2.0 if it isn't too late. We see a lot of apps fail because the driver and AM lock up.

@andrewor14
Copy link
Contributor

cc @vanzin @tgravescs

@vanzin
Copy link
Contributor

vanzin commented Jun 6, 2016

@rdblue could you follow the usual convention in the pr title ([SPARK-15725][yarn] Blah)? thx

@vanzin
Copy link
Contributor

vanzin commented Jun 6, 2016

Seems like an ok workaround to me; we really should spend some time looking at removing some of those locks and avoiding askWithRetry (which shouldn't ever be needed with a reliable RPC library like the one Spark uses).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should all be debug

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the only signal we have that the allocation loop is getting signalled too much. I think it's worth an info message so we can identify other cases that are causing this behavior. The normal case where the thread already slept for more than the min interval is debug. This doesn't add an unreasonable number of log messages.

Copy link
Contributor

@andrewor14 andrewor14 Jun 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most users don't care about how long the internal yarn allocator loop slept for. If these logs are valuable to you then I would just set the ApplicationMaster log level to DEBUG in your environment. Otherwise we'll end up cluttering the logs too much. (in the old code we also do logDebug here)

@andrewor14
Copy link
Contributor

andrewor14 commented Jun 6, 2016

@rdblue the reason for the hang is the GetExecutorLossReason right? IIUC we send one to the AM every time an executor dies. What if we just keep a set of executor IDs we're waiting to kill on the driver (this might already exist) and query that before asking the AM? If a particular executor is in that set then we can just return ExecutorKilled without sending a message.

@andrewor14
Copy link
Contributor

I think this is important to fix for 2.0 but I personally found the changes in this patch rather confusing. If there's a simpler workaround we could do (such as the solution I suggested, if that works) then I would prefer that.

@tgravescs
Copy link
Contributor

So why don't we just take out the notifyAll call when we get a GetExecutorLossReason?
We can add a parameter to resetAllocatorInterval() and resets the interval it but doesn't call the notify. The most its going to sleep is the heartbeat interval (a few seconds), which doesn't seem bad to just get the lost reason. If they keep coming in the interval gets reset so it only sleep 200ms (default) on subsequent sleeps.

I guess you might end up with the same situation on the RequestExecutors, but generally there I expect we request a bunch at a time and the allocation manager runs on a timer so those shouldn't be happens that frequently.
And we probably don't really need the notify on the requestExecutors call either but it its not hurting anything it could speed up requests a few seconds.

@vanzin
Copy link
Contributor

vanzin commented Jun 8, 2016

So why don't we just take out the notifyAll call when we get a GetExecutorLossReason?

If that helps it's ok too. It would probably increase a little bit the time for the driver to know why an executor failed, which can make some tasks take longer to be re-scheduled; but task failures because of executor loss aren't normal to start with, so it should be ok.

@rdblue rdblue changed the title SPARK-15725: Ensure ApplicationMaster sleeps for the min interval. [SPARK-15725][YARN] Ensure ApplicationMaster sleeps for the min interval. Jun 11, 2016
@rdblue
Copy link
Contributor Author

rdblue commented Jun 11, 2016

@andrewor14, I think we should consider two problems here: the fact that the thread will sleep for less than the min interval if something triggers it and whatever is currently triggering it. We should certainly fix the loss reason request that is currently triggering this behavior, but I still think that this patch is a good solution to the first problem in case there are other situations that cause it as well.

There's not a good reason to sleep for less than the min interval if it can cause the application to become unstable. We could look at a more complicated strategy -- like an exponentially increasing min interval up to the current min -- but the important thing right now is to ensure nothing can cause this instability.

To be clear, I don't consider this a complete fix for both of those problems. We should definitely avoid the askWithRetry, only signal the allocator thread when necessary, etc. But as a safety precaution, I think this patch is a good start.

@tgravescs
Copy link
Contributor

We should probably decouple the task scheduling and the executor lost reason eventually, but that is a separate issue.

The only time I would see removing the notifyAll a problem is if they increase the heartbeat timeout to a very large number, but it would have to be close to the rpc timeout, which they just shouldn't do. Otherwise a couple of extra seconds to reschedule the tasks in this failure case that is not the norm shouldn't be a problem and as soon as one happens, it goes down to the 200ms that this patch is suggesting anyway.

@rdblue does removing the notifyAll call solve your problem as well? That seems like a much cleaner approach then notifying but then sleeping some time again.

@rdblue
Copy link
Contributor Author

rdblue commented Jun 13, 2016

@tgravescs, removing notifyAll doesn't solve the problem entirely, it just removes one path that's causing the allocate call to be run too many times. (Also, I haven't tested delaying loss reasons in our Spark jobs at scale, other than for the 200ms introduced here.) Ensuring that allocate is not called too often addresses the problem no matter what the immediate cause is. That's why I think it's a good idea to fix the two separately: first ensure that allocate will not run too often and starve other operations on the YarnAllocator and, second, track down the cases that cause this.

Even if we were to fix the YarnAllocator so we don't have resource contention, ensuring a min interval between calls to allocate is a good idea so that Spark doesn't make too many useless calls to the resource manager. And, I don't want to track down this same bug in 3 months because of a different path trigger it.

@tgravescs
Copy link
Contributor

Ok, I'm fine with this as a work around for now since you don't really know and this will ensure it, but please clean up the code so that its clear which sleep is which and add a nice comment stating why we are doing this.

Then I think we should file another jira to investigate a more proper fix for this. We shouldn't have to wait for reason to schedule,

@andrewor14
Copy link
Contributor

@rdblue can you address the comments? I would like to get this into the 2.0 rc1 if possible.

@tgravescs
Copy link
Contributor

ping @rdblue do you have time to update this?

@rdblue rdblue force-pushed the SPARK-15725-am-sleep-work-around branch from fde631b to e022408 Compare June 23, 2016 16:16
@rdblue
Copy link
Contributor Author

rdblue commented Jun 23, 2016

@tgravescs, I've updated it. Sorry about the delay, for some reason the notifications for this issue didn't make it to my inbox so I wasn't seeing updates.

@SparkQA
Copy link

SparkQA commented Jun 23, 2016

Test build #61117 has finished for PR 13482 at commit e022408.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor

+1

asfgit pushed a commit that referenced this pull request Jun 23, 2016
…val.

## What changes were proposed in this pull request?

Update `ApplicationMaster` to sleep for at least the minimum allocation interval before calling `allocateResources`. This prevents overloading the `YarnAllocator` that is happening because the thread is triggered when an executor is killed and its connections die. In YARN, this prevents the app from overloading the allocator and becoming unstable.

## How was this patch tested?

Tested that this allows the an app to recover instead of hanging. It is still possible for the YarnAllocator to be overwhelmed by requests, but this prevents the issue for the most common cause.

Author: Ryan Blue <blue@apache.org>

Closes #13482 from rdblue/SPARK-15725-am-sleep-work-around.

(cherry picked from commit a410814)
Signed-off-by: Tom Graves <tgraves@yahoo-inc.com>
@asfgit asfgit closed this in a410814 Jun 23, 2016
@rdblue
Copy link
Contributor Author

rdblue commented Jun 23, 2016

@tgravescs, thanks for reviewing! Sorry about the delay!

zzcclp added a commit to zzcclp/spark that referenced this pull request Jun 24, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants