[SPARK-15725][YARN] Ensure ApplicationMaster sleeps for the min interval.#13482
[SPARK-15725][YARN] Ensure ApplicationMaster sleeps for the min interval.#13482rdblue wants to merge 1 commit intoapache:masterfrom
Conversation
|
Test build #59895 has finished for PR 13482 at commit
|
|
@rdblue could you follow the usual convention in the pr title ( |
|
Seems like an ok workaround to me; we really should spend some time looking at removing some of those locks and avoiding |
There was a problem hiding this comment.
these should all be debug
There was a problem hiding this comment.
These are the only signal we have that the allocation loop is getting signalled too much. I think it's worth an info message so we can identify other cases that are causing this behavior. The normal case where the thread already slept for more than the min interval is debug. This doesn't add an unreasonable number of log messages.
There was a problem hiding this comment.
most users don't care about how long the internal yarn allocator loop slept for. If these logs are valuable to you then I would just set the ApplicationMaster log level to DEBUG in your environment. Otherwise we'll end up cluttering the logs too much. (in the old code we also do logDebug here)
|
@rdblue the reason for the hang is the |
|
I think this is important to fix for 2.0 but I personally found the changes in this patch rather confusing. If there's a simpler workaround we could do (such as the solution I suggested, if that works) then I would prefer that. |
|
So why don't we just take out the notifyAll call when we get a GetExecutorLossReason? I guess you might end up with the same situation on the RequestExecutors, but generally there I expect we request a bunch at a time and the allocation manager runs on a timer so those shouldn't be happens that frequently. |
If that helps it's ok too. It would probably increase a little bit the time for the driver to know why an executor failed, which can make some tasks take longer to be re-scheduled; but task failures because of executor loss aren't normal to start with, so it should be ok. |
|
@andrewor14, I think we should consider two problems here: the fact that the thread will sleep for less than the min interval if something triggers it and whatever is currently triggering it. We should certainly fix the loss reason request that is currently triggering this behavior, but I still think that this patch is a good solution to the first problem in case there are other situations that cause it as well. There's not a good reason to sleep for less than the min interval if it can cause the application to become unstable. We could look at a more complicated strategy -- like an exponentially increasing min interval up to the current min -- but the important thing right now is to ensure nothing can cause this instability. To be clear, I don't consider this a complete fix for both of those problems. We should definitely avoid the |
|
We should probably decouple the task scheduling and the executor lost reason eventually, but that is a separate issue. The only time I would see removing the notifyAll a problem is if they increase the heartbeat timeout to a very large number, but it would have to be close to the rpc timeout, which they just shouldn't do. Otherwise a couple of extra seconds to reschedule the tasks in this failure case that is not the norm shouldn't be a problem and as soon as one happens, it goes down to the 200ms that this patch is suggesting anyway. @rdblue does removing the notifyAll call solve your problem as well? That seems like a much cleaner approach then notifying but then sleeping some time again. |
|
@tgravescs, removing Even if we were to fix the |
|
Ok, I'm fine with this as a work around for now since you don't really know and this will ensure it, but please clean up the code so that its clear which sleep is which and add a nice comment stating why we are doing this. Then I think we should file another jira to investigate a more proper fix for this. We shouldn't have to wait for reason to schedule, |
|
@rdblue can you address the comments? I would like to get this into the 2.0 rc1 if possible. |
|
ping @rdblue do you have time to update this? |
fde631b to
e022408
Compare
|
@tgravescs, I've updated it. Sorry about the delay, for some reason the notifications for this issue didn't make it to my inbox so I wasn't seeing updates. |
|
Test build #61117 has finished for PR 13482 at commit
|
|
+1 |
…val. ## What changes were proposed in this pull request? Update `ApplicationMaster` to sleep for at least the minimum allocation interval before calling `allocateResources`. This prevents overloading the `YarnAllocator` that is happening because the thread is triggered when an executor is killed and its connections die. In YARN, this prevents the app from overloading the allocator and becoming unstable. ## How was this patch tested? Tested that this allows the an app to recover instead of hanging. It is still possible for the YarnAllocator to be overwhelmed by requests, but this prevents the issue for the most common cause. Author: Ryan Blue <blue@apache.org> Closes #13482 from rdblue/SPARK-15725-am-sleep-work-around. (cherry picked from commit a410814) Signed-off-by: Tom Graves <tgraves@yahoo-inc.com>
|
@tgravescs, thanks for reviewing! Sorry about the delay! |
What changes were proposed in this pull request?
Update
ApplicationMasterto sleep for at least the minimum allocation interval before callingallocateResources. This prevents overloading theYarnAllocatorthat is happening because the thread is triggered when an executor is killed and its connections die. In YARN, this prevents the app from overloading the allocator and becoming unstable.How was this patch tested?
Tested that this allows the an app to recover instead of hanging. It is still possible for the YarnAllocator to be overwhelmed by requests, but this prevents the issue for the most common cause.