-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-11334] numRunningTasks can't be less than 0, or it will affect executor allocation #9288
Conversation
Test build #44397 has finished for PR 9288 at commit
|
IMHO, would it be better to fix this unexpected ordering of events, from my understanding, |
@@ -615,7 +615,11 @@ private[spark] class ExecutorAllocationManager( | |||
val taskIndex = taskEnd.taskInfo.index | |||
val stageId = taskEnd.stageId | |||
allocationManager.synchronized { | |||
numRunningTasks -= 1 | |||
if (numRunningTasks > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, this seems like a band-aid though. Is this because the task-end event happens in the wrong order with the job end event? can we catch and guard for that case more directly?
Yeah, I know the root cause is the wrong ordering of events. |
Hm, is it better to check the status of the task, and only decrement if it wasn't dead already? (I may not know what I'm talking about there) |
Can we do this by adding pending to kill tasks into list, only when all the tasks marked as finished, then call |
@@ -575,7 +575,7 @@ private[spark] class ExecutorAllocationManager( | |||
if (stageIdToNumTasks.isEmpty) { | |||
allocationManager.onSchedulerQueueEmpty() | |||
if (numRunningTasks != 0) { | |||
logWarning("No stages are running, but numRunningTasks != 0") | |||
logWarning(s"No stages are running, but numRunningTasks = $numRunningTasks") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say
... but numRunningTasks ($numRunningTasks) != 0
It's unclear how the wrong ordering resulted because |
@jerryshao, I try to realize the code of your suggestion, but many unit tests are failed. I think it's difficult for me. If you can help me, I would be very grateful. |
@andrewor14, yeah the I try to resolve the root cause, as you said scheduler is quite complicated, my realization caused many unit tests failed. So if you OK with doing a fix on the |
@XuTingjun , I'm OK with the current way if finding the root cause is a little bit complicated. Just need to add more comments why we have to change to that way. |
@andrewor14 , from my understanding this wrong ordering is happened when driver explicitly abort the running stages because of exceeding max failures. So |
I see, thanks for the explanation. @XuTingjun A more robust way to do this may be to keep track of
then you can just remove all the tasks associated with the stage when the stage is completed.
Can you address the comments? |
this seems very similar to what I'm seeing in https://issues.apache.org/jira/browse/SPARK-11701. that is dealing with speculation but also causes the numbers to be off. This is only fixing the executor allocation though. Is the executor page also showing a negative number of active tasks? I found that for speculation it affects multiple things that are doing accounting based on the taskEnd event that is coming after the stage is finished. if we can't fix the root cause of out of order we should check the other places it might affect. |
ping @XuTingjun are you still working on this? |
Let's close this PR for now since it's been inactive for many months. |
Let me take a crack at this issue :). |
With Dynamic Allocation function, a task failed over
maxFailure
time, all the dependent jobs, stages, tasks will be killed or aborted. In this process,SparkListenerTaskEnd
event will be behind inSparkListenerStageCompleted
andSparkListenerJobEnd
. Like the Event Log below:Because that, the
numRunningTasks
inExecutorAllocationManager
class will be less than 0, and it will affect executor allocation.