-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-11334][Core] Handle maximum task failure situation in dynamic allocation #11205
Conversation
Test build #51298 has finished for PR 11205 at commit
|
Just to add a link here to the previous PR #9288 |
// Number of tasks currently running on the cluster. Should be 0 when no stages are active. | ||
private var numRunningTasks: Int = _ | ||
private val executorIdToStageAndNumTasks = | ||
new mutable.HashMap[String, mutable.HashMap[Int, Int]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a very complicated data structure. Why do we need to keep track of what stage each executor is running? Is there a simpler way to fix this?
@jerryshao I took a look at this and it looks overly complicated. It seems that the problem is sometimes we have negative |
Hi @andrewor14 , thanks a lot for your comments. The reason why I introduce another data structure to track each executor's stage and task numbers is mentioned before and I pasted here again:
According to my test, TaskEnd event may not be delivered as the expected number, which will make executor never be released. So compared to the old implementation, I changed to clean the related task number when stage is completed. That's why I introduce a complicated data structure. |
sorry somehow I missed this go by, I haven't looked at the code chanes in detail yet. The TaskEnd event should be being sent all the time now, we fixed this bug a while back. Or is it because its out of order? Can you describe in more detail the exact issue and how this change fixes it? |
I am seeing this issue quite frequently. Not sure what is causing it but frequently we will get a onTaskEnd event after a stage has ended. This will cause the numRunningTasks to become negative. If executor number is updated then number of required executors(maxNumExecutorsNeeded) becomes negative & have issues in new executor allocation and deallocation. Best case you get executors that are unable to deallocate & over time spark does not allocate new executors even if there are tasks pending. |
I can confirm that removing speculation & setting maxtaskfailure to 1 eliminates this problem. Will try the patch & confirm |
gentle ping @rustagi, have you maybe had some time to confirm this patch maybe? It sounds the only thing we need here is the confirmation. |
gentle ping @rustagi |
What's the status of this PR? @jerryshao |
I guess the issue still exists, let me verify the issue again, if it still exists I will bring the PR to latest. Thanks! |
Sorry haven't been able to confirm this patch becaus have not seen issue in production for quite some time. |
This PR is pretty old and a lot has changed since, but it looks like this can be fixed now by just fixing code to look at Also, doesn't |
966eb89
to
59f9c15
Compare
@vanzin , in the current code Yes |
Verified again, looks like the 2nd bullet is not valid anymore, I cannot reproduce it in latest master branch, this might have already been fixed in SPARK-13054. So only first issue still exists, I think @sitalkedia 's PR is enough to handle this 1st issue. I'm going to close this one. @sitalkedia would you please reopen your PR, sorry to bring in noise. |
Test build #83067 has finished for PR 11205 at commit
|
@jerryshao I think the 2nd bullet has not been fixed in SPARK-13054. I use spark 2.1.1, and I still find that finished tasks remain in But the numRunningTasks equals 0 since:
|
Currently there're two problems in dynamic allocation when maximum task failure is met:
TaskEnd
events may never be delivered, which makes the related executor always be busy.This patch tries to fix these two issues, please review, thanks a lot.
CC @andrewor14 and @tgravescs .