[SPARK-11334][Core] Handle maximum task failure situation in dynamic allocation #11205

jerryshao · 2016-02-15T06:53:08Z

Currently there're two problems in dynamic allocation when maximum task failure is met:

Number of running tasks will possibly be negative, which will affect the calculation of needed executors.
Executors may never be idle. Currently we use the executor to tasks mapping relation to identify the status of executors, in maximum task failure situation, some TaskEnd events may never be delivered, which makes the related executor always be busy.

This patch tries to fix these two issues, please review, thanks a lot.

CC @andrewor14 and @tgravescs .

SparkQA · 2016-02-15T09:02:39Z

Test build #51298 has finished for PR 11205 at commit 966eb89.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2016-02-19T18:30:09Z

Just to add a link here to the previous PR #9288

andrewor14 · 2016-02-19T18:35:47Z

core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala

-    // Number of tasks currently running on the cluster.  Should be 0 when no stages are active.
-    private var numRunningTasks: Int = _
+    private val executorIdToStageAndNumTasks =
+      new mutable.HashMap[String, mutable.HashMap[Int, Int]]


this is a very complicated data structure. Why do we need to keep track of what stage each executor is running? Is there a simpler way to fix this?

andrewor14 · 2016-02-19T19:27:18Z

@jerryshao I took a look at this and it looks overly complicated. It seems that the problem is sometimes we have negative totalRunningTasks and that leads to undesirable behavior. Can't we fix this by expressing totalRunningTasks in terms of stageIdToTaskIndices.map(_.values.size).sum?

jerryshao · 2016-02-20T02:16:30Z

Hi @andrewor14 , thanks a lot for your comments.

The reason why I introduce another data structure to track each executor's stage and task numbers is mentioned before and I pasted here again:

Executors may never be idle. Currently we use the executor to tasks mapping relation to identify the status of executors, in maximum task failure situation, some TaskEnd events may never be delivered, which makes the related executor always be busy.

According to my test, TaskEnd event may not be delivered as the expected number, which will make executor never be released. So compared to the old implementation, I changed to clean the related task number when stage is completed. That's why I introduce a complicated data structure.

andrewor14 · 2016-05-05T01:12:08Z

Also cc @vanzin @srowen

tgravescs · 2016-05-05T13:48:31Z

sorry somehow I missed this go by, I haven't looked at the code chanes in detail yet. The TaskEnd event should be being sent all the time now, we fixed this bug a while back. Or is it because its out of order?

Can you describe in more detail the exact issue and how this change fixes it?

rustagi · 2016-09-03T07:17:16Z

I am seeing this issue quite frequently. Not sure what is causing it but frequently we will get a onTaskEnd event after a stage has ended. This will cause the numRunningTasks to become negative. If executor number is updated then number of required executors(maxNumExecutorsNeeded) becomes negative & have issues in new executor allocation and deallocation. Best case you get executors that are unable to deallocate & over time spark does not allocate new executors even if there are tasks pending.
There is a simple hacky patch here: #9288 & this one is an attempt to correct it with more accountability.
I am seeing this issue so frequently that I am not sure its possible to run Spark with dynamic allocation successfully for long duration without fixing it. I'll try the hacky patch & confirm.

rustagi · 2016-09-06T13:54:30Z

I can confirm that removing speculation & setting maxtaskfailure to 1 eliminates this problem. Will try the patch & confirm

HyukjinKwon · 2017-06-19T03:49:49Z

gentle ping @rustagi, have you maybe had some time to confirm this patch maybe? It sounds the only thing we need here is the confirmation.

HyukjinKwon · 2017-07-24T03:24:33Z

gentle ping @rustagi

jiangxb1987 · 2017-10-03T14:45:18Z

What's the status of this PR? @jerryshao

jerryshao · 2017-10-09T01:25:01Z

I guess the issue still exists, let me verify the issue again, if it still exists I will bring the PR to latest. Thanks!

rustagi · 2017-10-09T01:50:15Z

Sorry haven't been able to confirm this patch becaus have not seen issue in production for quite some time.
It was much more persistent with 2.0 than 2.1
Not sure of cause.

vanzin · 2017-10-23T18:39:53Z

This PR is pretty old and a lot has changed since, but it looks like this can be fixed now by just fixing code to look at stageIdToTaskIndices instead of keeping numRunningTasks around? (Or maybe use numRunningTasks as a cache for stageIdToTaskIndices.values.sum.)

Also, doesn't isExecutorIdle take care of the second bullet in your description?

…ario

jerryshao · 2017-10-26T01:13:01Z

@vanzin , in the current code stageIdToTaskIndices cannot be used to track number of running tasks, because this structure doesn't remove task index from itself when task is finished successfully.

Yes isExecutorIdle is used to take care of executor idle, but the way to identify whether executor is idle is not robust enough. In this scenario, when stage is aborted because of max task failures, some task end event will be missing, so using number of tasks per executor will lead to residual data, and makes executor always be busy.

jerryshao · 2017-10-26T02:06:12Z

Verified again, looks like the 2nd bullet is not valid anymore, I cannot reproduce it in latest master branch, this might have already been fixed in SPARK-13054.

So only first issue still exists, I think @sitalkedia 's PR is enough to handle this 1st issue. I'm going to close this one. @sitalkedia would you please reopen your PR, sorry to bring in noise.

SparkQA · 2017-10-26T04:25:12Z

Test build #83067 has finished for PR 11205 at commit 59f9c15.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

da-liii · 2018-04-26T10:26:34Z

@jerryshao I think the 2nd bullet has not been fixed in SPARK-13054.

I use spark 2.1.1, and I still find that finished tasks remain in private val executorIdToTaskIds = new mutable.HashMap[String, mutable.HashSet[Long]]

But the numRunningTasks equals 0 since:

          if (numRunningTasks != 0) {
            logWarning("No stages are running, but numRunningTasks != 0")
            numRunningTasks = 0
          }

andrewor14 mentioned this pull request Feb 19, 2016

[SPARK-12864][YARN] initialize executorIdCounter after ApplicationMaster killed for max n… #10794

Closed

andrewor14 reviewed Feb 19, 2016
View reviewed changes

jerryshao mentioned this pull request Oct 20, 2017

[SPARK-22312][CORE] Fix bug in Executor allocation manager in running tasks calculation #19534

Closed

Correctly handle maximum task failures introduced stage abortion scen…

59f9c15

…ario

jerryshao force-pushed the SPARK-11334 branch from 966eb89 to 59f9c15 Compare October 26, 2017 01:01

jerryshao closed this Oct 26, 2017

da-liii mentioned this pull request Apr 26, 2018

[SPARK-11334][CORE] clear idle executors in executorIdToTaskIds keySet #21166

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-11334][Core] Handle maximum task failure situation in dynamic allocation #11205

[SPARK-11334][Core] Handle maximum task failure situation in dynamic allocation #11205

jerryshao commented Feb 15, 2016

SparkQA commented Feb 15, 2016

andrewor14 commented Feb 19, 2016

andrewor14 Feb 19, 2016

andrewor14 commented Feb 19, 2016

jerryshao commented Feb 20, 2016

andrewor14 commented May 5, 2016

tgravescs commented May 5, 2016

rustagi commented Sep 3, 2016

rustagi commented Sep 6, 2016

HyukjinKwon commented Jun 19, 2017 •

edited

Loading

HyukjinKwon commented Jul 24, 2017

jiangxb1987 commented Oct 3, 2017

jerryshao commented Oct 9, 2017

rustagi commented Oct 9, 2017

vanzin commented Oct 23, 2017

jerryshao commented Oct 26, 2017

jerryshao commented Oct 26, 2017

SparkQA commented Oct 26, 2017

da-liii commented Apr 26, 2018

[SPARK-11334][Core] Handle maximum task failure situation in dynamic allocation #11205

[SPARK-11334][Core] Handle maximum task failure situation in dynamic allocation #11205

Conversation

jerryshao commented Feb 15, 2016

SparkQA commented Feb 15, 2016

andrewor14 commented Feb 19, 2016

andrewor14 Feb 19, 2016

Choose a reason for hiding this comment

andrewor14 commented Feb 19, 2016

jerryshao commented Feb 20, 2016

andrewor14 commented May 5, 2016

tgravescs commented May 5, 2016

rustagi commented Sep 3, 2016

rustagi commented Sep 6, 2016

HyukjinKwon commented Jun 19, 2017 • edited Loading

HyukjinKwon commented Jul 24, 2017

jiangxb1987 commented Oct 3, 2017

jerryshao commented Oct 9, 2017

rustagi commented Oct 9, 2017

vanzin commented Oct 23, 2017

jerryshao commented Oct 26, 2017

jerryshao commented Oct 26, 2017

SparkQA commented Oct 26, 2017

da-liii commented Apr 26, 2018

HyukjinKwon commented Jun 19, 2017 •

edited

Loading