[SPARK-13931] Resolve stage hanging up problem in a particular case #11760

GavinGavinNo1 · 2016-03-16T09:38:18Z

What changes were proposed in this pull request?

When function 'executorLost' is invoked in class 'TaskSetManager', it's significant to judge whether variable 'isZombie' is set to true.

This pull request fixes the following hang:

Open speculation switch in the application.
Run this app and suppose last task of shuffleMapStage 1 finishes. Let's get the record straight, from the eyes of DAG, this stage really finishes, and from the eyes of TaskSetManager, variable 'isZombie' is set to true, but variable runningTasksSet isn't empty because of speculation.
Suddenly, executor 3 is lost. TaskScheduler receiving this signal, invokes all executorLost functions of rootPool's taskSetManagers. DAG receiving this signal, removes all this executor's outputLocs.
TaskSetManager adds all this executor's tasks to pendingTasks and tells DAG they will be resubmitted (Attention: possibly not on time).
DAG starts to submit a new waitingStage, let's say shuffleMapStage 2, and going to find that shuffleMapStage 1 is its missing parent because some outputLocs are removed due to executor lost. Then DAG submits shuffleMapStage 1 again.
DAG still receives Task 'Resubmitted' signal from old taskSetManager, and increases the number of pendingTasks of shuffleMapStage 1 each time. However, old taskSetManager won't resolve new task to submit because its variable 'isZombie' is set to true.
Finally shuffleMapStage 1 never finishes in DAG together with all stages depending on it.

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

It's quite difficult to construct test cases.

srowen · 2016-03-16T10:15:31Z

@mccheah @kayousterhout do you have an opinion on this? you might be familiar with this method

GavinGavinNo1 · 2016-03-24T00:45:01Z

@mccheah @kayousterhout Could you please take a look at this for me? Thank you in advance.

mccheah · 2016-03-24T00:51:44Z

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

@@ -776,6 +776,7 @@ private[spark] class TaskSetManager(

  /** Called by TaskScheduler when an executor is lost so we can re-enqueue our tasks */
  override def executorLost(execId: String, host: String, reason: ExecutorLossReason) {
+    if (isZombie) return


I'd add a logging line here at least.

Is it safe to just return here if the task is a zombie? IIRC we need to mark all of the tasks in the TaskSetManager as either completed or failed at some point (otherwise I think this TaskSetManager will never get cleaned up).

I looked at this a little more and I think the right solution is to add "!zombie" to the if-condition below, on line 784 (the reasoning being that, if a task set is a zombie and some shuffle map output was lost, this will be handled when the reduce tasks try to fetch the output, so it's too complicated to bring the task set back from the dead here). That way, the loop on line 799 will still run, so the DAGScheduler will still get told that any speculated tasks running on the lost executor have failed (so the web UI can be updated correctly etc.). Does this seem reasonable?

Looking a bit more closely, I concur with this idea. We basically don't want to resubmit the task, but we also want to mark the task as failed and update metrics. FWIW the if switch in handleFailedTask() should also prevent this task from erroneously counting against the failed task attempt count that would otherwise possibly cause the stage to fail.

Hm... but It's slightly tricky though because I just noticed handleFailedTask has calls like sched.dagScheduler.taskEnded, addPendingTask... https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L721 - are the effects of these calls what was causing the problem in the first place?

My understanding is that the problem is from this line: https://github.com/GavinGavinNo1/spark/blob/resolve-stage-blocked/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L794 which makes the DAGScheduler think the task is still outstanding. handleFailedTask does seem to erroneously call addPendingTask, but then it calls maybeFinishTaskSet, which will see that there are no more running tasks (and that it's a zombie) and then mark it as really finished.

SGTM - and I suppose the call to dagScheduler.taskEnded is fine in handleFailedTask, as opposed to the call on L794 which marks the task as Resubmitted.

@kayousterhout I agree with you, it's reasonable to add "!zombie" to the if-condition below, on line 784.
@mccheah Sorry, I haven't kept up your mind. Will it be fine to just add "!zombie" to the if-condition below, on line 784, or there's a better way? Thanks.

mccheah · 2016-03-24T00:52:45Z

Can we not wrap this in a unit test in TaskSetManagerSuite?

kayousterhout · 2016-03-24T01:10:58Z

+1 on @mccheah's request to write a unit test for this in TaskSetManagerSuite.

Also, can you change the PR description to say something like:

This pull request fixes the following hang:

GavinGavinNo1 · 2016-03-24T01:12:39Z

@mccheah @kayousterhout OK, I'll add a logging line. But it's quite difficult to construct a reproduction of the case. I can add a test symbolically, but without effect. Do you have suggestions for me?
It's no problem to change the PR description.I'll change it.

kayousterhout · 2016-03-24T01:20:49Z

I think you can write a test that's similar to this one: https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala#L639

You can make a test with a one-task task set, and like the above task set, you can set things up so the task set's task gets speculatively executed. Then, after finishing one copy of the task (so the task set will be marked as a zombie), you can call executorLost() on the TaskSetManager, and then check that the DAGScheduler's executorLost function gets called in the right way. It looks like you'll need to some better mocking of the DAGScheduler than what's currently done in that test; it may be easier to create a DAGScheduler mock than to use the FakeDAGScheduler / FakeTaskScheduler that already exist in that test.

GavinGavinNo1 · 2016-03-24T14:21:10Z

Is the unit test ok? Thanks. @mccheah @kayousterhout

andrewor14 · 2016-03-29T18:15:46Z

ok to test

SparkQA · 2016-03-29T20:29:06Z

Test build #54450 has finished for PR 11760 at commit a1eb0f9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

GavinGavinNo1 · 2016-04-12T01:38:20Z

@kayousterhout Sorry to bother you again. What should I do then?

kayousterhout · 2017-01-04T19:45:45Z

Sorry for letting this hang for so long. @GavinGavinNo1 do you have time to work on this, and if so, can you bring it up to date with master? Then I can review it again.

kayousterhout · 2017-02-07T00:17:54Z

@GavinGavinNo1 if you don't have time to work on this PR, can you close it?

GavinGavinNo1 · 2017-02-07T01:57:12Z

@kayousterhout Sorry that I tried before but internet connection is poor in my company and then I forget. I'll work on it back home today or tomorrow. Thank you for caring for this PR.

kayousterhout · 2017-02-07T02:52:34Z

Great thanks!

GavinGavinNo1 · 2017-02-08T15:58:31Z

@kayousterhout I got some problem with git conflict. So I create a new branch and a new pull request. You may refer to #16855. And I close this pull request for the time being. Thanks!

Resolve stage hanging up problem in a particular case

e527d67

GavinGavinNo1 changed the title ~~Resolve stage hanging up problem in a particular case~~ [SPARK-13931]Resolve stage hanging up problem in a particular case Mar 16, 2016

GavinGavinNo1 changed the title ~~[SPARK-13931]Resolve stage hanging up problem in a particular case~~ [SPARK-13931] Resolve stage hanging up problem in a particular case Mar 16, 2016

mccheah reviewed Mar 24, 2016
View reviewed changes

GavinGavinNo1 force-pushed the resolve-stage-blocked branch from b396558 to e527d67 Compare March 24, 2016 14:03

make it more reasonable to judge isZombie and add unit test

a1eb0f9

GavinGavinNo1 closed this Feb 8, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-13931] Resolve stage hanging up problem in a particular case #11760

[SPARK-13931] Resolve stage hanging up problem in a particular case #11760

GavinGavinNo1 commented Mar 16, 2016

srowen commented Mar 16, 2016

GavinGavinNo1 commented Mar 24, 2016

mccheah Mar 24, 2016

kayousterhout Mar 24, 2016

kayousterhout Mar 24, 2016

mccheah Mar 24, 2016

mccheah Mar 24, 2016

kayousterhout Mar 24, 2016

mccheah Mar 24, 2016

GavinGavinNo1 Mar 24, 2016

mccheah commented Mar 24, 2016

kayousterhout commented Mar 24, 2016

GavinGavinNo1 commented Mar 24, 2016

kayousterhout commented Mar 24, 2016

GavinGavinNo1 commented Mar 24, 2016

andrewor14 commented Mar 29, 2016

SparkQA commented Mar 29, 2016

GavinGavinNo1 commented Apr 12, 2016

kayousterhout commented Jan 4, 2017

kayousterhout commented Feb 7, 2017

GavinGavinNo1 commented Feb 7, 2017

kayousterhout commented Feb 7, 2017

GavinGavinNo1 commented Feb 8, 2017

[SPARK-13931] Resolve stage hanging up problem in a particular case #11760

[SPARK-13931] Resolve stage hanging up problem in a particular case #11760

Conversation

GavinGavinNo1 commented Mar 16, 2016

What changes were proposed in this pull request?

How was this patch tested?

srowen commented Mar 16, 2016

GavinGavinNo1 commented Mar 24, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mccheah commented Mar 24, 2016

kayousterhout commented Mar 24, 2016

GavinGavinNo1 commented Mar 24, 2016

kayousterhout commented Mar 24, 2016

GavinGavinNo1 commented Mar 24, 2016

andrewor14 commented Mar 29, 2016

SparkQA commented Mar 29, 2016

GavinGavinNo1 commented Apr 12, 2016

kayousterhout commented Jan 4, 2017

kayousterhout commented Feb 7, 2017

GavinGavinNo1 commented Feb 7, 2017

kayousterhout commented Feb 7, 2017

GavinGavinNo1 commented Feb 8, 2017