[SPARK-26634]Do not allow task of FetchFailureStage commit in OutputCommitCoordinator #23563

liupc · 2019-01-16T08:25:01Z

What changes were proposed in this pull request?

What's the problem?

canCommit of OutputCommitCoordinator would allow the task of FetchFailure stage commit, which result in TaskCommitDenied for the task(with the same partition as the commit task of the fetchfailure stage) of retry stage. Because of TaskCommitDenied is not counting towards failure, So the scheduler will constantly scheduling task and got TaskCommitDenied, thus causing the application hangs forever.

How does it happen?

A detailed explaination for this:
Let's say we have:
stage 0.0 . (stage id 0, attempt 0)

task 1.0 (task 1, attempt 0)

Stage 0.1 (stage id 0, attempt 1) started due to fetch failure for instance

task 1.0 (task 1, attempt 0) . Equivalent with task 1.0 in stage 0.0

Task 1.0 in stage 0.0 is successfuly compeleted after the the launch of Stage 0.1, it will hold the commit lock for partition 1. (Sure, because AuthorizedCommiters for partition 1 is not exist and the attempt is not failed.)

spark/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

Line 180 in 4915cb3

case Some(state) if attemptFailed(state, stageAttempt, partition, attemptNumber) =>

spark/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

Line 183 in 4915cb3

false

spark/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

Line 184 in 4915cb3

case Some(state) =>

spark/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

Line 185 in 4915cb3

val existing = state.authorizedCommitters(partition)

spark/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

Line 186 in 4915cb3

if (existing == null) {

spark/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

Line 189 in 4915cb3

state.authorizedCommitters(partition) = TaskIdentifier(stageAttempt, attemptNumber)

spark/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

Line 190 in 4915cb3

true
Task 1.0 in stage 0.1 compeleted, it can not get the commit lock. (Sure, already hold by task 1.0 in stage 0.0)

spark/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

Line 185 in 4915cb3

val existing = state.authorizedCommitters(partition)

spark/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

Line 186 in 4915cb3

if (existing == null) {

spark/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

Line 191 in 4915cb3

} else {

spark/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

Line 192 in 4915cb3

logDebug(s"Commit denied for stage=$stage.$stageAttempt, partition=$partition: " +

spark/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

Line 194 in 4915cb3

false
Because of TaskCommitDenied not counting towards failure, TaskSetManager.handleFailedTask would not abort despite the consecutive failure of task1.x for parition 1 in stage0.1.

spark/core/src/main/scala/org/apache/spark/TaskEndReason.scala

Line 242 in 4915cb3

override def countTowardsTaskFailures: Boolean = false

spark/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

Line 907 in 4915cb3

if (!isZombie && reason.countTowardsTaskFailures) {
task 1 will be readded to pendingTasks and scheduler will schedule Task1.1 later.

spark/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

Line 927 in 4915cb3

addPendingTask(index)
Task 1.1 in stage 0.1 completed and also can not get the commit lock. and so back and forth

Logs:

2019-01-09,08:39:53,676 INFO org.apache.spark.scheduler.TaskSetManager: Starting task 138.0 in stage 5.1 (TID 31437, zjy-hadoop-prc-st159.bj, executor 456, partition 138, PROCESS_LOCAL, 5829 bytes)
2019-01-09,08:43:37,514 INFO org.apache.spark.scheduler.TaskSetManager: Finished task 138.0 in stage 5.0 (TID 30634) in 466958 ms on zjy-hadoop-prc-st1212.bj (executor 1632) (674/5000)
 2019-01-09,08:45:56,284 INFO org.apache.spark.scheduler.OutputCommitCoordinator: Denying attemptNumber=1 to commit for stage=5, partition=138; existingCommitter = 0
2019-01-09,08:45:57,372 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 138.0 in stage 5.1 (TID 31437, zjy-hadoop-prc-st159.bj, executor 456): TaskCommitDenied (Driver denied task commit) for job: 5, partition: 138, attemptNumber: 1
166483 2019-01-09,08:45:57,373 INFO org.apache.spark.scheduler.OutputCommitCoordinator: Task was denied committing, stage: 5, partition: 138, attempt number: 0, attempt number(counting failed stage): 1

How does this PR fix?

This PR will forbidden task of failed stage commit in the term of the new stage and thus solve the problem.

How was this patch tested?

unittest

Please review http://spark.apache.org/contributing.html before opening a pull request.

HyukjinKwon · 2019-01-17T02:50:37Z

@liupc, those logics are pretty core logics. It's not required but I would like to suggest elaborate it with examples, and pointers for codes so that reviewers can take a look easily. One example can be ... #21815 which has similar amount of change with it.

liupc · 2019-01-17T03:16:04Z

@HyukjinKwon Ok, thanks, I will try to explain it with examples in the description.

vanzin

I think this makes sense, but in this code it's good to get more eyes. @squito @tgravescs @cloud-fan

vanzin · 2019-02-13T22:08:01Z

core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

   * @param maxPartitionId the maximum partition id that could appear in this stage's tasks (i.e.
   *                       the maximum possible value of `context.partitionId`).
   */
-  private[scheduler] def stageStart(stage: Int, maxPartitionId: Int): Unit = synchronized {
+  private[scheduler] def stageStart(
+    stage: Int, stageAttemptNumber: Int, maxPartitionId: Int): Unit = synchronized {


nit: one parameter per line, double indented

vanzin · 2019-02-13T22:08:43Z

ok to test

viirya · 2019-02-13T23:57:59Z

core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

    stageStates.get(stage) match {
      case Some(state) =>
        require(state.authorizedCommitters.length == maxPartitionId + 1)
+        state.latestStageAttempt = stageAttemptNumber
        logInfo(s"Reusing state from previous attempt of stage $stage.")

      case _ =>


It is better to assign stageAttemptNumber to latestStageAttempt of newly create StageState too.

viirya · 2019-02-13T23:59:46Z

core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

    stageStates.get(stage) match {
      case Some(state) =>
        require(state.authorizedCommitters.length == maxPartitionId + 1)
+        state.latestStageAttempt = stageAttemptNumber


Don't we need to check if current latestStageAttempt is less than stageAttemptNumber or not?

viirya · 2019-02-14T00:03:08Z

Task 1.0 in stage 0.0 is successfuly compeleted after the the launch of Stage 0.1, it will hold the commit lock for partition 1. (Sure, because AuthorizedCommiters for partition 1 is not exist and the attempt is not failed.)

hmm, doesn't Task 1.0 in stage 0.0 hits FetchFailure? Or it can be successfully completed after FetchFailure? When the stage 0.1 gets launched? After FetchFailure of stage 0.0 but before it is completed?

vanzin · 2019-02-14T00:06:00Z

hmm, doesn't Task 1.0 in stage 0.0 hits FetchFailure?

No, some other task fails and causes the stage to be recomputed. That's what causes the scenario where task 1.0 commits after the new stage attempt starts.

viirya · 2019-02-14T00:24:45Z

Oh, I see. Thanks @vanzin.

Won't task 1.0 in stage 0.0 release the commit lock later? Will it hold the lock forever?

vanzin · 2019-02-14T00:27:44Z

No, once you commit, that's it. The partition stays "locked" until the stage finishes.

SparkQA · 2019-02-14T02:57:21Z

Test build #102310 has finished for PR 23563 at commit 5967f11.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-02-14T11:32:19Z

task 1 will be readded to pendingTasks and scheduler will schedule Task1.1 later.

the partition 1 of stage 0 is marked as completed, why the scheduler need to schedule this task?

liupc · 2019-02-14T12:54:49Z

@cloud-fan @vanzin @viirya @HyukjinKwon
Sorry, I think this issue is resolved by #21131, and this PR allow markPartitionCompletedInAllTaskSets which is a better way to fix the issue we encountered in spark2.1, as it improves performance by letting the partition being marked as successful by the first completion attempt.

vanzin · 2019-02-14T15:34:46Z

the partition 1 of stage 0 is marked as completed, why the scheduler need to schedule this task?

Because it completes after the next stage attempt starts, so the scheduler creates new tasks for all the tasks that did not finish in the previous attempt up to that point.

Anyway, if Imran's fix covers this case, that's good, no need to make this more complicated.

squito · 2019-02-14T18:48:11Z

btw that fix also had a bug, see followup in #22806 (which unfortunately still needs to be figured out). But that is needed for more than OutputCommitters, so that issue should still cover this.

vanzin · 2019-03-15T22:15:09Z

Based on the above I'm closing this.

Liupengcheng added 2 commits January 16, 2019 16:17

Do not allow task of FetchFailureStage commit in OutputCommitCoordinator

818379b

Refine

5967f11

vanzin reviewed Feb 13, 2019

View reviewed changes

viirya reviewed Feb 13, 2019

View reviewed changes

vanzin closed this Mar 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-26634]Do not allow task of FetchFailureStage commit in OutputCommitCoordinator #23563

[SPARK-26634]Do not allow task of FetchFailureStage commit in OutputCommitCoordinator #23563

liupc commented Jan 16, 2019 •

edited by cloud-fan

HyukjinKwon commented Jan 17, 2019

liupc commented Jan 17, 2019

vanzin left a comment

vanzin Feb 13, 2019

vanzin commented Feb 13, 2019

viirya Feb 13, 2019

viirya Feb 13, 2019

viirya commented Feb 14, 2019

vanzin commented Feb 14, 2019

viirya commented Feb 14, 2019

vanzin commented Feb 14, 2019

SparkQA commented Feb 14, 2019

cloud-fan commented Feb 14, 2019

liupc commented Feb 14, 2019

vanzin commented Feb 14, 2019

squito commented Feb 14, 2019

vanzin commented Mar 15, 2019

[SPARK-26634]Do not allow task of FetchFailureStage commit in OutputCommitCoordinator #23563

[SPARK-26634]Do not allow task of FetchFailureStage commit in OutputCommitCoordinator #23563

Conversation

liupc commented Jan 16, 2019 • edited by cloud-fan

What changes were proposed in this pull request?

What's the problem?

How does it happen?

How does this PR fix?

How was this patch tested?

HyukjinKwon commented Jan 17, 2019

liupc commented Jan 17, 2019

vanzin left a comment

Choose a reason for hiding this comment

vanzin Feb 13, 2019

Choose a reason for hiding this comment

vanzin commented Feb 13, 2019

viirya Feb 13, 2019

Choose a reason for hiding this comment

viirya Feb 13, 2019

Choose a reason for hiding this comment

viirya commented Feb 14, 2019

vanzin commented Feb 14, 2019

viirya commented Feb 14, 2019

vanzin commented Feb 14, 2019

SparkQA commented Feb 14, 2019

cloud-fan commented Feb 14, 2019

liupc commented Feb 14, 2019

vanzin commented Feb 14, 2019

squito commented Feb 14, 2019

vanzin commented Mar 15, 2019

liupc commented Jan 16, 2019 •

edited by cloud-fan