Skip to content

Commit

Permalink
[SPARK-30325][CORE] markPartitionCompleted cause task status inconsis…
Browse files Browse the repository at this point in the history
…tent

### **What changes were proposed in this pull request?**
 Fix task status inconsistent in `executorLost` which caused by `markPartitionCompleted`

### **Why are the changes needed?**
The inconsistent will cause app hung up.
The bugs occurs in the corer case as follows:
1. The stage occurs during stage retry, scheduler will resubmit a new stage with unfinished tasks.
2. Those unfinished tasks in origin stage finished and the same task on the new retry stage hasn't finished, it will mark the task partition on the current retry stage as succesuful in TSM `successful` array variable.
3. The executor crashed when it is running tasks which have succeeded by origin stage, it cause TSM run `executorLost` to rescheduler the task on the executor, and it will change the partition's running status in `copiesRunning` twice to -1.
4. 'dequeueTaskFromList' will use `copiesRunning` equal 0 as reschedule basis when rescheduler tasks, and now it is -1, can't to reschedule, and the app will hung forever.

### **Does this PR introduce any user-facing change?**
No

### **How was this patch tested?**

Closes #26975 from seayoun/fix_stageRetry_executorCrash_cause_problems.

Authored-by: yu <you@example.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
  • Loading branch information
yu authored and cloud-fan committed Jan 14, 2020
1 parent e0efd21 commit 4462756
Showing 1 changed file with 4 additions and 1 deletion.
Expand Up @@ -945,7 +945,10 @@ private[spark] class TaskSetManager(
&& !isZombie) {
for ((tid, info) <- taskInfos if info.executorId == execId) {
val index = taskInfos(tid).index
if (successful(index) && !killedByOtherAttempt.contains(tid)) {
// We may have a running task whose partition has been marked as successful,
// this partition has another task completed in another stage attempt.
// We treat it as a running task and will call handleFailedTask later.
if (successful(index) && !info.running && !killedByOtherAttempt.contains(tid)) {
successful(index) = false
copiesRunning(index) -= 1
tasksSuccessful -= 1
Expand Down

0 comments on commit 4462756

Please sign in to comment.