[SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks #40393

Stove-hust · 2023-03-13T08:25:21Z

What changes were proposed in this pull request?

Copy the logic of handleTaskCompletion in DAGScheduler for processing the last shuffleMapTask into submitMissingTasks.

Why are the changes needed?

In condition of push-based shuffle being enabled and speculative tasks existing, a shuffleMapStage will be resubmitting once fetchFailed occurring, then its parent stages will be resubmitting firstly and it will cost some time to compute. Before the shuffleMapStage being resubmitted, its all speculative tasks success and register map output, but speculative task successful events can not trigger shuffleMergeFinalized( shuffleBlockPusher.notifyDriverAboutPushCompletion ) because this stage has been removed from runningStages.

Then this stage is resubmitted, but speculative tasks have registered map output and there are no missing tasks to compute, resubmitting stages will also not trigger shuffleMergeFinalized. Eventually this stage‘s _shuffleMergedFinalized keeps false.

Then AQE will submit next stages which are dependent on this shuffleMapStage occurring fetchFailed. And in getMissingParentStages, this stage will be marked as missing and will be resubmitted, but next stages are added to waitingStages after this stage being finished, so next stages will not be submitted even though this stage's resubmitting has been finished.

Does this PR introduce any user-facing change?

No

How was this patch tested?

This extreme case is very difficult to construct, and we added logs to our production environment to capture the number of problems and verify the stability of the job. I am happy to provide a timeline of the various events in which the problem arose。

mridulm · 2023-03-13T22:00:38Z

+CC @otterc

otterc · 2023-03-14T03:53:58Z

@Stove-hust Thank you for reporting and the patch. Would you be able to share driver logs?

Stove-hust · 2023-03-14T04:32:31Z

@Stove-hust Thank you for reporting and the patch. Would you be able to share driver logs?

Sure（Add some comments）
--- stage 10 faield
22/10/15 10:55:58 WARN task-result-getter-1 TaskSetManager: Lost task 435.1 in stage 10.0 (TID 6822, zw02-data-hdp-dn21102.mt, executor 101): FetchFailed(null, shuffleId=3, mapIndex=-1, mapId=-1, reduceId=435, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 3 partition 435
22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: ShuffleMapStage 10 (processCmd at CliDriver.java:386) failed in 601.792 s due to org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 3 partition 435

-- resubmit stage 10 && parentStage 9
22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Resubmitting ShuffleMapStage 9 (processCmd at CliDriver.java:386) and ShuffleMapStage 10 (processCmd at CliDriver.java:386) due to fetch failure
22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Resubmitting failed stages
22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Submitting ShuffleMapStage 9 (MapPartitionsRDD[22] at processCmd at CliDriver.java:386), which has no missing parents
22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Push-based shuffle disabled for ShuffleMapStage 9 (processCmd at CliDriver.java:386) since it is already shuffle merge finalized
22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Submitting 3 missing tasks from ShuffleMapStage 9 (MapPartitionsRDD[22] at processCmd at CliDriver.java:386) (first 15 tasks are for partitions Vector(98, 372, 690))
22/10/15 10:55:58 INFO dag-scheduler-event-loop YarnClusterScheduler: Adding task set 9.1 with 3 tasks

-- The first stage10 task completes one after another, and notifyDriverAboutPushCompletion to end stage 10, and mark finalizeTask, because the stage is not in runningStages, so the stage cannot be marked shuffleMergeFinalized.
22/10/15 10:55:58 INFO task-result-getter-0 TaskSetManager: Finished task 325.0 in stage 10.0 (TID 6166) in 154455 ms on zw02-data-hdp-dn25537.mt (executor 117) (494/500)
22/10/15 10:55:59 WARN task-result-getter-1 TaskSetManager: Lost task 325.1 in stage 10.0 (TID 6671, zw02-data-hdp-dn23160.mt, executor 47): TaskKilled (another attempt succeeded)
22/10/15 10:56:20 WARN task-result-getter-1 TaskSetManager: Lost task 358.1 in stage 10.0 (TID 6731, zw02-data-hdp-dn25537.mt, executor 95): TaskKilled (another attempt succeeded)
22/10/15 10:56:20 INFO task-result-getter-1 TaskSetManager: Task 358.1 in stage 10.0 (TID 6731) failed, but the task will not be re-executed (either because the task failed with a shuffle data fetch failure, so the previous stage needs to be re-run, or because a different copy of the task has already succeeded).

--- Removed TaskSet 10.0, whose tasks have all completed
22/10/15 10:56:22 INFO task-result-getter-1 TaskSetManager: Ignoring task-finished event for 435.0 in stage 10.0 because task 435 has already completed successfully
22/10/15 10:56:22 INFO task-result-getter-1 YarnClusterScheduler: Removed TaskSet 10.0, whose tasks have all completed, from pool

--- notifyDriverAboutPushCompletion stage 10
22/10/15 10:56:23 INFO dag-scheduler-event-loop DAGScheduler: ShuffleMapStage 10 (processCmd at CliDriver.java:386) scheduled for finalizing shuffle merge in 0 s
22/10/15 10:56:23 INFO shuffle-merge-finalizer-2 DAGScheduler: ShuffleMapStage 10 (processCmd at CliDriver.java:386) finalizing the shuffle merge with registering merge results set to true

--- stage 9 finished
22/10/15 10:57:51 INFO task-result-getter-1 TaskSetManager: Finished task 2.0 in stage 9.1 (TID 6825) in 112825 ms on zw02-data-hdp-dn25559.mt (executor 74) (3/3)
22/10/15 10:57:51 INFO task-result-getter-1 YarnClusterScheduler: Removed TaskSet 9.1, whose tasks have all completed, from pool
22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: ShuffleMapStage 9 (processCmd at CliDriver.java:386) finished in 112.832 s

--- resubmit stage 10
2/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: looking for newly runnable stages
22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: running: Set(ShuffleMapStage 11, ShuffleMapStage 8)
22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: waiting: Set(ShuffleMapStage 12, ShuffleMapStage 10)
22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: failed: Set()
22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: Submitting ShuffleMapStage 10 (MapPartitionsRDD[36] at processCmd at CliDriver.java:386), which has no missing parents
22/10/15 10:57:51 INFO dag-scheduler-event-loop OutputCommitCoordinator: Reusing state from previous attempt of stage 10.
22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: Shuffle merge enabled before starting the stage for ShuffleMapStage 10 with shuffle 7 and shuffle merge 0 with 108 merger locations
22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: Submitting 4 missing tasks from ShuffleMapStage 10 (MapPartitionsRDD[36] at processCmd at CliDriver.java:386) (first 15 tasks are for partitions Vector(105, 288, 447, 481))
22/10/15 10:57:51 INFO dag-scheduler-event-loop YarnClusterScheduler: Adding task set 10.1 with 4 tasks

--- stage 10 can not finished
22/10/15 10:58:18 INFO task-result-getter-1 TaskSetManager: Finished task 2.0 in stage 10.1 (TID 6857) in 26644 ms on zw02-data-hdp-dn23767.mt (executor 139) (1/4)
22/10/15 10:58:24 INFO task-result-getter-1 TaskSetManager: Finished task 3.0 in stage 10.1 (TID 6860) in 32551 ms on zw02-data-hdp-dn23729.mt (executor 42) (2/4)
22/10/15 10:58:47 INFO task-result-getter-1 TaskSetManager: Finished task 0.0 in stage 10.1 (TID 6858) in 55524 ms on zw02-data-hdp-dn20640.mt (executor 134) (3/4)
22/10/15 10:58:58 INFO task-result-getter-0 TaskSetManager: Finished task 1.0 in stage 10.1 (TID 6859) in 66911 ms on zw02-data-hdp-dn25862.mt (executor 57) (4/4)

Stove-hust · 2023-03-15T02:45:23Z

@otterc Hello, is there anything else I should add?

otterc · 2023-03-15T04:02:33Z

@Stove-hust Haven't had a chance to look at it yet. I'll take a look at it this week.

Stove-hust · 2023-03-15T06:12:37Z

@Stove-hust Haven't had a chance to look at it yet. I'll take a look at it this week.

tks

otterc · 2023-03-15T17:57:46Z

@akpatnam25 @shuwang21

mridulm · 2023-03-17T05:52:17Z

So this is an interesting coincidence, I literally encountered a production job which seems to be hitting this exact same issue :-)
I was in the process of creating a test case, but my intuition was along the same lines as this PR.

Can you create a test case to validate this behavior @Stove-hust ?
Essentially it should fail with current master, and succeed after this change.

Thanks for working on this fix

Stove-hust · 2023-03-17T07:39:32Z

So this is an interesting coincidence, I literally encountered a production job which seems to be hitting this exact same issue :-) I was in the process of creating a test case, but my intuition was along the same lines as this PR.

Can you create a test case to validate this behavior @Stove-hust ? Essentially it should fail with current master, and succeed after this change.

Thanks for working on this fix

No problem

Stove-hust · 2023-03-17T17:19:18Z

So this is an interesting coincidence, I literally encountered a production job which seems to be hitting this exact same issue :-) I was in the process of creating a test case, but my intuition was along the same lines as this PR.

Can you create a test case to validate this behavior @Stove-hust ? Essentially it should fail with current master, and succeed after this change.

Thanks for working on this fix

Added UT

otterc · 2023-03-18T06:38:37Z

@Stove-hust The main change in DAGScheduler looks good to me. Basically, here we also check whether the parent stage is finalized and if it is not we submit that. The reason the parent stage is not getting finalized here is because it has no tasks.
Will review the UT and take another look at the code next week. Thanks for fixing this.

mridulm · 2023-03-18T09:10:39Z

Instead of only testing specifically for the flag - which is subject to change as the implementation evolves, we should also test for behavior here.

This is the reproducible test I was using (with some changes) to test approaches for this bug - and it mimics the case I saw in our production reasonably well.
(In DAGSchedulerSuite):

  for (pushBasedShuffleEnabled <- Seq(true, false)) {
    test("SPARK-40082: recomputation of shuffle map stage with no pending partitions should not " +
        s"hang. pushBasedShuffleEnabled = $pushBasedShuffleEnabled") {

      if (pushBasedShuffleEnabled) {
        initPushBasedShuffleConfs(conf)
        DAGSchedulerSuite.clearMergerLocs()
        DAGSchedulerSuite.addMergerLocs(Seq("host1", "host2", "host3", "host4", "host5"))
      }

      var taskIdCount = 0

      var completedStage: List[Int] = Nil
      val listener = new SparkListener() {
        override def onStageCompleted(event: SparkListenerStageCompleted): Unit = {
          completedStage = completedStage :+ event.stageInfo.stageId
        }
      }
      sc.addSparkListener(listener)

      val fetchFailParentPartition = 0

      val shuffleMapRdd0 = new MyRDD(sc, 2, Nil)
      val shuffleDep0 = new ShuffleDependency(shuffleMapRdd0, new HashPartitioner(2))

      val shuffleMapRdd1 = new MyRDD(sc, 2, List(shuffleDep0), tracker = mapOutputTracker)
      val shuffleDep1 = new ShuffleDependency(shuffleMapRdd1, new HashPartitioner(2))

      val reduceRdd = new MyRDD(sc, 2, List(shuffleDep1), tracker = mapOutputTracker)

      // submit the initial mapper stage, generate shuffle output for first reducer stage.
      submitMapStage(shuffleDep0)

      // Map stage completes successfully,
      completeShuffleMapStageSuccessfully(0, 0, 3, Seq("hostA", "hostB"))
      taskIdCount += 2
      assert(completedStage === List(0))

      // Now submit the first reducer stage
      submitMapStage(shuffleDep1)

      def createTaskInfo(speculative: Boolean): TaskInfo = {
        val taskInfo = new TaskInfo(
          taskId = taskIdCount,
          index = 0,
          attemptNumber = 0,
          partitionId = 0,
          launchTime = 0L,
          executorId = "",
          host = "hostC",
          TaskLocality.ANY,
          speculative = speculative)
        taskIdCount += 1
        taskInfo
      }

      val normalTask = createTaskInfo(speculative = false);
      val speculativeTask = createTaskInfo(speculative = true)

      // fail task 1.0 due to FetchFailed, and make 1.1 succeed.
      runEvent(makeCompletionEvent(taskSets(1).tasks(0),
        FetchFailed(makeBlockManagerId("hostA"), shuffleDep0.shuffleId, normalTask.taskId,
          fetchFailParentPartition, normalTask.index, "ignored"),
        result = null,
        Seq.empty,
        Array.empty,
        normalTask))

      // Make the speculative task succeed after initial task has failed
      runEvent(makeCompletionEvent(taskSets(1).tasks(0), Success,
        result = MapStatus(BlockManagerId("hostD-exec1", "hostD", 34512),
          Array.fill[Long](2)(2), mapTaskId = speculativeTask.taskId),
        taskInfo = speculativeTask))

      // The second task, for partition 1 succeeds as well.
      runEvent(makeCompletionEvent(taskSets(1).tasks(1), Success,
        result = MapStatus(BlockManagerId("hostE-exec2", "hostE", 23456),
          Array.fill[Long](2)(2), mapTaskId = taskIdCount),
      ))
      taskIdCount += 1

      sc.listenerBus.waitUntilEmpty()
      assert(completedStage === List(0, 2))

      // the stages will now get resubmitted due to the failure
      Thread.sleep(DAGScheduler.RESUBMIT_TIMEOUT * 2)

      // parent map stage resubmitted
      assert(scheduler.runningStages.size === 1)
      val mapStage = scheduler.runningStages.head

      // Stage 1 is same as Stage 0 - but created for the ShuffleMapTask 2, as it is a
      // different job
      assert(mapStage.id === 1)
      assert(mapStage.latestInfo.failureReason.isEmpty)
      // only the partition reported in fetch failure is resubmitted
      assert(mapStage.latestInfo.numTasks === 1)

      val stage0Retry = taskSets.filter(_.stageId == 1)
      assert(stage0Retry.size === 1)
      // make the original task succeed
      runEvent(makeCompletionEvent(stage0Retry.head.tasks(fetchFailParentPartition), Success,
        result = MapStatus(BlockManagerId("hostF-exec1", "hostF", 12345),
          Array.fill[Long](2)(2), mapTaskId = taskIdCount)))
      Thread.sleep(DAGScheduler.RESUBMIT_TIMEOUT * 2)

      // The retries should succeed
      sc.listenerBus.waitUntilEmpty()
      assert(completedStage === List(0, 2, 1, 2))

      // Now submit the entire dag again
      // This will add 3 new stages.
      submit(reduceRdd, Array(0, 1))
      Thread.sleep(DAGScheduler.RESUBMIT_TIMEOUT * 2)

      // Only the last stage needs to execute, and those tasks - so completed stages should not
      // change.
      sc.listenerBus.waitUntilEmpty()

      assert(completedStage === List(0, 2, 1, 2))

      // All other stages should be done, and only the final stage should be waiting
      assert(scheduler.runningStages.size === 1)
      assert(scheduler.runningStages.head.id === 5)
      assert(taskSets.count(_.stageId == 5) === 1)

      complete(taskSets.filter(_.stageId == 5).head, Seq((Success, 1), (Success, 2)))

      sc.listenerBus.waitUntilEmpty()
      assert(completedStage === List(0, 2, 1, 2, 5))
    }
  }

Would be good to adapt/clean it up for your PR, in addition to the existing test - so that the observed bug does not recur.

(Good news is, this PR works against it :-) )

Stove-hust · 2023-03-18T17:18:45Z

Instead of only testing specifically for the flag - which is subject to change as the implementation evolves, we should also test for behavior here.

This is the reproducible test I was using (with some changes) to test approaches for this bug - and it mimics the case I saw in our production reasonably well. (In DAGSchedulerSuite):

  for (pushBasedShuffleEnabled <- Seq(true, false)) {
    test("SPARK-40082: recomputation of shuffle map stage with no pending partitions should not " +
        s"hang. pushBasedShuffleEnabled = $pushBasedShuffleEnabled") {

      if (pushBasedShuffleEnabled) {
        initPushBasedShuffleConfs(conf)
        DAGSchedulerSuite.clearMergerLocs()
        DAGSchedulerSuite.addMergerLocs(Seq("host1", "host2", "host3", "host4", "host5"))
      }

      var taskIdCount = 0

      var completedStage: List[Int] = Nil
      val listener = new SparkListener() {
        override def onStageCompleted(event: SparkListenerStageCompleted): Unit = {
          completedStage = completedStage :+ event.stageInfo.stageId
        }
      }
      sc.addSparkListener(listener)

      val fetchFailParentPartition = 0

      val shuffleMapRdd0 = new MyRDD(sc, 2, Nil)
      val shuffleDep0 = new ShuffleDependency(shuffleMapRdd0, new HashPartitioner(2))

      val shuffleMapRdd1 = new MyRDD(sc, 2, List(shuffleDep0), tracker = mapOutputTracker)
      val shuffleDep1 = new ShuffleDependency(shuffleMapRdd1, new HashPartitioner(2))

      val reduceRdd = new MyRDD(sc, 2, List(shuffleDep1), tracker = mapOutputTracker)

      // submit the initial mapper stage, generate shuffle output for first reducer stage.
      submitMapStage(shuffleDep0)

      // Map stage completes successfully,
      completeShuffleMapStageSuccessfully(0, 0, 3, Seq("hostA", "hostB"))
      taskIdCount += 2
      assert(completedStage === List(0))

      // Now submit the first reducer stage
      submitMapStage(shuffleDep1)

      def createTaskInfo(speculative: Boolean): TaskInfo = {
        val taskInfo = new TaskInfo(
          taskId = taskIdCount,
          index = 0,
          attemptNumber = 0,
          partitionId = 0,
          launchTime = 0L,
          executorId = "",
          host = "hostC",
          TaskLocality.ANY,
          speculative = speculative)
        taskIdCount += 1
        taskInfo
      }

      val normalTask = createTaskInfo(speculative = false);
      val speculativeTask = createTaskInfo(speculative = true)

      // fail task 1.0 due to FetchFailed, and make 1.1 succeed.
      runEvent(makeCompletionEvent(taskSets(1).tasks(0),
        FetchFailed(makeBlockManagerId("hostA"), shuffleDep0.shuffleId, normalTask.taskId,
          fetchFailParentPartition, normalTask.index, "ignored"),
        result = null,
        Seq.empty,
        Array.empty,
        normalTask))

      // Make the speculative task succeed after initial task has failed
      runEvent(makeCompletionEvent(taskSets(1).tasks(0), Success,
        result = MapStatus(BlockManagerId("hostD-exec1", "hostD", 34512),
          Array.fill[Long](2)(2), mapTaskId = speculativeTask.taskId),
        taskInfo = speculativeTask))

      // The second task, for partition 1 succeeds as well.
      runEvent(makeCompletionEvent(taskSets(1).tasks(1), Success,
        result = MapStatus(BlockManagerId("hostE-exec2", "hostE", 23456),
          Array.fill[Long](2)(2), mapTaskId = taskIdCount),
      ))
      taskIdCount += 1

      sc.listenerBus.waitUntilEmpty()
      assert(completedStage === List(0, 2))

      // the stages will now get resubmitted due to the failure
      Thread.sleep(DAGScheduler.RESUBMIT_TIMEOUT * 2)

      // parent map stage resubmitted
      assert(scheduler.runningStages.size === 1)
      val mapStage = scheduler.runningStages.head

      // Stage 1 is same as Stage 0 - but created for the ShuffleMapTask 2, as it is a
      // different job
      assert(mapStage.id === 1)
      assert(mapStage.latestInfo.failureReason.isEmpty)
      // only the partition reported in fetch failure is resubmitted
      assert(mapStage.latestInfo.numTasks === 1)

      val stage0Retry = taskSets.filter(_.stageId == 1)
      assert(stage0Retry.size === 1)
      // make the original task succeed
      runEvent(makeCompletionEvent(stage0Retry.head.tasks(fetchFailParentPartition), Success,
        result = MapStatus(BlockManagerId("hostF-exec1", "hostF", 12345),
          Array.fill[Long](2)(2), mapTaskId = taskIdCount)))
      Thread.sleep(DAGScheduler.RESUBMIT_TIMEOUT * 2)

      // The retries should succeed
      sc.listenerBus.waitUntilEmpty()
      assert(completedStage === List(0, 2, 1, 2))

      // Now submit the entire dag again
      // This will add 3 new stages.
      submit(reduceRdd, Array(0, 1))
      Thread.sleep(DAGScheduler.RESUBMIT_TIMEOUT * 2)

      // Only the last stage needs to execute, and those tasks - so completed stages should not
      // change.
      sc.listenerBus.waitUntilEmpty()

      assert(completedStage === List(0, 2, 1, 2))

      // All other stages should be done, and only the final stage should be waiting
      assert(scheduler.runningStages.size === 1)
      assert(scheduler.runningStages.head.id === 5)
      assert(taskSets.count(_.stageId == 5) === 1)

      complete(taskSets.filter(_.stageId == 5).head, Seq((Success, 1), (Success, 2)))

      sc.listenerBus.waitUntilEmpty()
      assert(completedStage === List(0, 2, 1, 2, 5))
    }
  }

Would be good to adapt/clean it up for your PR, in addition to the existing test - so that the observed bug does not recur.

(Good news is, this PR works against it :-) )

Thank you for your advice on the UT I wrote, it was very important to me. I will delete my UT. thanks again very much

mridulm · 2023-03-18T20:40:08Z

@Stove-hust To clarify - I meant add this as well (after you had a chance to look at it and clean it up if required - this was from my test setup).
We should keep the UT you had added - and it is important to test the specific code expectation as it stands today.

Stove-hust · 2023-03-20T02:42:35Z

@Stove-hust To clarify - I meant add this as well (after you had a chance to look at it and clean it up if required - this was from my test setup). We should keep the UT you had added - and it is important to test the specific code expectation as it stands today.

Sorry, I misunderstood what you meant。😂
I think the UT written by you is great, can I write your UT in my PR, I will mark this part of UT written by you。
I have one more question, so for this PR we will have two UT， is that right？

mridulm · 2023-03-20T03:00:09Z

Technically, 3 :-)
The UT that I added will generate 2 tests - one for push based shuffle and one without.
And we have the initial test you added.

You dont need to mark it as written by me ! We can include it in your PR - with any changes you make as part of the adding it.

Stove-hust · 2023-03-20T06:34:41Z

Technically, 3 :-) The UT that I added will generate 2 tests - one for push based shuffle and one without. And we have the initial test you added.

You dont need to mark it as written by me ! We can include it in your PR - with any changes you make as part of the adding it.

Thanks for your answer, I have added all three UTs (including you wrote)

core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala

mridulm

Mostly looks good - just a few minor nits.

core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala

pom.xml

mridulm · 2023-03-20T16:45:30Z

The test failure is unrelated to this PR - once the changes above are made, the reexecution should pass

core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala

otterc

Looks good to me!

mridulm · 2023-03-22T02:23:21Z

Merged to master.
Thanks for working on this @Stove-hust !
Thanks for the review @otterc :-)

mridulm · 2023-03-22T02:24:04Z

I could not cherry pick this into 3.4 and 3.3 - we should fix for those branches as well IMO.
Can you create a PR against those two branches as well @Stove-hust ? Thanks

Stove-hust · 2023-03-22T02:51:48Z

I could not cherry pick this into 3.4 and 3.3 - we should fix for those branches as well IMO. Can you create a PR against those two branches as well @Stove-hust ? Thanks

No problem

mridulm · 2023-03-22T04:32:26Z

Is your apache jira id StoveM @Stove-hust ?

Stove-hust · 2023-03-22T05:08:16Z

@mridulm
yep，it`s me
Username: StoveM
Full name: Fencheng Mei

github-actions bot added the CORE label Mar 13, 2023

Stove-hust changed the title ~~[]SPARK-40082]~~ [SPARK-40082] Mar 14, 2023

Stove-hust changed the title ~~[SPARK-40082]~~ [SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks Mar 14, 2023

Stove-hust force-pushed the feature-40082 branch from c133877 to e7061bf Compare March 17, 2023 17:18

fixup SPARK-40082

60a5d07

Stove-hust force-pushed the feature-40082 branch from e7061bf to 60a5d07 Compare March 18, 2023 17:21

add UT

c126e0e

mridulm reviewed Mar 20, 2023

View reviewed changes

core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala Outdated Show resolved Hide resolved

meifencheng added 2 commits March 20, 2023 16:49

fix

2b66cf5

fix up scalaStyle error

bb92ee8

github-actions bot added the BUILD label Mar 20, 2023

mridulm reviewed Mar 20, 2023

View reviewed changes

otterc reviewed Mar 20, 2023

View reviewed changes

core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala Outdated Show resolved Hide resolved

core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala Outdated Show resolved Hide resolved

fix up scalaStyle error

1d1439a

github-actions bot removed the BUILD label Mar 21, 2023

fix up UT error

f195a38

otterc approved these changes Mar 21, 2023

View reviewed changes

mridulm closed this in 8b436b3 Mar 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks #40393

[SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks #40393

Stove-hust commented Mar 13, 2023

mridulm commented Mar 13, 2023

otterc commented Mar 14, 2023

Stove-hust commented Mar 14, 2023

Stove-hust commented Mar 15, 2023

otterc commented Mar 15, 2023

Stove-hust commented Mar 15, 2023

otterc commented Mar 15, 2023

mridulm commented Mar 17, 2023

Stove-hust commented Mar 17, 2023

Stove-hust commented Mar 17, 2023

otterc commented Mar 18, 2023 •

edited

mridulm commented Mar 18, 2023 •

edited

Stove-hust commented Mar 18, 2023

mridulm commented Mar 18, 2023 •

edited

Stove-hust commented Mar 20, 2023

mridulm commented Mar 20, 2023 •

edited

Stove-hust commented Mar 20, 2023

mridulm left a comment

mridulm commented Mar 20, 2023

otterc left a comment

mridulm commented Mar 22, 2023

mridulm commented Mar 22, 2023

Stove-hust commented Mar 22, 2023

mridulm commented Mar 22, 2023 •

edited

Stove-hust commented Mar 22, 2023

[SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks #40393

[SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks #40393

Conversation

Stove-hust commented Mar 13, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

mridulm commented Mar 13, 2023

otterc commented Mar 14, 2023

Stove-hust commented Mar 14, 2023

Stove-hust commented Mar 15, 2023

otterc commented Mar 15, 2023

Stove-hust commented Mar 15, 2023

otterc commented Mar 15, 2023

mridulm commented Mar 17, 2023

Stove-hust commented Mar 17, 2023

Stove-hust commented Mar 17, 2023

otterc commented Mar 18, 2023 • edited

mridulm commented Mar 18, 2023 • edited

Stove-hust commented Mar 18, 2023

mridulm commented Mar 18, 2023 • edited

Stove-hust commented Mar 20, 2023

mridulm commented Mar 20, 2023 • edited

Stove-hust commented Mar 20, 2023

mridulm left a comment

Choose a reason for hiding this comment

mridulm commented Mar 20, 2023

otterc left a comment

Choose a reason for hiding this comment

mridulm commented Mar 22, 2023

mridulm commented Mar 22, 2023

Stove-hust commented Mar 22, 2023

mridulm commented Mar 22, 2023 • edited

Stove-hust commented Mar 22, 2023

otterc commented Mar 18, 2023 •

edited

mridulm commented Mar 18, 2023 •

edited

mridulm commented Mar 18, 2023 •

edited

mridulm commented Mar 20, 2023 •

edited

mridulm commented Mar 22, 2023 •

edited