[SPARK-46052][CORE] Remove function TaskScheduler.killAllTaskAttempts #43954

Ngone51 · 2023-11-22T08:39:35Z

What changes were proposed in this pull request?

This PR removes the interface TaskScheduler.killAllTaskAttempts and its implementations. And replace it with TaskScheduler.cancelTasks. This PR also removes "abort stage" from TaskScheduler.cancelTasks but move it to after the call of TaskScheduler.cancelTasks with a control flag spark.legacy.scheduler.stage.abortAfterCancelTasks (true by default to keep the same behaviour for now). Because "abort stage" is not necessary while canceling tasks, see the comment at #43954 (comment).

Besides, this PR fixes a bug which pontentially launching new tasks after killing all the tasks in the stage attempt. This PR fixes it by marking it as zombie (i.e., suspend()) after the killing.

Why are the changes needed?

Spark has two functions to kill all tasks in a Stage:

cancelTasks: Not only kill all the running tasks in all the stage attempts but also abort all the stage attempts
killAllTaskAttempts: Only kill all the running tasks in all the stage attemtps but won't abort the attempts.

However, there's no use case in Spark that a stage would launch new tasks after its all tasks get killed. So I think we can replace killAllTaskAttempts with cancelTasks directly.

Does this PR introduce any user-facing change?

No. TaskScheduler is internal.

How was this patch tested?

Pass existing tests.

Was this patch authored or co-authored using generative AI tooling?

No.

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

JoshRosen · 2023-11-22T19:12:14Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

            val reason = s"Task $task from barrier stage $failedStage (${failedStage.name}) " +
              "failed."
            val job = jobIdToActiveJob.get(failedStage.firstJobId)
            val shouldInterrupt = job.exists(j => shouldInterruptTaskThread(j))
-            taskScheduler.killAllTaskAttempts(stageId, shouldInterrupt, reason)


I think that there is a subtle difference here: killAllTaskAttempts only kills tasks, whereas cancelTasks also calls tsm.abort() on the stage attempts, which might enqueue a new taskSetFailed event for each task set and I think that could have unintended side effects. Can you double-check whether we think that change is okay?

This's really a good point. taskSetFailed will abort the stage and in turn fails the whole job, which is not the intended behaviour here. The problem of killAllTaskAttempts is that it doesn't mark the TaskSetManager as zombie after killing all the tasks. So the TaskSetManager could still launch new tasks (by retry), which is not expected.

But I'm also thinking do we really want to abort the stages in cancelTasks? cancelTasks is currently called inside cancelRunningIndependentStages only. And cancelRunningIndependentStages is directly or indirectly called in 3 cases:

When a job successfully finished: in this case, we expect that all the stages in this job can release the computation resources (i.e., kill all the tasks via cleanupStateForJobAndIndependentStages) immediately. But I think we don't expect this "release" action would lead to the stage abortion and in turn fail the job in the end. It doesn't fail the already succeeded job today because the succeeded job has been clean up (no longer exists in the activeJobs list) when the taskSetFailed event comes.

When a job is requested to cancel: this case is essentially the same with the above case but only the job finishes in different states.

When a stage aborts: in this case, we expect all the active jobs which depends on this stage to be canceled. Thus, we need to call cancelRunningIndependentStages on each active job. And this would finally fallback to the first case as the active job will be cleaned up (via cleanupStateForJobAndIndependentStages) first before the taskSetFailed event comes.

I agree. The worst case is we manually trigger "abort stage" in the existing callers of cancelTasks to keep the old behavior.

Updated. I removed the "abort stage" inside cancelTasks but move it to after the callers of cancelTasks. I also added a new conf spark.scheduler.stage.abortStageAfterCancelTasks to control wether we should abort the stage after cancelTasks. By default, we don't abort.

core/src/main/scala/org/apache/spark/scheduler/TaskScheduler.scala

beliefer · 2023-11-28T08:18:23Z

core/src/main/scala/org/apache/spark/scheduler/TaskScheduler.scala

@@ -54,7 +54,7 @@ private[spark] trait TaskScheduler {
  // Submit a sequence of tasks to run.
  def submitTasks(taskSet: TaskSet): Unit

-  // Kill all the tasks in a stage and fail the stage and all the jobs that depend on the stage.
+  // Kill all the tasks in all the stage attempts of the same stage Id


Please add comment about mark all the stage attempts as zombie.

I considered to metion zombie here but gave up finally as think that it's only an API and we don't we don't enforce the same functionaltiy for every implementation.

Sounds good to me.

beliefer · 2023-11-28T08:25:06Z

core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala

-    val tsm = taskScheduler.taskSetManagerForAttempt(0, 0).get
-    assert(2 === tsm.runningTasks)
-
-    taskScheduler.killAllTaskAttempts(0, false, "test")


Shall we update the test case with taskScheduler.cancelTasks?

Sounds good.

I just realize that we have a test "cancelTasks shall kill all the running tasks and fail the stage" right above. So I think keeping that test should be enough.

I see. Thank you.

core/src/main/scala/org/apache/spark/internal/config/package.scala

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

cloud-fan · 2023-11-29T00:43:58Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

@@ -296,18 +296,32 @@ private[spark] class TaskSchedulerImpl(
    new TaskSetManager(this, taskSet, maxTaskFailures, healthTrackerOpt, clock)
  }

+  // Kill all the tasks in all the stage attempts of the same stage Id. Note stage attempts won't
+  // be aborted but will be marked as zombie. The stage attempt will be finished and cleaned up
+  // once all the tasks has been finished. The stage attempt could be aborted after the call of


I'm trying to understand the rationale here. What's the key difference between marking as zombie and aborting the stage? Is it for the still-running tasks?

def abort(message: String, exception: Option[Throwable] = None): Unit = sched.synchronized { sched.dagScheduler.taskSetFailed(taskSet, message, exception) isZombie = true maybeFinishTaskSet() }

When there is a call to abort, the TSM must be marked as zombie. So the key difference should come from dagScheduler.taskSetFailed. dagScheduler.taskSetFailed essentially cleans up the data related to this stage and fail the jobs which depends on this stage.

There's no difference to TSM between zombie and abort. Tasks in TSM can still run until finishes (whether killed or succeeded).

beliefer · 2023-11-29T09:08:12Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

@@ -1894,24 +1894,8 @@ private[spark] class DAGScheduler(
                  job.numFinished += 1
                  // If the whole job has finished, remove it
                  if (job.numFinished == job.numPartitions) {
-                    markStageAsFinished(resultStage)


Why markStageAsFinished no need?

cancelRunningIndependentStages already does that, see:

spark/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

Lines 2862 to 2863 in 14d854b

taskScheduler.cancelTasks(stageId, shouldInterruptTaskThread(job), reason)

markStageAsFinished(stage, Some(reason))

They are not the same @Ngone51 - markStageAsFinished is for successful stage completions (when no reason), while cancelRunningIndependentStages is aborting other stages for the job which now need to be killed due to the job successfully terminating.

One impact of this is in the errorMessage - it will be nonEmpty from cancelRunningIndependentStages and so trigger failure paths.

@mridulm Oh I see. Thanks for catching this.

beliefer · 2023-11-29T09:09:47Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

+            backend.killTask(tid, execId, interruptThread, s"Stage cancelled: $reason")
+          }
+        }
+        tsm.suspend()


I guess you expect that the tsm should be finished. But it may not necessarily happen.

No, I don't. Killing tasks may take some time so I don't expect an immediate tsm finishes. suspend() intends to call maybeFinishTaskSet() for safe in case the tsm can't finish normally in the end. Although, I don't see it's an issue in prod (tms should finishes normally when all tasks finish) but I did see it fails a test (cancelTasks shall kill all the running tasks) without maybeFinishTaskSet().

cloud-fan · 2023-11-29T12:03:05Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

@@ -2603,4 +2603,13 @@ package object config {
      .stringConf
      .toSequence
      .createWithDefault("org.apache.spark.sql.connect.client" :: Nil)
+
+  private[spark] val LEGACY_ABORT_STAGE_AFTER_CANCEL_TASKS =
+    ConfigBuilder("spark.scheduler.stage.legacyAbortStageAfterCancelTasks")


should be spark.legacy....

cloud-fan · 2023-11-29T13:24:00Z

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

+  // Suspends this TSM to avoid launching new tasks.
+  //
+  // Unlike `abort()`, this function intentionally to not notify DAGScheduler to avoid
+  // redundant operations. So the invocation to this function should assume DAGScheduler


So how expensive is the redundant operation? We may choose to always do them if it's cheap, to simplify the code.

It's not expensive. The operation ("abort stage") is a noop in the end as I mentioned at https://github.com/apache/spark/pull/43954/files#r1402869071. I want to remove "abort stage" because I think it's not a right behaviour. "abort stage" always means any active jobs that depends on it needs to fail. So it doesn't make sense to me, for example, when a result stage succeeds and the job succeeds, but in turn we needs to cancel straggle running tasks in that result stage and abort that stage. The "abort" here will try to fail the job (which already succeeds) but just doesn't happen today because DAGScheduler is thread-safe and the succeeded job have been removed from the acive job list.

If we want to be conservative, I'm fine to keep as it is.

mridulm

Will need me a bit more time to go over this, in the meantime can you please check my other comments please ?
Changes (if any) due to that would impact how we analyze the rest of the pr.

Thanks for working on this @Ngone51 !

mridulm · 2023-11-29T15:48:12Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

@@ -2860,6 +2844,11 @@ private[spark] class DAGScheduler(
          if (runningStages.contains(stage)) {
            try { // cancelTasks will fail if a SchedulerBackend does not implement killTask
              taskScheduler.cancelTasks(stageId, shouldInterruptTaskThread(job), reason)
+              if (sc.getConf.get(LEGACY_ABORT_STAGE_AFTER_CANCEL_TASKS)) {


Pull this out as a field

cloud-fan

LGTM, but let's wait for @mridulm to do a final sign off.

mridulm

I will need to look into it in more detail, but I am not sure this change is without side effects.

I am dumping my thoughts here, though it is spread out in the code.
Considering existing state:

killAllTaskAttempts will simply kill any running attempts and then we are done - with all additional suffix work done by the caller.
cancelTasks was using killAllTaskAttempts, and then additionally doing a tsm.abort.

The new behavior is:

killAllTaskAttempts is now cancelTasks
cancelTasks becomes earlier killAllTaskAttempts + a tsm.suspend (which does isZombie = true + tsm.maybeFinishTaskSet)

This has two impacts:

Additional call to suspend for existing killAllTaskAttempts
Lack of TaskSetFailed for existing cancelTasks (which could be impacting use from abortStage, job cancellation, etc)

Note - I would love to reduce the api surface, and this might still be exactly the same as existing behavior - but it has resulted in a bit of nontrivial change already and I would like us to make sure this is safe and equivalent.

+CC @JoshRosen in case I am missing something.

mridulm · 2023-12-01T01:38:34Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

@@ -2603,4 +2603,13 @@ package object config {
      .stringConf
      .toSequence
      .createWithDefault("org.apache.spark.sql.connect.client" :: Nil)
+
+  private[spark] val LEGACY_ABORT_STAGE_AFTER_CANCEL_TASKS =
+    ConfigBuilder("spark.legacy.scheduler.stage.abortAfterCancelTasks")


Namespace it below scheduler ? Something like spark.scheduler.stage.legacyAbortAfterCancelTasks ...
Also, mark it as internal ?

Btw, we should flip this switch on and off in the relevant tests to check if the behavior is preserved.

Btw, we should flip this switch on and off in the relevant tests to check if the behavior is preserved.

It is a bit annoying to flip this conf switch on and off only for the relevant unit tests in DAGSchedulerSuite since it uses a global SparkContext. One way to do this is to might intrduce a new entire suite called DAGSchedulerWithAbortStageDisabledSuite but not sure it's worth to do that. Or we could extract the relevant unit tests into a separate suite and then flip the conf with two on- and off- suites. a bit complicated though.

mridulm · 2023-12-01T01:43:59Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

+        "from TaskScheduler.cancelTasks()")
+      .version("4.0.0")
+      .booleanConf
+      .createWithDefault(false)


Do we want to default to true and switch to false in a later ver ?

Ngone51 · 2023-12-04T12:16:44Z

@mridulm Thanks for the detailed comment.

Additional call to suspend for existing killAllTaskAttempts

Note that we always call markStageAsFinished after the call to killAllTaskAttempts. So I think we don't expect to launch new tasks for such stage. And suspend, especially "mark as zombie", excatly avoids that. I think this's actually a bug that we need to fix.

Lack of TaskSetFailed for existing cancelTasks (which could be impacting use from abortStage, job cancellation, etc)

TaskSetFailed leads to abortStage. And I think the essential (main) effect of abortStage is to fail the active jobs which depend on the stage. And I have the analysis here to explain why I think abortStage is a noop at the end during the call of cancelTasks. Could you take a look there?

mridulm · 2023-12-06T05:47:05Z

Thanks for the details @Ngone51, sorry for the delay in going over this - your explaination makes sense to me.
Can you update the PR description about the fact that we are also fixing a bug through the change with suspend ?

Did another pass, and dont have additional notes beyond what I have already left some comments - thanks for working on this !

Ngone51 · 2023-12-06T14:39:45Z

Thanks @mridulm . Will address your comments.

Ngone51 · 2023-12-07T13:51:00Z

Btw, we should flip this switch on and off in the relevant tests to check if the behavior is preserved.

@mridulm I tried to enable spark.scheduler.stage.legacyAbortAfterCancelTasks in DAGSchedulerSuite. It does appear to have several test fauilures 😢 . I'm working on investigating the failures right now.

Ngone51 · 2023-12-08T13:03:27Z

To fix the tests, I have to move (fe70ba9) the "abort stage" call back into cancelTasks() with the control flag rather than call "abort stage" after cancelTasks(). After the move, the timing of "abort stage" call strictly follows the original behavior.

The probelm of calling "abort stage" after cancelTasks() is, e.g., test DAGSchedulerSuite has its own implementation of TaskScheduler which overrides the cancelTasks() with no "abort stage". But our change has extracted "abort stage" out side of cancelTasks(). Thus, the intention to not call "abort stage" in cancelTask() is broken and so test fails.

core/src/main/scala/org/apache/spark/internal/config/package.scala

Ngone51 · 2023-12-12T08:33:53Z

Damn! Barrier stage seems to be a special case. It called killAllTaskAttempts() to kill all the other tasks when there was a task failure but didn't abort the stage as it would have a retry later. In this PR, we replace killAllTaskAttempts() with cancelTasks() and enables stage abortion by default within cancelTasks(). This leads to the barrier stage failure instead of retry. It would only work for all the cases if we're confident enough to remove stage abortion from cancelTasks() thoroughly without any control flag.

mridulm · 2023-12-13T05:36:57Z

Ah, interesting - I had not looked at barrier stage in as much detail; my initial observation was it worked fine, but you are right - this does break the assumption.

cloud-fan · 2024-01-09T07:50:41Z

core/src/main/scala/org/apache/spark/scheduler/TaskScheduler.scala

@@ -54,7 +54,7 @@ private[spark] trait TaskScheduler {
  // Submit a sequence of tasks to run.
  def submitTasks(taskSet: TaskSet): Unit

-  // Kill all the tasks in a stage and fail the stage and all the jobs that depend on the stage.
+  // Kill all the tasks in all the stage attempts of the same stage Id


Looking at the comment, shall we keep killAllTaskAttempts instead of cancelTasks, as the naming of killAllTaskAttempts fits the comment better?

Updated, thanks!

cloud-fan · 2024-01-11T02:13:55Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

+  // Kill all the tasks in all the stage attempts of the same stage Id. Note stage attempts won't
+  // be aborted but will be marked as zombie. The stage attempt will be finished and cleaned up
+  // once all the tasks has been finished. The stage attempt could be aborted after the call of
+  // `cancelTasks` if required.


Suggested change

// `cancelTasks` if required.

// `killAllTaskAttempts ` if required.

cloud-fan · 2024-01-11T02:14:13Z

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

+  // Unlike `abort()`, this function intentionally to not notify DAGScheduler to avoid
+  // redundant operations. So the invocation to this function should assume DAGScheduler
+  // already knows about this TSM failure. For example, this function can be called from
+  // `TaskScheduler.cancelTasks` by DAGScheduler.


cloud-fan · 2024-01-11T02:15:34Z

core/src/main/scala/org/apache/spark/scheduler/TaskScheduler.scala

  // Throw UnsupportedOperationException if the backend doesn't support kill tasks.
-  def cancelTasks(stageId: Int, interruptThread: Boolean, reason: String): Unit
+  def killAllTaskAttempts(stageId: Int, interruptThread: Boolean, reason: String): Unit


super nit: shall we put this method after def killTaskAttempt? The same for the testing TaskScheduler implementations to reduce the code diff.

Updated, thanks!

Ngone51 · 2024-01-12T09:13:29Z

@mridulm @cloud-fan Could you help merge the PR if you don't have other comments? Thanks!

cloud-fan · 2024-01-12T09:17:16Z

thanks, merging to master!

github-actions bot added the CORE label Nov 22, 2023

Ngone51 commented Nov 22, 2023

View reviewed changes

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala Outdated Show resolved Hide resolved

Ngone51 requested review from mridulm, cloud-fan and jiangxb1987 November 22, 2023 08:42

JoshRosen reviewed Nov 22, 2023

View reviewed changes

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala Outdated Show resolved Hide resolved

JoshRosen reviewed Nov 22, 2023

View reviewed changes

Ngone51 force-pushed the remove_killAllTaskAttempts branch from de9be9a to 2ead5fc Compare November 28, 2023 07:46

beliefer reviewed Nov 28, 2023

View reviewed changes

cloud-fan reviewed Nov 29, 2023

View reviewed changes

core/src/main/scala/org/apache/spark/internal/config/package.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Nov 29, 2023

View reviewed changes

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Nov 29, 2023

View reviewed changes

beliefer reviewed Nov 29, 2023

View reviewed changes

cloud-fan reviewed Nov 29, 2023

View reviewed changes

mridulm reviewed Nov 29, 2023

View reviewed changes

Ngone51 requested review from JoshRosen, mridulm, cloud-fan and beliefer December 1, 2023 01:46

cloud-fan approved these changes Dec 1, 2023

View reviewed changes

mridulm reviewed Dec 1, 2023

View reviewed changes

cloud-fan reviewed Dec 10, 2023

View reviewed changes

core/src/main/scala/org/apache/spark/internal/config/package.scala Show resolved Hide resolved

cloud-fan reviewed Jan 9, 2024

View reviewed changes

Ngone51 added 16 commits January 11, 2024 09:36

.

5587266

.

adb41b5

add back test with cancelTasks

38d996f

rename conf with legacy

6b69a36

fix

7ff02ca

remove duplicate markStageAsFinished(resultStage)

577b760

address comment

298ef4e

revert changes

d8ffc4b

add field legacyAbortStageAfterCancelTasks

24f5fa5

conf update

87883b7

fix

55b1b70

revert abort in cancelTasks and fix tests

f780427

pull out as a field

1a3fe45

remove sleep

c3c9bd0

add DAGSchedulerAbortStageOffSuite

863a4bb

address comment

e866578

Ngone51 force-pushed the remove_killAllTaskAttempts branch from 061b3a7 to e866578 Compare January 11, 2024 01:52

cloud-fan reviewed Jan 11, 2024

View reviewed changes

Ngone51 added 3 commits January 11, 2024 15:51

update cancelTasks to killAllTaskAttempts

fb478b9

address comment

c6dd45b

update

84a1b82

cloud-fan approved these changes Jan 11, 2024

View reviewed changes

Merge branch 'master' into remove_killAllTaskAttempts

d3d6b1c

cloud-fan closed this in 96f34bb Jan 12, 2024

	taskScheduler.cancelTasks(stageId, shouldInterruptTaskThread(job), reason)
	markStageAsFinished(stage, Some(reason))

	// `cancelTasks` if required.
	// `killAllTaskAttempts ` if required.

[SPARK-46052][CORE] Remove function TaskScheduler.killAllTaskAttempts #43954

[SPARK-46052][CORE] Remove function TaskScheduler.killAllTaskAttempts #43954

Conversation

Ngone51 commented Nov 22, 2023 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mridulm Nov 29, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mridulm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan left a comment

Choose a reason for hiding this comment

mridulm left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ngone51 commented Dec 4, 2023

mridulm commented Dec 6, 2023 • edited

Ngone51 commented Dec 6, 2023

Ngone51 commented Dec 7, 2023

Ngone51 commented Dec 8, 2023

Ngone51 commented Dec 12, 2023

mridulm commented Dec 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ngone51 commented Jan 12, 2024

cloud-fan commented Jan 12, 2024

Ngone51 commented Nov 22, 2023 •

edited

mridulm Nov 29, 2023 •

edited

mridulm left a comment •

edited

mridulm commented Dec 6, 2023 •

edited