[SPARK-41469][CORE] Avoid unnecessary task rerun on decommissioned executor lost if shuffle data migrated #39011

Ngone51 · 2022-12-09T15:24:58Z

What changes were proposed in this pull request?

This PR proposes to avoid rerunning the finished shuffle map task in TaskSetManager.executorLost() if the executor lost is caused by decommission and the shuffle data has been successfully migrated.

Why are the changes needed?

To avoid unnecessary task recomputation.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added UT

Ngone51 · 2022-12-09T15:29:46Z

cc @warrenzhu25 too

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

Ngone51 · 2022-12-09T15:38:44Z

Mark as WIP first regarding the compilation error and missing ut. Any feedback is still welcome.

dongjoon-hyun

Thank you for working on this improvement.

warrenzhu25 · 2022-12-09T17:11:42Z

cc @warrenzhu25 too

It's really the change I want. Great work.

mridulm

Nice fix @Ngone51 !
Had a couple of comments.

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala

Ngone51 · 2022-12-12T08:12:59Z

The failed test seems to be flaky:

- decommission workers ensure that shuffle output is regenerated even with shuffle service *** FAILED *** (18 seconds, 479 milliseconds)
[info]   5 did not equal 4 Expected 4 tasks but got List(0:0:0:0-SUCCESS, 0:0:0:0-SUCCESS, 0:0:1:0-SUCCESS, 0:0:0:1-SUCCESS, 1:0:0:0-SUCCESS) (DecommissionWorkerSuite.scala:191)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
[info]   at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
[info]   at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
[info]   at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
[info]   at org.apache.spark.deploy.DecommissionWorkerSuite.$anonfun$new$6(DecommissionWorkerSuite.scala:191)
....

dongjoon-hyun

Could you take a look at the failure? It looks like relevant.

[error] Failed: Total 3462, Failed 1, Errors 0, Passed 3461, Ignored 9, Canceled 2
[error] Failed tests:
[error] 	org.apache.spark.deploy.DecommissionWorkerSuite

Ngone51 · 2022-12-15T09:40:09Z

Could you take a look at the failure? It looks like relevant.

The failure is not reproducible every time. I suspect it is still a flaky test. I will keep an eye on it.

dongjoon-hyun · 2022-12-15T10:07:11Z

Got it. If then, it's okay. Thank you for checking and confirming that.

mridulm · 2022-12-15T17:57:32Z

~~The failure is indicating two task end events for the same task ? I dont see how that can happen due to this PR - but wondering how that could have happened in general ... any thoughts @Ngone51 ?~~

Let me take that back, this can actually be explained - and is an effect of this PR.
The task resubmission is the duplicate event.
Earlier condition was: isShuffleMapTasks && !env.blockManager.externalShuffleServiceEnabled && !isZombie - which will always be false for this test, since externalShuffleServiceEnabled == true.

Now, we also check for ExecutorDecommission - and it is a race whether the worker was killed before this event was processed in TSM or not.

We should fix the test to account for the change in behavior - or relook at whether this is a legitimate case we are missing ? (dont think so, but want to be sure)

Ngone51 · 2022-12-26T02:54:29Z

@mridulm Thanks for the help. Let me take another look. (Sorry for the delay was unhealthy last week...

mridulm

Looks good to me, thanks for fixing this @Ngone51 !

mridulm · 2022-12-27T21:38:34Z

Merged to master.
Thanks for working on this @Ngone51 !
Thanks for review @dongjoon-hyun :-)

Ngone51 · 2022-12-28T07:41:14Z

Thanks @mridulm @dongjoon-hyun

…ecutor lost if shuffle data migrated This PR proposes to avoid rerunning the finished shuffle map task in `TaskSetManager.executorLost()` if the executor lost is caused by decommission and the shuffle data has been successfully migrated. To avoid unnecessary task recomputation. No. Added UT Closes apache#39011 from Ngone51/decom-executor-lost. Authored-by: Yi Wu <yi.wu@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

…ecutor lost if shuffle data migrated ### What changes were proposed in this pull request? This PR proposes to avoid rerunning the finished shuffle map task in `TaskSetManager.executorLost()` if the executor lost is caused by decommission and the shuffle data has been successfully migrated. ### Why are the changes needed? To avoid unnecessary task recomputation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added UT Closes apache#39011 from Ngone51/decom-executor-lost. Authored-by: Yi Wu <yi.wu@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

…ecutor lost if shuffle data migrated This PR proposes to avoid rerunning the finished shuffle map task in `TaskSetManager.executorLost()` if the executor lost is caused by decommission and the shuffle data has been successfully migrated. To avoid unnecessary task recomputation. No. Added UT Closes apache#39011 from Ngone51/decom-executor-lost. Authored-by: Yi Wu <yi.wu@databricks.com> Change-Id: Ic6142fdb304ed67df019111210e728b7b73a917d Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> Reviewed-on: https://bigdataoss-internal-review.googlesource.com/c/third_party/apache/spark/+/54919 Reviewed-by: Animesh Nandanwar <animeshvn@google.com> Tested-by: Prow Service Account <425329972751-compute@developer.gserviceaccount.com> Reviewed-by: Wei Yan <weiyans@google.com>

JoshRosen · 2023-04-26T02:18:49Z

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

+    val maybeShuffleMapOutputLoss = isShuffleMapTasks &&
+      (reason.isInstanceOf[ExecutorDecommission] || !env.blockManager.externalShuffleServiceEnabled)


@Ngone51 @mridulm I have a question about the logic here:

Executor decommissioning does not necessarily imply worker decommissioning. If an external shuffle service is used and the executor is decommissioned without the worker also being decommissioned then the shuffle files will continue to be available at the original host.

Prior to this PR, I don't think this executorLost method would not have scheduled re-runs in that case because the && !env.blockManager.externalShuffleServiceEnabled condition would evaluate to false when the ESS was used, causing us to skip all of the resubmission logic here.

With this PR, though, I think these changes might actually cause unnecessary task re-submission in that case because the (reason.isInstanceOf[ExecutorDecommission] || !env.blockManager.externalShuffleServiceEnabled) condition would evaluate to true and locationOpt.exists(_.host != host) would evaluate to false because the original outputs are still available because no migration is needed.

I think this could be addressed by checking whether ExecutorDecommission.workerHost is defined, i.e. to do

val workerIsDecommissioned = reason match { case e: ExecutorDecommission if e.workerHost.isDefined => true case _ => false } val maybeShuffleMapOutputLoss = isShuffleMapTasks && (workerIsDecommissioned || !env.blockManager.externalShuffleServiceEnabled)

That said, my argument above doesn't hold for Spark-on-Kubernetes because it never sets workerHost:

spark/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala

Lines 336 to 341 in a2a5299

executorsPendingDecommission.get(id) match {

case Some(host) =>

// We don't pass through the host because by convention the

// host is only populated if the entire host is going away

// and we don't know if that's the case or just one container.

removeExecutor(id, ExecutorDecommission(None))

Given this, I am wondering whether this PR's change might represent a regression when Dynamic Allocation is used alongside an external shuffle service in YARN.

WDYT? Am I interpreting the code correctly or have I overlooked something here?

Sorry for the delay - this message got lost in my inbox.
You are correct @JoshRosen, we should indeed check for ExecutorDecommission.workerHost.isDefined for standalone

+CC @Ngone51 in case I am missing something.

@Ngone51 @mridulm Is the above issue being tracked elsewhere?

09306677806 · 2024-04-21T22:17:50Z

#(&£€&445₩)*khakqganHynosql sharsing ShahrzadMahro در تاریخ یکشنبه ۲۱ آوریل ۲۰۲۴،‏ ۲۲:۳۶ Parth Shyara ***@***.***> نوشت:

…

***@***.**** commented on this pull request. ------------------------------ In core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala <#39011 (comment)>: > + val maybeShuffleMapOutputLoss = isShuffleMapTasks && + (reason.isInstanceOf[ExecutorDecommission] || !env.blockManager.externalShuffleServiceEnabled) @Ngone51 <https://github.com/Ngone51> @mridulm <https://github.com/mridulm> Is the above issue being tracked elsewhere? — Reply to this email directly, view it on GitHub <#39011 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANEZZRHCBDW5SHLP2PE25W3Y6QE45AVCNFSM6AAAAAASZOZVDSVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDAMJTGUYDCMBYHA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

github-actions bot added the CORE label Dec 9, 2022

Ngone51 requested review from dongjoon-hyun, holdenk and JoshRosen December 9, 2022 15:26

Ngone51 commented Dec 9, 2022

View reviewed changes

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala Outdated Show resolved Hide resolved

Ngone51 changed the title ~~[SPARK-41469][CORE] Avoid unnecessary task rerun on decommissioned executor lost if shuffle data migrated~~ [WIP][SPARK-41469][CORE] Avoid unnecessary task rerun on decommissioned executor lost if shuffle data migrated Dec 9, 2022

dongjoon-hyun reviewed Dec 9, 2022

View reviewed changes

Ngone51 changed the title ~~[WIP][SPARK-41469][CORE] Avoid unnecessary task rerun on decommissioned executor lost if shuffle data migrated~~ [SPARK-41469][CORE] Avoid unnecessary task rerun on decommissioned executor lost if shuffle data migrated Dec 10, 2022

mridulm reviewed Dec 10, 2022

View reviewed changes

core/src/main/scala/org/apache/spark/MapOutputTracker.scala Show resolved Hide resolved

core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Dec 14, 2022

View reviewed changes

Ngone51 added 5 commits December 15, 2022 17:16

.

a2162f5

fix compile error

c2ec271

update

75a0d73

add ut

b5c9d20

use Option

984dea9

Ngone51 force-pushed the decom-executor-lost branch from 579cc22 to 984dea9 Compare December 15, 2022 09:34

fix

b2f9aab

mridulm approved these changes Dec 27, 2022

View reviewed changes

mridulm closed this in b219f27 Dec 27, 2022

JoshRosen reviewed Apr 26, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-41469][CORE] Avoid unnecessary task rerun on decommissioned executor lost if shuffle data migrated #39011

[SPARK-41469][CORE] Avoid unnecessary task rerun on decommissioned executor lost if shuffle data migrated #39011

Ngone51 commented Dec 9, 2022 •

edited

Ngone51 commented Dec 9, 2022

Ngone51 commented Dec 9, 2022

dongjoon-hyun left a comment

warrenzhu25 commented Dec 9, 2022

mridulm left a comment

Ngone51 commented Dec 12, 2022

dongjoon-hyun left a comment

Ngone51 commented Dec 15, 2022

dongjoon-hyun commented Dec 15, 2022

mridulm commented Dec 15, 2022 •

edited

Ngone51 commented Dec 26, 2022

mridulm left a comment

mridulm commented Dec 27, 2022

Ngone51 commented Dec 28, 2022

JoshRosen Apr 26, 2023

mridulm Jul 1, 2023 •

edited

parthshyara Apr 21, 2024

09306677806 commented Apr 21, 2024 via email

		val maybeShuffleMapOutputLoss = isShuffleMapTasks &&
		(reason.isInstanceOf[ExecutorDecommission] \|\| !env.blockManager.externalShuffleServiceEnabled)

	executorsPendingDecommission.get(id) match {
	case Some(host) =>
	// We don't pass through the host because by convention the
	// host is only populated if the entire host is going away
	// and we don't know if that's the case or just one container.
	removeExecutor(id, ExecutorDecommission(None))

[SPARK-41469][CORE] Avoid unnecessary task rerun on decommissioned executor lost if shuffle data migrated #39011

[SPARK-41469][CORE] Avoid unnecessary task rerun on decommissioned executor lost if shuffle data migrated #39011

Conversation

Ngone51 commented Dec 9, 2022 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Ngone51 commented Dec 9, 2022

Ngone51 commented Dec 9, 2022

dongjoon-hyun left a comment

Choose a reason for hiding this comment

warrenzhu25 commented Dec 9, 2022

mridulm left a comment

Choose a reason for hiding this comment

Ngone51 commented Dec 12, 2022

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Ngone51 commented Dec 15, 2022

dongjoon-hyun commented Dec 15, 2022

mridulm commented Dec 15, 2022 • edited

Ngone51 commented Dec 26, 2022

mridulm left a comment

Choose a reason for hiding this comment

mridulm commented Dec 27, 2022

Ngone51 commented Dec 28, 2022

JoshRosen Apr 26, 2023

Choose a reason for hiding this comment

mridulm Jul 1, 2023 • edited

Choose a reason for hiding this comment

parthshyara Apr 21, 2024

Choose a reason for hiding this comment

09306677806 commented Apr 21, 2024 via email

Ngone51 commented Dec 9, 2022 •

edited

mridulm commented Dec 15, 2022 •

edited

mridulm Jul 1, 2023 •

edited