[SPARK-39601][YARN] AllocationFailure should not be treated as exitCausedByApp when driver is shutting down #38622

pan3793 · 2022-11-11T08:30:47Z

What changes were proposed in this pull request?

Treating container AllocationFailure as not "exitCausedByApp" when driver is shutting down.

The approach is suggested at #36991 (comment)

Why are the changes needed?

I observed some Spark Applications successfully completed all jobs but failed during the shutting down phase w/ reason: Max number of executor failures (16) reached, the timeline is

Driver - Job success, Spark starts shutting down procedure.

2022-06-23 19:50:55 CST AbstractConnector INFO - Stopped Spark@74e9431b{HTTP/1.1, (http/1.1)}{0.0.0.0:0}
2022-06-23 19:50:55 CST SparkUI INFO - Stopped Spark web UI at http://hadoop2627.xxx.org:28446
2022-06-23 19:50:55 CST YarnClusterSchedulerBackend INFO - Shutting down all executors

Driver - A container allocate successful during shutting down phase.

2022-06-23 19:52:21 CST YarnAllocator INFO - Launching container container_e94_1649986670278_7743380_02_000025 on host hadoop4388.xxx.org for executor with ID 24 for ResourceProfile Id 0

Executor - The executor can not connect to driver endpoint because driver already stopped the endpoint.

Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1911)
  at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
  at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:393)
  at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend$.main(YarnCoarseGrainedExecutorBackend.scala:81)
  at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend.main(YarnCoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult: 
  at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
  at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
  at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
  at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$9(CoarseGrainedExecutorBackend.scala:413)
  at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23)
  at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877)
  at scala.collection.immutable.Range.foreach(Range.scala:158)
  at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:876)
  at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:411)
  at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62)
  at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:422)
  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
  ... 4 more
Caused by: org.apache.spark.rpc.RpcEndpointNotFoundException: Cannot find endpoint: spark://CoarseGrainedScheduler@hadoop2627.xxx.org:21956
  at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$asyncSetupEndpointRefByURI$1(NettyRpcEnv.scala:148)
  at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$asyncSetupEndpointRefByURI$1$adapted(NettyRpcEnv.scala:144)
  at scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:307)
  at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:41)
  at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
  at org.apache.spark.util.ThreadUtils$$anon$1.execute(ThreadUtils.scala:99)
  at scala.concurrent.impl.ExecutionContextImpl$$anon$4.execute(ExecutionContextImpl.scala:138)
  at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:72)
  at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:288)
  at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:288)
  at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:288)

Driver - YarnAllocator received container launch error message and treat it as exitCausedByApp

2022-06-23 19:52:27 CST YarnAllocator INFO - Completed container container_e94_1649986670278_7743380_02_000025 on host: hadoop4388.xxx.org (state: COMPLETE, exit status: 1)
2022-06-23 19:52:27 CST YarnAllocator WARN - Container from a bad node: container_e94_1649986670278_7743380_02_000025 on host: hadoop4388.xxx.org. Exit status: 1. Diagnostics: [2022-06-23 19:52:24.932]Exception from container-launch.
Container id: container_e94_1649986670278_7743380_02_000025
Exit code: 1
Shell output: main : command provided 1
main : run as user is bdms_pm
main : requested yarn user is bdms_pm
Getting exit code file...
Creating script paths...
Writing pid file...
Writing to tmp file /mnt/dfs/2/yarn/local/nmPrivate/application_1649986670278_7743380/container_e94_1649986670278_7743380_02_000025/container_e94_1649986670278_7743380_02_000025.pid.tmp
Writing to cgroup task files...
Creating local dirs...
Launching container...
Getting exit code file...
Creating script paths...


[2022-06-23 19:52:24.938]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:873)
  at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:72)
  at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:288)
  at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:288)
  at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:288)
  at scala.concurrent.Promise.trySuccess(Promise.scala:94)
  at scala.concurrent.Promise.trySuccess$(Promise.scala:94)
  at scala.concurrent.impl.Promise$DefaultPromise.trySuccess(Promise.scala:187)
  at org.apache.spark.rpc.netty.NettyRpcEnv.onSuccess$1(NettyRpcEnv.scala:225)
  at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$askAbortable$7(NettyRpcEnv.scala:246)
  at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$askAbortable$7$adapted(NettyRpcEnv.scala:246)
  at org.apache.spark.rpc.netty.RpcOutboxMessage.onSuccess(Outbox.scala:90)
  at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:195)
  at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:142)
  at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
  at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)

Driver - Eventually application failed because ”failed“ executor reached threshold

2022-06-23 19:52:30 CST ApplicationMaster INFO - Final app status: FAILED, exitCode: 11, (reason: Max number of executor failures (16) reached)

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Update UT.

pan3793 · 2022-11-11T10:53:38Z

cc @tgravescs

AmplabJenkins · 2022-11-11T16:43:30Z

Can one of the admins verify this patch?

LuciferYang · 2022-11-12T11:59:23Z

cc @mridulm @tgravescs FYI

mridulm · 2022-11-14T07:24:12Z

I will let @tgravescs take a look at this - I dont have as much context as him.

tgravescs · 2022-11-14T15:19:14Z

can you please add a description to the issue: https://issues.apache.org/jira/projects/SPARK/issues/SPARK-39601

tgravescs · 2022-11-14T15:37:30Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala

@@ -815,6 +815,7 @@ private[spark] class ApplicationMaster(
      case Shutdown(code) =>
        exitCode = code
        shutdown = true
+        allocator.setShutdown(true)


so maybe I'm missing a call but it looks like this Shutdown message is only sent in Client mode. Is that the only time you are seeing this issue? ie when you have the driver is local and the applicationmaster and yarn allocator are on the cluster?
But the log message in the description has YarnClusterSchedulerBackend which makes me think this is cluster mode.

it looks like this Shutdown message is only sent in Client mode

Yea, I just noticed that, do you think it's a good idea to send Shutdown in cluster mode as well? or any other suggestions? cc @AngersZhuuuu as you are the author of that code.

the log message in the description has YarnClusterSchedulerBackend which makes me think this is cluster mode.

You are right, my reported job failed in cluster mode, and I think both yarn client/cluster modes have this issue.

yeah this won't fix your issue as is so we would need something for cluster mode. I'm fine with sending the shutdown message. It would be nice if you could test the fix on your use case to make sure it fixes it as well.

It's ok to send Shutdown in cluster mode too, not only for client mode.

pan3793 · 2022-11-14T17:00:59Z

can you please add a description to the issue: https://issues.apache.org/jira/projects/SPARK/issues/SPARK-39601

Thanks for reminding, updated.

…usedByApp when driver is shutting down

pan3793 · 2022-11-18T09:25:46Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala

@@ -835,6 +839,8 @@ private[yarn] class YarnAllocator(
        // now I think its ok as none of the containers are expected to exit.
        val exitStatus = completedContainer.getExitStatus
        val (exitCausedByApp, containerExitReason) = exitStatus match {
+          case _ if shutdown =>
+            (false, s"Executor for container $containerId exited after Application shutdown.")


ignore all executor exit after Spark shutdown.

tgravescs · 2022-11-21T16:10:00Z

Please try not to force push as it makes reviewing what changed impossible. Overall I think this looks good.

Did you manually test your cases to make sure it fixed? The unit tests here don't really test client/cluster mode fully through which I understand would be hard in a unit test

pan3793 · 2022-11-22T16:34:55Z

Thanks @tgravescs, I find it's not easy to construct a simple reproduced test case in a small testing cluster, because the issue only happens when RM is busy, I plan to apply this patch to our internal branch and run production workloads to see if it helps, will feedback in a few days.

pan3793 · 2022-12-02T05:36:55Z

@tgravescs I have rolled out this patch to one of our internal clusters for a week, and everything looks OK until now.

Previously, we increased the spark.yarn.max.executor.failures for the bad case as a workaround, and we removed it after applying the patch, no failure occurs now.

I think it's ready to go.

pan3793 · 2022-12-13T03:35:35Z

@tgravescs could you please help merging this one?

tgravescs · 2022-12-13T14:19:04Z

merged to master, this didn't cherry pick to the branch-3.3 so if you want it there we should put up a separate pr

pan3793 · 2022-12-13T14:52:41Z

...ers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClusterSchedulerBackend.scala

@@ -35,6 +35,10 @@ private[spark] class YarnClusterSchedulerBackend(
    startBindings()
  }

+  override def stop(exitCode: Int): Unit = {
+    yarnSchedulerEndpoint.signalDriverStop(exitCode)


@tgravescs I just notice super.stop is missed here, opened #39053 for that.

pan3793 · 2022-12-13T14:54:01Z

@tgravescs branch-3.3 has some code conflicts, given this is a corner case and not a regression, I think it's fine to get into master only.

… super.stop() ### What changes were proposed in this pull request? This is a followup of #38622, I just notice that the `YarnClusterSchedulerBackend#stop` missed calling `super.stop()`. ### Why are the changes needed? Followup previous change, otherwise Spark may not shutdown properly on Yarn cluster mode. ### Does this PR introduce _any_ user-facing change? No, unreleased change. ### How was this patch tested? Existing UT. Closes #39053 from pan3793/SPARK-39601-followup. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

… super.stop() ### What changes were proposed in this pull request? This is a followup of apache/spark#38622, I just notice that the `YarnClusterSchedulerBackend#stop` missed calling `super.stop()`. ### Why are the changes needed? Followup previous change, otherwise Spark may not shutdown properly on Yarn cluster mode. ### Does this PR introduce _any_ user-facing change? No, unreleased change. ### How was this patch tested? Existing UT. Closes #39053 from pan3793/SPARK-39601-followup. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…usedByApp when driver is shutting down ### What changes were proposed in this pull request? Treating container `AllocationFailure` as not "exitCausedByApp" when driver is shutting down. The approach is suggested at apache#36991 (comment) ### Why are the changes needed? I observed some Spark Applications successfully completed all jobs but failed during the shutting down phase w/ reason: Max number of executor failures (16) reached, the timeline is Driver - Job success, Spark starts shutting down procedure. ``` 2022-06-23 19:50:55 CST AbstractConnector INFO - Stopped Spark74e9431b{HTTP/1.1, (http/1.1)}{0.0.0.0:0} 2022-06-23 19:50:55 CST SparkUI INFO - Stopped Spark web UI at http://hadoop2627.xxx.org:28446 2022-06-23 19:50:55 CST YarnClusterSchedulerBackend INFO - Shutting down all executors ``` Driver - A container allocate successful during shutting down phase. ``` 2022-06-23 19:52:21 CST YarnAllocator INFO - Launching container container_e94_1649986670278_7743380_02_000025 on host hadoop4388.xxx.org for executor with ID 24 for ResourceProfile Id 0 ``` Executor - The executor can not connect to driver endpoint because driver already stopped the endpoint. ``` Exception in thread "main" java.lang.reflect.UndeclaredThrowableException at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1911) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:393) at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend$.main(YarnCoarseGrainedExecutorBackend.scala:81) at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend.main(YarnCoarseGrainedExecutorBackend.scala) Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$9(CoarseGrainedExecutorBackend.scala:413) at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23) at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) at scala.collection.immutable.Range.foreach(Range.scala:158) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:876) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:411) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893) ... 4 more Caused by: org.apache.spark.rpc.RpcEndpointNotFoundException: Cannot find endpoint: spark://CoarseGrainedSchedulerhadoop2627.xxx.org:21956 at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$asyncSetupEndpointRefByURI$1(NettyRpcEnv.scala:148) at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$asyncSetupEndpointRefByURI$1$adapted(NettyRpcEnv.scala:144) at scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:307) at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:41) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64) at org.apache.spark.util.ThreadUtils$$anon$1.execute(ThreadUtils.scala:99) at scala.concurrent.impl.ExecutionContextImpl$$anon$4.execute(ExecutionContextImpl.scala:138) at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:72) at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:288) at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:288) at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:288) ``` Driver - YarnAllocator received container launch error message and treat it as `exitCausedByApp` ``` 2022-06-23 19:52:27 CST YarnAllocator INFO - Completed container container_e94_1649986670278_7743380_02_000025 on host: hadoop4388.xxx.org (state: COMPLETE, exit status: 1) 2022-06-23 19:52:27 CST YarnAllocator WARN - Container from a bad node: container_e94_1649986670278_7743380_02_000025 on host: hadoop4388.xxx.org. Exit status: 1. Diagnostics: [2022-06-23 19:52:24.932]Exception from container-launch. Container id: container_e94_1649986670278_7743380_02_000025 Exit code: 1 Shell output: main : command provided 1 main : run as user is bdms_pm main : requested yarn user is bdms_pm Getting exit code file... Creating script paths... Writing pid file... Writing to tmp file /mnt/dfs/2/yarn/local/nmPrivate/application_1649986670278_7743380/container_e94_1649986670278_7743380_02_000025/container_e94_1649986670278_7743380_02_000025.pid.tmp Writing to cgroup task files... Creating local dirs... Launching container... Getting exit code file... Creating script paths... [2022-06-23 19:52:24.938]Container exited with a non-zero exit code 1. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : Last 4096 bytes of stderr : at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:873) at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:72) at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:288) at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:288) at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:288) at scala.concurrent.Promise.trySuccess(Promise.scala:94) at scala.concurrent.Promise.trySuccess$(Promise.scala:94) at scala.concurrent.impl.Promise$DefaultPromise.trySuccess(Promise.scala:187) at org.apache.spark.rpc.netty.NettyRpcEnv.onSuccess$1(NettyRpcEnv.scala:225) at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$askAbortable$7(NettyRpcEnv.scala:246) at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$askAbortable$7$adapted(NettyRpcEnv.scala:246) at org.apache.spark.rpc.netty.RpcOutboxMessage.onSuccess(Outbox.scala:90) at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:195) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:142) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ``` Driver - Eventually application failed because ”failed“ executor reached threshold ``` 2022-06-23 19:52:30 CST ApplicationMaster INFO - Final app status: FAILED, exitCode: 11, (reason: Max number of executor failures (16) reached) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Update UT. Closes apache#38622 from pan3793/SPARK-39601. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Thomas Graves <tgraves@apache.org>

… super.stop() ### What changes were proposed in this pull request? This is a followup of apache#38622, I just notice that the `YarnClusterSchedulerBackend#stop` missed calling `super.stop()`. ### Why are the changes needed? Followup previous change, otherwise Spark may not shutdown properly on Yarn cluster mode. ### Does this PR introduce _any_ user-facing change? No, unreleased change. ### How was this patch tested? Existing UT. Closes apache#39053 from pan3793/SPARK-39601-followup. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

… super.stop() ### What changes were proposed in this pull request? This is a followup of apache/spark#38622, I just notice that the `YarnClusterSchedulerBackend#stop` missed calling `super.stop()`. ### Why are the changes needed? Followup previous change, otherwise Spark may not shutdown properly on Yarn cluster mode. ### Does this PR introduce _any_ user-facing change? No, unreleased change. ### How was this patch tested? Existing UT. Closes #39053 from pan3793/SPARK-39601-followup. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…r to 0 to cancel pending allocate request when driver stop ### What changes were proposed in this pull request? YarnAllocator set target executor number to 0 to cancel pending allocate request when driver stop Now for this issue we do: 1. AllocationFailure should not be treated as exitCausedByApp when driver is shutting down #38622 2. Avoid new allocation requests when sc.stop stuck #43906 3. Cancel pending allocation request, this pr #44036 ### Why are the changes needed? Avoid unnecessary allocate request ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? MT ### Was this patch authored or co-authored using generative AI tooling? No Closes #44036 from AngersZhuuuu/SPARK-46006-FOLLOWUP. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Kent Yao <yao@apache.org>

…r to 0 to cancel pending allocate request when driver stop ### What changes were proposed in this pull request? YarnAllocator set target executor number to 0 to cancel pending allocate request when driver stop Now for this issue we do: 1. AllocationFailure should not be treated as exitCausedByApp when driver is shutting down #38622 2. Avoid new allocation requests when sc.stop stuck #43906 3. Cancel pending allocation request, this pr #44036 ### Why are the changes needed? Avoid unnecessary allocate request ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? MT ### Was this patch authored or co-authored using generative AI tooling? No Closes #44036 from AngersZhuuuu/SPARK-46006-FOLLOWUP. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit dbc8756) Signed-off-by: Kent Yao <yao@apache.org>

…r to 0 to cancel pending allocate request when driver stop ### What changes were proposed in this pull request? YarnAllocator set target executor number to 0 to cancel pending allocate request when driver stop Now for this issue we do: 1. AllocationFailure should not be treated as exitCausedByApp when driver is shutting down apache#38622 2. Avoid new allocation requests when sc.stop stuck apache#43906 3. Cancel pending allocation request, this pr apache#44036 ### Why are the changes needed? Avoid unnecessary allocate request ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? MT ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#44036 from AngersZhuuuu/SPARK-46006-FOLLOWUP. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit dbc8756) Signed-off-by: Kent Yao <yao@apache.org>

github-actions bot added the YARN label Nov 11, 2022

pan3793 force-pushed the SPARK-39601 branch from 68a9844 to 78cecc9 Compare November 11, 2022 08:44

tgravescs reviewed Nov 14, 2022

View reviewed changes

pan3793 added 2 commits November 18, 2022 16:42

[SPARK-39601][YARN] AllocationFailure should not be treated as exitCa…

f57e0e4

…usedByApp when driver is shutting down

signal driver shutdown on yarn cluster mode as well

184ed04

pan3793 force-pushed the SPARK-39601 branch from 78cecc9 to 184ed04 Compare November 18, 2022 08:49

pan3793 added 2 commits November 18, 2022 17:12

handle shutdown first

e1d72cb

nit

3f8ff05

pan3793 commented Nov 18, 2022

View reviewed changes

tgravescs approved these changes Dec 5, 2022

View reviewed changes

asfgit closed this in e857b7a Dec 13, 2022

pan3793 mentioned this pull request Dec 13, 2022

[SPARK-39601][YARN][FOLLOWUP] YarnClusterSchedulerBackend should call super.stop() #39053

Closed

pan3793 commented Dec 13, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-39601][YARN] AllocationFailure should not be treated as exitCausedByApp when driver is shutting down #38622

[SPARK-39601][YARN] AllocationFailure should not be treated as exitCausedByApp when driver is shutting down #38622

pan3793 commented Nov 11, 2022

pan3793 commented Nov 11, 2022

AmplabJenkins commented Nov 11, 2022

LuciferYang commented Nov 12, 2022

mridulm commented Nov 14, 2022

tgravescs commented Nov 14, 2022

tgravescs Nov 14, 2022

pan3793 Nov 14, 2022

tgravescs Nov 15, 2022

AngersZhuuuu Nov 22, 2022

pan3793 commented Nov 14, 2022

pan3793 Nov 18, 2022

tgravescs commented Nov 21, 2022

pan3793 commented Nov 22, 2022

pan3793 commented Dec 2, 2022

pan3793 commented Dec 13, 2022

tgravescs commented Dec 13, 2022

pan3793 Dec 13, 2022

pan3793 commented Dec 13, 2022

[SPARK-39601][YARN] AllocationFailure should not be treated as exitCausedByApp when driver is shutting down #38622

[SPARK-39601][YARN] AllocationFailure should not be treated as exitCausedByApp when driver is shutting down #38622

Conversation

pan3793 commented Nov 11, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

pan3793 commented Nov 11, 2022

AmplabJenkins commented Nov 11, 2022

LuciferYang commented Nov 12, 2022

mridulm commented Nov 14, 2022

tgravescs commented Nov 14, 2022

tgravescs Nov 14, 2022

Choose a reason for hiding this comment

pan3793 Nov 14, 2022

Choose a reason for hiding this comment

tgravescs Nov 15, 2022

Choose a reason for hiding this comment

AngersZhuuuu Nov 22, 2022

Choose a reason for hiding this comment

pan3793 commented Nov 14, 2022

pan3793 Nov 18, 2022

Choose a reason for hiding this comment

tgravescs commented Nov 21, 2022

pan3793 commented Nov 22, 2022

pan3793 commented Dec 2, 2022

pan3793 commented Dec 13, 2022

tgravescs commented Dec 13, 2022

pan3793 Dec 13, 2022

Choose a reason for hiding this comment

pan3793 commented Dec 13, 2022