Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-39601][YARN] AllocationFailure should not be treated as exitCausedByApp when driver is shutting down #38622

Closed
wants to merge 4 commits into from

Conversation

pan3793
Copy link
Member

@pan3793 pan3793 commented Nov 11, 2022

What changes were proposed in this pull request?

Treating container AllocationFailure as not "exitCausedByApp" when driver is shutting down.

The approach is suggested at #36991 (comment)

Why are the changes needed?

I observed some Spark Applications successfully completed all jobs but failed during the shutting down phase w/ reason: Max number of executor failures (16) reached, the timeline is

Driver - Job success, Spark starts shutting down procedure.

2022-06-23 19:50:55 CST AbstractConnector INFO - Stopped Spark@74e9431b{HTTP/1.1, (http/1.1)}{0.0.0.0:0}
2022-06-23 19:50:55 CST SparkUI INFO - Stopped Spark web UI at http://hadoop2627.xxx.org:28446
2022-06-23 19:50:55 CST YarnClusterSchedulerBackend INFO - Shutting down all executors

Driver - A container allocate successful during shutting down phase.

2022-06-23 19:52:21 CST YarnAllocator INFO - Launching container container_e94_1649986670278_7743380_02_000025 on host hadoop4388.xxx.org for executor with ID 24 for ResourceProfile Id 0

Executor - The executor can not connect to driver endpoint because driver already stopped the endpoint.

Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1911)
  at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
  at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:393)
  at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend$.main(YarnCoarseGrainedExecutorBackend.scala:81)
  at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend.main(YarnCoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult: 
  at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
  at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
  at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
  at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$9(CoarseGrainedExecutorBackend.scala:413)
  at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23)
  at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877)
  at scala.collection.immutable.Range.foreach(Range.scala:158)
  at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:876)
  at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:411)
  at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62)
  at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:422)
  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
  ... 4 more
Caused by: org.apache.spark.rpc.RpcEndpointNotFoundException: Cannot find endpoint: spark://CoarseGrainedScheduler@hadoop2627.xxx.org:21956
  at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$asyncSetupEndpointRefByURI$1(NettyRpcEnv.scala:148)
  at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$asyncSetupEndpointRefByURI$1$adapted(NettyRpcEnv.scala:144)
  at scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:307)
  at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:41)
  at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
  at org.apache.spark.util.ThreadUtils$$anon$1.execute(ThreadUtils.scala:99)
  at scala.concurrent.impl.ExecutionContextImpl$$anon$4.execute(ExecutionContextImpl.scala:138)
  at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:72)
  at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:288)
  at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:288)
  at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:288)

Driver - YarnAllocator received container launch error message and treat it as exitCausedByApp

2022-06-23 19:52:27 CST YarnAllocator INFO - Completed container container_e94_1649986670278_7743380_02_000025 on host: hadoop4388.xxx.org (state: COMPLETE, exit status: 1)
2022-06-23 19:52:27 CST YarnAllocator WARN - Container from a bad node: container_e94_1649986670278_7743380_02_000025 on host: hadoop4388.xxx.org. Exit status: 1. Diagnostics: [2022-06-23 19:52:24.932]Exception from container-launch.
Container id: container_e94_1649986670278_7743380_02_000025
Exit code: 1
Shell output: main : command provided 1
main : run as user is bdms_pm
main : requested yarn user is bdms_pm
Getting exit code file...
Creating script paths...
Writing pid file...
Writing to tmp file /mnt/dfs/2/yarn/local/nmPrivate/application_1649986670278_7743380/container_e94_1649986670278_7743380_02_000025/container_e94_1649986670278_7743380_02_000025.pid.tmp
Writing to cgroup task files...
Creating local dirs...
Launching container...
Getting exit code file...
Creating script paths...


[2022-06-23 19:52:24.938]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:873)
  at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:72)
  at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:288)
  at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:288)
  at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:288)
  at scala.concurrent.Promise.trySuccess(Promise.scala:94)
  at scala.concurrent.Promise.trySuccess$(Promise.scala:94)
  at scala.concurrent.impl.Promise$DefaultPromise.trySuccess(Promise.scala:187)
  at org.apache.spark.rpc.netty.NettyRpcEnv.onSuccess$1(NettyRpcEnv.scala:225)
  at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$askAbortable$7(NettyRpcEnv.scala:246)
  at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$askAbortable$7$adapted(NettyRpcEnv.scala:246)
  at org.apache.spark.rpc.netty.RpcOutboxMessage.onSuccess(Outbox.scala:90)
  at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:195)
  at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:142)
  at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
  at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)

Driver - Eventually application failed because ”failed“ executor reached threshold

2022-06-23 19:52:30 CST ApplicationMaster INFO - Final app status: FAILED, exitCode: 11, (reason: Max number of executor failures (16) reached)

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Update UT.

@pan3793
Copy link
Member Author

pan3793 commented Nov 11, 2022

cc @tgravescs

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@LuciferYang
Copy link
Contributor

cc @mridulm @tgravescs FYI

@mridulm
Copy link
Contributor

mridulm commented Nov 14, 2022

I will let @tgravescs take a look at this - I dont have as much context as him.

@tgravescs
Copy link
Contributor

can you please add a description to the issue: https://issues.apache.org/jira/projects/SPARK/issues/SPARK-39601

@@ -815,6 +815,7 @@ private[spark] class ApplicationMaster(
case Shutdown(code) =>
exitCode = code
shutdown = true
allocator.setShutdown(true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so maybe I'm missing a call but it looks like this Shutdown message is only sent in Client mode. Is that the only time you are seeing this issue? ie when you have the driver is local and the applicationmaster and yarn allocator are on the cluster?
But the log message in the description has YarnClusterSchedulerBackend which makes me think this is cluster mode.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like this Shutdown message is only sent in Client mode

Yea, I just noticed that, do you think it's a good idea to send Shutdown in cluster mode as well? or any other suggestions? cc @AngersZhuuuu as you are the author of that code.

the log message in the description has YarnClusterSchedulerBackend which makes me think this is cluster mode.

You are right, my reported job failed in cluster mode, and I think both yarn client/cluster modes have this issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this won't fix your issue as is so we would need something for cluster mode. I'm fine with sending the shutdown message. It would be nice if you could test the fix on your use case to make sure it fixes it as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok to send Shutdown in cluster mode too, not only for client mode.

@pan3793
Copy link
Member Author

pan3793 commented Nov 14, 2022

can you please add a description to the issue: https://issues.apache.org/jira/projects/SPARK/issues/SPARK-39601

Thanks for reminding, updated.

@@ -835,6 +839,8 @@ private[yarn] class YarnAllocator(
// now I think its ok as none of the containers are expected to exit.
val exitStatus = completedContainer.getExitStatus
val (exitCausedByApp, containerExitReason) = exitStatus match {
case _ if shutdown =>
(false, s"Executor for container $containerId exited after Application shutdown.")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignore all executor exit after Spark shutdown.

@tgravescs
Copy link
Contributor

Please try not to force push as it makes reviewing what changed impossible. Overall I think this looks good.

Did you manually test your cases to make sure it fixed? The unit tests here don't really test client/cluster mode fully through which I understand would be hard in a unit test

@pan3793
Copy link
Member Author

pan3793 commented Nov 22, 2022

Thanks @tgravescs, I find it's not easy to construct a simple reproduced test case in a small testing cluster, because the issue only happens when RM is busy, I plan to apply this patch to our internal branch and run production workloads to see if it helps, will feedback in a few days.

@pan3793
Copy link
Member Author

pan3793 commented Dec 2, 2022

@tgravescs I have rolled out this patch to one of our internal clusters for a week, and everything looks OK until now.

Previously, we increased the spark.yarn.max.executor.failures for the bad case as a workaround, and we removed it after applying the patch, no failure occurs now.

I think it's ready to go.

@pan3793
Copy link
Member Author

pan3793 commented Dec 13, 2022

@tgravescs could you please help merging this one?

@asfgit asfgit closed this in e857b7a Dec 13, 2022
@tgravescs
Copy link
Contributor

merged to master, this didn't cherry pick to the branch-3.3 so if you want it there we should put up a separate pr

@@ -35,6 +35,10 @@ private[spark] class YarnClusterSchedulerBackend(
startBindings()
}

override def stop(exitCode: Int): Unit = {
yarnSchedulerEndpoint.signalDriverStop(exitCode)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tgravescs I just notice super.stop is missed here, opened #39053 for that.

@pan3793
Copy link
Member Author

pan3793 commented Dec 13, 2022

@tgravescs branch-3.3 has some code conflicts, given this is a corner case and not a regression, I think it's fine to get into master only.

dongjoon-hyun pushed a commit that referenced this pull request Dec 13, 2022
… super.stop()

### What changes were proposed in this pull request?

This is a followup of #38622, I just notice that the `YarnClusterSchedulerBackend#stop` missed calling `super.stop()`.

### Why are the changes needed?

Followup previous change, otherwise Spark may not shutdown properly on Yarn cluster mode.

### Does this PR introduce _any_ user-facing change?

No, unreleased change.

### How was this patch tested?

Existing UT.

Closes #39053 from pan3793/SPARK-39601-followup.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
a0x8o added a commit to a0x8o/spark that referenced this pull request Dec 13, 2022
… super.stop()

### What changes were proposed in this pull request?

This is a followup of apache/spark#38622, I just notice that the `YarnClusterSchedulerBackend#stop` missed calling `super.stop()`.

### Why are the changes needed?

Followup previous change, otherwise Spark may not shutdown properly on Yarn cluster mode.

### Does this PR introduce _any_ user-facing change?

No, unreleased change.

### How was this patch tested?

Existing UT.

Closes #39053 from pan3793/SPARK-39601-followup.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 18, 2022
…usedByApp when driver is shutting down

### What changes were proposed in this pull request?

Treating container `AllocationFailure` as not "exitCausedByApp" when driver is shutting down.

The approach is suggested at apache#36991 (comment)

### Why are the changes needed?

I observed some Spark Applications successfully completed all jobs but failed during the shutting down phase w/ reason: Max number of executor failures (16) reached, the timeline is

Driver - Job success, Spark starts shutting down procedure.
```
2022-06-23 19:50:55 CST AbstractConnector INFO - Stopped Spark74e9431b{HTTP/1.1, (http/1.1)}{0.0.0.0:0}
2022-06-23 19:50:55 CST SparkUI INFO - Stopped Spark web UI at http://hadoop2627.xxx.org:28446
2022-06-23 19:50:55 CST YarnClusterSchedulerBackend INFO - Shutting down all executors
```

Driver - A container allocate successful during shutting down phase.
```
2022-06-23 19:52:21 CST YarnAllocator INFO - Launching container container_e94_1649986670278_7743380_02_000025 on host hadoop4388.xxx.org for executor with ID 24 for ResourceProfile Id 0
```

Executor - The executor can not connect to driver endpoint because driver already stopped the endpoint.
```
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1911)
  at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
  at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:393)
  at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend$.main(YarnCoarseGrainedExecutorBackend.scala:81)
  at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend.main(YarnCoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
  at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
  at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
  at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
  at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$9(CoarseGrainedExecutorBackend.scala:413)
  at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23)
  at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877)
  at scala.collection.immutable.Range.foreach(Range.scala:158)
  at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:876)
  at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:411)
  at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62)
  at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:422)
  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
  ... 4 more
Caused by: org.apache.spark.rpc.RpcEndpointNotFoundException: Cannot find endpoint: spark://CoarseGrainedSchedulerhadoop2627.xxx.org:21956
  at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$asyncSetupEndpointRefByURI$1(NettyRpcEnv.scala:148)
  at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$asyncSetupEndpointRefByURI$1$adapted(NettyRpcEnv.scala:144)
  at scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:307)
  at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:41)
  at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
  at org.apache.spark.util.ThreadUtils$$anon$1.execute(ThreadUtils.scala:99)
  at scala.concurrent.impl.ExecutionContextImpl$$anon$4.execute(ExecutionContextImpl.scala:138)
  at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:72)
  at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:288)
  at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:288)
  at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:288)
```

Driver - YarnAllocator received container launch error message and treat it as `exitCausedByApp`
```
2022-06-23 19:52:27 CST YarnAllocator INFO - Completed container container_e94_1649986670278_7743380_02_000025 on host: hadoop4388.xxx.org (state: COMPLETE, exit status: 1)
2022-06-23 19:52:27 CST YarnAllocator WARN - Container from a bad node: container_e94_1649986670278_7743380_02_000025 on host: hadoop4388.xxx.org. Exit status: 1. Diagnostics: [2022-06-23 19:52:24.932]Exception from container-launch.
Container id: container_e94_1649986670278_7743380_02_000025
Exit code: 1
Shell output: main : command provided 1
main : run as user is bdms_pm
main : requested yarn user is bdms_pm
Getting exit code file...
Creating script paths...
Writing pid file...
Writing to tmp file /mnt/dfs/2/yarn/local/nmPrivate/application_1649986670278_7743380/container_e94_1649986670278_7743380_02_000025/container_e94_1649986670278_7743380_02_000025.pid.tmp
Writing to cgroup task files...
Creating local dirs...
Launching container...
Getting exit code file...
Creating script paths...

[2022-06-23 19:52:24.938]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:873)
  at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:72)
  at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:288)
  at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:288)
  at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:288)
  at scala.concurrent.Promise.trySuccess(Promise.scala:94)
  at scala.concurrent.Promise.trySuccess$(Promise.scala:94)
  at scala.concurrent.impl.Promise$DefaultPromise.trySuccess(Promise.scala:187)
  at org.apache.spark.rpc.netty.NettyRpcEnv.onSuccess$1(NettyRpcEnv.scala:225)
  at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$askAbortable$7(NettyRpcEnv.scala:246)
  at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$askAbortable$7$adapted(NettyRpcEnv.scala:246)
  at org.apache.spark.rpc.netty.RpcOutboxMessage.onSuccess(Outbox.scala:90)
  at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:195)
  at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:142)
  at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
  at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
```

Driver - Eventually application failed because ”failed“ executor reached threshold
```
2022-06-23 19:52:30 CST ApplicationMaster INFO - Final app status: FAILED, exitCode: 11, (reason: Max number of executor failures (16) reached)
```
### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Update UT.

Closes apache#38622 from pan3793/SPARK-39601.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Thomas Graves <tgraves@apache.org>
beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 18, 2022
… super.stop()

### What changes were proposed in this pull request?

This is a followup of apache#38622, I just notice that the `YarnClusterSchedulerBackend#stop` missed calling `super.stop()`.

### Why are the changes needed?

Followup previous change, otherwise Spark may not shutdown properly on Yarn cluster mode.

### Does this PR introduce _any_ user-facing change?

No, unreleased change.

### How was this patch tested?

Existing UT.

Closes apache#39053 from pan3793/SPARK-39601-followup.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
a0x8o added a commit to a0x8o/spark that referenced this pull request Dec 30, 2022
… super.stop()

### What changes were proposed in this pull request?

This is a followup of apache/spark#38622, I just notice that the `YarnClusterSchedulerBackend#stop` missed calling `super.stop()`.

### Why are the changes needed?

Followup previous change, otherwise Spark may not shutdown properly on Yarn cluster mode.

### Does this PR introduce _any_ user-facing change?

No, unreleased change.

### How was this patch tested?

Existing UT.

Closes #39053 from pan3793/SPARK-39601-followup.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
a0x8o added a commit to a0x8o/spark that referenced this pull request Dec 30, 2022
… super.stop()

### What changes were proposed in this pull request?

This is a followup of apache/spark#38622, I just notice that the `YarnClusterSchedulerBackend#stop` missed calling `super.stop()`.

### Why are the changes needed?

Followup previous change, otherwise Spark may not shutdown properly on Yarn cluster mode.

### Does this PR introduce _any_ user-facing change?

No, unreleased change.

### How was this patch tested?

Existing UT.

Closes #39053 from pan3793/SPARK-39601-followup.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
a0x8o added a commit to a0x8o/spark that referenced this pull request Dec 30, 2022
… super.stop()

### What changes were proposed in this pull request?

This is a followup of apache/spark#38622, I just notice that the `YarnClusterSchedulerBackend#stop` missed calling `super.stop()`.

### Why are the changes needed?

Followup previous change, otherwise Spark may not shutdown properly on Yarn cluster mode.

### Does this PR introduce _any_ user-facing change?

No, unreleased change.

### How was this patch tested?

Existing UT.

Closes #39053 from pan3793/SPARK-39601-followup.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
yaooqinn pushed a commit that referenced this pull request Nov 28, 2023
…r to 0 to cancel pending allocate request when driver stop

### What changes were proposed in this pull request?
YarnAllocator set target executor number to 0 to cancel pending allocate request when driver stop
Now for this issue we do:

1. AllocationFailure should not be treated as exitCausedByApp when driver is shutting down #38622
2. Avoid new allocation requests when sc.stop stuck #43906
3. Cancel pending allocation request, this pr #44036

### Why are the changes needed?
Avoid unnecessary allocate request

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
MT

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44036 from AngersZhuuuu/SPARK-46006-FOLLOWUP.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Kent Yao <yao@apache.org>
yaooqinn pushed a commit that referenced this pull request Nov 28, 2023
…r to 0 to cancel pending allocate request when driver stop

### What changes were proposed in this pull request?
YarnAllocator set target executor number to 0 to cancel pending allocate request when driver stop
Now for this issue we do:

1. AllocationFailure should not be treated as exitCausedByApp when driver is shutting down #38622
2. Avoid new allocation requests when sc.stop stuck #43906
3. Cancel pending allocation request, this pr #44036

### Why are the changes needed?
Avoid unnecessary allocate request

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
MT

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44036 from AngersZhuuuu/SPARK-46006-FOLLOWUP.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Kent Yao <yao@apache.org>
(cherry picked from commit dbc8756)
Signed-off-by: Kent Yao <yao@apache.org>
yaooqinn pushed a commit that referenced this pull request Nov 28, 2023
…r to 0 to cancel pending allocate request when driver stop

### What changes were proposed in this pull request?
YarnAllocator set target executor number to 0 to cancel pending allocate request when driver stop
Now for this issue we do:

1. AllocationFailure should not be treated as exitCausedByApp when driver is shutting down #38622
2. Avoid new allocation requests when sc.stop stuck #43906
3. Cancel pending allocation request, this pr #44036

### Why are the changes needed?
Avoid unnecessary allocate request

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
MT

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44036 from AngersZhuuuu/SPARK-46006-FOLLOWUP.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Kent Yao <yao@apache.org>
(cherry picked from commit dbc8756)
Signed-off-by: Kent Yao <yao@apache.org>
yaooqinn pushed a commit that referenced this pull request Nov 28, 2023
…r to 0 to cancel pending allocate request when driver stop

### What changes were proposed in this pull request?
YarnAllocator set target executor number to 0 to cancel pending allocate request when driver stop
Now for this issue we do:

1. AllocationFailure should not be treated as exitCausedByApp when driver is shutting down #38622
2. Avoid new allocation requests when sc.stop stuck #43906
3. Cancel pending allocation request, this pr #44036

### Why are the changes needed?
Avoid unnecessary allocate request

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
MT

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44036 from AngersZhuuuu/SPARK-46006-FOLLOWUP.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Kent Yao <yao@apache.org>
(cherry picked from commit dbc8756)
Signed-off-by: Kent Yao <yao@apache.org>
szehon-ho pushed a commit to szehon-ho/spark that referenced this pull request Feb 7, 2024
…r to 0 to cancel pending allocate request when driver stop

### What changes were proposed in this pull request?
YarnAllocator set target executor number to 0 to cancel pending allocate request when driver stop
Now for this issue we do:

1. AllocationFailure should not be treated as exitCausedByApp when driver is shutting down apache#38622
2. Avoid new allocation requests when sc.stop stuck apache#43906
3. Cancel pending allocation request, this pr apache#44036

### Why are the changes needed?
Avoid unnecessary allocate request

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
MT

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#44036 from AngersZhuuuu/SPARK-46006-FOLLOWUP.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Kent Yao <yao@apache.org>
(cherry picked from commit dbc8756)
Signed-off-by: Kent Yao <yao@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants