New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-39601][YARN] AllocationFailure should not be treated as exitCausedByApp when driver is shutting down #38622
Conversation
cc @tgravescs |
Can one of the admins verify this patch? |
cc @mridulm @tgravescs FYI |
I will let @tgravescs take a look at this - I dont have as much context as him. |
can you please add a description to the issue: https://issues.apache.org/jira/projects/SPARK/issues/SPARK-39601 |
@@ -815,6 +815,7 @@ private[spark] class ApplicationMaster( | |||
case Shutdown(code) => | |||
exitCode = code | |||
shutdown = true | |||
allocator.setShutdown(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so maybe I'm missing a call but it looks like this Shutdown message is only sent in Client mode. Is that the only time you are seeing this issue? ie when you have the driver is local and the applicationmaster and yarn allocator are on the cluster?
But the log message in the description has YarnClusterSchedulerBackend which makes me think this is cluster mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like this Shutdown message is only sent in Client mode
Yea, I just noticed that, do you think it's a good idea to send Shutdown
in cluster mode as well? or any other suggestions? cc @AngersZhuuuu as you are the author of that code.
the log message in the description has YarnClusterSchedulerBackend which makes me think this is cluster mode.
You are right, my reported job failed in cluster mode, and I think both yarn client/cluster modes have this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this won't fix your issue as is so we would need something for cluster mode. I'm fine with sending the shutdown message. It would be nice if you could test the fix on your use case to make sure it fixes it as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's ok to send Shutdown
in cluster mode too, not only for client mode.
Thanks for reminding, updated. |
…usedByApp when driver is shutting down
@@ -835,6 +839,8 @@ private[yarn] class YarnAllocator( | |||
// now I think its ok as none of the containers are expected to exit. | |||
val exitStatus = completedContainer.getExitStatus | |||
val (exitCausedByApp, containerExitReason) = exitStatus match { | |||
case _ if shutdown => | |||
(false, s"Executor for container $containerId exited after Application shutdown.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ignore all executor exit after Spark shutdown.
Please try not to force push as it makes reviewing what changed impossible. Overall I think this looks good. Did you manually test your cases to make sure it fixed? The unit tests here don't really test client/cluster mode fully through which I understand would be hard in a unit test |
Thanks @tgravescs, I find it's not easy to construct a simple reproduced test case in a small testing cluster, because the issue only happens when RM is busy, I plan to apply this patch to our internal branch and run production workloads to see if it helps, will feedback in a few days. |
@tgravescs I have rolled out this patch to one of our internal clusters for a week, and everything looks OK until now. Previously, we increased the I think it's ready to go. |
@tgravescs could you please help merging this one? |
merged to master, this didn't cherry pick to the branch-3.3 so if you want it there we should put up a separate pr |
@@ -35,6 +35,10 @@ private[spark] class YarnClusterSchedulerBackend( | |||
startBindings() | |||
} | |||
|
|||
override def stop(exitCode: Int): Unit = { | |||
yarnSchedulerEndpoint.signalDriverStop(exitCode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tgravescs I just notice super.stop is missed here, opened #39053 for that.
@tgravescs branch-3.3 has some code conflicts, given this is a corner case and not a regression, I think it's fine to get into master only. |
… super.stop() ### What changes were proposed in this pull request? This is a followup of #38622, I just notice that the `YarnClusterSchedulerBackend#stop` missed calling `super.stop()`. ### Why are the changes needed? Followup previous change, otherwise Spark may not shutdown properly on Yarn cluster mode. ### Does this PR introduce _any_ user-facing change? No, unreleased change. ### How was this patch tested? Existing UT. Closes #39053 from pan3793/SPARK-39601-followup. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
… super.stop() ### What changes were proposed in this pull request? This is a followup of apache/spark#38622, I just notice that the `YarnClusterSchedulerBackend#stop` missed calling `super.stop()`. ### Why are the changes needed? Followup previous change, otherwise Spark may not shutdown properly on Yarn cluster mode. ### Does this PR introduce _any_ user-facing change? No, unreleased change. ### How was this patch tested? Existing UT. Closes #39053 from pan3793/SPARK-39601-followup. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…usedByApp when driver is shutting down ### What changes were proposed in this pull request? Treating container `AllocationFailure` as not "exitCausedByApp" when driver is shutting down. The approach is suggested at apache#36991 (comment) ### Why are the changes needed? I observed some Spark Applications successfully completed all jobs but failed during the shutting down phase w/ reason: Max number of executor failures (16) reached, the timeline is Driver - Job success, Spark starts shutting down procedure. ``` 2022-06-23 19:50:55 CST AbstractConnector INFO - Stopped Spark74e9431b{HTTP/1.1, (http/1.1)}{0.0.0.0:0} 2022-06-23 19:50:55 CST SparkUI INFO - Stopped Spark web UI at http://hadoop2627.xxx.org:28446 2022-06-23 19:50:55 CST YarnClusterSchedulerBackend INFO - Shutting down all executors ``` Driver - A container allocate successful during shutting down phase. ``` 2022-06-23 19:52:21 CST YarnAllocator INFO - Launching container container_e94_1649986670278_7743380_02_000025 on host hadoop4388.xxx.org for executor with ID 24 for ResourceProfile Id 0 ``` Executor - The executor can not connect to driver endpoint because driver already stopped the endpoint. ``` Exception in thread "main" java.lang.reflect.UndeclaredThrowableException at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1911) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:393) at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend$.main(YarnCoarseGrainedExecutorBackend.scala:81) at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend.main(YarnCoarseGrainedExecutorBackend.scala) Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$9(CoarseGrainedExecutorBackend.scala:413) at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23) at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) at scala.collection.immutable.Range.foreach(Range.scala:158) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:876) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:411) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893) ... 4 more Caused by: org.apache.spark.rpc.RpcEndpointNotFoundException: Cannot find endpoint: spark://CoarseGrainedSchedulerhadoop2627.xxx.org:21956 at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$asyncSetupEndpointRefByURI$1(NettyRpcEnv.scala:148) at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$asyncSetupEndpointRefByURI$1$adapted(NettyRpcEnv.scala:144) at scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:307) at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:41) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64) at org.apache.spark.util.ThreadUtils$$anon$1.execute(ThreadUtils.scala:99) at scala.concurrent.impl.ExecutionContextImpl$$anon$4.execute(ExecutionContextImpl.scala:138) at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:72) at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:288) at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:288) at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:288) ``` Driver - YarnAllocator received container launch error message and treat it as `exitCausedByApp` ``` 2022-06-23 19:52:27 CST YarnAllocator INFO - Completed container container_e94_1649986670278_7743380_02_000025 on host: hadoop4388.xxx.org (state: COMPLETE, exit status: 1) 2022-06-23 19:52:27 CST YarnAllocator WARN - Container from a bad node: container_e94_1649986670278_7743380_02_000025 on host: hadoop4388.xxx.org. Exit status: 1. Diagnostics: [2022-06-23 19:52:24.932]Exception from container-launch. Container id: container_e94_1649986670278_7743380_02_000025 Exit code: 1 Shell output: main : command provided 1 main : run as user is bdms_pm main : requested yarn user is bdms_pm Getting exit code file... Creating script paths... Writing pid file... Writing to tmp file /mnt/dfs/2/yarn/local/nmPrivate/application_1649986670278_7743380/container_e94_1649986670278_7743380_02_000025/container_e94_1649986670278_7743380_02_000025.pid.tmp Writing to cgroup task files... Creating local dirs... Launching container... Getting exit code file... Creating script paths... [2022-06-23 19:52:24.938]Container exited with a non-zero exit code 1. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : Last 4096 bytes of stderr : at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:873) at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:72) at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:288) at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:288) at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:288) at scala.concurrent.Promise.trySuccess(Promise.scala:94) at scala.concurrent.Promise.trySuccess$(Promise.scala:94) at scala.concurrent.impl.Promise$DefaultPromise.trySuccess(Promise.scala:187) at org.apache.spark.rpc.netty.NettyRpcEnv.onSuccess$1(NettyRpcEnv.scala:225) at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$askAbortable$7(NettyRpcEnv.scala:246) at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$askAbortable$7$adapted(NettyRpcEnv.scala:246) at org.apache.spark.rpc.netty.RpcOutboxMessage.onSuccess(Outbox.scala:90) at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:195) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:142) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ``` Driver - Eventually application failed because ”failed“ executor reached threshold ``` 2022-06-23 19:52:30 CST ApplicationMaster INFO - Final app status: FAILED, exitCode: 11, (reason: Max number of executor failures (16) reached) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Update UT. Closes apache#38622 from pan3793/SPARK-39601. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Thomas Graves <tgraves@apache.org>
… super.stop() ### What changes were proposed in this pull request? This is a followup of apache#38622, I just notice that the `YarnClusterSchedulerBackend#stop` missed calling `super.stop()`. ### Why are the changes needed? Followup previous change, otherwise Spark may not shutdown properly on Yarn cluster mode. ### Does this PR introduce _any_ user-facing change? No, unreleased change. ### How was this patch tested? Existing UT. Closes apache#39053 from pan3793/SPARK-39601-followup. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
… super.stop() ### What changes were proposed in this pull request? This is a followup of apache/spark#38622, I just notice that the `YarnClusterSchedulerBackend#stop` missed calling `super.stop()`. ### Why are the changes needed? Followup previous change, otherwise Spark may not shutdown properly on Yarn cluster mode. ### Does this PR introduce _any_ user-facing change? No, unreleased change. ### How was this patch tested? Existing UT. Closes #39053 from pan3793/SPARK-39601-followup. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
… super.stop() ### What changes were proposed in this pull request? This is a followup of apache/spark#38622, I just notice that the `YarnClusterSchedulerBackend#stop` missed calling `super.stop()`. ### Why are the changes needed? Followup previous change, otherwise Spark may not shutdown properly on Yarn cluster mode. ### Does this PR introduce _any_ user-facing change? No, unreleased change. ### How was this patch tested? Existing UT. Closes #39053 from pan3793/SPARK-39601-followup. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
… super.stop() ### What changes were proposed in this pull request? This is a followup of apache/spark#38622, I just notice that the `YarnClusterSchedulerBackend#stop` missed calling `super.stop()`. ### Why are the changes needed? Followup previous change, otherwise Spark may not shutdown properly on Yarn cluster mode. ### Does this PR introduce _any_ user-facing change? No, unreleased change. ### How was this patch tested? Existing UT. Closes #39053 from pan3793/SPARK-39601-followup. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…r to 0 to cancel pending allocate request when driver stop ### What changes were proposed in this pull request? YarnAllocator set target executor number to 0 to cancel pending allocate request when driver stop Now for this issue we do: 1. AllocationFailure should not be treated as exitCausedByApp when driver is shutting down #38622 2. Avoid new allocation requests when sc.stop stuck #43906 3. Cancel pending allocation request, this pr #44036 ### Why are the changes needed? Avoid unnecessary allocate request ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? MT ### Was this patch authored or co-authored using generative AI tooling? No Closes #44036 from AngersZhuuuu/SPARK-46006-FOLLOWUP. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Kent Yao <yao@apache.org>
…r to 0 to cancel pending allocate request when driver stop ### What changes were proposed in this pull request? YarnAllocator set target executor number to 0 to cancel pending allocate request when driver stop Now for this issue we do: 1. AllocationFailure should not be treated as exitCausedByApp when driver is shutting down #38622 2. Avoid new allocation requests when sc.stop stuck #43906 3. Cancel pending allocation request, this pr #44036 ### Why are the changes needed? Avoid unnecessary allocate request ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? MT ### Was this patch authored or co-authored using generative AI tooling? No Closes #44036 from AngersZhuuuu/SPARK-46006-FOLLOWUP. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit dbc8756) Signed-off-by: Kent Yao <yao@apache.org>
…r to 0 to cancel pending allocate request when driver stop ### What changes were proposed in this pull request? YarnAllocator set target executor number to 0 to cancel pending allocate request when driver stop Now for this issue we do: 1. AllocationFailure should not be treated as exitCausedByApp when driver is shutting down #38622 2. Avoid new allocation requests when sc.stop stuck #43906 3. Cancel pending allocation request, this pr #44036 ### Why are the changes needed? Avoid unnecessary allocate request ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? MT ### Was this patch authored or co-authored using generative AI tooling? No Closes #44036 from AngersZhuuuu/SPARK-46006-FOLLOWUP. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit dbc8756) Signed-off-by: Kent Yao <yao@apache.org>
…r to 0 to cancel pending allocate request when driver stop ### What changes were proposed in this pull request? YarnAllocator set target executor number to 0 to cancel pending allocate request when driver stop Now for this issue we do: 1. AllocationFailure should not be treated as exitCausedByApp when driver is shutting down #38622 2. Avoid new allocation requests when sc.stop stuck #43906 3. Cancel pending allocation request, this pr #44036 ### Why are the changes needed? Avoid unnecessary allocate request ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? MT ### Was this patch authored or co-authored using generative AI tooling? No Closes #44036 from AngersZhuuuu/SPARK-46006-FOLLOWUP. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit dbc8756) Signed-off-by: Kent Yao <yao@apache.org>
…r to 0 to cancel pending allocate request when driver stop ### What changes were proposed in this pull request? YarnAllocator set target executor number to 0 to cancel pending allocate request when driver stop Now for this issue we do: 1. AllocationFailure should not be treated as exitCausedByApp when driver is shutting down apache#38622 2. Avoid new allocation requests when sc.stop stuck apache#43906 3. Cancel pending allocation request, this pr apache#44036 ### Why are the changes needed? Avoid unnecessary allocate request ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? MT ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#44036 from AngersZhuuuu/SPARK-46006-FOLLOWUP. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit dbc8756) Signed-off-by: Kent Yao <yao@apache.org>
What changes were proposed in this pull request?
Treating container
AllocationFailure
as not "exitCausedByApp" when driver is shutting down.The approach is suggested at #36991 (comment)
Why are the changes needed?
I observed some Spark Applications successfully completed all jobs but failed during the shutting down phase w/ reason: Max number of executor failures (16) reached, the timeline is
Driver - Job success, Spark starts shutting down procedure.
Driver - A container allocate successful during shutting down phase.
Executor - The executor can not connect to driver endpoint because driver already stopped the endpoint.
Driver - YarnAllocator received container launch error message and treat it as
exitCausedByApp
Driver - Eventually application failed because ”failed“ executor reached threshold
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Update UT.