-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-6924][YARN] Fix driver hangs in yarn-client mode when network is disconnected #5663
[SPARK-6924][YARN] Fix driver hangs in yarn-client mode when network is disconnected #5663
Conversation
…read 'Yarn application state monitor' within yarn-client mode and cause the driver hang forever.
Test build #30840 has finished for PR 5663 at commit
|
@@ -17,7 +17,7 @@ | |||
|
|||
package org.apache.spark.deploy.yarn | |||
|
|||
import java.io.{File, FileOutputStream} | |||
import java.io.{IOException, File, FileOutputStream} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: sorting
Test build #30898 has finished for PR 5663 at commit
|
@andrewor14 Can you also have a look at this problem? |
@SaintBacchus, what is the exact behavior with this patch? Looking at the client code it appears that is there is network issue at all it will fail and kill the application. I don't think that is what we want here. If there is very temporary network glitch it could kill it even though if you waited a couple seconds it would be fine. In the very least we should have a certain number of retries. Also note that monitorApplication is used in both client and cluster modes. Cluster mode nothing is running on the gateway so we don't care if it loses network temporarily and it won't actually affect the running application so here we would be returning something saying the application is killed when we don't really know that. |
As you said:
Retry may be not nesseary in this case within yarn-client mode since the ExecutorLauncher had been disconnected with driver and the report the success of the application to the YARN. |
Ok I went and dug deeper into the yarn yarn client and it is handling retries and things like rm rolling upgrade for you. So you can ignore my comment about retries as its handled there. My comment about yarn cluster mode is because the yarn client which runs on the gateway (where the application was launched from) calls monitorApplication to report status to the user. So what does YARN show the status of the application as in this case (when run in yarn-client mode)? I'm guessing what happens is the application master disconnects from the driver (onDisconnected) and then reports status as succeeded? So if you bring the network back up, you just continue to see these exceptions then? If that is the case perhaps we should be doing something different with our yarn client. |
Hi @tgravescs , I think that's actually out of the scope of this change. The backend here is fine; when the net is shaky, connections will be torn down and executors will go away since they lost connection to the driver. Retrying to connect to the RM on the driver side won't help fix that (although you could potentially recover at some point and then have new executors allocated for you). What I understand from the bug report is that when that happens, the driver process will stay around and not leave for some reason. Looking at the code I don't fully understand how that happens (seeing a |
@vanzin The question was basically trying to get at fully was is happening. I did not say we should fix the reporting, as I agree with you, that is a separate issue. You are also right that retrying to connect to RM won't help fix things that have already gone away, but it can give you the real state. Just because some network connects are down doesn't mean all of them are. Reconnecting to RM could get you the real state of the application on the cluster. We shouldn't just look at the exact case reported here, you have to look at other possibilities and how this patch would affect them. For instance, if the RM goes down and we can't get the application status, it doesn't mean the application master has disconnected from the client/driver or that the executors have died. |
Note that in the end I just want to understand what is fully happening. Like you say jstack and more information would be useful. It seems like once the network comes back up, the client should reconnect. If it has exhausted its retries then perhaps we should be handling that in a general case and either exiting like we are here or recreating the client. |
Test build #30984 has started for PR 5663 at commit |
@tgravescs Yeah the status logError("Can't gain the status of application from Yarn because of exception: ", e)
return (YarnApplicationState.FAILED, FinalApplicationStatus.UNDEFINED) Back the suggestion as @vanzin said, I used "main" prio=10 tid=0x000000000060a800 nid=0xd7ae in Object.wait() [0x00007fe7dcfb3000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000000e02b9a58> (a org.apache.spark.scheduler.JobWaiter)
at java.lang.Object.wait(Object.java:503)
at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73)
- locked <0x00000000e02b9a58> (a org.apache.spark.scheduler.JobWaiter)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:526)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1586)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1655)
at org.apache.spark.rdd.RDD.reduce(RDD.scala:906)
at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:35)
at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:611)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:171)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:194)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:115)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
"VM Thread" prio=10 tid=0x0000000000665800 nid=0xd7b1 runnable
"GC task thread#0 (ParallelGC)" prio=10 tid=0x0000000000620000 nid=0xd7af runnable
"GC task thread#1 (ParallelGC)" prio=10 tid=0x0000000000622000 nid=0xd7b0 runnable
"VM Periodic Task Thread" prio=10 tid=0x00007fe7d003d800 nid=0xd7b8 waiting on condition
JNI global references: 29 I don't know the deeper reason why the main thread is still waiting, but my modify is work for this problem. |
How long do you wait before bringing network back up? Is there anything in the logs about the yarn client during this time? Perhaps retry and finally disconnect? |
I shutdown the network about 3 mins. Exception in thread "Yarn application state monitor" java.io.IOException: Failed on local exception: java.net.SocketException: Network is unreachable; Host Details : local host is: "vm-120/9.91.11.120"; des
tination host is: "linux-223":45022;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:776)
at org.apache.hadoop.ipc.Client.call(Client.java:1480)
at org.apache.hadoop.ipc.Client.call(Client.java:1407)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy7.getApplicationReport(Unknown Source)
at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationReport(ApplicationClientProtocolPBClientImpl.java:191)
at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy8.getApplicationReport(Unknown Source)
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:429)
at org.apache.spark.deploy.yarn.Client.getApplicationReport(Client.scala:148)
at org.apache.spark.deploy.yarn.Client.monitorApplication(Client.scala:655)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:132)
Caused by: java.net.SocketException: Network is unreachable
at sun.nio.ch.Net.connect0(Native Method)
at sun.nio.ch.Net.connect(Net.java:465)
at sun.nio.ch.Net.connect(Net.java:457)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:670)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:609)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1529)
at org.apache.hadoop.ipc.Client.call(Client.java:1446)
... 14 more So I tried to catch this problem in thread "Yarn application state monitor" and it work. |
When driver's network is disconnected for a while within yarn-client mode, an IOException will occur in thread 'Yarn application state monitor' and cause the driver hang forever.
To replay this scenario, you can do as follow:
ifconfig {your NIC} down
ifconfig {same NIC} up
SparkSubmit
jvm process will hang foreverThe exception log is about this:
Jira: https://issues.apache.org/jira/browse/HDFS-3032 may be a related problem with it.
My solution was to catch the IOException when
renewLease
logic happened and shutdown the spark driver.