[SPARK-6924][YARN] Fix driver hangs in yarn-client mode when network is disconnected #5663

SaintBacchus · 2015-04-23T12:44:57Z

When driver's network is disconnected for a while within yarn-client mode, an IOException will occur in thread 'Yarn application state monitor' and cause the driver hang forever.

To replay this scenario, you can do as follow:

run a spark job in yarn-client mode
Type ifconfig {your NIC} down
After a while, type ifconfig {same NIC} up
The SparkSubmit jvm process will hang forever

The exception log is about this:

INFO RetryInvocationHandler: Exception while invoking renewLease of class ClientNamenodeProtocolTranslatorPB over linux-223/9.91.8.223:65110 after 12 fail over attempts. Trying to fail over immediately.
java.io.IOException: Failed on local exception: java.net.SocketException: Network is unreachable; 
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
        at org.apache.hadoop.ipc.Client.call(Client.java:1472)
        at org.apache.hadoop.ipc.Client.call(Client.java:1399)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
        at com.sun.proxy.$Proxy15.renewLease(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.renewLease(ClientNamenodeProtocolTranslatorPB.java:571)
        at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
        at com.sun.proxy.$Proxy16.renewLease(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:879)
        at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:417)
        at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:442)
        at org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71)
        at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:298)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.SocketException: Network is unreachable
        at sun.nio.ch.Net.connect0(Native Method)
        at sun.nio.ch.Net.connect(Net.java:465)
        at sun.nio.ch.Net.connect(Net.java:457)
        at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:670)
        at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
        at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607)
        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705)
        at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521)
        at org.apache.hadoop.ipc.Client.call(Client.java:1438)

Jira: https://issues.apache.org/jira/browse/HDFS-3032 may be a related problem with it.
My solution was to catch the IOException when renewLease logic happened and shutdown the spark driver.

…read 'Yarn application state monitor' within yarn-client mode and cause the driver hang forever.

SaintBacchus · 2015-04-23T12:46:53Z

It`s the same problem discussed in #5523 . @srowen can you take a look into this problem?

SparkQA · 2015-04-23T14:23:30Z

Test build #30840 has finished for PR 5663 at commit 5a28319.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

vanzin · 2015-04-24T00:02:12Z

yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala

@@ -17,7 +17,7 @@

 package org.apache.spark.deploy.yarn

-import java.io.{File, FileOutputStream}
+import java.io.{IOException, File, FileOutputStream}


nit: sorting

… thread.

SparkQA · 2015-04-24T03:35:31Z

Test build #30898 has finished for PR 5663 at commit cf80049.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

SaintBacchus · 2015-04-27T02:02:57Z

@andrewor14 Can you also have a look at this problem?

tgravescs · 2015-04-27T14:19:21Z

@SaintBacchus, what is the exact behavior with this patch? Looking at the client code it appears that is there is network issue at all it will fail and kill the application. I don't think that is what we want here. If there is very temporary network glitch it could kill it even though if you waited a couple seconds it would be fine. In the very least we should have a certain number of retries.

Also note that monitorApplication is used in both client and cluster modes. Cluster mode nothing is running on the gateway so we don't care if it loses network temporarily and it won't actually affect the running application so here we would be returning something saying the application is killed when we don't really know that.

SaintBacchus · 2015-04-27T14:57:34Z

As you said:

In the very least we should have a certain number of retries.

Retry may be not nesseary in this case within yarn-client mode since the ExecutorLauncher had been disconnected with driver and the report the success of the application to the YARN.
Yarn-Cluster will be OK because the driver and AM is in the same process. These won't cause the inconsistency of the applicaiton.

tgravescs · 2015-04-27T16:39:07Z

Ok I went and dug deeper into the yarn yarn client and it is handling retries and things like rm rolling upgrade for you. So you can ignore my comment about retries as its handled there.

My comment about yarn cluster mode is because the yarn client which runs on the gateway (where the application was launched from) calls monitorApplication to report status to the user.
If just the gateway node loses network connection with the RM, but everything else in the cluster is fine, it would report to the user that the application has been killed when it could really be running fine. So I think the main thing here would be to just report to user the state is unknown. For that perhaps we can return FinalApplicationStatus.UNDEFINED instead.

So what does YARN show the status of the application as in this case (when run in yarn-client mode)? I'm guessing what happens is the application master disconnects from the driver (onDisconnected) and then reports status as succeeded?

So if you bring the network back up, you just continue to see these exceptions then? If that is the case perhaps we should be doing something different with our yarn client.

vanzin · 2015-04-27T17:03:04Z

So what does YARN show the status of the application as in this case (when run in yarn-client mode)?

Hi @tgravescs , I think that's actually out of the scope of this change. The backend here is fine; when the net is shaky, connections will be torn down and executors will go away since they lost connection to the driver. Retrying to connect to the RM on the driver side won't help fix that (although you could potentially recover at some point and then have new executors allocated for you).

What I understand from the bug report is that when that happens, the driver process will stay around and not leave for some reason. Looking at the code I don't fully understand how that happens (seeing a jstack output for the process would maybe clear that out), but that's what the change is trying to fix here.

tgravescs · 2015-04-27T17:21:09Z

@vanzin The question was basically trying to get at fully was is happening. I did not say we should fix the reporting, as I agree with you, that is a separate issue.

You are also right that retrying to connect to RM won't help fix things that have already gone away, but it can give you the real state. Just because some network connects are down doesn't mean all of them are. Reconnecting to RM could get you the real state of the application on the cluster. We shouldn't just look at the exact case reported here, you have to look at other possibilities and how this patch would affect them. For instance, if the RM goes down and we can't get the application status, it doesn't mean the application master has disconnected from the client/driver or that the executors have died.

tgravescs · 2015-04-27T17:23:39Z

Note that in the end I just want to understand what is fully happening. Like you say jstack and more information would be useful. It seems like once the network comes back up, the client should reconnect. If it has exhausted its retries then perhaps we should be handling that in a general case and either exiting like we are here or recreating the client.

SparkQA · 2015-04-27T18:18:50Z

Test build #30984 has started for PR 5663 at commit cf80049.

SaintBacchus · 2015-04-28T09:06:08Z

@tgravescs Yeah the status FinalApplicationStatus.UNDEFINED about the application is better than Fail and Kill since the client can't know really on the application when network shaky.
So the modify will be as this:

    logError("Can't gain the status of application from Yarn because of exception: ", e)
    return (YarnApplicationState.FAILED, FinalApplicationStatus.UNDEFINED)

Back the suggestion as @vanzin said, I used jstack to catch the process(net is back but process still wait). It showed that only the main-thread was a non-demon thread except some java thread such as GC task.

"main" prio=10 tid=0x000000000060a800 nid=0xd7ae in Object.wait() [0x00007fe7dcfb3000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00000000e02b9a58> (a org.apache.spark.scheduler.JobWaiter)
    at java.lang.Object.wait(Object.java:503)
    at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73)
    - locked <0x00000000e02b9a58> (a org.apache.spark.scheduler.JobWaiter)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:526)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1586)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1655)
    at org.apache.spark.rdd.RDD.reduce(RDD.scala:906)
    at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:35)
    at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:611)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:171)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:194)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:115)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

"VM Thread" prio=10 tid=0x0000000000665800 nid=0xd7b1 runnable 

"GC task thread#0 (ParallelGC)" prio=10 tid=0x0000000000620000 nid=0xd7af runnable 

"GC task thread#1 (ParallelGC)" prio=10 tid=0x0000000000622000 nid=0xd7b0 runnable 

"VM Periodic Task Thread" prio=10 tid=0x00007fe7d003d800 nid=0xd7b8 waiting on condition 

JNI global references: 29

I don't know the deeper reason why the main thread is still waiting, but my modify is work for this problem.

tgravescs · 2015-04-28T13:27:40Z

@SaintBacchus

How long do you wait before bringing network back up?

Is there anything in the logs about the yarn client during this time? Perhaps retry and finally disconnect?

SaintBacchus · 2015-04-29T08:42:40Z

I shutdown the network about 3 mins.
@tgravescs The main exception was showed in the description.
And I find another log as:

Exception in thread "Yarn application state monitor" java.io.IOException: Failed on local exception: java.net.SocketException: Network is unreachable; Host Details : local host is: "vm-120/9.91.11.120"; des
tination host is: "linux-223":45022;
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:776)
        at org.apache.hadoop.ipc.Client.call(Client.java:1480)
        at org.apache.hadoop.ipc.Client.call(Client.java:1407)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
        at com.sun.proxy.$Proxy7.getApplicationReport(Unknown Source)
        at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationReport(ApplicationClientProtocolPBClientImpl.java:191)
        at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
        at com.sun.proxy.$Proxy8.getApplicationReport(Unknown Source)
        at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:429)
        at org.apache.spark.deploy.yarn.Client.getApplicationReport(Client.scala:148)
        at org.apache.spark.deploy.yarn.Client.monitorApplication(Client.scala:655)
        at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:132)
Caused by: java.net.SocketException: Network is unreachable
        at sun.nio.ch.Net.connect0(Native Method)
        at sun.nio.ch.Net.connect(Net.java:465)
        at sun.nio.ch.Net.connect(Net.java:457)
        at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:670)
        at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
        at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:609)
        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707)
        at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1529)
        at org.apache.hadoop.ipc.Client.call(Client.java:1446)
        ... 14 more

So I tried to catch this problem in thread "Yarn application state monitor" and it work.

When driver's netword is unreachable, an IOException will occur in th…

5a28319

…read 'Yarn application state monitor' within yarn-client mode and cause the driver hang forever.

srowen mentioned this pull request Apr 23, 2015

[Spark-6924]Fix driver hangs in yarn-client mode when net is broken #5523

Closed

vanzin reviewed Apr 24, 2015
View reviewed changes

Change the IOException into Exception to catch any problem in monitor…

cf80049

… thread.

SaintBacchus closed this May 10, 2015

SaintBacchus deleted the YarnClientNetWorkUnreachable branch December 26, 2015 06:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-6924][YARN] Fix driver hangs in yarn-client mode when network is disconnected #5663

[SPARK-6924][YARN] Fix driver hangs in yarn-client mode when network is disconnected #5663

SaintBacchus commented Apr 23, 2015

SaintBacchus commented Apr 23, 2015

SparkQA commented Apr 23, 2015

vanzin Apr 24, 2015

SparkQA commented Apr 24, 2015

SaintBacchus commented Apr 27, 2015

tgravescs commented Apr 27, 2015

SaintBacchus commented Apr 27, 2015

tgravescs commented Apr 27, 2015

vanzin commented Apr 27, 2015

tgravescs commented Apr 27, 2015

tgravescs commented Apr 27, 2015

SparkQA commented Apr 27, 2015

SaintBacchus commented Apr 28, 2015

tgravescs commented Apr 28, 2015

SaintBacchus commented Apr 29, 2015

[SPARK-6924][YARN] Fix driver hangs in yarn-client mode when network is disconnected #5663

[SPARK-6924][YARN] Fix driver hangs in yarn-client mode when network is disconnected #5663

Conversation

SaintBacchus commented Apr 23, 2015

SaintBacchus commented Apr 23, 2015

SparkQA commented Apr 23, 2015

vanzin Apr 24, 2015

Choose a reason for hiding this comment

SparkQA commented Apr 24, 2015

SaintBacchus commented Apr 27, 2015

tgravescs commented Apr 27, 2015

SaintBacchus commented Apr 27, 2015

tgravescs commented Apr 27, 2015

vanzin commented Apr 27, 2015

tgravescs commented Apr 27, 2015

tgravescs commented Apr 27, 2015

SparkQA commented Apr 27, 2015

SaintBacchus commented Apr 28, 2015

tgravescs commented Apr 28, 2015

SaintBacchus commented Apr 29, 2015