[Spark-6924]Fix driver hangs in yarn-client mode when net is broken #5523

viper-kun · 2015-04-15T09:41:31Z

In yarn-client mode, client is deployed out side of cluster. When the net between client and cluster is broken, driver lost all executors. In normal situation, client returns and app fails. Actually, the driver hangs, user do not know whether app is ok. So we should let driver return not hang.
The solution: in HeartbeatReceiver thread, check whether some executor send heartbeat to dirver at the fixed rate. If no execuor send heartbeats to driver, close SparkContext.

https://issues.apache.org/jira/browse/SPARK-6924

AmplabJenkins · 2015-04-15T09:42:11Z

Can one of the admins verify this patch?

srowen · 2015-04-15T09:45:12Z

@viper-kun I closed the JIRA to make a point since it did not describe a problem. Do you mean "hangs"? still, I think you need to elaborate somewhere what the nature of the problem is and why this is a resolution.

viper-kun · 2015-04-16T01:50:39Z

@srowen I updated the jira. Pls review it. Thanks.

srowen · 2015-04-16T09:12:05Z

core/src/main/scala/org/apache/spark/SparkContext.scala

@@ -75,6 +75,8 @@ import org.apache.spark.util._
 */
 class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationClient {

+  var isInited: Boolean = false


This is going to conflict with the overhaul of the SparkContext constructor. I don't see why this depends on the constructor finishing since where you reference the SparkContext, it has been constructed. I also don't think that a lack of executor messages indicates a disconnection; it's not possible to distinguish from temporary loss of connectivity this way. I think you'd have to explain this a lot more (with tests) or close this.

This is going to conflict with the overhaul of the SparkContext constructor. I don't see why this depends on the constructor finishing since where you reference the SparkContext, it has been constructed.

In SparkContext constructor, when HeartbeatReceiver create, timeoutCheckingThread will check expire dead host. if executorLastSeen is empty, it will execute sc.stop().
Then it will throw exception:
java.lang.NullPointerException
at org.apache.spark.SparkContext.stop(SparkContext.scala:1416)
at org.apache.spark.HeartbeatReceiver.org$apache$spark$HeartbeatReceiver$$expireDeadHosts(HeartbeatReceiver.scala:134)
at org.apache.spark.HeartbeatReceiver$$anonfun$receive$1.applyOrElse(HeartbeatReceiver.scala:92)
at org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$processMessage(AkkaRpcEnv.scala:176)
at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$4.apply$mcV$sp(AkkaRpcEnv.scala:125)
at org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$safelyCall(AkkaRpcEnv.scala:196)
at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1.applyOrElse(AkkaRpcEnv.scala:124)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:53)
at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1.aroundReceive(AkkaRpcEnv.scala:91)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
15/04/16 15:12:16 INFO netty.NettyBlockTransferService: Server created on 53493

I also don't think that a lack of executor messages indicates a disconnection; it's not possible to distinguish from temporary loss of connectivity this way.

All executors send heartbeat to driver at fixed Rate.Over a period of time, all executors are expire, I think there is a disconnection. Is there any better way to distinguish from temporary loss of connectivity.
Can we check some times? If all executors still expire, we indicates a disconnection.

srowen · 2015-04-16T11:54:19Z

Have a look at de4fa6b which may resolve the issue of bad state after construction.

I don't think it's correct to make callers check some status of SparkContext to decide whether calling a method is safe. stop() should handle this. I don't think you can call stop() just because you didn't hear from executors recently.

viper-kun · 2015-04-16T13:01:44Z

Between construction, it is normal that it didn't hear from executors. Only after construction, executors have connected and sent heartbeat to driver. We can indicates whether there is a disconnection. SparkContext.isInited show whether SparkContext construction has completed .

I don't think you can call stop() just because you didn't hear from executors recently.

Is there any better way?

srowen · 2015-04-23T13:05:32Z

I think #5663 stands a better chance of addressing the issue and being merged. Do you mind commenting on that one, and closing this PR?

viper-kun · 2015-04-24T00:52:55Z

ok. I will close it.

fix client hands in yarn-client mode

9288f22

viper-kun changed the title ~~[Spark-6924]Fix client hands in yarn-client mode when net is broken~~ [Spark-6924]Fix client hangs in yarn-client mode when net is broken Apr 16, 2015

viper-kun changed the title ~~[Spark-6924]Fix client hangs in yarn-client mode when net is broken~~ [Spark-6924]Fix driver hangs in yarn-client mode when net is broken Apr 16, 2015

srowen reviewed Apr 16, 2015
View reviewed changes

SaintBacchus mentioned this pull request Apr 23, 2015

[SPARK-6924][YARN] Fix driver hangs in yarn-client mode when network is disconnected #5663

Closed

viper-kun closed this Apr 24, 2015

viper-kun deleted the spark-6924 branch January 18, 2017 09:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark-6924]Fix driver hangs in yarn-client mode when net is broken #5523

[Spark-6924]Fix driver hangs in yarn-client mode when net is broken #5523

viper-kun commented Apr 15, 2015

AmplabJenkins commented Apr 15, 2015

srowen commented Apr 15, 2015

viper-kun commented Apr 16, 2015

srowen Apr 16, 2015

viper-kun Apr 16, 2015

viper-kun Apr 16, 2015

srowen commented Apr 16, 2015

viper-kun commented Apr 16, 2015

srowen commented Apr 23, 2015

viper-kun commented Apr 24, 2015

[Spark-6924]Fix driver hangs in yarn-client mode when net is broken #5523

[Spark-6924]Fix driver hangs in yarn-client mode when net is broken #5523

Conversation

viper-kun commented Apr 15, 2015

AmplabJenkins commented Apr 15, 2015

srowen commented Apr 15, 2015

viper-kun commented Apr 16, 2015

srowen Apr 16, 2015

Choose a reason for hiding this comment

viper-kun Apr 16, 2015

Choose a reason for hiding this comment

viper-kun Apr 16, 2015

Choose a reason for hiding this comment

srowen commented Apr 16, 2015

viper-kun commented Apr 16, 2015

srowen commented Apr 23, 2015

viper-kun commented Apr 24, 2015