Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark-6924]Fix driver hangs in yarn-client mode when net is broken #5523

Closed
wants to merge 1 commit into from
Closed

[Spark-6924]Fix driver hangs in yarn-client mode when net is broken #5523

wants to merge 1 commit into from

Conversation

viper-kun
Copy link
Contributor

In yarn-client mode, client is deployed out side of cluster. When the net between client and cluster is broken, driver lost all executors. In normal situation, client returns and app fails. Actually, the driver hangs, user do not know whether app is ok. So we should let driver return not hang.
The solution: in HeartbeatReceiver thread, check whether some executor send heartbeat to dirver at the fixed rate. If no execuor send heartbeats to driver, close SparkContext.

https://issues.apache.org/jira/browse/SPARK-6924

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@srowen
Copy link
Member

srowen commented Apr 15, 2015

@viper-kun I closed the JIRA to make a point since it did not describe a problem. Do you mean "hangs"? still, I think you need to elaborate somewhere what the nature of the problem is and why this is a resolution.

@viper-kun viper-kun changed the title [Spark-6924]Fix client hands in yarn-client mode when net is broken [Spark-6924]Fix client hangs in yarn-client mode when net is broken Apr 16, 2015
@viper-kun viper-kun changed the title [Spark-6924]Fix client hangs in yarn-client mode when net is broken [Spark-6924]Fix driver hangs in yarn-client mode when net is broken Apr 16, 2015
@viper-kun
Copy link
Contributor Author

@srowen I updated the jira. Pls review it. Thanks.

@@ -75,6 +75,8 @@ import org.apache.spark.util._
*/
class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationClient {

var isInited: Boolean = false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to conflict with the overhaul of the SparkContext constructor. I don't see why this depends on the constructor finishing since where you reference the SparkContext, it has been constructed. I also don't think that a lack of executor messages indicates a disconnection; it's not possible to distinguish from temporary loss of connectivity this way. I think you'd have to explain this a lot more (with tests) or close this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to conflict with the overhaul of the SparkContext constructor. I don't see why this depends on the constructor finishing since where you reference the SparkContext, it has been constructed.

In SparkContext constructor, when HeartbeatReceiver create, timeoutCheckingThread will check expire dead host. if executorLastSeen is empty, it will execute sc.stop().
Then it will throw exception:
java.lang.NullPointerException
at org.apache.spark.SparkContext.stop(SparkContext.scala:1416)
at org.apache.spark.HeartbeatReceiver.org$apache$spark$HeartbeatReceiver$$expireDeadHosts(HeartbeatReceiver.scala:134)
at org.apache.spark.HeartbeatReceiver$$anonfun$receive$1.applyOrElse(HeartbeatReceiver.scala:92)
at org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$processMessage(AkkaRpcEnv.scala:176)
at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$4.apply$mcV$sp(AkkaRpcEnv.scala:125)
at org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$safelyCall(AkkaRpcEnv.scala:196)
at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1.applyOrElse(AkkaRpcEnv.scala:124)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:53)
at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1.aroundReceive(AkkaRpcEnv.scala:91)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
15/04/16 15:12:16 INFO netty.NettyBlockTransferService: Server created on 53493

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't think that a lack of executor messages indicates a disconnection; it's not possible to distinguish from temporary loss of connectivity this way.

All executors send heartbeat to driver at fixed Rate.Over a period of time, all executors are expire, I think there is a disconnection. Is there any better way to distinguish from temporary loss of connectivity.
Can we check some times? If all executors still expire, we indicates a disconnection.

@srowen
Copy link
Member

srowen commented Apr 16, 2015

Have a look at de4fa6b which may resolve the issue of bad state after construction.

I don't think it's correct to make callers check some status of SparkContext to decide whether calling a method is safe. stop() should handle this. I don't think you can call stop() just because you didn't hear from executors recently.

@viper-kun
Copy link
Contributor Author

Between construction, it is normal that it didn't hear from executors. Only after construction, executors have connected and sent heartbeat to driver. We can indicates whether there is a disconnection. SparkContext.isInited show whether SparkContext construction has completed .

I don't think you can call stop() just because you didn't hear from executors recently.

Is there any better way?

@srowen
Copy link
Member

srowen commented Apr 23, 2015

I think #5663 stands a better chance of addressing the issue and being merged. Do you mind commenting on that one, and closing this PR?

@viper-kun
Copy link
Contributor Author

ok. I will close it.

@viper-kun viper-kun closed this Apr 24, 2015
@viper-kun viper-kun deleted the spark-6924 branch January 18, 2017 09:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants