Skip to content

Commit

Permalink
[SPARK-12281][CORE] Fix a race condition when reporting ExecutorState…
Browse files Browse the repository at this point in the history
… in the shutdown hook

1. Make sure workers and masters exit so that no worker or master will still be running when triggering the shutdown hook.
2. Set ExecutorState to FAILED if it's still RUNNING when executing the shutdown hook.

This should fix the potential exceptions when exiting a local cluster
```
java.lang.AssertionError: assertion failed: executor 4 state transfer from RUNNING to RUNNING is illegal
	at scala.Predef$.assert(Predef.scala:179)
	at org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:260)
	at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.
	at org.apache.spark.util.SparkShutdownHookManager.add(ShutdownHookManager.scala:246)
	at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:191)
	at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:180)
	at org.apache.spark.deploy.worker.ExecutorRunner.start(ExecutorRunner.scala:73)
	at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:474)
	at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
```

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10269 from zsxwing/executor-state.
  • Loading branch information
zsxwing committed Dec 14, 2015
1 parent 8af2f8c commit 2aecda2
Show file tree
Hide file tree
Showing 3 changed files with 9 additions and 3 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,8 @@ class LocalSparkCluster(
// Stop the workers before the master so they don't get upset that it disconnected
workerRpcEnvs.foreach(_.shutdown())
masterRpcEnvs.foreach(_.shutdown())
workerRpcEnvs.foreach(_.awaitTermination())
masterRpcEnvs.foreach(_.awaitTermination())
masterRpcEnvs.clear()
workerRpcEnvs.clear()
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -257,9 +257,8 @@ private[deploy] class Master(
exec.state = state

if (state == ExecutorState.RUNNING) {
if (oldState != ExecutorState.LAUNCHING) {
logWarning(s"Executor $execId state transfer from $oldState to RUNNING is unexpected")
}
assert(oldState == ExecutorState.LAUNCHING,
s"executor $execId state transfer from $oldState to RUNNING is illegal")
appInfo.resetRetryCount()
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,11 @@ private[deploy] class ExecutorRunner(
workerThread.start()
// Shutdown hook that kills actors on shutdown.
shutdownHook = ShutdownHookManager.addShutdownHook { () =>
// It's possible that we arrive here before calling `fetchAndRunExecutor`, then `state` will
// be `ExecutorState.RUNNING`. In this case, we should set `state` to `FAILED`.
if (state == ExecutorState.RUNNING) {
state = ExecutorState.FAILED
}
killProcess(Some("Worker shutting down")) }
}

Expand Down

0 comments on commit 2aecda2

Please sign in to comment.