[SPARK-13604][Core]Sync worker's state after registering with master #11455

zsxwing · 2016-03-02T00:43:56Z

What changes were proposed in this pull request?

Here lists all cases that Master cannot talk with Worker for a while and then network is back.

Master doesn't know the network issue (not yet timeout)

a. Worker doesn't know the network issue (onDisconnected is not called)
- Worker keeps sending Heartbeat. Both Worker and Master don't know the network issue. Nothing to do. (Finally, Master will notice the heartbeat timeout if network is not recovered)
b. Worker knows the network issue (onDisconnected is called)
- Worker stops sending Heartbeat and sends RegisterWorker to master. Master will reply RegisterWorkerFailed("Duplicate worker ID"). Worker calls "System.exit(1)" (Finally, Master will notice the heartbeat timeout if network is not recovered) (May leak driver processes. See SPARK-13602)
Worker timeout (Master knows the network issue). In such case, master removes Worker and its executors and drivers.

a. Worker doesn't know the network issue (onDisconnected is not called)
- Worker keeps sending Heartbeat.
- If the network is back, say Master receives Heartbeat, Master sends ReconnectWorker to Worker
- Worker send RegisterWorker to master.
- Master accepts RegisterWorker but doesn't know executors and drivers in Worker. (may leak executors)
b. Worker knows the network issue (onDisconnected is called)
- Worker stop sending Heartbeat. Worker will send "RegisterWorker" to master.
- Master accepts RegisterWorker but doesn't know executors and drivers in Worker. (may leak executors)

This PR fixes executors and drivers leak in 2.a and 2.b when Worker reregisters with Master. The approach is making Worker send WorkerLatestState to sync the state after registering with master successfully. Then Master will ask Worker to kill unknown executors and drivers.

Note: Worker cannot just kill executors after registering with master because in the worker, LaunchExecutor and RegisteredWorker are processed in two threads. If LaunchExecutor happens before RegisteredWorker, Worker's executor list will contain new executors after Master accepts RegisterWorker. We should not kill these executors. So sending the list to Master and let Master tell Worker which executors should be killed.

How was this patch tested?

test("SPARK-13604: Master should ask Worker kill unknown executors and drivers")

zsxwing · 2016-03-02T00:50:19Z

cc @andrewor14

SparkQA · 2016-03-02T01:52:47Z

Test build #52279 has finished for PR 11455 at commit 7e0b2a2.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class WorkerState(

SparkQA · 2016-03-02T03:06:49Z

Test build #52272 has finished for PR 11455 at commit 6c13702.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-02T03:30:32Z

Test build #52275 has finished for PR 11455 at commit 97002e4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-02T23:58:08Z

Test build #52339 has finished for PR 11455 at commit 1b95f5b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class WorkerLatestState(

zsxwing · 2016-03-09T01:36:31Z

ping @andrewor14

andrewor14 · 2016-03-10T02:16:43Z

core/src/main/scala/org/apache/spark/deploy/master/Master.scala

+        case Some(worker) =>
+          for (exec <- executors) {
+            if (!worker.executors.exists(
+              e => e._2.application.id == exec.appId && e._2.id == exec.execId)) {


style: can you use .exists { case (_, something) => something.application.id ... } and store it in a variable? e.g.

for (exec <- executors) { val executorMatches = worker.executors.exists { ... } if (!executorMatches) { worker.endpoint.send(...) } }

andrewor14 · 2016-03-10T02:18:13Z

LGTM, just style nits.

SparkQA · 2016-03-10T20:17:03Z

Test build #52838 has finished for PR 11455 at commit 51ac6dd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-03-10T20:35:55Z

retest this please

SparkQA · 2016-03-10T22:20:24Z

Test build #52853 has finished for PR 11455 at commit 51ac6dd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-03-10T22:32:12Z

retest this please

SparkQA · 2016-03-11T00:51:11Z

Test build #52861 has finished for PR 11455 at commit 51ac6dd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2016-03-11T00:58:59Z

Merged into master.

tedyu · 2016-03-11T17:15:35Z

core/src/main/scala/org/apache/spark/deploy/master/Master.scala

+            val driverMatches = worker.drivers.exists { case (id, _) => id == driverId }
+            if (!driverMatches) {
+              // master doesn't recognize this driver. So just tell worker to kill it.
+              worker.endpoint.send(KillDriver(driverId))


Looks like there may be scenario that Executor gets killed but driver gets kept, vice versa.

Is that desirable ?

I don't get it. Here just compare them with the executors and drivers of a worker stored in the master. If we find any mismatch, just kill it.

I don't think so. Which part of the code leads you to believe that?

Let me look at other parts of Master.scala and see if I can find anything.

## What changes were proposed in this pull request? Here lists all cases that Master cannot talk with Worker for a while and then network is back. 1. Master doesn't know the network issue (not yet timeout) a. Worker doesn't know the network issue (onDisconnected is not called) - Worker keeps sending Heartbeat. Both Worker and Master don't know the network issue. Nothing to do. (Finally, Master will notice the heartbeat timeout if network is not recovered) b. Worker knows the network issue (onDisconnected is called) - Worker stops sending Heartbeat and sends `RegisterWorker` to master. Master will reply `RegisterWorkerFailed("Duplicate worker ID")`. Worker calls "System.exit(1)" (Finally, Master will notice the heartbeat timeout if network is not recovered) (May leak driver processes. See [SPARK-13602](https://issues.apache.org/jira/browse/SPARK-13602)) 2. Worker timeout (Master knows the network issue). In such case, master removes Worker and its executors and drivers. a. Worker doesn't know the network issue (onDisconnected is not called) - Worker keeps sending Heartbeat. - If the network is back, say Master receives Heartbeat, Master sends `ReconnectWorker` to Worker - Worker send `RegisterWorker` to master. - Master accepts `RegisterWorker` but doesn't know executors and drivers in Worker. (may leak executors) b. Worker knows the network issue (onDisconnected is called) - Worker stop sending `Heartbeat`. Worker will send "RegisterWorker" to master. - Master accepts `RegisterWorker` but doesn't know executors and drivers in Worker. (may leak executors) This PR fixes executors and drivers leak in 2.a and 2.b when Worker reregisters with Master. The approach is making Worker send `WorkerLatestState` to sync the state after registering with master successfully. Then Master will ask Worker to kill unknown executors and drivers. Note: Worker cannot just kill executors after registering with master because in the worker, `LaunchExecutor` and `RegisteredWorker` are processed in two threads. If `LaunchExecutor` happens before `RegisteredWorker`, Worker's executor list will contain new executors after Master accepts `RegisterWorker`. We should not kill these executors. So sending the list to Master and let Master tell Worker which executors should be killed. ## How was this patch tested? test("SPARK-13604: Master should ask Worker kill unknown executors and drivers") Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#11455 from zsxwing/orphan-executors.

zsxwing changed the title ~~Sync worker's state after registering with master~~ [SPARK-13604][Core]Sync worker's state after registering with master Mar 2, 2016

Sync worker's state after registering with master

1b95f5b

andrewor14 reviewed Mar 10, 2016
View reviewed changes

Address comments

51ac6dd

asfgit closed this in 27fe6ba Mar 11, 2016

zsxwing deleted the orphan-executors branch March 11, 2016 01:02

tedyu reviewed Mar 11, 2016
View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-13604][Core]Sync worker's state after registering with master #11455

[SPARK-13604][Core]Sync worker's state after registering with master #11455

zsxwing commented Mar 2, 2016

zsxwing commented Mar 2, 2016

SparkQA commented Mar 2, 2016

SparkQA commented Mar 2, 2016

SparkQA commented Mar 2, 2016

SparkQA commented Mar 2, 2016

zsxwing commented Mar 9, 2016

andrewor14 Mar 10, 2016

andrewor14 commented Mar 10, 2016

SparkQA commented Mar 10, 2016

zsxwing commented Mar 10, 2016

SparkQA commented Mar 10, 2016

zsxwing commented Mar 10, 2016

SparkQA commented Mar 11, 2016

andrewor14 commented Mar 11, 2016

tedyu Mar 11, 2016

zsxwing Mar 11, 2016

andrewor14 Mar 11, 2016

tedyu Mar 11, 2016

[SPARK-13604][Core]Sync worker's state after registering with master #11455

[SPARK-13604][Core]Sync worker's state after registering with master #11455

Conversation

zsxwing commented Mar 2, 2016

What changes were proposed in this pull request?

How was this patch tested?

zsxwing commented Mar 2, 2016

SparkQA commented Mar 2, 2016

SparkQA commented Mar 2, 2016

SparkQA commented Mar 2, 2016

SparkQA commented Mar 2, 2016

zsxwing commented Mar 9, 2016

andrewor14 Mar 10, 2016

Choose a reason for hiding this comment

andrewor14 commented Mar 10, 2016

SparkQA commented Mar 10, 2016

zsxwing commented Mar 10, 2016

SparkQA commented Mar 10, 2016

zsxwing commented Mar 10, 2016

SparkQA commented Mar 11, 2016

andrewor14 commented Mar 11, 2016

tedyu Mar 11, 2016

Choose a reason for hiding this comment

zsxwing Mar 11, 2016

Choose a reason for hiding this comment

andrewor14 Mar 11, 2016

Choose a reason for hiding this comment

tedyu Mar 11, 2016

Choose a reason for hiding this comment