-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-13604][Core]Sync worker's state after registering with master #11455
Conversation
cc @andrewor14 |
Test build #52279 has finished for PR 11455 at commit
|
Test build #52272 has finished for PR 11455 at commit
|
Test build #52275 has finished for PR 11455 at commit
|
Test build #52339 has finished for PR 11455 at commit
|
ping @andrewor14 |
case Some(worker) => | ||
for (exec <- executors) { | ||
if (!worker.executors.exists( | ||
e => e._2.application.id == exec.appId && e._2.id == exec.execId)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: can you use .exists { case (_, something) => something.application.id ... }
and store it in a variable? e.g.
for (exec <- executors) {
val executorMatches = worker.executors.exists { ... }
if (!executorMatches) {
worker.endpoint.send(...)
}
}
LGTM, just style nits. |
Test build #52838 has finished for PR 11455 at commit
|
retest this please |
Test build #52853 has finished for PR 11455 at commit
|
retest this please |
Test build #52861 has finished for PR 11455 at commit
|
Merged into master. |
val driverMatches = worker.drivers.exists { case (id, _) => id == driverId } | ||
if (!driverMatches) { | ||
// master doesn't recognize this driver. So just tell worker to kill it. | ||
worker.endpoint.send(KillDriver(driverId)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there may be scenario that Executor gets killed but driver gets kept, vice versa.
Is that desirable ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get it. Here just compare them with the executors and drivers of a worker stored in the master. If we find any mismatch, just kill it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. Which part of the code leads you to believe that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me look at other parts of Master.scala and see if I can find anything.
## What changes were proposed in this pull request? Here lists all cases that Master cannot talk with Worker for a while and then network is back. 1. Master doesn't know the network issue (not yet timeout) a. Worker doesn't know the network issue (onDisconnected is not called) - Worker keeps sending Heartbeat. Both Worker and Master don't know the network issue. Nothing to do. (Finally, Master will notice the heartbeat timeout if network is not recovered) b. Worker knows the network issue (onDisconnected is called) - Worker stops sending Heartbeat and sends `RegisterWorker` to master. Master will reply `RegisterWorkerFailed("Duplicate worker ID")`. Worker calls "System.exit(1)" (Finally, Master will notice the heartbeat timeout if network is not recovered) (May leak driver processes. See [SPARK-13602](https://issues.apache.org/jira/browse/SPARK-13602)) 2. Worker timeout (Master knows the network issue). In such case, master removes Worker and its executors and drivers. a. Worker doesn't know the network issue (onDisconnected is not called) - Worker keeps sending Heartbeat. - If the network is back, say Master receives Heartbeat, Master sends `ReconnectWorker` to Worker - Worker send `RegisterWorker` to master. - Master accepts `RegisterWorker` but doesn't know executors and drivers in Worker. (may leak executors) b. Worker knows the network issue (onDisconnected is called) - Worker stop sending `Heartbeat`. Worker will send "RegisterWorker" to master. - Master accepts `RegisterWorker` but doesn't know executors and drivers in Worker. (may leak executors) This PR fixes executors and drivers leak in 2.a and 2.b when Worker reregisters with Master. The approach is making Worker send `WorkerLatestState` to sync the state after registering with master successfully. Then Master will ask Worker to kill unknown executors and drivers. Note: Worker cannot just kill executors after registering with master because in the worker, `LaunchExecutor` and `RegisteredWorker` are processed in two threads. If `LaunchExecutor` happens before `RegisteredWorker`, Worker's executor list will contain new executors after Master accepts `RegisterWorker`. We should not kill these executors. So sending the list to Master and let Master tell Worker which executors should be killed. ## How was this patch tested? test("SPARK-13604: Master should ask Worker kill unknown executors and drivers") Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#11455 from zsxwing/orphan-executors.
What changes were proposed in this pull request?
Here lists all cases that Master cannot talk with Worker for a while and then network is back.
Master doesn't know the network issue (not yet timeout)
a. Worker doesn't know the network issue (onDisconnected is not called)
b. Worker knows the network issue (onDisconnected is called)
RegisterWorker
to master. Master will replyRegisterWorkerFailed("Duplicate worker ID")
. Worker calls "System.exit(1)" (Finally, Master will notice the heartbeat timeout if network is not recovered) (May leak driver processes. See SPARK-13602)Worker timeout (Master knows the network issue). In such case, master removes Worker and its executors and drivers.
a. Worker doesn't know the network issue (onDisconnected is not called)
ReconnectWorker
to WorkerRegisterWorker
to master.RegisterWorker
but doesn't know executors and drivers in Worker. (may leak executors)b. Worker knows the network issue (onDisconnected is called)
Heartbeat
. Worker will send "RegisterWorker" to master.RegisterWorker
but doesn't know executors and drivers in Worker. (may leak executors)This PR fixes executors and drivers leak in 2.a and 2.b when Worker reregisters with Master. The approach is making Worker send
WorkerLatestState
to sync the state after registering with master successfully. Then Master will ask Worker to kill unknown executors and drivers.Note: Worker cannot just kill executors after registering with master because in the worker,
LaunchExecutor
andRegisteredWorker
are processed in two threads. IfLaunchExecutor
happens beforeRegisteredWorker
, Worker's executor list will contain new executors after Master acceptsRegisterWorker
. We should not kill these executors. So sending the list to Master and let Master tell Worker which executors should be killed.How was this patch tested?
test("SPARK-13604: Master should ask Worker kill unknown executors and drivers")