[SPARK-14699][Core]Stop endpoints before closing the connections and don't stop client in Outbox#12481
[SPARK-14699][Core]Stop endpoints before closing the connections and don't stop client in Outbox#12481zsxwing wants to merge 3 commits intoapache:masterfrom zsxwing:SPARK-14699
Conversation
|
LGTM. I noticed this problem in testing out the DriverRunner for another issue and was looking for a fix. After applying this patch, the Driver will complete with a FINISHED status instead of FAILED (tested on a local cluster). I did notice some new warnings in the Driver log though, but doesn't seem to cause a problem And the app log still has an error that the worker watcher was disconnect, but maybe that is a separate issue and it doesn't seem to affect the application final state |
Yep, Netty may emit the network events after RpcEnv stops. But it doesn't matter since there are only a few warning logs.
Was the worker stopped at the same time? If the connection does disconnect before |
|
Test build #56141 has finished for PR 12481 at commit
|
No the worker wasn't stopped while running this. I think this error could be from this: when the application completes, the Master removes the application and kill executors, but if the rpcEnv in the |
|
cc @andrewor14 |
Looking the related codes. |
|
@BryanCutler Fixed it in this PR. Outbox should not close the client since it will be reused by others. |
|
cc @vanzin |
| @volatile var onDisconnectedCalled = false | ||
| @volatile var onNetworkErrorCalled = false | ||
| val anotherEnv = createRpcEnv(new SparkConf(), "remote", 0) | ||
| anotherEnv.setupEndpoint("SPARK-14699", new RpcEndpoint { |
There was a problem hiding this comment.
Minor, but using mockito (mock then verify) here would be a lot less code.
|
Looks alright; I assume no existing code relies on the "onDisconnected" event for cleaning things up? |
No. Just went through all |
|
Test build #56395 has finished for PR 12481 at commit
|
|
Test build #56399 has finished for PR 12481 at commit
|
|
retest this please |
|
Test build #56417 has finished for PR 12481 at commit
|
|
retest this please |
|
@zsxwing that test is failing everywhere, you don't need to bother retesting because of it. |
|
Test build #56445 has finished for PR 12481 at commit
|
|
retest this please |
|
Test build #56491 has finished for PR 12481 at commit
|
I confirmed this fix takes care of the error in the app log |
|
Thanks, merging to master. |
… and don't stop client in Outbox apache#12481
… and don't stop client in Outbox apache#12481
What changes were proposed in this pull request?
In general,
onDisconnectedis for dealing with unexpected network disconnections. When RpcEnv.shutdown is called, the disconnections are expected so RpcEnv should not fire these events.This PR moves
dispatcher.stop()above closing the connections so that when stopping RpcEnv, the endpoints won't receiveonDisconnectedevents.In addition, Outbox should not close the client since it will be reused by others. This PR fixes it as well.
How was this patch tested?
test("SPARK-14699: RpcEnv.shutdown should not fire onDisconnected events")