Skip to content

[SPARK-14699][Core]Stop endpoints before closing the connections and don't stop client in Outbox#12481

Closed
zsxwing wants to merge 3 commits intoapache:masterfrom
zsxwing:SPARK-14699
Closed

[SPARK-14699][Core]Stop endpoints before closing the connections and don't stop client in Outbox#12481
zsxwing wants to merge 3 commits intoapache:masterfrom
zsxwing:SPARK-14699

Conversation

@zsxwing
Copy link
Member

@zsxwing zsxwing commented Apr 18, 2016

What changes were proposed in this pull request?

In general, onDisconnected is for dealing with unexpected network disconnections. When RpcEnv.shutdown is called, the disconnections are expected so RpcEnv should not fire these events.

This PR moves dispatcher.stop() above closing the connections so that when stopping RpcEnv, the endpoints won't receive onDisconnected events.

In addition, Outbox should not close the client since it will be reused by others. This PR fixes it as well.

How was this patch tested?

test("SPARK-14699: RpcEnv.shutdown should not fire onDisconnected events")

@BryanCutler
Copy link
Member

BryanCutler commented Apr 18, 2016

LGTM. I noticed this problem in testing out the DriverRunner for another issue and was looking for a fix. After applying this patch, the Driver will complete with a FINISHED status instead of FAILED (tested on a local cluster).

I did notice some new warnings in the Driver log though, but doesn't seem to cause a problem

WARN Dispatcher: Message RemoteProcessDisconnected(oc7258617548.ibm.com:7077) dropped. RpcEnv already stopped.

And the app log still has an error that the worker watcher was disconnect, but maybe that is a separate issue and it doesn't seem to affect the application final state

16/04/18 14:55:57 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
16/04/18 14:55:57 INFO MemoryStore: MemoryStore cleared
16/04/18 14:55:57 INFO BlockManager: BlockManager stopped
16/04/18 14:55:57 ERROR WorkerWatcher: Lost connection to worker rpc endpoint spark://Worker@192.168.1.22:39166. Exiting.

@zsxwing
Copy link
Member Author

zsxwing commented Apr 18, 2016

WARN Dispatcher: Message RemoteProcessDisconnected(oc7258617548.ibm.com:7077) dropped. RpcEnv already stopped.

Yep, Netty may emit the network events after RpcEnv stops. But it doesn't matter since there are only a few warning logs.

16/04/18 14:55:57 ERROR WorkerWatcher: Lost connection to worker rpc endpoint spark://Worker@192.168.1.22:39166. Exiting.

Was the worker stopped at the same time? If the connection does disconnect before dispatcher.stop, this could happen.

@SparkQA
Copy link

SparkQA commented Apr 18, 2016

Test build #56141 has finished for PR 12481 at commit 91c5bfd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member

Was the worker stopped at the same time?

No the worker wasn't stopped while running this. I think this error could be from this: when the application completes, the Master removes the application and kill executors, but if the rpcEnv in the CoarseGrainedExecutorBackend hasn't fully terminated then the WorkerWatcher will receive the onDisconnect message.

@zsxwing
Copy link
Member Author

zsxwing commented Apr 20, 2016

cc @andrewor14

@zsxwing
Copy link
Member Author

zsxwing commented Apr 20, 2016

No the worker wasn't stopped while running this. I think this error could be from this: when the application completes, the Master removes the application and kill executors, but if the rpcEnv in the CoarseGrainedExecutorBackend hasn't fully terminated then the WorkerWatcher will receive the onDisconnect message.

Looking the related codes.

@zsxwing
Copy link
Member Author

zsxwing commented Apr 20, 2016

@BryanCutler Fixed it in this PR. Outbox should not close the client since it will be reused by others.

@zsxwing zsxwing changed the title [SPARK-14699][Core]Stop endpoints before closing the connections [SPARK-14699][Core]Stop endpoints before closing the connections and don't stop client in Outbox Apr 20, 2016
@zsxwing
Copy link
Member Author

zsxwing commented Apr 20, 2016

cc @vanzin

@volatile var onDisconnectedCalled = false
@volatile var onNetworkErrorCalled = false
val anotherEnv = createRpcEnv(new SparkConf(), "remote", 0)
anotherEnv.setupEndpoint("SPARK-14699", new RpcEndpoint {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, but using mockito (mock then verify) here would be a lot less code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@vanzin
Copy link
Contributor

vanzin commented Apr 20, 2016

Looks alright; I assume no existing code relies on the "onDisconnected" event for cleaning things up?

@zsxwing
Copy link
Member Author

zsxwing commented Apr 20, 2016

Looks alright; I assume no existing code relies on the "onDisconnected" event for cleaning things up?

No. Just went through all onDisconnected methods. Cleaning up should be done in onStop.

@SparkQA
Copy link

SparkQA commented Apr 20, 2016

Test build #56395 has finished for PR 12481 at commit 7c78927.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 20, 2016

Test build #56399 has finished for PR 12481 at commit cee0e42.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member Author

zsxwing commented Apr 20, 2016

retest this please

@SparkQA
Copy link

SparkQA commented Apr 20, 2016

Test build #56417 has finished for PR 12481 at commit cee0e42.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member Author

zsxwing commented Apr 21, 2016

retest this please

@vanzin
Copy link
Contributor

vanzin commented Apr 21, 2016

@zsxwing that test is failing everywhere, you don't need to bother retesting because of it.

@SparkQA
Copy link

SparkQA commented Apr 21, 2016

Test build #56445 has finished for PR 12481 at commit cee0e42.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member Author

zsxwing commented Apr 21, 2016

retest this please

@SparkQA
Copy link

SparkQA commented Apr 21, 2016

Test build #56491 has finished for PR 12481 at commit cee0e42.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member

BryanCutler commented Apr 21, 2016

@BryanCutler Fixed it in this PR. Outbox should not close the client since it will be reused by others.

I confirmed this fix takes care of the error in the app log

@zsxwing
Copy link
Member Author

zsxwing commented Apr 21, 2016

Thanks, merging to master.

@asfgit asfgit closed this in e4904d8 Apr 21, 2016
@zsxwing zsxwing deleted the SPARK-14699 branch April 21, 2016 19:06
zzcclp added a commit to zzcclp/spark that referenced this pull request Apr 22, 2016
bruckner pushed a commit to bruckner/spark that referenced this pull request Jul 22, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants