Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-12967][Netty] Avoid NettyRpc error message during sparkContext shutdown #10881

Closed
wants to merge 4 commits into from

Conversation

nishkamravi2
Copy link
Contributor

If there's an RPC issue while sparkContext is alive but stopped (which would happen only when executing SparkContext.stop), log a warning instead. This is a common occurrence.

@vanzin

@SparkQA
Copy link

SparkQA commented Jan 23, 2016

Test build #49924 has finished for PR 10881 at commit 2c07186.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -2183,6 +2183,13 @@ object SparkContext extends Logging {
}

/**
* Return SparkContext
*/
def get(): SparkContext = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this meant to be part of the public API or should this be private?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@holdenk, figured this may be a good utility to have in SparkContext alongside getOrCreate. Unless you think otherwise?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No please avoid introducing new random APIs in an unrelated pull request.

Also it would be great to not have the lower level depend on an upper level (e.g. network rpc depending on sparkcontext).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rxin Modified it to private. Any ideas on how to fix this without using SparkContext?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could potentially just drop a message if it cannot be sent and log a warning?

@@ -2183,6 +2183,13 @@ object SparkContext extends Logging {
}

/**
* Return SparkContext
*/
private[spark] def get(): SparkContext = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's best to just remove this get method. We should have the upper layer (i.e. sparkcontext) send an explicit signal via function calls to the lower layer (e.g. rpc).

@zsxwing might know more about whether this is needed at all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, there is already a such explicit signal: RpcEnv.shutdown. Previously, I was worried about the immature NettyRpcEnv. So the stack trace of the error is still logged to help debug after RpcEnv is shutdown. However, since NettyRpcEnv becomes mature, we can remove the stack trace.

@zsxwing
Copy link
Member

zsxwing commented Jan 23, 2016

@nishkamravi2 I suggest that you add a new exception RpcEnvStopException that extends IllegalStateException. Then you can use RpcEnvStopException in this line:

new IllegalStateException("RpcEnv already stopped.")

After that, there are two places need to handle this exception:

  1. Try-catch RpcEnvStopException in the following line:

dispatcher.postOneWayMessage(message)

and only log a simple message
2. Check the exception type in the following line:

logWarning(s"Failed to send one-way RPC.", e)

If it's RpcEnvStopException, log a simple message.

@nishkamravi2
Copy link
Contributor Author

Thanks for the pointers @zsxwing. Makes sense to just log an error message. Instead of throwing an exception from postMessage and logging it in send, what if we just log the warning in postMessage directly?

@zsxwing
Copy link
Member

zsxwing commented Jan 23, 2016

Thanks for the pointers @zsxwing. Makes sense to just log an error message. Instead of throwing an exception from postMessage and logging it in send, what if we just log the warning in postMessage directly?

If this is called from ask, we still need to throw the exception so that we can notify the caller. The user may use something like Await.result(endpointRef.ask(...)).

@nishkamravi2
Copy link
Contributor Author

Ok, makes sense. Btw, we also want to mask the exception being thrown at line 161 (I see that in the logs as well).

@nishkamravi2
Copy link
Contributor Author

@zsxwing Updated the PR after some testing. Most of the stack trace in the log was coming from postToAll RPC for RemoteProcessDisconnected, so I have removed the stack trace from the callback.

@SparkQA
Copy link

SparkQA commented Jan 25, 2016

Test build #49979 has finished for PR 10881 at commit 3376bfb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Jan 26, 2016

LGTM.

@@ -106,7 +106,7 @@ private[netty] class Dispatcher(nettyEnv: NettyRpcEnv) extends Logging {
val iter = endpoints.keySet().iterator()
while (iter.hasNext) {
val name = iter.next
postMessage(name, message, (e) => logWarning(s"Message $message dropped.", e))
postMessage(name, message, (e) => logWarning(s"Message $message dropped."))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update the message as s"Message $message dropped. ${e.getMessage}"? It's worth to add the exception message here since there are two types of exceptions.

@zsxwing
Copy link
Member

zsxwing commented Jan 26, 2016

Looks pretty good. Will merge it once you update the warning message.

@nishkamravi2
Copy link
Contributor Author

Updated the warning message. Thanks.

@SparkQA
Copy link

SparkQA commented Jan 27, 2016

Test build #50166 has finished for PR 10881 at commit f91afed.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member

zsxwing commented Jan 27, 2016

@nishkamravi2 thanks, merging to master

@asfgit asfgit closed this in bae3c9a Jan 27, 2016
zzcclp pushed a commit to zzcclp/spark that referenced this pull request Mar 29, 2016
…ntext shutdown

If there's an RPC issue while sparkContext is alive but stopped (which would happen only when executing SparkContext.stop), log a warning instead. This is a common occurrence.

vanzin

Author: Nishkam Ravi <nishkamravi@gmail.com>
Author: nishkamravi2 <nishkamravi@gmail.com>

Closes apache#10881 from nishkamravi2/master_netty.

(cherry picked from commit bae3c9a)
@JerryLead
Copy link
Contributor

This bug still exists in latest Spark 1.6.2. How about merging it to branch-1.6? @nishkamravi2 @zsxwing

@1236897
Copy link

1236897 commented Oct 21, 2016

could i know how to merge the updated code to my project to avoid this error?

@zsxwing
Copy link
Member

zsxwing commented Oct 21, 2016

@1236897 Just try git cherry-pick this PR.

@1236897
Copy link

1236897 commented Nov 3, 2016

actually, I face the issue that "RpcEnv already stopped" when i call sparkContxt.stop in the end of the program. I add the spark1.6 through maven pom. could i know how to fix this issue ?

@zsxwing
Copy link
Member

zsxwing commented Nov 3, 2016

@1236897 You can check out Spark 1.6.2 tag and apply this patch. Then build the Spark and use this one to submit your Spark application. You can still use the Spark maven artifact to build your application.

@1236897
Copy link

1236897 commented Nov 3, 2016

@zsxwing Thank you for your reply and sorry to disturb you, cos this project is so import for me, I descrise what i need to do. Firistly, check out the spark 1.6 from Github. Secondly, use the git cherry-pick to apply this patch and compile the code through ecplise. Lastly, add the output jar to maven repository and add this jar as dependency to my project. right ?

@zsxwing
Copy link
Member

zsxwing commented Nov 3, 2016

compile the code through ecplise.

No. Take a look at this page about how to build Spark: http://spark.apache.org/docs/1.6.2/building-spark.html

Lastly, add the output jar to maven repository and add this jar as dependency to my project.

You don't need to. You can still use the official maven artifact to build your application. However, you need to use the new spark-submit inside your own Spark build to submit your application. If you use the standalone Spark cluster and run the Spark app using the cluster mode, you must upgrade your Spark cluster as well.

@zsxwing
Copy link
Member

zsxwing commented Nov 3, 2016

@1236897 If you don't want to build Spark, it's fine to just catch this special exception thrown from SparkContxt.stop and ignore it.

@1236897
Copy link

1236897 commented Nov 3, 2016

@ zsxwing if I just ignore it, will it avoid my issue? my issue is I need complete my job within 5 mins. but the issue "RpcEnv already stopped" waste a lot of time to disconnect and make my job beyond my expect.

@zsxwing
Copy link
Member

zsxwing commented Nov 3, 2016

the issue "RpcEnv already stopped" waste a lot of time to disconnect

Did you measure how long to stop SparkContext? I don't think it will take several minutes.

@1236897
Copy link

1236897 commented Nov 3, 2016

cos I watch the monitoring page. my last step is save as parquet. but it page monitoring page stage show completed and the job keep running, after few minutes, the job is complete and throw the Expention about "RpcEnv already stopped" , that's the reason I assume the Excetpion lead to the long RUNNING and not Complete. PS: the job use 40 executors in the 200 servers Env.

@zsxwing
Copy link
Member

zsxwing commented Nov 3, 2016

@1236897 The Spark job status from RUNNING to COMPLETE happens before SparkContext.stop.

@fan1emon
Copy link

fan1emon commented Mar 9, 2017

HI, @zsxwing. Is there any way to reproduce this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants