Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21321][Spark Core] Spark very verbose on shutdown #18547

Closed
wants to merge 6 commits into from

Conversation

yoonlee95
Copy link

@yoonlee95 yoonlee95 commented Jul 5, 2017

What changes were proposed in this pull request?

The current code is very verbose on shutdown.

The changes I propose is to change the log level when the driver is shutting down and the RPC connections are closed (RpcEnvStoppedException).

How was this patch tested?

Tested with word count(deploy-mode = cluster, master = yarn, num-executors = 4) with 300GB of data.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still am not sure it's worth revising these. What's logged at shutdown isn't that big.

@@ -206,6 +206,13 @@ private[netty] class Inbox(
case NonFatal(e) =>
try endpoint.onError(e) catch {
case NonFatal(ee) => logError(s"Ignoring error", ee)
case NonFatal(ee) =>
if (stopped) {
logDebug(s"Ignoring error", ee)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logInfo, not debug, probably. I can see reducing the level if you know your'e stopping, but not in other cases. The code style here would have to be fixed. For example you don't use string interpolation here.

@tgravescs
Copy link
Contributor

There is no reason to print out messages that aren't useful to the users. Many users see Warnings and read them and think there is a problem with their application or configuration. Most of these are really for advanced users or devs trying to debug what is going on. So changing them to debug makes sense to me.

There are also times that shutdown can take a very long time and all its doing it spewing these log messages. For instance when you have a backlog of events in the event queue. I have seen multiple minutes of it just throwing the warnings like: Message $message dropped. ${e.getMessage}
dynamic allocation can also add a lot of warning messages that are unneeded. When I'm shutting down things should stop quickly and release the resources for others to use.

I still need to review in detail but these should just be in the shutdown case, other times I agree that you may want to see these.

We should definitely update the description here and in the jira to have more details though.

@tgravescs
Copy link
Contributor

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Jul 7, 2017

Test build #79340 has finished for PR 18547 at commit 80e160a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor

@yoonlee95 please look at the test failures, it might be a race in clearing the events

@tgravescs
Copy link
Contributor

Jenkins, test this please

1 similar comment
@tgravescs
Copy link
Contributor

Jenkins, test this please

@vanzin
Copy link
Contributor

vanzin commented Jul 13, 2017

ok to test

@SparkQA
Copy link

SparkQA commented Jul 13, 2017

Test build #79591 has finished for PR 18547 at commit 8975eda.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -200,7 +200,7 @@ private[spark] class LiveListenerBus(conf: SparkConf) extends SparkListenerBus {
private def queueIsEmpty: Boolean = synchronized { eventQueue.isEmpty && !processingEvent }

/**
* Stop the listener bus. It will wait until the queued events have been processed, but drop the
* Stop the listener bus. It will clear the queued events for faster shutdown and drop the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the comment now

@SparkQA
Copy link

SparkQA commented Jul 14, 2017

Test build #79611 has finished for PR 18547 at commit a2aa7ab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 14, 2017

Test build #79617 has finished for PR 18547 at commit edfcb71.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor

+1

asfgit pushed a commit that referenced this pull request Jul 17, 2017
## What changes were proposed in this pull request?

The current code is very verbose on shutdown.

The changes I propose is to change the log level when the driver is shutting down and the RPC connections are closed (RpcEnvStoppedException).

## How was this patch tested?

Tested with word count(deploy-mode = cluster, master = yarn, num-executors = 4) with 300GB of data.

Author: John Lee <jlee2@yahoo-inc.com>

Closes #18547 from yoonlee95/SPARK-21321.

(cherry picked from commit 0e07a29)
Signed-off-by: Tom Graves <tgraves@yahoo-inc.com>
@asfgit asfgit closed this in 0e07a29 Jul 17, 2017
MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
## What changes were proposed in this pull request?

The current code is very verbose on shutdown.

The changes I propose is to change the log level when the driver is shutting down and the RPC connections are closed (RpcEnvStoppedException).

## How was this patch tested?

Tested with word count(deploy-mode = cluster, master = yarn, num-executors = 4) with 300GB of data.

Author: John Lee <jlee2@yahoo-inc.com>

Closes apache#18547 from yoonlee95/SPARK-21321.

(cherry picked from commit 0e07a29)
Signed-off-by: Tom Graves <tgraves@yahoo-inc.com>
dongjoon-hyun pushed a commit that referenced this pull request Jul 22, 2020
…parkCluster shutdown

### What changes were proposed in this pull request?

Catch the `RpcEnvStoppedException` and log debug it when stop is called for a `LocalSparkCluster`.

This PR also contains two small changes to fix the potential issues.

### Why are the changes needed?

Currently, there's always "RpcEnv already stopped" error if we exit spark-shell with local-cluster mode:

```
20/06/07 14:54:18 ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message.
org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already stopped.
        at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:167)
        at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:150)
        at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:691)
        at org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:253)
        at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:111)
        at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
        at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:748)

```

When we call stop on `StandaloneSchedulerBackend`, the backend will firstly send `UnregisterApplication` to `Master` and then call stop on `LocalSparkCluster` immediately. On the other side, `Master` will send messages to `Worker` when it receives `UnregisterApplication`.  However, the rpcEnv of the `Worker` has been already stoped by the backend. Therefore, the error message shows when the `Worker` tries to handle the messages.

It's only an error on shutdown, users would not like to care about it. So we could hide it in debug log and this is also what we've done previously in #18547.

### Does this PR introduce _any_ user-facing change?

Yes, users will not see the error message after this PR.

### How was this patch tested?

Tested manually.

Closes #28746 from Ngone51/fix-spark-31922.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants