[SPARK-21321][Spark Core] Spark very verbose on shutdown #18547

yoonlee95 · 2017-07-05T18:28:33Z

What changes were proposed in this pull request?

The current code is very verbose on shutdown.

The changes I propose is to change the log level when the driver is shutting down and the RPC connections are closed (RpcEnvStoppedException).

How was this patch tested?

Tested with word count(deploy-mode = cluster, master = yarn, num-executors = 4) with 300GB of data.

… spark

srowen

I still am not sure it's worth revising these. What's logged at shutdown isn't that big.

srowen · 2017-07-05T20:00:52Z

core/src/main/scala/org/apache/spark/rpc/netty/Inbox.scala

@@ -206,6 +206,13 @@ private[netty] class Inbox(
      case NonFatal(e) =>
        try endpoint.onError(e) catch {
          case NonFatal(ee) => logError(s"Ignoring error", ee)
+          case NonFatal(ee) =>
+            if (stopped) {
+            logDebug(s"Ignoring error", ee)


logInfo, not debug, probably. I can see reducing the level if you know your'e stopping, but not in other cases. The code style here would have to be fixed. For example you don't use string interpolation here.

tgravescs · 2017-07-05T20:34:44Z

There is no reason to print out messages that aren't useful to the users. Many users see Warnings and read them and think there is a problem with their application or configuration. Most of these are really for advanced users or devs trying to debug what is going on. So changing them to debug makes sense to me.

There are also times that shutdown can take a very long time and all its doing it spewing these log messages. For instance when you have a backlog of events in the event queue. I have seen multiple minutes of it just throwing the warnings like: Message $message dropped. ${e.getMessage}
dynamic allocation can also add a lot of warning messages that are unneeded. When I'm shutting down things should stop quickly and release the resources for others to use.

I still need to review in detail but these should just be in the shutdown case, other times I agree that you may want to see these.

We should definitely update the description here and in the jira to have more details though.

tgravescs · 2017-07-07T17:06:37Z

Jenkins, test this please

SparkQA · 2017-07-07T19:17:07Z

Test build #79340 has finished for PR 18547 at commit 80e160a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2017-07-10T14:05:25Z

@yoonlee95 please look at the test failures, it might be a race in clearing the events

…est to fail

tgravescs · 2017-07-12T19:37:20Z

Jenkins, test this please

tgravescs · 2017-07-13T18:57:39Z

Jenkins, test this please

vanzin · 2017-07-13T19:53:00Z

ok to test

SparkQA · 2017-07-13T23:09:52Z

Test build #79591 has finished for PR 18547 at commit 8975eda.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2017-07-14T14:06:20Z

core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala

@@ -200,7 +200,7 @@ private[spark] class LiveListenerBus(conf: SparkConf) extends SparkListenerBus {
  private def queueIsEmpty: Boolean = synchronized { eventQueue.isEmpty && !processingEvent }

  /**
-   * Stop the listener bus. It will wait until the queued events have been processed, but drop the
+   * Stop the listener bus. It will clear the queued events for faster shutdown and drop the


remove the comment now

SparkQA · 2017-07-14T17:31:53Z

Test build #79611 has finished for PR 18547 at commit a2aa7ab.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-14T21:31:16Z

Test build #79617 has finished for PR 18547 at commit edfcb71.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2017-07-17T18:12:10Z

+1

## What changes were proposed in this pull request? The current code is very verbose on shutdown. The changes I propose is to change the log level when the driver is shutting down and the RPC connections are closed (RpcEnvStoppedException). ## How was this patch tested? Tested with word count(deploy-mode = cluster, master = yarn, num-executors = 4) with 300GB of data. Author: John Lee <jlee2@yahoo-inc.com> Closes #18547 from yoonlee95/SPARK-21321. (cherry picked from commit 0e07a29) Signed-off-by: Tom Graves <tgraves@yahoo-inc.com>

## What changes were proposed in this pull request? The current code is very verbose on shutdown. The changes I propose is to change the log level when the driver is shutting down and the RPC connections are closed (RpcEnvStoppedException). ## How was this patch tested? Tested with word count(deploy-mode = cluster, master = yarn, num-executors = 4) with 300GB of data. Author: John Lee <jlee2@yahoo-inc.com> Closes apache#18547 from yoonlee95/SPARK-21321. (cherry picked from commit 0e07a29) Signed-off-by: Tom Graves <tgraves@yahoo-inc.com>

…parkCluster shutdown ### What changes were proposed in this pull request? Catch the `RpcEnvStoppedException` and log debug it when stop is called for a `LocalSparkCluster`. This PR also contains two small changes to fix the potential issues. ### Why are the changes needed? Currently, there's always "RpcEnv already stopped" error if we exit spark-shell with local-cluster mode: ``` 20/06/07 14:54:18 ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message. org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already stopped. at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:167) at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:150) at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:691) at org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:253) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:111) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748) ``` When we call stop on `StandaloneSchedulerBackend`, the backend will firstly send `UnregisterApplication` to `Master` and then call stop on `LocalSparkCluster` immediately. On the other side, `Master` will send messages to `Worker` when it receives `UnregisterApplication`. However, the rpcEnv of the `Worker` has been already stoped by the backend. Therefore, the error message shows when the `Worker` tries to handle the messages. It's only an error on shutdown, users would not like to care about it. So we could hide it in debug log and this is also what we've done previously in #18547. ### Does this PR introduce _any_ user-facing change? Yes, users will not see the error message after this PR. ### How was this patch tested? Tested manually. Closes #28746 from Ngone51/fix-spark-31922. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

Changed logs to the appropriate log levels depending on the status of…

cc30196

… spark

srowen reviewed Jul 5, 2017

View reviewed changes

Bug : didn't delete line 208 of Inbox.scala

4e0b40a

Changed if/else style and removed unncessary string interpolation

80e160a

Removed eventQueue.clear() in LiveListenerBus as it will cause unit t…

8975eda

…est to fail

tgravescs reviewed Jul 14, 2017

View reviewed changes

comment reverted since the code has been

a2aa7ab

removed blank lines

edfcb71

asfgit closed this in 0e07a29 Jul 17, 2017

Ngone51 mentioned this pull request Jul 10, 2020

[SPARK-31922][CORE] logDebug "RpcEnv already stopped" error on LocalSparkCluster shutdown #28746

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-21321][Spark Core] Spark very verbose on shutdown #18547

[SPARK-21321][Spark Core] Spark very verbose on shutdown #18547

yoonlee95 commented Jul 5, 2017 •

edited

Loading

srowen left a comment

srowen Jul 5, 2017

tgravescs commented Jul 5, 2017

tgravescs commented Jul 7, 2017

SparkQA commented Jul 7, 2017

tgravescs commented Jul 10, 2017

tgravescs commented Jul 12, 2017

tgravescs commented Jul 13, 2017

vanzin commented Jul 13, 2017

SparkQA commented Jul 13, 2017

tgravescs Jul 14, 2017

SparkQA commented Jul 14, 2017

SparkQA commented Jul 14, 2017

tgravescs commented Jul 17, 2017

[SPARK-21321][Spark Core] Spark very verbose on shutdown #18547

[SPARK-21321][Spark Core] Spark very verbose on shutdown #18547

Conversation

yoonlee95 commented Jul 5, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

srowen left a comment

Choose a reason for hiding this comment

srowen Jul 5, 2017

Choose a reason for hiding this comment

tgravescs commented Jul 5, 2017

tgravescs commented Jul 7, 2017

SparkQA commented Jul 7, 2017

tgravescs commented Jul 10, 2017

tgravescs commented Jul 12, 2017

tgravescs commented Jul 13, 2017

vanzin commented Jul 13, 2017

SparkQA commented Jul 13, 2017

tgravescs Jul 14, 2017

Choose a reason for hiding this comment

SparkQA commented Jul 14, 2017

SparkQA commented Jul 14, 2017

tgravescs commented Jul 17, 2017

yoonlee95 commented Jul 5, 2017 •

edited

Loading