[SPARK-6014] [core][hotfix] Add try-catch block around ShutDownHook #5672

nishkamravi2 · 2015-04-24T01:02:39Z

Add a try/catch block around removeShutDownHook else IllegalStateException thrown in YARN cluster mode (see #4690)

cc @andrewor14, @srowen

…extFiles The prefix "file:" is missing in the string inserted as key in HashMap

… HADOOP-10456)

…onsistent with rest of Spark

…nravi

…multiplier (redone to resolve merge conflicts)

…nravi Conflicts: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala

…nravi Conflicts: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala yarn/common/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala

…n to check if maxResultSize > 0)

…nravi

…kManager.stop

nishkamravi2 · 2015-04-24T18:30:35Z

See Utils.inShutDown discussion in 4690.

nishkamravi2 · 2015-04-24T18:30:42Z

Can this be tested please. Thanks.

vanzin · 2015-04-24T18:34:04Z

See Utils.inShutDown discussion in 4690.

The part about it being expensive? This is hardly a performance sensitive path. It could also be made faster by first checking shutdownHooks.shutdown before trying the Runtime.blah route.

But in any case, not a big deal.

srowen · 2015-04-24T18:39:57Z

core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala

@@ -148,7 +148,11 @@ private[spark] class DiskBlockManager(blockManager: BlockManager, conf: SparkCon
  /** Cleanup local dirs and stop shuffle sender. */
  private[spark] def stop() {
    // Remove the shutdown hook.  It causes memory leaks if we leave it around.
-    Utils.removeShutdownHook(shutdownHook)
+    try {
+      Utils.removeShutdownHook(shutdownHook)


Can this simply not throw an exception?

What would it do instead? Just ignore the call and return false?

Probably not a big deal in practice, but it does overload the return value to mean "either the hook doesn't exist, or I can't remove it at this time".

Yes. Asking to remove a hook always races with shutdown, so I suppose code can never guarantee it's been able to remove a hook before shutdown. It can't be sure it has successfully removed the hook before it executes. So it doesn't seem to bad to just try to remove the hook, which might or might not have executed and discarded already, and return normally either way. That is do you want to prohibit removal of a hook that hasn't run, even during shutdown? meh, might be a simpler way to resolve this.

Hmm. It should be ok to allow removing the hooks during shutdown. SparkShutdownHookManager.runAll is safe against modifications to the underlying queue.

nishkamravi2 · 2015-04-24T18:52:06Z

I would be fine with 'not explicitly throwing an exception' and keeping the try-catch block.

vanzin · 2015-04-24T20:26:03Z

LGTM. try..catch seems unnecessary now but doesn't hurt either.

srowen · 2015-04-24T20:28:12Z

Why keep the block? what exception is expected now and why is it safe to squash it (without logging even)?

nishkamravi2 · 2015-04-24T20:39:58Z

A case of erring on the conservative side: the implementation of Utils.removeShutdownHook could change over time. The invocation from DiskBlockManager should not have to rely on it being exception-free.

nishkamravi2 · 2015-04-24T20:54:32Z

With regards to logging, I thought you had recommended not logging (in 4690) . Same reasons apply?

srowen · 2015-04-24T21:11:07Z

There I was talking about a normal, expected exception. If this method changes to not throw an exception that we think of as normal and expect-able, then whatever this catches is surprising and should be logged. By this logic we should catch a lot of exceptions in a lot of places... this makes me believe that the author expected a particular exception and specially knows it can be silently ignored. That doesn't seem to be the argument anymore.

nishkamravi2 · 2015-04-24T22:06:30Z

The original PR in 4690 didn't make assumptions about the type of Exception thrown and logged it, I guess we were thinking slightly differently back then too. Agreed: generalizing and logging seems like the best way forward.

srowen · 2015-04-24T22:09:31Z

In general I'd argue for not just catching exceptions broadly as a defense. This is a stop() method though, and often the right behavior is to just keep going even if stopping failed due to some I/O problem or messed up state. Here I don't see what exception this method can throw. I'd still find try-catch surprising and would not expect it.

vanzin · 2015-04-24T22:12:06Z

Since we're talking exceptions, SparkShutdownHookManager.runAll could be modified to catch exceptions for individual hooks and ignore them. It already logs them, but logUncaughtExceptions re-throws, so an exception would cause downstream hooks to not be executed.

nishkamravi2 · 2015-04-24T22:49:41Z

@srowen I think I'd approach this one as why-not (especially with logging enabled). This particular call site has warranted two separate PRs attempting to add a try-catch block because the code in Utils changed and thus stands out from the rest of Spark code in that regard?

nishkamravi2 · 2015-04-24T22:54:12Z

@vanzin Are we suggesting adding a try-catch block inside the while loop in runAll or deleting throw's in logUncaughtExceptions ?

vanzin · 2015-04-24T22:58:37Z

@nishkamravi2 just adding Try() around the logUncaughtExceptions call should be sufficient.

nishkamravi2 · 2015-04-24T23:13:34Z

While we are at it: any particular reason for using PriorityQueue or can we replace it by ConcurrentLinkedQueue or something and get rid of a few synchronized's ?

vanzin · 2015-04-24T23:18:25Z

Hooks have a priority so that they can choose the order they're executed in. e.g. if some hook really needs to execute before another one, that's possible. Similar to how the Spark hook needs to execute before the HDFS one to avoid exceptions.

The locking shouldn't cause issues - this is not a hot path where avoiding locks brings any benefit. Outside of shutdown, there's just a handful of places where the methods are called. During shutdown, everything happens from runAll, so any re-entrant calls already have the lock held and thus can go through.

nishkamravi2 · 2015-04-25T00:12:04Z

@vanzin Sounds reasonable.

srowen · 2015-04-25T12:32:18Z

@nishkamravi2 this isn't the same method or exception handling as before. The question is what's right to do given the current state of the code. It feels a little arbitrary to wrap this invocation but not others, but it's a stop() method and at least logs it, so won't argue it further.

srowen · 2015-04-25T12:33:37Z

ok to test

SparkQA · 2015-04-25T22:02:18Z

Test build #701 has started for PR 5672 at commit 0f1abd0.

srowen · 2015-04-26T00:00:36Z

This test actually succeeded normally, just wasn't able to post back here. Going ahead.

Add a try/catch block around removeShutDownHook else IllegalStateException thrown in YARN cluster mode (see apache#4690) cc andrewor14, srowen Author: Nishkam Ravi <nravi@cloudera.com> Author: nishkamravi2 <nishkamravi@gmail.com> Author: nravi <nravi@c1704.halxg.cloudera.com> Closes apache#5672 from nishkamravi2/master_nravi and squashes the following commits: 0f1abd0 [nishkamravi2] Update Utils.scala 474e3bf [nishkamravi2] Update DiskBlockManager.scala 97c383e [nishkamravi2] Update Utils.scala 8691e0c [Nishkam Ravi] Add a try/catch block around Utils.removeShutdownHook 2be1e76 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi 1c13b79 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi bad4349 [nishkamravi2] Update Main.java 36a6f87 [Nishkam Ravi] Minor changes and bug fixes b7f4ae7 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi 4a45d6a [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi 458af39 [Nishkam Ravi] Locate the jar using getLocation, obviates the need to pass assembly path as an argument d9658d6 [Nishkam Ravi] Changes for SPARK-6406 ccdc334 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi 3faa7a4 [Nishkam Ravi] Launcher library changes (SPARK-6406) 345206a [Nishkam Ravi] spark-class merge Merge branch 'master_nravi' of https://github.com/nishkamravi2/spark into master_nravi ac58975 [Nishkam Ravi] spark-class changes 06bfeb0 [nishkamravi2] Update spark-class 35af990 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi 32c3ab3 [nishkamravi2] Update AbstractCommandBuilder.java 4bd4489 [nishkamravi2] Update AbstractCommandBuilder.java 746f35b [Nishkam Ravi] "hadoop" string in the assembly name should not be mandatory (everywhere else in spark we mandate spark-assembly*hadoop*.jar) bfe96e0 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi ee902fa [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi d453197 [nishkamravi2] Update NewHadoopRDD.scala 6f41a1d [nishkamravi2] Update NewHadoopRDD.scala 0ce2c32 [nishkamravi2] Update HadoopRDD.scala f7e33c2 [Nishkam Ravi] Merge branch 'master_nravi' of https://github.com/nishkamravi2/spark into master_nravi ba1eb8b [Nishkam Ravi] Try-catch block around the two occurrences of removeShutDownHook. Deletion of semi-redundant occurrences of expensive operation inShutDown. 71d0e17 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi 494d8c0 [nishkamravi2] Update DiskBlockManager.scala 3c5ddba [nishkamravi2] Update DiskBlockManager.scala f0d12de [Nishkam Ravi] Workaround for IllegalStateException caused by recent changes to BlockManager.stop 79ea8b4 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi b446edc [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi 5c9a4cb [nishkamravi2] Update TaskSetManagerSuite.scala 535295a [nishkamravi2] Update TaskSetManager.scala 3e1b616 [Nishkam Ravi] Modify test for maxResultSize 9f6583e [Nishkam Ravi] Changes to maxResultSize code (improve error message and add condition to check if maxResultSize > 0) 5f8f9ed [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi 636a9ff [nishkamravi2] Update YarnAllocator.scala 8f76c8b [Nishkam Ravi] Doc change for yarn memory overhead 35daa64 [Nishkam Ravi] Slight change in the doc for yarn memory overhead 5ac2ec1 [Nishkam Ravi] Remove out dac1047 [Nishkam Ravi] Additional documentation for yarn memory overhead issue 42c2c3d [Nishkam Ravi] Additional changes for yarn memory overhead issue 362da5e [Nishkam Ravi] Additional changes for yarn memory overhead c726bd9 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi f00fa31 [Nishkam Ravi] Improving logging for AM memoryOverhead 1cf2d1e [nishkamravi2] Update YarnAllocator.scala ebcde10 [Nishkam Ravi] Modify default YARN memory_overhead-- from an additive constant to a multiplier (redone to resolve merge conflicts) 2e69f11 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi efd688a [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark 2b630f9 [nravi] Accept memory input as "30g", "512M" instead of an int value, to be consistent with rest of Spark 3bf8fad [nravi] Merge branch 'master' of https://github.com/apache/spark 5423a03 [nravi] Merge branch 'master' of https://github.com/apache/spark eb663ca [nravi] Merge branch 'master' of https://github.com/apache/spark df2aeb1 [nravi] Improved fix for ConcurrentModificationIssue (Spark-1097, Hadoop-10456) 6b840f0 [nravi] Undo the fix for SPARK-1758 (the problem is fixed) 5108700 [nravi] Fix in Spark for the Concurrent thread modification issue (SPARK-1097, HADOOP-10456) 681b36f [nravi] Fix for SPARK-1758: failing test org.apache.spark.JavaAPISuite.wholeTextFiles

nishkamravi2 and others added 30 commits June 3, 2014 15:28

Fix for SPARK-1758: failing test org.apache.spark.JavaAPISuite.wholeT…

681b36f

…extFiles The prefix "file:" is missing in the string inserted as key in HashMap

Fix in Spark for the Concurrent thread modification issue (SPARK-1097,…

5108700

… HADOOP-10456)

Undo the fix for SPARK-1758 (the problem is fixed)

6b840f0

Improved fix for ConcurrentModificationIssue (Spark-1097, Hadoop-10456)

df2aeb1

Merge branch 'master' of https://github.com/apache/spark

eb663ca

Merge branch 'master' of https://github.com/apache/spark

5423a03

Merge branch 'master' of https://github.com/apache/spark

3bf8fad

Accept memory input as "30g", "512M" instead of an int value, to be c…

2b630f9

…onsistent with rest of Spark

Merge branch 'master' of https://github.com/apache/spark

efd688a

Merge branch 'master' of https://github.com/apache/spark into master_…

2e69f11

…nravi

Modify default YARN memory_overhead-- from an additive constant to a …

ebcde10

…multiplier (redone to resolve merge conflicts)

Update YarnAllocator.scala

1cf2d1e

Improving logging for AM memoryOverhead

f00fa31

Merge branch 'master' of https://github.com/apache/spark into master_…

c726bd9

…nravi Conflicts: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala

Additional changes for yarn memory overhead

362da5e

Additional changes for yarn memory overhead issue

42c2c3d

Additional documentation for yarn memory overhead issue

dac1047

Remove out

5ac2ec1

Slight change in the doc for yarn memory overhead

35daa64

Doc change for yarn memory overhead

8f76c8b

Update YarnAllocator.scala

636a9ff

Changes to maxResultSize code (improve error message and add conditio…

9f6583e

…n to check if maxResultSize > 0)

Modify test for maxResultSize

3e1b616

Update TaskSetManager.scala

535295a

Update TaskSetManagerSuite.scala

5c9a4cb

Merge branch 'master' of https://github.com/apache/spark into master_…

b446edc

…nravi

Merge branch 'master' of https://github.com/apache/spark into master_…

79ea8b4

…nravi

Workaround for IllegalStateException caused by recent changes to Bloc…

f0d12de

…kManager.stop

Update DiskBlockManager.scala

3c5ddba

srowen reviewed Apr 24, 2015
View reviewed changes

Update Utils.scala

97c383e

Update DiskBlockManager.scala

474e3bf

Update Utils.scala

0f1abd0

vanzin mentioned this pull request Apr 24, 2015

[SPARK-3090] [core] Stop SparkContext if user forgets to. #5696

Closed

asfgit closed this in f5473c2 Apr 26, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-6014] [core][hotfix] Add try-catch block around ShutDownHook #5672

[SPARK-6014] [core][hotfix] Add try-catch block around ShutDownHook #5672

nishkamravi2 commented Apr 24, 2015

nishkamravi2 commented Apr 24, 2015

nishkamravi2 commented Apr 24, 2015

vanzin commented Apr 24, 2015

srowen Apr 24, 2015

vanzin Apr 24, 2015

srowen Apr 24, 2015

vanzin Apr 24, 2015

nishkamravi2 commented Apr 24, 2015

vanzin commented Apr 24, 2015

srowen commented Apr 24, 2015

nishkamravi2 commented Apr 24, 2015

nishkamravi2 commented Apr 24, 2015

srowen commented Apr 24, 2015

nishkamravi2 commented Apr 24, 2015

srowen commented Apr 24, 2015

vanzin commented Apr 24, 2015

nishkamravi2 commented Apr 24, 2015

nishkamravi2 commented Apr 24, 2015

vanzin commented Apr 24, 2015

nishkamravi2 commented Apr 24, 2015

vanzin commented Apr 24, 2015

nishkamravi2 commented Apr 25, 2015

srowen commented Apr 25, 2015

srowen commented Apr 25, 2015

SparkQA commented Apr 25, 2015

srowen commented Apr 26, 2015

[SPARK-6014] [core][hotfix] Add try-catch block around ShutDownHook #5672

[SPARK-6014] [core][hotfix] Add try-catch block around ShutDownHook #5672

Conversation

nishkamravi2 commented Apr 24, 2015

nishkamravi2 commented Apr 24, 2015

nishkamravi2 commented Apr 24, 2015

vanzin commented Apr 24, 2015

srowen Apr 24, 2015

Choose a reason for hiding this comment

vanzin Apr 24, 2015

Choose a reason for hiding this comment

srowen Apr 24, 2015

Choose a reason for hiding this comment

vanzin Apr 24, 2015

Choose a reason for hiding this comment

nishkamravi2 commented Apr 24, 2015

vanzin commented Apr 24, 2015

srowen commented Apr 24, 2015

nishkamravi2 commented Apr 24, 2015

nishkamravi2 commented Apr 24, 2015

srowen commented Apr 24, 2015

nishkamravi2 commented Apr 24, 2015

srowen commented Apr 24, 2015

vanzin commented Apr 24, 2015

nishkamravi2 commented Apr 24, 2015

nishkamravi2 commented Apr 24, 2015

vanzin commented Apr 24, 2015

nishkamravi2 commented Apr 24, 2015

vanzin commented Apr 24, 2015

nishkamravi2 commented Apr 25, 2015

srowen commented Apr 25, 2015

srowen commented Apr 25, 2015

SparkQA commented Apr 25, 2015

srowen commented Apr 26, 2015