Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-6014] [core][hotfix] Add try-catch block around ShutDownHook #5672

Closed
wants to merge 60 commits into from
Closed

[SPARK-6014] [core][hotfix] Add try-catch block around ShutDownHook #5672

wants to merge 60 commits into from

Conversation

nishkamravi2
Copy link
Contributor

Add a try/catch block around removeShutDownHook else IllegalStateException thrown in YARN cluster mode (see #4690)

cc @andrewor14, @srowen

nishkamravi2 and others added 30 commits June 3, 2014 15:28
…extFiles

The prefix "file:" is missing in the string inserted as key in HashMap
…multiplier (redone to resolve merge conflicts)
…nravi

Conflicts:
	yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala
…nravi

Conflicts:
	yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala
	yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala
	yarn/common/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
@nishkamravi2
Copy link
Contributor Author

See Utils.inShutDown discussion in 4690.

@nishkamravi2
Copy link
Contributor Author

Can this be tested please. Thanks.

@vanzin
Copy link
Contributor

vanzin commented Apr 24, 2015

See Utils.inShutDown discussion in 4690.

The part about it being expensive? This is hardly a performance sensitive path. It could also be made faster by first checking shutdownHooks.shutdown before trying the Runtime.blah route.

But in any case, not a big deal.

@@ -148,7 +148,11 @@ private[spark] class DiskBlockManager(blockManager: BlockManager, conf: SparkCon
/** Cleanup local dirs and stop shuffle sender. */
private[spark] def stop() {
// Remove the shutdown hook. It causes memory leaks if we leave it around.
Utils.removeShutdownHook(shutdownHook)
try {
Utils.removeShutdownHook(shutdownHook)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this simply not throw an exception?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would it do instead? Just ignore the call and return false?

Probably not a big deal in practice, but it does overload the return value to mean "either the hook doesn't exist, or I can't remove it at this time".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Asking to remove a hook always races with shutdown, so I suppose code can never guarantee it's been able to remove a hook before shutdown. It can't be sure it has successfully removed the hook before it executes. So it doesn't seem to bad to just try to remove the hook, which might or might not have executed and discarded already, and return normally either way. That is do you want to prohibit removal of a hook that hasn't run, even during shutdown? meh, might be a simpler way to resolve this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. It should be ok to allow removing the hooks during shutdown. SparkShutdownHookManager.runAll is safe against modifications to the underlying queue.

@nishkamravi2
Copy link
Contributor Author

I would be fine with 'not explicitly throwing an exception' and keeping the try-catch block.

@vanzin
Copy link
Contributor

vanzin commented Apr 24, 2015

LGTM. try..catch seems unnecessary now but doesn't hurt either.

@srowen
Copy link
Member

srowen commented Apr 24, 2015

Why keep the block? what exception is expected now and why is it safe to squash it (without logging even)?

@nishkamravi2
Copy link
Contributor Author

A case of erring on the conservative side: the implementation of Utils.removeShutdownHook could change over time. The invocation from DiskBlockManager should not have to rely on it being exception-free.

@nishkamravi2
Copy link
Contributor Author

With regards to logging, I thought you had recommended not logging (in 4690) . Same reasons apply?

@srowen
Copy link
Member

srowen commented Apr 24, 2015

There I was talking about a normal, expected exception. If this method changes to not throw an exception that we think of as normal and expect-able, then whatever this catches is surprising and should be logged. By this logic we should catch a lot of exceptions in a lot of places... this makes me believe that the author expected a particular exception and specially knows it can be silently ignored. That doesn't seem to be the argument anymore.

@nishkamravi2
Copy link
Contributor Author

The original PR in 4690 didn't make assumptions about the type of Exception thrown and logged it, I guess we were thinking slightly differently back then too. Agreed: generalizing and logging seems like the best way forward.

@srowen
Copy link
Member

srowen commented Apr 24, 2015

In general I'd argue for not just catching exceptions broadly as a defense. This is a stop() method though, and often the right behavior is to just keep going even if stopping failed due to some I/O problem or messed up state. Here I don't see what exception this method can throw. I'd still find try-catch surprising and would not expect it.

@vanzin
Copy link
Contributor

vanzin commented Apr 24, 2015

Since we're talking exceptions, SparkShutdownHookManager.runAll could be modified to catch exceptions for individual hooks and ignore them. It already logs them, but logUncaughtExceptions re-throws, so an exception would cause downstream hooks to not be executed.

@nishkamravi2
Copy link
Contributor Author

@srowen I think I'd approach this one as why-not (especially with logging enabled). This particular call site has warranted two separate PRs attempting to add a try-catch block because the code in Utils changed and thus stands out from the rest of Spark code in that regard?

@nishkamravi2
Copy link
Contributor Author

@vanzin Are we suggesting adding a try-catch block inside the while loop in runAll or deleting throw's in logUncaughtExceptions ?

@vanzin
Copy link
Contributor

vanzin commented Apr 24, 2015

@nishkamravi2 just adding Try() around the logUncaughtExceptions call should be sufficient.

@nishkamravi2
Copy link
Contributor Author

While we are at it: any particular reason for using PriorityQueue or can we replace it by ConcurrentLinkedQueue or something and get rid of a few synchronized's ?

@vanzin
Copy link
Contributor

vanzin commented Apr 24, 2015

Hooks have a priority so that they can choose the order they're executed in. e.g. if some hook really needs to execute before another one, that's possible. Similar to how the Spark hook needs to execute before the HDFS one to avoid exceptions.

The locking shouldn't cause issues - this is not a hot path where avoiding locks brings any benefit. Outside of shutdown, there's just a handful of places where the methods are called. During shutdown, everything happens from runAll, so any re-entrant calls already have the lock held and thus can go through.

@nishkamravi2
Copy link
Contributor Author

@vanzin Sounds reasonable.

@srowen
Copy link
Member

srowen commented Apr 25, 2015

@nishkamravi2 this isn't the same method or exception handling as before. The question is what's right to do given the current state of the code. It feels a little arbitrary to wrap this invocation but not others, but it's a stop() method and at least logs it, so won't argue it further.

@srowen
Copy link
Member

srowen commented Apr 25, 2015

ok to test

@SparkQA
Copy link

SparkQA commented Apr 25, 2015

Test build #701 has started for PR 5672 at commit 0f1abd0.

@srowen
Copy link
Member

srowen commented Apr 26, 2015

This test actually succeeded normally, just wasn't able to post back here. Going ahead.

@asfgit asfgit closed this in f5473c2 Apr 26, 2015
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 14, 2015
Add a try/catch block around removeShutDownHook else IllegalStateException thrown in YARN cluster mode (see apache#4690)

cc andrewor14, srowen

Author: Nishkam Ravi <nravi@cloudera.com>
Author: nishkamravi2 <nishkamravi@gmail.com>
Author: nravi <nravi@c1704.halxg.cloudera.com>

Closes apache#5672 from nishkamravi2/master_nravi and squashes the following commits:

0f1abd0 [nishkamravi2] Update Utils.scala
474e3bf [nishkamravi2] Update DiskBlockManager.scala
97c383e [nishkamravi2] Update Utils.scala
8691e0c [Nishkam Ravi] Add a try/catch block around Utils.removeShutdownHook
2be1e76 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
1c13b79 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
bad4349 [nishkamravi2] Update Main.java
36a6f87 [Nishkam Ravi] Minor changes and bug fixes
b7f4ae7 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
4a45d6a [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
458af39 [Nishkam Ravi] Locate the jar using getLocation, obviates the need to pass assembly path as an argument
d9658d6 [Nishkam Ravi] Changes for SPARK-6406
ccdc334 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
3faa7a4 [Nishkam Ravi] Launcher library changes (SPARK-6406)
345206a [Nishkam Ravi] spark-class merge Merge branch 'master_nravi' of https://github.com/nishkamravi2/spark into master_nravi
ac58975 [Nishkam Ravi] spark-class changes
06bfeb0 [nishkamravi2] Update spark-class
35af990 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
32c3ab3 [nishkamravi2] Update AbstractCommandBuilder.java
4bd4489 [nishkamravi2] Update AbstractCommandBuilder.java
746f35b [Nishkam Ravi] "hadoop" string in the assembly name should not be mandatory (everywhere else in spark we mandate spark-assembly*hadoop*.jar)
bfe96e0 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
ee902fa [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
d453197 [nishkamravi2] Update NewHadoopRDD.scala
6f41a1d [nishkamravi2] Update NewHadoopRDD.scala
0ce2c32 [nishkamravi2] Update HadoopRDD.scala
f7e33c2 [Nishkam Ravi] Merge branch 'master_nravi' of https://github.com/nishkamravi2/spark into master_nravi
ba1eb8b [Nishkam Ravi] Try-catch block around the two occurrences of removeShutDownHook. Deletion of semi-redundant occurrences of expensive operation inShutDown.
71d0e17 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
494d8c0 [nishkamravi2] Update DiskBlockManager.scala
3c5ddba [nishkamravi2] Update DiskBlockManager.scala
f0d12de [Nishkam Ravi] Workaround for IllegalStateException caused by recent changes to BlockManager.stop
79ea8b4 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
b446edc [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
5c9a4cb [nishkamravi2] Update TaskSetManagerSuite.scala
535295a [nishkamravi2] Update TaskSetManager.scala
3e1b616 [Nishkam Ravi] Modify test for maxResultSize
9f6583e [Nishkam Ravi] Changes to maxResultSize code (improve error message and add condition to check if maxResultSize > 0)
5f8f9ed [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
636a9ff [nishkamravi2] Update YarnAllocator.scala
8f76c8b [Nishkam Ravi] Doc change for yarn memory overhead
35daa64 [Nishkam Ravi] Slight change in the doc for yarn memory overhead
5ac2ec1 [Nishkam Ravi] Remove out
dac1047 [Nishkam Ravi] Additional documentation for yarn memory overhead issue
42c2c3d [Nishkam Ravi] Additional changes for yarn memory overhead issue
362da5e [Nishkam Ravi] Additional changes for yarn memory overhead
c726bd9 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
f00fa31 [Nishkam Ravi] Improving logging for AM memoryOverhead
1cf2d1e [nishkamravi2] Update YarnAllocator.scala
ebcde10 [Nishkam Ravi] Modify default YARN memory_overhead-- from an additive constant to a multiplier (redone to resolve merge conflicts)
2e69f11 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
efd688a [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark
2b630f9 [nravi] Accept memory input as "30g", "512M" instead of an int value, to be consistent with rest of Spark
3bf8fad [nravi] Merge branch 'master' of https://github.com/apache/spark
5423a03 [nravi] Merge branch 'master' of https://github.com/apache/spark
eb663ca [nravi] Merge branch 'master' of https://github.com/apache/spark
df2aeb1 [nravi] Improved fix for ConcurrentModificationIssue (Spark-1097, Hadoop-10456)
6b840f0 [nravi] Undo the fix for SPARK-1758 (the problem is fixed)
5108700 [nravi] Fix in Spark for the Concurrent thread modification issue (SPARK-1097, HADOOP-10456)
681b36f [nravi] Fix for SPARK-1758: failing test org.apache.spark.JavaAPISuite.wholeTextFiles
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
Add a try/catch block around removeShutDownHook else IllegalStateException thrown in YARN cluster mode (see apache#4690)

cc andrewor14, srowen

Author: Nishkam Ravi <nravi@cloudera.com>
Author: nishkamravi2 <nishkamravi@gmail.com>
Author: nravi <nravi@c1704.halxg.cloudera.com>

Closes apache#5672 from nishkamravi2/master_nravi and squashes the following commits:

0f1abd0 [nishkamravi2] Update Utils.scala
474e3bf [nishkamravi2] Update DiskBlockManager.scala
97c383e [nishkamravi2] Update Utils.scala
8691e0c [Nishkam Ravi] Add a try/catch block around Utils.removeShutdownHook
2be1e76 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
1c13b79 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
bad4349 [nishkamravi2] Update Main.java
36a6f87 [Nishkam Ravi] Minor changes and bug fixes
b7f4ae7 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
4a45d6a [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
458af39 [Nishkam Ravi] Locate the jar using getLocation, obviates the need to pass assembly path as an argument
d9658d6 [Nishkam Ravi] Changes for SPARK-6406
ccdc334 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
3faa7a4 [Nishkam Ravi] Launcher library changes (SPARK-6406)
345206a [Nishkam Ravi] spark-class merge Merge branch 'master_nravi' of https://github.com/nishkamravi2/spark into master_nravi
ac58975 [Nishkam Ravi] spark-class changes
06bfeb0 [nishkamravi2] Update spark-class
35af990 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
32c3ab3 [nishkamravi2] Update AbstractCommandBuilder.java
4bd4489 [nishkamravi2] Update AbstractCommandBuilder.java
746f35b [Nishkam Ravi] "hadoop" string in the assembly name should not be mandatory (everywhere else in spark we mandate spark-assembly*hadoop*.jar)
bfe96e0 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
ee902fa [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
d453197 [nishkamravi2] Update NewHadoopRDD.scala
6f41a1d [nishkamravi2] Update NewHadoopRDD.scala
0ce2c32 [nishkamravi2] Update HadoopRDD.scala
f7e33c2 [Nishkam Ravi] Merge branch 'master_nravi' of https://github.com/nishkamravi2/spark into master_nravi
ba1eb8b [Nishkam Ravi] Try-catch block around the two occurrences of removeShutDownHook. Deletion of semi-redundant occurrences of expensive operation inShutDown.
71d0e17 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
494d8c0 [nishkamravi2] Update DiskBlockManager.scala
3c5ddba [nishkamravi2] Update DiskBlockManager.scala
f0d12de [Nishkam Ravi] Workaround for IllegalStateException caused by recent changes to BlockManager.stop
79ea8b4 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
b446edc [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
5c9a4cb [nishkamravi2] Update TaskSetManagerSuite.scala
535295a [nishkamravi2] Update TaskSetManager.scala
3e1b616 [Nishkam Ravi] Modify test for maxResultSize
9f6583e [Nishkam Ravi] Changes to maxResultSize code (improve error message and add condition to check if maxResultSize > 0)
5f8f9ed [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
636a9ff [nishkamravi2] Update YarnAllocator.scala
8f76c8b [Nishkam Ravi] Doc change for yarn memory overhead
35daa64 [Nishkam Ravi] Slight change in the doc for yarn memory overhead
5ac2ec1 [Nishkam Ravi] Remove out
dac1047 [Nishkam Ravi] Additional documentation for yarn memory overhead issue
42c2c3d [Nishkam Ravi] Additional changes for yarn memory overhead issue
362da5e [Nishkam Ravi] Additional changes for yarn memory overhead
c726bd9 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
f00fa31 [Nishkam Ravi] Improving logging for AM memoryOverhead
1cf2d1e [nishkamravi2] Update YarnAllocator.scala
ebcde10 [Nishkam Ravi] Modify default YARN memory_overhead-- from an additive constant to a multiplier (redone to resolve merge conflicts)
2e69f11 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
efd688a [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark
2b630f9 [nravi] Accept memory input as "30g", "512M" instead of an int value, to be consistent with rest of Spark
3bf8fad [nravi] Merge branch 'master' of https://github.com/apache/spark
5423a03 [nravi] Merge branch 'master' of https://github.com/apache/spark
eb663ca [nravi] Merge branch 'master' of https://github.com/apache/spark
df2aeb1 [nravi] Improved fix for ConcurrentModificationIssue (Spark-1097, Hadoop-10456)
6b840f0 [nravi] Undo the fix for SPARK-1758 (the problem is fixed)
5108700 [nravi] Fix in Spark for the Concurrent thread modification issue (SPARK-1097, HADOOP-10456)
681b36f [nravi] Fix for SPARK-1758: failing test org.apache.spark.JavaAPISuite.wholeTextFiles
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants