[SPARK-2889] Create Hadoop config objects consistently. #1843

vanzin · 2014-08-07T23:32:05Z

Different places in the code were instantiating Configuration / YarnConfiguration objects in different ways. This could lead to confusion for people who actually expected "spark.hadoop.*" options to end up in the configs used by Spark code, since that would only happen for the SparkContext's config.

This change modifies most places to use SparkHadoopUtil to initialize configs, and make that method do the translation that previously was only done inside SparkContext.

The places that were not changed fall in one of the following categories:

Test code where this doesn't really matter
Places deep in the code where plumbing SparkConf would be too difficult for very little gain
Default values for arguments - since the caller can provide their own config in that case

This is the basic grunt work; code doesn't fully compile yet, since I'll do some of the more questionable changes in separate commits.

Instead of using "new Configuration()" where a configuration is needed, let the caller provide a context-appropriate config object.

This is sort of hackish, since it doesn't account for any customization someone might make to SparkConf before they actually start executing spark code. Instead, this will only consider options available in the system properties when creating the hadoop conf.

vanzin · 2014-08-07T23:33:16Z

BTW: I'd like to add a couple of simple tests for the YarnSparkHadoopUtil class, but #1724 adds the test suite for that class and I'll wait until that PR is merged before adding the tests.

SparkQA · 2014-08-07T23:34:32Z

QA tests have started for PR 1843. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18155/consoleFull

SparkQA · 2014-08-08T00:25:36Z

QA results for PR 1843:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class ExecutorClassLoader(conf: SparkConf, classUri: String, parent: ClassLoader,

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18155/consoleFull

vanzin · 2014-08-08T17:46:42Z

python errors only, unrelated?

vanzin · 2014-08-08T17:46:49Z

Jenkins, retest this please.

SparkQA · 2014-08-08T17:49:30Z

QA tests have started for PR 1843. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18210/consoleFull

SparkQA · 2014-08-08T18:43:59Z

QA results for PR 1843:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class ExecutorClassLoader(conf: SparkConf, classUri: String, parent: ClassLoader,

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18210/consoleFull

Conflicts: core/src/main/scala/org/apache/spark/util/FileLogger.scala yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala

vanzin · 2014-08-20T22:44:44Z

Jenkins, test this please.

vanzin · 2014-08-21T16:23:43Z

Jenkins, test this please.

SparkQA · 2014-08-21T16:30:35Z

QA tests have started for PR 1843 at commit 0ac3fdf.

This patch merges cleanly.

SparkQA · 2014-08-21T17:36:17Z

QA tests have finished for PR 1843 at commit 0ac3fdf.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class ExecutorClassLoader(conf: SparkConf, classUri: String, parent: ClassLoader,

mateiz · 2014-08-26T02:29:25Z

core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala

@@ -68,7 +68,26 @@ class SparkHadoopUtil extends Logging {
   * Return an appropriate (subclass) of Configuration. Creating config can initializes some Hadoop
   * subsystems.
   */
-  def newConfiguration(): Configuration = new Configuration()
+  def newConfiguration(conf: SparkConf): Configuration = {


This is technically a breaking API change, we can't just do it like this. We have to add the old version.

Also, somewhat worryingly, I don't think SparkHadoopUtil was meant to be a public API, so it's weird that it gets used in our examples. We should probably mark it as @DeveloperApi and make sure that the examples don't use it.

I know the whole "deploy" package is excluded from mima checks (because I added the exclude at @pwendell's request). How is it documented that these packages are "private", if at all? Do we need explicit annotations in that case?

(http://spark.apache.org/docs/1.0.0/api/scala/#package does not list the package, so maybe that's it?)

It's the same as the rest of the codebase -- everything that is "private" should be marked private[spark]. Things that we need to make public for advanced developers are @DeveloperApi. In this case, this thing has been public so we can't remove it, but we could at least mark it to tell people not to depend on it.

BTW in this case you should mark this class and all its methods as @DeveloperApi.

ok, I added the annotation.

Conflicts: core/src/main/scala/org/apache/spark/scheduler/cluster/SimrSchedulerBackend.scala

SparkQA · 2014-08-26T16:45:59Z

QA tests have started for PR 1843 at commit 3d345cb.

This patch merges cleanly.

SparkQA · 2014-08-26T17:45:57Z

QA tests have finished for PR 1843 at commit 3d345cb.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class ExecutorClassLoader(conf: SparkConf, classUri: String, parent: ClassLoader,

SparkQA · 2014-08-27T01:56:00Z

QA tests have started for PR 1843 at commit 51e71cf.

This patch merges cleanly.

SparkQA · 2014-08-27T02:57:11Z

QA tests have finished for PR 1843 at commit 51e71cf.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- "$FWDIR"/bin/spark-submit --class $CLASS "$
- class ExecutorClassLoader(conf: SparkConf, classUri: String, parent: ClassLoader,
- "$FWDIR"/bin/spark-submit --class $CLASS "$

mateiz · 2014-08-27T17:52:25Z

@vanzin unfortunately this no longer merges cleanly, probably due to your YARN change. Mind rebasing it?

Conflicts: yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ExecutorLauncher.scala yarn/common/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientClusterScheduler.scala yarn/common/src/main/scala/org/apache/spark/scheduler/cluster/YarnClusterScheduler.scala yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ExecutorLauncher.scala

vanzin · 2014-08-27T20:49:55Z

Jenkins, test this please.

mateiz · 2014-08-28T02:50:20Z

add to whitelist and test this please

SparkQA · 2014-08-28T02:56:14Z

QA tests have started for PR 1843 at commit f179013.

This patch merges cleanly.

SparkQA · 2014-08-28T04:06:34Z

QA tests have finished for PR 1843 at commit f179013.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class ExecutorClassLoader(conf: SparkConf, classUri: String, parent: ClassLoader,

Conflicts: core/src/test/scala/org/apache/spark/scheduler/ReplayListenerSuite.scala

vanzin · 2014-08-29T17:29:59Z

Jenkins, test this please.

SparkQA · 2014-08-29T17:34:08Z

QA tests have started for PR 1843 at commit 52daf35.

This patch merges cleanly.

SparkQA · 2014-08-29T18:35:38Z

QA tests have finished for PR 1843 at commit 52daf35.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class ExecutorClassLoader(conf: SparkConf, classUri: String, parent: ClassLoader,

mateiz · 2014-08-30T21:48:42Z

Thanks Marcelo! I've merged this in.

Different places in the code were instantiating Configuration / YarnConfiguration objects in different ways. This could lead to confusion for people who actually expected "spark.hadoop.*" options to end up in the configs used by Spark code, since that would only happen for the SparkContext's config. This change modifies most places to use SparkHadoopUtil to initialize configs, and make that method do the translation that previously was only done inside SparkContext. The places that were not changed fall in one of the following categories: - Test code where this doesn't really matter - Places deep in the code where plumbing SparkConf would be too difficult for very little gain - Default values for arguments - since the caller can provide their own config in that case Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#1843 from vanzin/SPARK-2889 and squashes the following commits: 52daf35 [Marcelo Vanzin] Merge branch 'master' into SPARK-2889 f179013 [Marcelo Vanzin] Merge branch 'master' into SPARK-2889 51e71cf [Marcelo Vanzin] Add test to ensure that overriding Yarn configs works. 53f9506 [Marcelo Vanzin] Add DeveloperApi annotation. 3d345cb [Marcelo Vanzin] Restore old method for backwards compat. fc45067 [Marcelo Vanzin] Merge branch 'master' into SPARK-2889 0ac3fdf [Marcelo Vanzin] Merge branch 'master' into SPARK-2889 3f26760 [Marcelo Vanzin] Compilation fix. f16cadd [Marcelo Vanzin] Initialize config in SparkHadoopUtil. b8ab173 [Marcelo Vanzin] Update Utils API to take a Configuration argument. 1e7003f [Marcelo Vanzin] Replace explicit Configuration instantiation with SparkHadoopUtil.

Marcelo Vanzin added 4 commits August 7, 2014 13:28

Replace explicit Configuration instantiation with SparkHadoopUtil.

1e7003f

This is the basic grunt work; code doesn't fully compile yet, since I'll do some of the more questionable changes in separate commits.

Update Utils API to take a Configuration argument.

b8ab173

Instead of using "new Configuration()" where a configuration is needed, let the caller provide a context-appropriate config object.

Compilation fix.

3f26760

Merge branch 'master' into SPARK-2889

0ac3fdf

Conflicts: core/src/main/scala/org/apache/spark/util/FileLogger.scala yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala

mateiz reviewed Aug 26, 2014
View reviewed changes

Marcelo Vanzin added 2 commits August 26, 2014 09:31

Merge branch 'master' into SPARK-2889

fc45067

Conflicts: core/src/main/scala/org/apache/spark/scheduler/cluster/SimrSchedulerBackend.scala

Restore old method for backwards compat.

3d345cb

Marcelo Vanzin added 2 commits August 26, 2014 18:11

Add DeveloperApi annotation.

53f9506

Add test to ensure that overriding Yarn configs works.

51e71cf

Merge branch 'master' into SPARK-2889

52daf35

Conflicts: core/src/test/scala/org/apache/spark/scheduler/ReplayListenerSuite.scala

asfgit closed this in b6cf134 Aug 30, 2014

vanzin deleted the SPARK-2889 branch September 2, 2014 17:37

JoshRosen mentioned this pull request Dec 19, 2014

[SPARK-4896] don’t redundantly overwrite executor JAR deps #2848

Closed

gatorsmile mentioned this pull request Aug 1, 2017

[SPARK-21637][SPARK-21451][SQL]get spark.hadoop.* properties from sysProps to hiveconf #18668

Closed

viirya pushed a commit to viirya/spark-1 that referenced this pull request Oct 19, 2023

rdar://113767905 (Update Iceberg to 1.3.0.3-apple) (apache#1843)

f06119f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-2889] Create Hadoop config objects consistently. #1843

[SPARK-2889] Create Hadoop config objects consistently. #1843

vanzin commented Aug 7, 2014

vanzin commented Aug 7, 2014

SparkQA commented Aug 7, 2014

SparkQA commented Aug 8, 2014

vanzin commented Aug 8, 2014

vanzin commented Aug 8, 2014

SparkQA commented Aug 8, 2014

SparkQA commented Aug 8, 2014

vanzin commented Aug 20, 2014

vanzin commented Aug 21, 2014

SparkQA commented Aug 21, 2014

SparkQA commented Aug 21, 2014

mateiz Aug 26, 2014

vanzin Aug 26, 2014

mateiz Aug 27, 2014

mateiz Aug 27, 2014

vanzin Aug 27, 2014

SparkQA commented Aug 26, 2014

SparkQA commented Aug 26, 2014

SparkQA commented Aug 27, 2014

SparkQA commented Aug 27, 2014

mateiz commented Aug 27, 2014

vanzin commented Aug 27, 2014

mateiz commented Aug 28, 2014

SparkQA commented Aug 28, 2014

SparkQA commented Aug 28, 2014

vanzin commented Aug 29, 2014

SparkQA commented Aug 29, 2014

SparkQA commented Aug 29, 2014

mateiz commented Aug 30, 2014

[SPARK-2889] Create Hadoop config objects consistently. #1843

[SPARK-2889] Create Hadoop config objects consistently. #1843

Conversation

vanzin commented Aug 7, 2014

vanzin commented Aug 7, 2014

SparkQA commented Aug 7, 2014

SparkQA commented Aug 8, 2014

vanzin commented Aug 8, 2014

vanzin commented Aug 8, 2014

SparkQA commented Aug 8, 2014

SparkQA commented Aug 8, 2014

vanzin commented Aug 20, 2014

vanzin commented Aug 21, 2014

SparkQA commented Aug 21, 2014

SparkQA commented Aug 21, 2014

mateiz Aug 26, 2014

Choose a reason for hiding this comment

vanzin Aug 26, 2014

Choose a reason for hiding this comment

mateiz Aug 27, 2014

Choose a reason for hiding this comment

mateiz Aug 27, 2014

Choose a reason for hiding this comment

vanzin Aug 27, 2014

Choose a reason for hiding this comment

SparkQA commented Aug 26, 2014

SparkQA commented Aug 26, 2014

SparkQA commented Aug 27, 2014

SparkQA commented Aug 27, 2014

mateiz commented Aug 27, 2014

vanzin commented Aug 27, 2014

mateiz commented Aug 28, 2014

SparkQA commented Aug 28, 2014

SparkQA commented Aug 28, 2014

vanzin commented Aug 29, 2014

SparkQA commented Aug 29, 2014

SparkQA commented Aug 29, 2014

mateiz commented Aug 30, 2014