Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-2889] Create Hadoop config objects consistently. #1843

Closed
wants to merge 11 commits into from

Conversation

vanzin
Copy link
Contributor

@vanzin vanzin commented Aug 7, 2014

Different places in the code were instantiating Configuration / YarnConfiguration objects in different ways. This could lead to confusion for people who actually expected "spark.hadoop.*" options to end up in the configs used by Spark code, since that would only happen for the SparkContext's config.

This change modifies most places to use SparkHadoopUtil to initialize configs, and make that method do the translation that previously was only done inside SparkContext.

The places that were not changed fall in one of the following categories:

  • Test code where this doesn't really matter
  • Places deep in the code where plumbing SparkConf would be too difficult for very little gain
  • Default values for arguments - since the caller can provide their own config in that case

Marcelo Vanzin added 4 commits August 7, 2014 13:28
This is the basic grunt work; code doesn't fully compile yet, since
I'll do some of the more questionable changes in separate commits.
Instead of using "new Configuration()" where a configuration is
needed, let the caller provide a context-appropriate config
object.
This is sort of hackish, since it doesn't account for any customization
someone might make to SparkConf before they actually start executing spark
code. Instead, this will only consider options available in the
system properties when creating the hadoop conf.
@vanzin
Copy link
Contributor Author

vanzin commented Aug 7, 2014

BTW: I'd like to add a couple of simple tests for the YarnSparkHadoopUtil class, but #1724 adds the test suite for that class and I'll wait until that PR is merged before adding the tests.

@SparkQA
Copy link

SparkQA commented Aug 7, 2014

QA tests have started for PR 1843. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18155/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 8, 2014

QA results for PR 1843:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class ExecutorClassLoader(conf: SparkConf, classUri: String, parent: ClassLoader,

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18155/consoleFull

@vanzin
Copy link
Contributor Author

vanzin commented Aug 8, 2014

python errors only, unrelated?

@vanzin
Copy link
Contributor Author

vanzin commented Aug 8, 2014

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Aug 8, 2014

QA tests have started for PR 1843. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18210/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 8, 2014

QA results for PR 1843:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class ExecutorClassLoader(conf: SparkConf, classUri: String, parent: ClassLoader,

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18210/consoleFull

Conflicts:
	core/src/main/scala/org/apache/spark/util/FileLogger.scala
	yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala
@vanzin
Copy link
Contributor Author

vanzin commented Aug 20, 2014

Jenkins, test this please.

1 similar comment
@vanzin
Copy link
Contributor Author

vanzin commented Aug 21, 2014

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Aug 21, 2014

QA tests have started for PR 1843 at commit 0ac3fdf.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 21, 2014

QA tests have finished for PR 1843 at commit 0ac3fdf.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class ExecutorClassLoader(conf: SparkConf, classUri: String, parent: ClassLoader,

@@ -68,7 +68,26 @@ class SparkHadoopUtil extends Logging {
* Return an appropriate (subclass) of Configuration. Creating config can initializes some Hadoop
* subsystems.
*/
def newConfiguration(): Configuration = new Configuration()
def newConfiguration(conf: SparkConf): Configuration = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is technically a breaking API change, we can't just do it like this. We have to add the old version.

Also, somewhat worryingly, I don't think SparkHadoopUtil was meant to be a public API, so it's weird that it gets used in our examples. We should probably mark it as @DeveloperApi and make sure that the examples don't use it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the whole "deploy" package is excluded from mima checks (because I added the exclude at @pwendell's request). How is it documented that these packages are "private", if at all? Do we need explicit annotations in that case?

(http://spark.apache.org/docs/1.0.0/api/scala/#package does not list the package, so maybe that's it?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the same as the rest of the codebase -- everything that is "private" should be marked private[spark]. Things that we need to make public for advanced developers are @DeveloperApi. In this case, this thing has been public so we can't remove it, but we could at least mark it to tell people not to depend on it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW in this case you should mark this class and all its methods as @DeveloperApi.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I added the annotation.

Marcelo Vanzin added 2 commits August 26, 2014 09:31
Conflicts:
	core/src/main/scala/org/apache/spark/scheduler/cluster/SimrSchedulerBackend.scala
@SparkQA
Copy link

SparkQA commented Aug 26, 2014

QA tests have started for PR 1843 at commit 3d345cb.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 26, 2014

QA tests have finished for PR 1843 at commit 3d345cb.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class ExecutorClassLoader(conf: SparkConf, classUri: String, parent: ClassLoader,

@SparkQA
Copy link

SparkQA commented Aug 27, 2014

QA tests have started for PR 1843 at commit 51e71cf.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 27, 2014

QA tests have finished for PR 1843 at commit 51e71cf.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • "$FWDIR"/bin/spark-submit --class $CLASS "$
    • class ExecutorClassLoader(conf: SparkConf, classUri: String, parent: ClassLoader,
    • "$FWDIR"/bin/spark-submit --class $CLASS "$

@mateiz
Copy link
Contributor

mateiz commented Aug 27, 2014

@vanzin unfortunately this no longer merges cleanly, probably due to your YARN change. Mind rebasing it?

Conflicts:
	yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala
	yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ExecutorLauncher.scala
	yarn/common/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientClusterScheduler.scala
	yarn/common/src/main/scala/org/apache/spark/scheduler/cluster/YarnClusterScheduler.scala
	yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala
	yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ExecutorLauncher.scala
@vanzin
Copy link
Contributor Author

vanzin commented Aug 27, 2014

Jenkins, test this please.

@mateiz
Copy link
Contributor

mateiz commented Aug 28, 2014

add to whitelist and test this please

@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA tests have started for PR 1843 at commit f179013.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA tests have finished for PR 1843 at commit f179013.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class ExecutorClassLoader(conf: SparkConf, classUri: String, parent: ClassLoader,

Conflicts:
	core/src/test/scala/org/apache/spark/scheduler/ReplayListenerSuite.scala
@vanzin
Copy link
Contributor Author

vanzin commented Aug 29, 2014

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Aug 29, 2014

QA tests have started for PR 1843 at commit 52daf35.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 29, 2014

QA tests have finished for PR 1843 at commit 52daf35.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class ExecutorClassLoader(conf: SparkConf, classUri: String, parent: ClassLoader,

@mateiz
Copy link
Contributor

mateiz commented Aug 30, 2014

Thanks Marcelo! I've merged this in.

@asfgit asfgit closed this in b6cf134 Aug 30, 2014
@vanzin vanzin deleted the SPARK-2889 branch September 2, 2014 17:37
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
Different places in the code were instantiating Configuration / YarnConfiguration objects in different ways. This could lead to confusion for people who actually expected "spark.hadoop.*" options to end up in the configs used by Spark code, since that would only happen for the SparkContext's config.

This change modifies most places to use SparkHadoopUtil to initialize configs, and make that method do the translation that previously was only done inside SparkContext.

The places that were not changed fall in one of the following categories:
- Test code where this doesn't really matter
- Places deep in the code where plumbing SparkConf would be too difficult for very little gain
- Default values for arguments - since the caller can provide their own config in that case

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes apache#1843 from vanzin/SPARK-2889 and squashes the following commits:

52daf35 [Marcelo Vanzin] Merge branch 'master' into SPARK-2889
f179013 [Marcelo Vanzin] Merge branch 'master' into SPARK-2889
51e71cf [Marcelo Vanzin] Add test to ensure that overriding Yarn configs works.
53f9506 [Marcelo Vanzin] Add DeveloperApi annotation.
3d345cb [Marcelo Vanzin] Restore old method for backwards compat.
fc45067 [Marcelo Vanzin] Merge branch 'master' into SPARK-2889
0ac3fdf [Marcelo Vanzin] Merge branch 'master' into SPARK-2889
3f26760 [Marcelo Vanzin] Compilation fix.
f16cadd [Marcelo Vanzin] Initialize config in SparkHadoopUtil.
b8ab173 [Marcelo Vanzin] Update Utils API to take a Configuration argument.
1e7003f [Marcelo Vanzin] Replace explicit Configuration instantiation with SparkHadoopUtil.
viirya pushed a commit to viirya/spark-1 that referenced this pull request Oct 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants