[SPARK-8302] Support heterogeneous cluster install paths on YARN. #6752

vanzin · 2015-06-11T01:08:30Z

Some users have Hadoop installations on different paths across
their cluster. Currently, that makes it hard to set up some
configuration in Spark since that requires hardcoding paths to
jar files or native libraries, which wouldn't work on such a cluster.

This change introduces a couple of YARN-specific configurations
that instruct the backend to replace certain paths when launching
remote processes. That way, if the configuration says the Spark
jar is in "/spark/spark.jar", and also says that "/spark" should be
replaced with "{{SPARK_INSTALL_DIR}}", YARN will start containers
in the NMs with "{{SPARK_INSTALL_DIR}}/spark.jar" as the location
of the jar.

Coupled with YARN's environment whitelist (which allows certain
env variables to be exposed to containers), this allows users to
support such heterogeneous environments, as long as a single
replacement is enough. (Otherwise, this feature would need to be
extended to support multiple path replacements.)

Some users have Hadoop installations on different paths across their cluster. Currently, that makes it hard to set up some configuration in Spark since that requires hardcoding paths to jar files or native libraries, which wouldn't work on such a cluster. This change introduces a couple of YARN-specific configurations that instruct the backend to replace certain paths when launching remote processes. That way, if the configuration says the Spark jar is in "/spark/spark.jar", and also says that "/spark" should be replaced with "{{SPARK_INSTALL_DIR}}", YARN will start containers in the NMs with "{{SPARK_INSTALL_DIR}}/spark.jar" as the location of the jar. Coupled with YARN's environment whitelist (which allows certain env variables to be exposed to containers), this allows users to support such heterogeneous environments, as long as a single replacement is enough. (Otherwise, this feature would need to be extended to support multiple path replacements.)

vanzin · 2015-06-11T01:08:59Z

note: check comments on SPARK-8302 for an alternative approach that I thought was too intrusive.

SparkQA · 2015-06-11T01:16:57Z

Test build #34648 has finished for PR 6752 at commit a5e1f68.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-06-11T02:50:55Z

Test build #34652 has finished for PR 6752 at commit 2e9cc9d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class OverrideFunctionRegistry(underlying: FunctionRegistry) extends FunctionRegistry
- abstract class LeafMathExpression(c: Double, name: String)
- case class EulerNumber() extends LeafMathExpression(math.E, "E")
- case class Pi() extends LeafMathExpression(math.Pi, "PI")

vanzin · 2015-06-11T04:03:07Z

Jenkins, retest this please.

SparkQA · 2015-06-11T05:57:44Z

Test build #34658 has finished for PR 6752 at commit 2e9cc9d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

lianhuiwang · 2015-06-11T10:56:32Z

yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala

+      path: String,
+      env: HashMap[String, String]): Unit =
+    YarnSparkHadoopUtil.addPathToEnvironment(env, Environment.CLASSPATH.name,
+      getClusterPath(conf, path))


how about put getClusterPath outside addClasspathEntry? because many files do not need to replace,example:spark_conf, pyfiles .

I think it should be safe, because it's very unlikely that those paths would contain the string being replaced, but I see your point. Let me take a look.

SparkQA · 2015-06-11T20:02:31Z

Test build #34702 has finished for PR 6752 at commit 0aa2a02.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

lianhuiwang · 2015-06-12T10:00:16Z

yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala

+   *
+   * If either config is not available, the input path is returned.
+   */
+  def getClusterPath(conf: SparkConf, path: String): String = {


i think it is better for putting getClusterPath into YarnSparkHadoopUtil object.

It's very tightly-coupled with all the code in this class that handles paths, and not really used anywhere else... perhaps when we clean up this code we can move all the path-handling code to a more appropriate location.

srowen · 2015-06-14T09:12:40Z

LGTM, pending any further comments

squito · 2015-06-15T15:56:46Z

yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala

+    val localPath = conf.get("spark.yarn.config.localPath", null)
+    val clusterPath = conf.get("spark.yarn.config.clusterPath", null)
+    if (localPath != null && clusterPath != null) {
+      path.replace(localPath, clusterPath)


do you think it makes sense to replace anywhere in the path? maybe it would be better to only replace a prefix? I know this is more general, but I'm just worried it might do some inadvertent replacement.

(if we really want it to be general, we could use a regex ... but I feel like simpler is probably better)

The problem is that configs such as SPARK_DIST_CLASSPATH may contain multiple instances of the path. So either you have to do this (which is simpler but in some really rare cases might cause an issue) or you have to parse classpaths (i.e. break them down into individual entries) and perform a prefix replacement on each entry.

makes sense ... again I was thinking that parsing classpaths would be simpler b/c I forgot about all platforms.

squito · 2015-06-15T16:15:34Z

I just left some minor comments on the code. I like the simplicity of this over some more complicated proposals. The only thing I'm still thinking about is how to best document this -- both within the code and for users. I'll make some more comments for docs, but they are just suggestions, I'm not entirely certain about the best way to do it.

vanzin · 2015-06-15T16:47:39Z

I'm also not sure where to document this; while it's obviously a user config, it's not something I expect users to fiddle with; this is something that admins would set once, or even would be set automatically by management tools, so users wouldn't ever need to worry about it.

squito · 2015-06-15T16:50:16Z

yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala

+   *   only be valid in the local process.
+   * - spark.yarn.config.clusterPath: a string with which to replace the local path. This may
+   *   contain, for example, env variable references, which will be expanded by the NMs when
+   *   starting containers.


I find "local" to be very confusing -- local from the viewpoint of which node? Its really "local" from the viewpoint of the gateway node. (maybe it ends up being the same thing, since this is always run on the gateway, but just looking at this in isolation its not clear.)

So I'm not crazy about these names either, but about something like:
spark.yarn.config.gatewayPath and spark.yarn.config.gatewayClusterReplacementPath?

Also can you expand the initial line of the doc slightly to include more from the overall PR description? Eg., something like "Returns the path to be sent to the NM for building the command line to launch spark containers. The NM will perform variable substitution of the expanded path". I know this in your description of clusterPath but would like it a little more prominent ... also just a suggestion ...

squito · 2015-06-15T16:53:21Z

The other part that needs to get added is the user docs in running-on-yarn.md -- I suppose that can wait till we agree on the approach.

squito · 2015-06-15T16:59:14Z

I think our comments about user config crossed each other ... anyhow:

I'm also not sure where to document this; while it's obviously a user config, it's not something I expect users to fiddle with; this is something that admins would set once, or even would be set automatically by management tools, so users wouldn't ever need to worry about it.

I agree that this is not the kind of thing a normal user will ever want to touch, but it does need to be discoverable to admins. I think we need to stick something in the docs which at least explains the purpose of the feature.

vanzin · 2015-06-15T18:35:32Z

I think our comments about user config crossed each other ... anyhow:

No, I was actually replying to you. :-) It's just that github doesn't understand the concept of "threads"...

SparkQA · 2015-06-15T20:39:03Z

Test build #34958 has finished for PR 6752 at commit 4bff8d4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2015-06-16T16:17:03Z

lgtm

vanzin · 2015-06-25T16:10:17Z

Jenkins, retest this please.

SparkQA · 2015-06-25T18:43:13Z

Test build #35790 has finished for PR 6752 at commit 4bff8d4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2015-06-26T13:44:19Z

thanks @vanzin , merging to master

Style.

2e9cc9d

lianhuiwang reviewed Jun 11, 2015
View reviewed changes

Only do replacement for paths that need it.

0aa2a02

lianhuiwang reviewed Jun 12, 2015
View reviewed changes

squito reviewed Jun 15, 2015
View reviewed changes

Add docs, rename configs.

4bff8d4

asfgit closed this in 37bf76a Jun 26, 2015

vanzin deleted the SPARK-8302 branch June 26, 2015 22:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-8302] Support heterogeneous cluster install paths on YARN. #6752

[SPARK-8302] Support heterogeneous cluster install paths on YARN. #6752

vanzin commented Jun 11, 2015

vanzin commented Jun 11, 2015

SparkQA commented Jun 11, 2015

SparkQA commented Jun 11, 2015

vanzin commented Jun 11, 2015

SparkQA commented Jun 11, 2015

lianhuiwang Jun 11, 2015

vanzin Jun 11, 2015

SparkQA commented Jun 11, 2015

lianhuiwang Jun 12, 2015

vanzin Jun 12, 2015

srowen commented Jun 14, 2015

squito Jun 15, 2015

vanzin Jun 15, 2015

squito Jun 16, 2015

squito commented Jun 15, 2015

vanzin commented Jun 15, 2015

squito Jun 15, 2015

squito commented Jun 15, 2015

squito commented Jun 15, 2015

vanzin commented Jun 15, 2015

SparkQA commented Jun 15, 2015

squito commented Jun 16, 2015

vanzin commented Jun 25, 2015

SparkQA commented Jun 25, 2015

squito commented Jun 26, 2015

[SPARK-8302] Support heterogeneous cluster install paths on YARN. #6752

[SPARK-8302] Support heterogeneous cluster install paths on YARN. #6752

Conversation

vanzin commented Jun 11, 2015

vanzin commented Jun 11, 2015

SparkQA commented Jun 11, 2015

SparkQA commented Jun 11, 2015

vanzin commented Jun 11, 2015

SparkQA commented Jun 11, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 11, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srowen commented Jun 14, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

squito commented Jun 15, 2015

vanzin commented Jun 15, 2015

Choose a reason for hiding this comment

squito commented Jun 15, 2015

squito commented Jun 15, 2015

vanzin commented Jun 15, 2015

SparkQA commented Jun 15, 2015

squito commented Jun 16, 2015

vanzin commented Jun 25, 2015

SparkQA commented Jun 25, 2015

squito commented Jun 26, 2015