Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-8302] Support heterogeneous cluster install paths on YARN. #6752

Closed
wants to merge 4 commits into from

Conversation

vanzin
Copy link
Contributor

@vanzin vanzin commented Jun 11, 2015

Some users have Hadoop installations on different paths across
their cluster. Currently, that makes it hard to set up some
configuration in Spark since that requires hardcoding paths to
jar files or native libraries, which wouldn't work on such a cluster.

This change introduces a couple of YARN-specific configurations
that instruct the backend to replace certain paths when launching
remote processes. That way, if the configuration says the Spark
jar is in "/spark/spark.jar", and also says that "/spark" should be
replaced with "{{SPARK_INSTALL_DIR}}", YARN will start containers
in the NMs with "{{SPARK_INSTALL_DIR}}/spark.jar" as the location
of the jar.

Coupled with YARN's environment whitelist (which allows certain
env variables to be exposed to containers), this allows users to
support such heterogeneous environments, as long as a single
replacement is enough. (Otherwise, this feature would need to be
extended to support multiple path replacements.)

Some users have Hadoop installations on different paths across
their cluster. Currently, that makes it hard to set up some
configuration in Spark since that requires hardcoding paths to
jar files or native libraries, which wouldn't work on such a cluster.

This change introduces a couple of YARN-specific configurations
that instruct the backend to replace certain paths when launching
remote processes. That way, if the configuration says the Spark
jar is in "/spark/spark.jar", and also says that "/spark" should be
replaced with "{{SPARK_INSTALL_DIR}}", YARN will start containers
in the NMs with "{{SPARK_INSTALL_DIR}}/spark.jar" as the location
of the jar.

Coupled with YARN's environment whitelist (which allows certain
env variables to be exposed to containers), this allows users to
support such heterogeneous environments, as long as a single
replacement is enough. (Otherwise, this feature would need to be
extended to support multiple path replacements.)
@vanzin
Copy link
Contributor Author

vanzin commented Jun 11, 2015

note: check comments on SPARK-8302 for an alternative approach that I thought was too intrusive.

@SparkQA
Copy link

SparkQA commented Jun 11, 2015

Test build #34648 has finished for PR 6752 at commit a5e1f68.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 11, 2015

Test build #34652 has finished for PR 6752 at commit 2e9cc9d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class OverrideFunctionRegistry(underlying: FunctionRegistry) extends FunctionRegistry
    • abstract class LeafMathExpression(c: Double, name: String)
    • case class EulerNumber() extends LeafMathExpression(math.E, "E")
    • case class Pi() extends LeafMathExpression(math.Pi, "PI")

@vanzin
Copy link
Contributor Author

vanzin commented Jun 11, 2015

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jun 11, 2015

Test build #34658 has finished for PR 6752 at commit 2e9cc9d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

path: String,
env: HashMap[String, String]): Unit =
YarnSparkHadoopUtil.addPathToEnvironment(env, Environment.CLASSPATH.name,
getClusterPath(conf, path))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about put getClusterPath outside addClasspathEntry? because many files do not need to replace,example:spark_conf, pyfiles .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be safe, because it's very unlikely that those paths would contain the string being replaced, but I see your point. Let me take a look.

@SparkQA
Copy link

SparkQA commented Jun 11, 2015

Test build #34702 has finished for PR 6752 at commit 0aa2a02.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

*
* If either config is not available, the input path is returned.
*/
def getClusterPath(conf: SparkConf, path: String): String = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it is better for putting getClusterPath into YarnSparkHadoopUtil object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very tightly-coupled with all the code in this class that handles paths, and not really used anywhere else... perhaps when we clean up this code we can move all the path-handling code to a more appropriate location.

@srowen
Copy link
Member

srowen commented Jun 14, 2015

LGTM, pending any further comments

val localPath = conf.get("spark.yarn.config.localPath", null)
val clusterPath = conf.get("spark.yarn.config.clusterPath", null)
if (localPath != null && clusterPath != null) {
path.replace(localPath, clusterPath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think it makes sense to replace anywhere in the path? maybe it would be better to only replace a prefix? I know this is more general, but I'm just worried it might do some inadvertent replacement.

(if we really want it to be general, we could use a regex ... but I feel like simpler is probably better)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that configs such as SPARK_DIST_CLASSPATH may contain multiple instances of the path. So either you have to do this (which is simpler but in some really rare cases might cause an issue) or you have to parse classpaths (i.e. break them down into individual entries) and perform a prefix replacement on each entry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense ... again I was thinking that parsing classpaths would be simpler b/c I forgot about all platforms.

@squito
Copy link
Contributor

squito commented Jun 15, 2015

I just left some minor comments on the code. I like the simplicity of this over some more complicated proposals. The only thing I'm still thinking about is how to best document this -- both within the code and for users. I'll make some more comments for docs, but they are just suggestions, I'm not entirely certain about the best way to do it.

@vanzin
Copy link
Contributor Author

vanzin commented Jun 15, 2015

I'm also not sure where to document this; while it's obviously a user config, it's not something I expect users to fiddle with; this is something that admins would set once, or even would be set automatically by management tools, so users wouldn't ever need to worry about it.

* only be valid in the local process.
* - spark.yarn.config.clusterPath: a string with which to replace the local path. This may
* contain, for example, env variable references, which will be expanded by the NMs when
* starting containers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find "local" to be very confusing -- local from the viewpoint of which node? Its really "local" from the viewpoint of the gateway node. (maybe it ends up being the same thing, since this is always run on the gateway, but just looking at this in isolation its not clear.)

So I'm not crazy about these names either, but about something like:
spark.yarn.config.gatewayPath and spark.yarn.config.gatewayClusterReplacementPath?

Also can you expand the initial line of the doc slightly to include more from the overall PR description? Eg., something like "Returns the path to be sent to the NM for building the command line to launch spark containers. The NM will perform variable substitution of the expanded path". I know this in your description of clusterPath but would like it a little more prominent ... also just a suggestion ...

@squito
Copy link
Contributor

squito commented Jun 15, 2015

The other part that needs to get added is the user docs in running-on-yarn.md -- I suppose that can wait till we agree on the approach.

@squito
Copy link
Contributor

squito commented Jun 15, 2015

I think our comments about user config crossed each other ... anyhow:

I'm also not sure where to document this; while it's obviously a user config, it's not something I expect users to fiddle with; this is something that admins would set once, or even would be set automatically by management tools, so users wouldn't ever need to worry about it.

I agree that this is not the kind of thing a normal user will ever want to touch, but it does need to be discoverable to admins. I think we need to stick something in the docs which at least explains the purpose of the feature.

@vanzin
Copy link
Contributor Author

vanzin commented Jun 15, 2015

I think our comments about user config crossed each other ... anyhow:

No, I was actually replying to you. :-) It's just that github doesn't understand the concept of "threads"...

@SparkQA
Copy link

SparkQA commented Jun 15, 2015

Test build #34958 has finished for PR 6752 at commit 4bff8d4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@squito
Copy link
Contributor

squito commented Jun 16, 2015

lgtm

@vanzin
Copy link
Contributor Author

vanzin commented Jun 25, 2015

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jun 25, 2015

Test build #35790 has finished for PR 6752 at commit 4bff8d4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@squito
Copy link
Contributor

squito commented Jun 26, 2015

thanks @vanzin , merging to master

@asfgit asfgit closed this in 37bf76a Jun 26, 2015
@vanzin vanzin deleted the SPARK-8302 branch June 26, 2015 22:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants