-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-8302] Support heterogeneous cluster install paths on YARN. #6752
Conversation
Some users have Hadoop installations on different paths across their cluster. Currently, that makes it hard to set up some configuration in Spark since that requires hardcoding paths to jar files or native libraries, which wouldn't work on such a cluster. This change introduces a couple of YARN-specific configurations that instruct the backend to replace certain paths when launching remote processes. That way, if the configuration says the Spark jar is in "/spark/spark.jar", and also says that "/spark" should be replaced with "{{SPARK_INSTALL_DIR}}", YARN will start containers in the NMs with "{{SPARK_INSTALL_DIR}}/spark.jar" as the location of the jar. Coupled with YARN's environment whitelist (which allows certain env variables to be exposed to containers), this allows users to support such heterogeneous environments, as long as a single replacement is enough. (Otherwise, this feature would need to be extended to support multiple path replacements.)
note: check comments on SPARK-8302 for an alternative approach that I thought was too intrusive. |
Test build #34648 has finished for PR 6752 at commit
|
Test build #34652 has finished for PR 6752 at commit
|
Jenkins, retest this please. |
Test build #34658 has finished for PR 6752 at commit
|
path: String, | ||
env: HashMap[String, String]): Unit = | ||
YarnSparkHadoopUtil.addPathToEnvironment(env, Environment.CLASSPATH.name, | ||
getClusterPath(conf, path)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about put getClusterPath outside addClasspathEntry? because many files do not need to replace,example:spark_conf, pyfiles .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be safe, because it's very unlikely that those paths would contain the string being replaced, but I see your point. Let me take a look.
Test build #34702 has finished for PR 6752 at commit
|
* | ||
* If either config is not available, the input path is returned. | ||
*/ | ||
def getClusterPath(conf: SparkConf, path: String): String = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it is better for putting getClusterPath into YarnSparkHadoopUtil object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's very tightly-coupled with all the code in this class that handles paths, and not really used anywhere else... perhaps when we clean up this code we can move all the path-handling code to a more appropriate location.
LGTM, pending any further comments |
val localPath = conf.get("spark.yarn.config.localPath", null) | ||
val clusterPath = conf.get("spark.yarn.config.clusterPath", null) | ||
if (localPath != null && clusterPath != null) { | ||
path.replace(localPath, clusterPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think it makes sense to replace anywhere in the path? maybe it would be better to only replace a prefix? I know this is more general, but I'm just worried it might do some inadvertent replacement.
(if we really want it to be general, we could use a regex ... but I feel like simpler is probably better)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that configs such as SPARK_DIST_CLASSPATH
may contain multiple instances of the path. So either you have to do this (which is simpler but in some really rare cases might cause an issue) or you have to parse classpaths (i.e. break them down into individual entries) and perform a prefix replacement on each entry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense ... again I was thinking that parsing classpaths would be simpler b/c I forgot about all platforms.
I just left some minor comments on the code. I like the simplicity of this over some more complicated proposals. The only thing I'm still thinking about is how to best document this -- both within the code and for users. I'll make some more comments for docs, but they are just suggestions, I'm not entirely certain about the best way to do it. |
I'm also not sure where to document this; while it's obviously a user config, it's not something I expect users to fiddle with; this is something that admins would set once, or even would be set automatically by management tools, so users wouldn't ever need to worry about it. |
* only be valid in the local process. | ||
* - spark.yarn.config.clusterPath: a string with which to replace the local path. This may | ||
* contain, for example, env variable references, which will be expanded by the NMs when | ||
* starting containers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find "local" to be very confusing -- local from the viewpoint of which node? Its really "local" from the viewpoint of the gateway node. (maybe it ends up being the same thing, since this is always run on the gateway, but just looking at this in isolation its not clear.)
So I'm not crazy about these names either, but about something like:
spark.yarn.config.gatewayPath
and spark.yarn.config.gatewayClusterReplacementPath
?
Also can you expand the initial line of the doc slightly to include more from the overall PR description? Eg., something like "Returns the path to be sent to the NM for building the command line to launch spark containers. The NM will perform variable substitution of the expanded path". I know this in your description of clusterPath
but would like it a little more prominent ... also just a suggestion ...
The other part that needs to get added is the user docs in running-on-yarn.md -- I suppose that can wait till we agree on the approach. |
I think our comments about user config crossed each other ... anyhow:
I agree that this is not the kind of thing a normal user will ever want to touch, but it does need to be discoverable to admins. I think we need to stick something in the docs which at least explains the purpose of the feature. |
No, I was actually replying to you. :-) It's just that github doesn't understand the concept of "threads"... |
Test build #34958 has finished for PR 6752 at commit
|
lgtm |
Jenkins, retest this please. |
Test build #35790 has finished for PR 6752 at commit
|
thanks @vanzin , merging to master |
Some users have Hadoop installations on different paths across
their cluster. Currently, that makes it hard to set up some
configuration in Spark since that requires hardcoding paths to
jar files or native libraries, which wouldn't work on such a cluster.
This change introduces a couple of YARN-specific configurations
that instruct the backend to replace certain paths when launching
remote processes. That way, if the configuration says the Spark
jar is in "/spark/spark.jar", and also says that "/spark" should be
replaced with "{{SPARK_INSTALL_DIR}}", YARN will start containers
in the NMs with "{{SPARK_INSTALL_DIR}}/spark.jar" as the location
of the jar.
Coupled with YARN's environment whitelist (which allows certain
env variables to be exposed to containers), this allows users to
support such heterogeneous environments, as long as a single
replacement is enough. (Otherwise, this feature would need to be
extended to support multiple path replacements.)