[SPARK-9825][yarn] Do not overwrite final Hadoop config entries. #18370

vanzin · 2017-06-20T21:37:44Z

When localizing the gateway config files in a YARN application, avoid
overwriting final configs by distributing the gateway files to a separate
directory, and explicitly loading them into the Hadoop config, instead
of placing those files before the cluster's files in the classpath.

This is done by saving the gateway's config to a separate XML file
distributed with the rest of the Spark app's config, and loading that
file when creating a new config through YarnSparkHadoopUtil.

Tested with existing unit tests, and by verifying the behavior in a YARN
cluster (final values are not overridden, non-final values are).

When localizing the gateway config files in a YARN application, avoid overwriting final configs by distributing the gateway files to a separate directory, and explicitly loading them into the Hadoop config, instead of placing those files before the cluster's files in the classpath. The implementation is a little hacky, but mostly because there's no API that tells you which are the files that are loaded by a YarnConfiguration object; so the list of files was obtained empirically and is hardcoded in the code. Tested with existing unit tests, and by verifying the behavior in a YARN cluster (final values are not overridden, non-final values are).

vanzin · 2017-06-20T21:39:46Z

I also tried a different approach (dumping the conf object to a new XML file and adding that as a resource), but the dumped file contained all the default values from the config object (in spite of what the API docs say), so even though my config files only had a handful of entries, the final XML file was over 100k (way bigger than the sum of the original files).

That might be cleaner, but it is slower.

SparkQA · 2017-06-20T21:59:54Z

Test build #78323 has finished for PR 18370 at commit 4c35cce.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2017-07-07T21:05:39Z

@tgravescs any options about this (or the alternate solution described above)?

tgravescs · 2017-07-10T14:38:46Z

so the issue here is that the gateway boxes have a different hadoop config on purpose because from those boxes you need one proxy but on the actual yarn nodes you don't need the proxy?

I thought someone had the exact other issue where they wanted the gateway configs to work over the server side configs. I guess that is fine if its not final.

I agree with you that its a bit hacky and not very nice we have to hardcode the files.

I was actually thinking of the second approach you tried. Like you mention it will definitely pull in all the defaults so that might be larger then the config files but its going to depend on that setup. How large is that file (in KB/MB)?

vanzin · 2017-07-10T16:40:32Z

Yeah, the original feature was for someone who wanted to override certain configs locally; that makes sense (e.g. some user has a custom config with a different replication factor than the cluster's), but it shouldn't overwrite final configs. It's still possible to override final configs if the user really wants to (e.g. by setting spark.hadoop.* in Spark's config).

How large is that file (in KB/MB)?

The generated XML is about 100k (compared to about 10k for the XML files in the config directory). It's not too horrible if you think about it. Maybe it's better to go that route.

tgravescs · 2017-07-10T17:58:05Z

Yeah I think that is going to be safer. That is what mapreduce does with its job.xml file. I'm not sure about tez.

SparkQA · 2017-07-10T21:43:20Z

Test build #79480 has finished for PR 18370 at commit 3d59f0b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2017-07-12T19:53:12Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala

  // File containing the conf archive in the AM. See prepareLocalResources().
  val LOCALIZED_CONF_ARCHIVE = LOCALIZED_CONF_DIR + ".zip"

  // Name of the file in the conf archive containing Spark configuration.
  val SPARK_CONF_FILE = "__spark_conf__.properties"

+  // Name of the file containing the gateway's Hadoop configuration, to be overlayed on top of the
+  // cluster's Hadoop config.
+  val SPARK_HADOOP_CONF_FILE = "__spark_conf__.xml"


perhaps call it spark_hadoop_conf.xml to clarity

SparkQA · 2017-07-12T22:19:03Z

Test build #79573 has finished for PR 18370 at commit 302f16c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2017-07-14T16:15:24Z

+1, the only downside here is we are now shipping the xml files twice. still the entire hadoop conf dir and now as the single xml file. it feels like we shouldn't ship the hadoop conf dir but that might break some people. on running, my combined spark_hadoop_conf.xml was about 160k.

vanzin · 2017-07-14T21:32:01Z

it feels like we shouldn't ship the hadoop conf dir but that might break some people

Yeah, we rely on that to ship other configs (like Hive and HBase) with the application without the user having to mess with --files and whatnot.

Thanks, merging to master.

Create a separate file with the gateway's config.

3d59f0b

tgravescs reviewed Jul 12, 2017

View reviewed changes

Change generated config file name.

302f16c

asfgit closed this in 601a237 Jul 14, 2017

vanzin deleted the SPARK-9825 branch July 18, 2017 20:37

xkrogen mentioned this pull request Nov 14, 2022

[SPARK-40504][YARN] Make yarn appmaster load config from client #37949

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-9825][yarn] Do not overwrite final Hadoop config entries. #18370

[SPARK-9825][yarn] Do not overwrite final Hadoop config entries. #18370

vanzin commented Jun 20, 2017 •

edited

vanzin commented Jun 20, 2017

SparkQA commented Jun 20, 2017

vanzin commented Jul 7, 2017

tgravescs commented Jul 10, 2017

vanzin commented Jul 10, 2017

tgravescs commented Jul 10, 2017

SparkQA commented Jul 10, 2017

tgravescs Jul 12, 2017

SparkQA commented Jul 12, 2017

tgravescs commented Jul 14, 2017

vanzin commented Jul 14, 2017

[SPARK-9825][yarn] Do not overwrite final Hadoop config entries. #18370

[SPARK-9825][yarn] Do not overwrite final Hadoop config entries. #18370

Conversation

vanzin commented Jun 20, 2017 • edited

vanzin commented Jun 20, 2017

SparkQA commented Jun 20, 2017

vanzin commented Jul 7, 2017

tgravescs commented Jul 10, 2017

vanzin commented Jul 10, 2017

tgravescs commented Jul 10, 2017

SparkQA commented Jul 10, 2017

tgravescs Jul 12, 2017

Choose a reason for hiding this comment

SparkQA commented Jul 12, 2017

tgravescs commented Jul 14, 2017

vanzin commented Jul 14, 2017

vanzin commented Jun 20, 2017 •

edited