Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-9825][yarn] Do not overwrite final Hadoop config entries. #18370

Closed
wants to merge 3 commits into from

Conversation

vanzin
Copy link
Contributor

@vanzin vanzin commented Jun 20, 2017

When localizing the gateway config files in a YARN application, avoid
overwriting final configs by distributing the gateway files to a separate
directory, and explicitly loading them into the Hadoop config, instead
of placing those files before the cluster's files in the classpath.

This is done by saving the gateway's config to a separate XML file
distributed with the rest of the Spark app's config, and loading that
file when creating a new config through YarnSparkHadoopUtil.

Tested with existing unit tests, and by verifying the behavior in a YARN
cluster (final values are not overridden, non-final values are).

When localizing the gateway config files in a YARN application, avoid
overwriting final configs by distributing the gateway files to a separate
directory, and explicitly loading them into the Hadoop config, instead
of placing those files before the cluster's files in the classpath.

The implementation is a little hacky, but mostly because there's no API
that tells you which are the files that are loaded by a YarnConfiguration
object; so the list of files was obtained empirically and is hardcoded
in the code.

Tested with existing unit tests, and by verifying the behavior in a YARN
cluster (final values are not overridden, non-final values are).
@vanzin
Copy link
Contributor Author

vanzin commented Jun 20, 2017

I also tried a different approach (dumping the conf object to a new XML file and adding that as a resource), but the dumped file contained all the default values from the config object (in spite of what the API docs say), so even though my config files only had a handful of entries, the final XML file was over 100k (way bigger than the sum of the original files).

That might be cleaner, but it is slower.

@SparkQA
Copy link

SparkQA commented Jun 20, 2017

Test build #78323 has finished for PR 18370 at commit 4c35cce.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Jul 7, 2017

@tgravescs any options about this (or the alternate solution described above)?

@tgravescs
Copy link
Contributor

so the issue here is that the gateway boxes have a different hadoop config on purpose because from those boxes you need one proxy but on the actual yarn nodes you don't need the proxy?

I thought someone had the exact other issue where they wanted the gateway configs to work over the server side configs. I guess that is fine if its not final.

I agree with you that its a bit hacky and not very nice we have to hardcode the files.

I was actually thinking of the second approach you tried. Like you mention it will definitely pull in all the defaults so that might be larger then the config files but its going to depend on that setup. How large is that file (in KB/MB)?

@vanzin
Copy link
Contributor Author

vanzin commented Jul 10, 2017

Yeah, the original feature was for someone who wanted to override certain configs locally; that makes sense (e.g. some user has a custom config with a different replication factor than the cluster's), but it shouldn't overwrite final configs. It's still possible to override final configs if the user really wants to (e.g. by setting spark.hadoop.* in Spark's config).

How large is that file (in KB/MB)?

The generated XML is about 100k (compared to about 10k for the XML files in the config directory). It's not too horrible if you think about it. Maybe it's better to go that route.

@tgravescs
Copy link
Contributor

Yeah I think that is going to be safer. That is what mapreduce does with its job.xml file. I'm not sure about tez.

@SparkQA
Copy link

SparkQA commented Jul 10, 2017

Test build #79480 has finished for PR 18370 at commit 3d59f0b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// File containing the conf archive in the AM. See prepareLocalResources().
val LOCALIZED_CONF_ARCHIVE = LOCALIZED_CONF_DIR + ".zip"

// Name of the file in the conf archive containing Spark configuration.
val SPARK_CONF_FILE = "__spark_conf__.properties"

// Name of the file containing the gateway's Hadoop configuration, to be overlayed on top of the
// cluster's Hadoop config.
val SPARK_HADOOP_CONF_FILE = "__spark_conf__.xml"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps call it spark_hadoop_conf.xml to clarity

@SparkQA
Copy link

SparkQA commented Jul 12, 2017

Test build #79573 has finished for PR 18370 at commit 302f16c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor

+1, the only downside here is we are now shipping the xml files twice. still the entire hadoop conf dir and now as the single xml file. it feels like we shouldn't ship the hadoop conf dir but that might break some people. on running, my combined spark_hadoop_conf.xml was about 160k.

@vanzin
Copy link
Contributor Author

vanzin commented Jul 14, 2017

it feels like we shouldn't ship the hadoop conf dir but that might break some people

Yeah, we rely on that to ship other configs (like Hive and HBase) with the application without the user having to mess with --files and whatnot.

Thanks, merging to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants