Skip to content

Conversation

@brennonyork
Copy link

By default the SPARK_CONF_DIR is not capable of being set from the spark-submit script, but a spark properties file is. After diving through the code it turned out that the SPARK_CONF_DIR is actually a cyclic reference as it is referenced within the load-spark-env.sh script and can also then be reset within the loaded spark-env.sh file. What's worse is that, if the spark-env.sh defined a SPARK_CONF_DIR it wouldn't be picked up and used to grab the default spark-defaults.conf if no --properties-file flag was present. As such, it seemed best to provide a --environment-file flag that can be used to present an arbitrary bash file to the system which would preload any necessary environment configuration options (including SPARK_CONF_DIR). This then solves the original problem that the SPARK_CONF_DIR wasn't effective within spark-submit, but also removes the cyclic dependency on where and when the SPARK_CONF_DIR is loaded.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@JoshRosen
Copy link
Contributor

Jenkins, this is ok to test.

@SparkQA
Copy link

SparkQA commented Dec 5, 2014

Test build #24152 has started for PR 3559 at commit d857bf7.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 5, 2014

Test build #24152 has finished for PR 3559 at commit d857bf7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24152/
Test PASSed.

@brennonyork
Copy link
Author

@JoshRosen Is there anything else needed for this patch to be pushed in? Any feedback / review would be great as well!

@andrewor14
Copy link
Contributor

Hey @brennonyork I'm actually not sure if we should introduce an additional command line option to specify the environment file. I think it makes the semantics very confusing because we can't be sure which environment file is being used: is it the one specified through --environment-file? Is it the one in $SPARK_CONF_DIR/spark-env.sh? Is it both? If it's both how to we merge the values, or which file overwrites the other?

I think because of this circular dependency relationship between SPARK_CONF_DIR and spark-env.sh we should just add a note in spark-env.sh.template warning the user not to set SPARK_CONF_DIR there.

@brennonyork
Copy link
Author

@andrewor14 definitely understand that reasoning. I guess my only question would be how would we a) answer the bug then and still b) support the --properties-file option? Even if we remove the --environment-file CLI option I'm still confused / not seeing a solution to setting the SPARK_CONF_DIR. If we allowed SPARK_CONF_DIR to be set from the CLI then we would still run into a merge problem with the the SPARK_CONF_DIR/spark-defaults.conf and whatever was set from the --properties-file. I agree though that, regardless, we should minimize merging of configuration files as that would likely cause more confusion than help for the end user.

What about this solution... We completely remove the --properties-file CLI option in place of the --environment-file option? I thinking that, at the end of the day, ensuring proper environment settings (i.e. SPARK_CONF_DIR, etc.) are more important than a single properties file, especially when the SPARK_CONF_DIR would point to the configuration file. I'm just throwing that out, but I'm wondering if any other solution would work without necessitating file merging. That being said, doesn't Spark already merge defaults into the given --properties-file if it was provided? Now I'm wondering if that actually would be more confusing if we did the same thing with the spark environment file.

@brennonyork
Copy link
Author

@andrewor14 @JoshRosen wondering what should be done with this issue, thoughts on my comments above??

@andrewor14
Copy link
Contributor

@brennonyork we can't remove --properties-file because we need to maintain backwards compatibility. We currently don't attempt to merge the properties at all, but instead allow --properties-file to take precedence. Note that the --environment-file case is different in that it's not as simple as having the command line option overriding an equivalent environment variable.

I think the better approach is to simply warn the user in spark-env.sh not to set SPARK_CONF_DIR there since this is somewhat minor of an issue in my opinion. Given that there is not a clean way to fix this I recommend that we close this issue.

@brennonyork
Copy link
Author

Roger that. What do you think about, for this PR, to put merely put a blurb into the spark-env.sh.template and / or change the docs to reflect this issue? Just don't want the issue to be closed without documenting it anywhere for future users.

@andrewor14
Copy link
Contributor

That sounds good. Feel free to make a brief mention in the docs where appropriate.

@asfgit asfgit closed this in c082385 Jan 8, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants