-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21637][SPARK-21451][SQL]get spark.hadoop.*
properties from sysProps to hiveconf
#18668
Conversation
ping @cloud-fan @gatorsmile |
@@ -404,6 +404,13 @@ private[spark] object HiveUtils extends Logging { | |||
propMap.put(ConfVars.METASTORE_EVENT_LISTENERS.varname, "") | |||
propMap.put(ConfVars.METASTORE_END_FUNCTION_LISTENERS.varname, "") | |||
|
|||
// Copy any "spark.hadoop.foo=bar" system properties into conf as "foo=bar" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we should do so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan if we run bin/spark-sql --conf spark.hadoop.hive.exec.strachdir=/some/dir
or in spark-default.conf, SessionState.start(cliSessionState) in SparkSQLCliDriver will not use this dir but the default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have documents saying that spark.hadoop.xxx
is supported? or are you proposing a new feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets move this to a util method so that we know this is done in 2 places
ok to test |
Test build #79792 has finished for PR 18668 at commit
|
@cloud-fan tests passed, mind take a look? |
ping @cloud-fan |
ping @cloud-fan again |
@@ -33,4 +33,13 @@ class HiveUtilsSuite extends QueryTest with SQLTestUtils with TestHiveSingleton | |||
assert(conf(ConfVars.METASTORE_END_FUNCTION_LISTENERS.varname) === "") | |||
} | |||
} | |||
|
|||
test("newTemporaryConfiguration respect spark.hadoop.foo=bar in SparkConf") { | |||
sys.props.put("spark.hadoop.foo", "bar") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test says we should respect hadoop conf in SparkConf
, but why we handle system properties?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan at the very beginning, the spark-sumit do the same thing that add properties from --conf and spark-default.conf to sys.props.
sys.props.put("spark.hadoop.foo", "bar") | ||
Seq(true, false) foreach { useInMemoryDerby => | ||
val hiveConf = HiveUtils.newTemporaryConfiguration(useInMemoryDerby) | ||
intercept[NoSuchElementException](hiveConf("spark.hadoop.foo") === "bar") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: assert(!hiveConf.contains("spark.hadoop.foo"))
LGTM except one minor comment |
If this is an issue in |
Test build #80063 has finished for PR 18668 at commit
|
@cloud-fan would you plz take a look Anyway in CliSuite |
Hi @gatorsmile, I add some UTs in CliSuite, please check! |
Test build #80077 has finished for PR 18668 at commit
|
test this please |
runCliWithin(1.minute)("set spark.sql.warehouse.dir;" -> warehousePath.getAbsolutePath) | ||
} | ||
|
||
test("SPARK-21451: Apply spark.hadoop.* configurations") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the fix, this test case still can succeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile Yes, after sc initialized, spark.hadoop.hive.metastore.warehouse.dir will be translated into a hadoop conf hive.metastore.warehouse.dir as an alternative of warehouse dir. This test case couldn't tell whether this pr works. CliSuite may not see these values only if we explicitly set them to SqlConf.
The original code did break another test case anyway.
@@ -50,6 +50,7 @@ private[hive] object SparkSQLCLIDriver extends Logging { | |||
private val prompt = "spark-sql" | |||
private val continuedPrompt = "".padTo(prompt.length, ' ') | |||
private var transport: TSocket = _ | |||
private final val SPARK_HADOOP_PROP_PREFIX = "spark.hadoop." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a question, why the prefix has to be spark.hadoop.
?
See the related PR: #2379
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. I see spark.hive
in some of my configs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark.hadoop.
was tribal knowledge and was a sneaky way to stick values into Hadoop Configuration
object (which can later also pass on to HiveConf
). What does spark.hive.
do ? Have never seen such configs and would like to know.
Keeping that aside, are you proposing to drop that prefix at L145 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking more, I think we should just consider spark.hadoop.
in this PR, unless we get the other feedbacks from the community.
Since the most related PR #1843 was submitted by @vanzin Could you please review this PR? Actually, the prefix Could you explain more details about this? |
Test build #80110 has finished for PR 18668 at commit
|
As for this particular PR, I'm not so sure about it. I'm not exactly familiar with how the "execution" Hive instance uses the metastore and other Hadoop configs, but at the very least this creates a conflict between the configuration of the execution Hive and that of the metadata Hive (which does use So with this change you cannot use |
There is a bug in HiveClientImpl about reusing cliSessionState, see HiveClientImpl.scala#L140
Actually, it is never been reached and reused. you can run
|
Does this mean a hive client initialized by HiveUtils.newClientForExecution? If true, this is ONLY used in HiveThiftSever2 after SparkContext initialized. Example: |
Test build #80158 has finished for PR 18668 at commit
|
I see. Seems like this code changed a bit since I last looked at it closely, when there was an explicit "execution Hive" in If that's the case then it's probably ok to add this. |
Test build #80227 has finished for PR 18668 at commit
|
@@ -404,6 +404,13 @@ private[spark] object HiveUtils extends Logging { | |||
propMap.put(ConfVars.METASTORE_EVENT_LISTENERS.varname, "") | |||
propMap.put(ConfVars.METASTORE_END_FUNCTION_LISTENERS.varname, "") | |||
|
|||
// Copy any "spark.hadoop.foo=bar" system properties into conf as "foo=bar" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yaooqinn Please follow what @tejasapatil said and create a util function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
while (it.hasNext) { | ||
val kv = it.next() | ||
SparkSQLEnv.sqlContext.setConf(kv.getKey, kv.getValue) | ||
newHiveConf.foreach{ kv => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
foreach{
-> foreach {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks
// If the same property is configured by spark.hadoop.xxx, we ignore it and | ||
// obey settings from spark properties | ||
val k = kv.getKey | ||
val v = sys.props.getOrElseUpdate(SPARK_HADOOP_PROP_PREFIX + k, kv.getValue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me try to summarize the impacts of these changes. The initial call of newTemporaryConfiguration
is before we setting sys.props
. The subsequent call of newTemporaryConfiguration
in newClientForExecution
will be used for Hive execution clients. Thus, the changes will affect Hive execution clients.
Could you check all the codes in Spark are using sys.prop
? Will this change impact them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we build SparkConf
in SparkSQLEnv
, we get the conf from system prop because loadDefaults
is set to true
. That is the way we pass -hiveconf
values to sc.hadoopConfiguration
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
newClientForExecution
is used ONLY in HiveThriftServer2, where it is used to get a hiveconf. There is no more a execution hive client, IMO this method be removed. This activity happens after SparkSQLEnv.init
, so it is OK for spark.hadoop.
properties.
I realize that --hiveconf
should be added to sys.props
as spark.hadoop.xxx
before SparkSQLEnv.init
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
newClientForExecution
is used for us to read/write hive serde tables. This is the major concern I have. Let us add another parameter in newTemporaryConfiguration
. When newClientForExecution
is calling newTemporaryConfiguration
, we should not get the hive conf from sys.prop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have checked the whole project that newClientForExecution
is only used at HiveThriftServer2.scala#L58, HiveThriftServer2.scala#L86
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, that sounds ok to me.
@@ -157,12 +168,8 @@ private[hive] object SparkSQLCLIDriver extends Logging { | |||
// Execute -i init files (always in silent mode) | |||
cli.processInitFiles(sessionState) | |||
|
|||
// Respect the configurations set by --hiveconf from the command line | |||
// (based on Hive's CliDriver). | |||
val it = sessionState.getOverriddenConfigurations.entrySet().iterator() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason you move it to line 140?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--hiveconf abc.def will be add to system properties as spark.hadoop.abc.def if is not existed , before SparkSQLEnv.init
Please open another JIRA for the issue of #18668 (comment). Also put the JIRA number in this PR. It can help us track the issues. Thanks! |
@@ -404,6 +404,13 @@ private[spark] object HiveUtils extends Logging { | |||
propMap.put(ConfVars.METASTORE_EVENT_LISTENERS.varname, "") | |||
propMap.put(ConfVars.METASTORE_END_FUNCTION_LISTENERS.varname, "") | |||
|
|||
// Copy any "spark.hadoop.foo=bar" system properties into conf as "foo=bar" | |||
sys.props.foreach { case (key, value) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned above, we should not do this for newClientForExecution
.
Test build #80239 has finished for PR 18668 at commit
|
spark.hadoop.*
properties from sysProps to hiveconf spark.hadoop.*
properties from sysProps to hiveconf
Test build #80241 has finished for PR 18668 at commit
|
retest this please |
@@ -404,6 +405,8 @@ private[spark] object HiveUtils extends Logging { | |||
propMap.put(ConfVars.METASTORE_EVENT_LISTENERS.varname, "") | |||
propMap.put(ConfVars.METASTORE_END_FUNCTION_LISTENERS.varname, "") | |||
|
|||
SparkHadoopUtil.get.appendSparkHadoopConfigs(propMap) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are unable to know this is getting the values from sys.props
. How about changing the interface to?
// xyz
SparkHadoopUtil.get.appendSparkHadoopConfigs(sys.props.toMap, propMap)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
sys.props.foreach { case (key, value) => | ||
if (key.startsWith("spark.hadoop.")) { | ||
propMap.put(key.substring("spark.hadoop.".length), value) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can shorten it to something like
for ((key, value) <- conf if key.startsWith("spark.hadoop.")) {
propMap.put(key.substring("spark.hadoop.".length), value)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
Test build #80278 has finished for PR 18668 at commit
|
// Copy any "spark.hadoop.foo=bar" system properties into destMap as "foo=bar" | ||
srcMap.foreach { case (key, value) if key.startsWith("spark.hadoop.") => | ||
destMap.put(key.substring("spark.hadoop.".length), value) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your change is different from what I posted before
for ((key, value) <- conf if key.startsWith("spark.hadoop.")) {
propMap.put(key.substring("spark.hadoop.".length), value)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your solution requires another case _ =>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
Test build #80284 has finished for PR 18668 at commit
|
retest this please |
1 similar comment
retest this please |
LGTM pending Jenkins. |
Test build #80288 has finished for PR 18668 at commit
|
Thanks everyone! Merging it to master. If any other comment, we can address it in the follow-up PRs. |
What changes were proposed in this pull request?
When we use
bin/spark-sql
command configuring--conf spark.hadoop.foo=bar
, theSparkSQLCliDriver
initializes an instance of hiveconf, it does not addfoo->bar
to it.this pr gets
spark.hadoop.*
properties from sysProps to this hiveconfHow was this patch tested?
UT