-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-16610][SQL] Add orc.compress
as an alias for compression
option.
#14518
Conversation
Hi @yhuai, thanks for your kind suggestion and I open the PR as suggested. Just to triple-check, I would appreciate if I can be sure on the behaviour written in the PR description is correct,
|
orc.compress
when compression
option is unsetorc.compress
when compression
option is unset in ORC datasource
Test build #63304 has finished for PR 14518 at commit
|
val availableCodecs = shortOrcCompressionCodecNames.keys.map(_.toLowerCase) | ||
throw new IllegalArgumentException(s"Codec [$codecName] " + | ||
s"is not available. Available codecs are ${availableCodecs.mkString(", ")}.") | ||
val default = conf.get(OrcRelation.ORC_COMPRESSION, "SNAPPY") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. Maybe I did not explain clearly in the jira. The use case I mentioned was df.write.option("orc.compress", ...)
. We do not need to look at hadoop conf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Then, it all adds up. Sorry for not reading your comments carefully.
orc.compress
when compression
option is unset in ORC datasourceorc.compress
as an alias for compression
option.
orc.compress
as an alias for compression
option.orc.compress
as an alias for compression
option.
Test build #63331 has finished for PR 14518 at commit
|
Test build #63333 has finished for PR 14518 at commit
|
What if users set both cc @yhuai , what do you think? |
IMHO, I would prefer As far as I know, for this reason, If using |
@@ -31,7 +30,8 @@ private[orc] class OrcOptions( | |||
* Acceptable values are defined in [[shortOrcCompressionCodecNames]]. | |||
*/ | |||
val compressionCodec: String = { | |||
val codecName = parameters.getOrElse("compression", "snappy").toLowerCase | |||
val codecName = parameters.getOrElse( | |||
"compression", parameters.getOrElse("orc.compress", "snappy")).toLowerCase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use OrcRelation.ORC_COMPRESSION
(since we have a val defined)? Let's add comments to explain what we are doing (we should mention that orc.compress is a ORC conf and which conf will take precedence). Also, will the following lines look better?
val orcCompressionConf = parameters.get(OrcRelation.ORC_COMPRESSION)
val codecName = parameters
.get("compression")
.orElse(orcCompressionConf)
.getOrElse("snappy")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, Thanks for the cleaner snippet!
I think it is fine that |
Just FYI, |
LGTM, pending jenkins. |
Test build #63397 has finished for PR 14518 at commit
|
…ption. ## What changes were proposed in this pull request? For ORC source, Spark SQL has a writer option `compression`, which is used to set the codec and its value will be also set to `orc.compress` (the orc conf used for codec). However, if a user only set `orc.compress` in the writer option, we should not use the default value of `compression` (snappy) as the codec. Instead, we should respect the value of `orc.compress`. This PR makes ORC data source not ignoring `orc.compress` when `comperssion` is unset. So, here is the behaviour, 1. Check `compression` and use this if it is set. 2. If `compression` is not set, check `orc.compress` and use it. 3. If `compression` and `orc.compress` are not set, then use the default snappy. ## How was this patch tested? Unit test in `OrcQuerySuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #14518 from HyukjinKwon/SPARK-16610. (cherry picked from commit bb2b9d0) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
thanks, merging to master and 2.0! |
What changes were proposed in this pull request?
For ORC source, Spark SQL has a writer option
compression
, which is used to set the codec and its value will be also set toorc.compress
(the orc conf used for codec). However, if a user only setorc.compress
in the writer option, we should not use the default value ofcompression
(snappy) as the codec. Instead, we should respect the value oforc.compress
.This PR makes ORC data source not ignoring
orc.compress
whencomperssion
is unset.So, here is the behaviour,
compression
and use this if it is set.compression
is not set, checkorc.compress
and use it.compression
andorc.compress
are not set, then use the default snappy.How was this patch tested?
Unit test in
OrcQuerySuite
.