[SPARK-16610][SQL] Add `orc.compress` as an alias for `compression` option. #14518

HyukjinKwon · 2016-08-06T02:14:21Z

What changes were proposed in this pull request?

For ORC source, Spark SQL has a writer option compression, which is used to set the codec and its value will be also set to orc.compress (the orc conf used for codec). However, if a user only set orc.compress in the writer option, we should not use the default value of compression (snappy) as the codec. Instead, we should respect the value of orc.compress.

This PR makes ORC data source not ignoring orc.compress when comperssion is unset.

So, here is the behaviour,

Check compression and use this if it is set.
If compression is not set, check orc.compress and use it.
If compression and orc.compress are not set, then use the default snappy.

How was this patch tested?

Unit test in OrcQuerySuite.

…tion is unset

HyukjinKwon · 2016-08-06T02:19:49Z

Hi @yhuai, thanks for your kind suggestion and I open the PR as suggested. Just to triple-check, I would appreciate if I can be sure on the behaviour written in the PR description is correct,

Check compression and use this if it is set.
If compression is not set, check orc.compress and use it.
If compression and orc.compress are not set, then use the default snappy.

(BTW, I apologise that I am asking similar things again and again but please excuse this and bear with this because I just want to avoid make multiple PRs to change something forwards and backwards (this is almost identical with the initial version of the past PR)).

SparkQA · 2016-08-06T03:40:58Z

Test build #63304 has finished for PR 14518 at commit af1a3b8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-08-07T23:22:45Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcOptions.scala

-      val availableCodecs = shortOrcCompressionCodecNames.keys.map(_.toLowerCase)
-      throw new IllegalArgumentException(s"Codec [$codecName] " +
-        s"is not available. Available codecs are ${availableCodecs.mkString(", ")}.")
+    val default = conf.get(OrcRelation.ORC_COMPRESSION, "SNAPPY")


Sorry. Maybe I did not explain clearly in the jira. The use case I mentioned was df.write.option("orc.compress", ...). We do not need to look at hadoop conf.

Ah, I see. Then, it all adds up. Sorry for not reading your comments carefully.

SparkQA · 2016-08-08T02:11:28Z

Test build #63331 has finished for PR 14518 at commit 3f70f25.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-08T02:17:05Z

Test build #63333 has finished for PR 14518 at commit be04706.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-08-08T03:31:03Z

What if users set both compression and orc.compress? It looks to me that orc.compress is for ORC only and should have higher priority over compression in ORC data source.

cc @yhuai , what do you think?

HyukjinKwon · 2016-08-08T04:02:27Z

IMHO, I would prefer compression over orc.compress because I believe we should promote to use compression rather than orc.compress for consistency with other datasources.

As far as I know, for this reason, sep has higher priority over delimiter and encoding has higher priority over charset in CSV (both delimiter and charset are not documented).

If using orc.compress is preferred over compression (for ORC), I agree with changing it with documentation.

yhuai · 2016-08-08T17:13:56Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcOptions.scala

@@ -31,7 +30,8 @@ private[orc] class OrcOptions(
   * Acceptable values are defined in [[shortOrcCompressionCodecNames]].
   */
  val compressionCodec: String = {
-    val codecName = parameters.getOrElse("compression", "snappy").toLowerCase
+    val codecName = parameters.getOrElse(
+      "compression", parameters.getOrElse("orc.compress", "snappy")).toLowerCase


use OrcRelation.ORC_COMPRESSION (since we have a val defined)? Let's add comments to explain what we are doing (we should mention that orc.compress is a ORC conf and which conf will take precedence). Also, will the following lines look better?

val orcCompressionConf = parameters.get(OrcRelation.ORC_COMPRESSION) val codecName = parameters .get("compression") .orElse(orcCompressionConf) .getOrElse("snappy")

Sure, Thanks for the cleaner snippet!

yhuai · 2016-08-08T17:21:50Z

I think it is fine that compression takes precedence. btw, is this flag used by other data sources?

HyukjinKwon · 2016-08-09T00:23:03Z

Yes, Parquet(here), JSON(here), CSV(here), Text(here) and ORC have this option, compression.

HyukjinKwon · 2016-08-09T00:27:01Z

Just FYI, compression overwriting orc.compress is already documented here in DataFrameWriter but I will definitely mention it in OrcOptions too.

cloud-fan · 2016-08-09T00:44:50Z

LGTM, pending jenkins.

SparkQA · 2016-08-09T02:19:50Z

Test build #63397 has finished for PR 14518 at commit e4d6999.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…ption. ## What changes were proposed in this pull request? For ORC source, Spark SQL has a writer option `compression`, which is used to set the codec and its value will be also set to `orc.compress` (the orc conf used for codec). However, if a user only set `orc.compress` in the writer option, we should not use the default value of `compression` (snappy) as the codec. Instead, we should respect the value of `orc.compress`. This PR makes ORC data source not ignoring `orc.compress` when `comperssion` is unset. So, here is the behaviour, 1. Check `compression` and use this if it is set. 2. If `compression` is not set, check `orc.compress` and use it. 3. If `compression` and `orc.compress` are not set, then use the default snappy. ## How was this patch tested? Unit test in `OrcQuerySuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #14518 from HyukjinKwon/SPARK-16610. (cherry picked from commit bb2b9d0) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan · 2016-08-09T02:26:05Z

thanks, merging to master and 2.0!

HyukjinKwon added 4 commits August 6, 2016 11:05

[SPARK-16610][SQL] Do not ignore orc.compress when compression op…

2d55a61

…tion is unset

Use SNAPPY as default

4f27313

Fix indentation

1ad44ec

Add a comment for default value

af1a3b8

HyukjinKwon changed the title ~~[SPARK-16610][SQL] Do not ignore orc.compress when compression option is unset~~ [SPARK-16610][SQL] Do not ignore orc.compress when compression option is unset in ORC datasource Aug 6, 2016

yhuai reviewed Aug 7, 2016
View reviewed changes

HyukjinKwon added 2 commits August 8, 2016 09:44

Revert the change to look up Hadoop configuration but just add an alias

3f70f25

Update the names of test cases

be04706

HyukjinKwon changed the title ~~[SPARK-16610][SQL] Do not ignore orc.compress when compression option is unset in ORC datasource~~ [SPARK-16610][SQL] Adds orc.compress as an alias for compression option. Aug 8, 2016

HyukjinKwon changed the title ~~[SPARK-16610][SQL] Adds orc.compress as an alias for compression option.~~ [SPARK-16610][SQL] Add orc.compress as an alias for compression option. Aug 8, 2016

yhuai reviewed Aug 8, 2016
View reviewed changes

Clean codes and add some more comments

e4d6999

asfgit closed this in bb2b9d0 Aug 9, 2016

HyukjinKwon deleted the SPARK-16610 branch January 2, 2018 03:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-16610][SQL] Add `orc.compress` as an alias for `compression` option. #14518

[SPARK-16610][SQL] Add `orc.compress` as an alias for `compression` option. #14518

HyukjinKwon commented Aug 6, 2016 •

edited

Loading

HyukjinKwon commented Aug 6, 2016 •

edited

Loading

SparkQA commented Aug 6, 2016

yhuai Aug 7, 2016

HyukjinKwon Aug 8, 2016

SparkQA commented Aug 8, 2016

SparkQA commented Aug 8, 2016

cloud-fan commented Aug 8, 2016

HyukjinKwon commented Aug 8, 2016 •

edited

Loading

yhuai Aug 8, 2016

HyukjinKwon Aug 9, 2016

yhuai commented Aug 8, 2016

HyukjinKwon commented Aug 9, 2016 •

edited

Loading

HyukjinKwon commented Aug 9, 2016 •

edited

Loading

cloud-fan commented Aug 9, 2016

SparkQA commented Aug 9, 2016

cloud-fan commented Aug 9, 2016

[SPARK-16610][SQL] Add orc.compress as an alias for compression option. #14518

[SPARK-16610][SQL] Add orc.compress as an alias for compression option. #14518

Conversation

HyukjinKwon commented Aug 6, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

HyukjinKwon commented Aug 6, 2016 • edited Loading

SparkQA commented Aug 6, 2016

yhuai Aug 7, 2016

Choose a reason for hiding this comment

HyukjinKwon Aug 8, 2016

Choose a reason for hiding this comment

SparkQA commented Aug 8, 2016

SparkQA commented Aug 8, 2016

cloud-fan commented Aug 8, 2016

HyukjinKwon commented Aug 8, 2016 • edited Loading

yhuai Aug 8, 2016

Choose a reason for hiding this comment

HyukjinKwon Aug 9, 2016

Choose a reason for hiding this comment

yhuai commented Aug 8, 2016

HyukjinKwon commented Aug 9, 2016 • edited Loading

HyukjinKwon commented Aug 9, 2016 • edited Loading

cloud-fan commented Aug 9, 2016

SparkQA commented Aug 9, 2016

cloud-fan commented Aug 9, 2016

[SPARK-16610][SQL] Add `orc.compress` as an alias for `compression` option. #14518

[SPARK-16610][SQL] Add `orc.compress` as an alias for `compression` option. #14518

HyukjinKwon commented Aug 6, 2016 •

edited

Loading

HyukjinKwon commented Aug 6, 2016 •

edited

Loading

HyukjinKwon commented Aug 8, 2016 •

edited

Loading

HyukjinKwon commented Aug 9, 2016 •

edited

Loading

HyukjinKwon commented Aug 9, 2016 •

edited

Loading