Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16610][SQL] Add orc.compress as an alias for compression option. #14518

Closed
wants to merge 7 commits into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Aug 6, 2016

What changes were proposed in this pull request?

For ORC source, Spark SQL has a writer option compression, which is used to set the codec and its value will be also set to orc.compress (the orc conf used for codec). However, if a user only set orc.compress in the writer option, we should not use the default value of compression (snappy) as the codec. Instead, we should respect the value of orc.compress.

This PR makes ORC data source not ignoring orc.compress when comperssion is unset.

So, here is the behaviour,

  1. Check compression and use this if it is set.
  2. If compression is not set, check orc.compress and use it.
  3. If compression and orc.compress are not set, then use the default snappy.

How was this patch tested?

Unit test in OrcQuerySuite.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Aug 6, 2016

Hi @yhuai, thanks for your kind suggestion and I open the PR as suggested. Just to triple-check, I would appreciate if I can be sure on the behaviour written in the PR description is correct,

  1. Check compression and use this if it is set.
  2. If compression is not set, check orc.compress and use it.
  3. If compression and orc.compress are not set, then use the default snappy.

(BTW, I apologise that I am asking similar things again and again but please excuse this and bear with this because I just want to avoid make multiple PRs to change something forwards and backwards (this is almost identical with the initial version of the past PR)).

@HyukjinKwon HyukjinKwon changed the title [SPARK-16610][SQL] Do not ignore orc.compress when compression option is unset [SPARK-16610][SQL] Do not ignore orc.compress when compression option is unset in ORC datasource Aug 6, 2016
@SparkQA
Copy link

SparkQA commented Aug 6, 2016

Test build #63304 has finished for PR 14518 at commit af1a3b8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val availableCodecs = shortOrcCompressionCodecNames.keys.map(_.toLowerCase)
throw new IllegalArgumentException(s"Codec [$codecName] " +
s"is not available. Available codecs are ${availableCodecs.mkString(", ")}.")
val default = conf.get(OrcRelation.ORC_COMPRESSION, "SNAPPY")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. Maybe I did not explain clearly in the jira. The use case I mentioned was df.write.option("orc.compress", ...). We do not need to look at hadoop conf.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Then, it all adds up. Sorry for not reading your comments carefully.

@HyukjinKwon HyukjinKwon changed the title [SPARK-16610][SQL] Do not ignore orc.compress when compression option is unset in ORC datasource [SPARK-16610][SQL] Adds orc.compress as an alias for compression option. Aug 8, 2016
@HyukjinKwon HyukjinKwon changed the title [SPARK-16610][SQL] Adds orc.compress as an alias for compression option. [SPARK-16610][SQL] Add orc.compress as an alias for compression option. Aug 8, 2016
@SparkQA
Copy link

SparkQA commented Aug 8, 2016

Test build #63331 has finished for PR 14518 at commit 3f70f25.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 8, 2016

Test build #63333 has finished for PR 14518 at commit be04706.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

What if users set both compression and orc.compress? It looks to me that orc.compress is for ORC only and should have higher priority over compression in ORC data source.

cc @yhuai , what do you think?

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Aug 8, 2016

IMHO, I would prefer compression over orc.compress because I believe we should promote to use compression rather than orc.compress for consistency with other datasources.

As far as I know, for this reason, sep has higher priority over delimiter and encoding has higher priority over charset in CSV (both delimiter and charset are not documented).

If using orc.compress is preferred over compression (for ORC), I agree with changing it with documentation.

@@ -31,7 +30,8 @@ private[orc] class OrcOptions(
* Acceptable values are defined in [[shortOrcCompressionCodecNames]].
*/
val compressionCodec: String = {
val codecName = parameters.getOrElse("compression", "snappy").toLowerCase
val codecName = parameters.getOrElse(
"compression", parameters.getOrElse("orc.compress", "snappy")).toLowerCase
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use OrcRelation.ORC_COMPRESSION (since we have a val defined)? Let's add comments to explain what we are doing (we should mention that orc.compress is a ORC conf and which conf will take precedence). Also, will the following lines look better?

    val orcCompressionConf = parameters.get(OrcRelation.ORC_COMPRESSION)
    val codecName = parameters
      .get("compression")
      .orElse(orcCompressionConf)
      .getOrElse("snappy")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, Thanks for the cleaner snippet!

@yhuai
Copy link
Contributor

yhuai commented Aug 8, 2016

I think it is fine that compression takes precedence. btw, is this flag used by other data sources?

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Aug 9, 2016

Yes, Parquet(here), JSON(here), CSV(here), Text(here) and ORC have this option, compression.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Aug 9, 2016

Just FYI, compression overwriting orc.compress is already documented here in DataFrameWriter but I will definitely mention it in OrcOptions too.

@cloud-fan
Copy link
Contributor

LGTM, pending jenkins.

@SparkQA
Copy link

SparkQA commented Aug 9, 2016

Test build #63397 has finished for PR 14518 at commit e4d6999.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Aug 9, 2016
…ption.

## What changes were proposed in this pull request?

For ORC source, Spark SQL has a writer option `compression`, which is used to set the codec and its value will be also set to `orc.compress` (the orc conf used for codec). However, if a user only set `orc.compress` in the writer option, we should not use the default value of `compression` (snappy) as the codec. Instead, we should respect the value of `orc.compress`.

This PR makes ORC data source not ignoring `orc.compress` when `comperssion` is unset.

So, here is the behaviour,

 1. Check `compression` and use this if it is set.
 2. If `compression` is not set, check `orc.compress` and use it.
 3. If `compression` and `orc.compress` are not set, then use the default snappy.

## How was this patch tested?

Unit test in `OrcQuerySuite`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14518 from HyukjinKwon/SPARK-16610.

(cherry picked from commit bb2b9d0)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@asfgit asfgit closed this in bb2b9d0 Aug 9, 2016
@cloud-fan
Copy link
Contributor

thanks, merging to master and 2.0!

@HyukjinKwon HyukjinKwon deleted the SPARK-16610 branch January 2, 2018 03:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants