Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13543][SQL] Support for specifying compression codec for Parquet/ORC via option() #11464

Closed
wants to merge 17 commits into from

Conversation

HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

This PR adds the support to specify compression codecs for both ORC and Parquet.

How was this patch tested?

unittests within IDE and code style tests with dev/run_tests.

@HyukjinKwon
Copy link
Member Author

cc @rxin

@@ -487,6 +487,12 @@ def parquet(self, path, mode=None, partitionBy=None):
* ``error`` (default case): Throw an exception if data already exists.
:param partitionBy: names of partitioning columns

You can set the following Parquet-specific option(s) for writing Parquet files:
* ``compression`` (default ``None``): compression codec to use when saving to file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just put this as an argument?

@rxin
Copy link
Contributor

rxin commented Mar 2, 2016

How does this relate to #11408?

@SparkQA
Copy link

SparkQA commented Mar 2, 2016

Test build #52305 has finished for PR 11464 at commit a96e510.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

@rxin I think I should change the codec names to lower cases here.

@SparkQA
Copy link

SparkQA commented Mar 2, 2016

Test build #52306 has finished for PR 11464 at commit d107d50.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 2, 2016

Test build #52307 has finished for PR 11464 at commit 9c53d10.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

I will handle the consistent stuff in a new PR based on #11408.

@SparkQA
Copy link

SparkQA commented Mar 2, 2016

Test build #52312 has finished for PR 11464 at commit 12e7275.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

@rxin Could you maybe merge this first before dealing with consistent compression name stuff if it looks good?

++Let me fix the conflicts first.

@rxin
Copy link
Contributor

rxin commented Mar 3, 2016

There is a conflict. Maybe just make stuff consistent here?

@HyukjinKwon
Copy link
Member Author

@rxin Sure.

@SparkQA
Copy link

SparkQA commented Mar 3, 2016

Test build #52349 has finished for PR 11464 at commit baf7a63.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

Hm.. This will not explicitly set uncompressed or none for JSON, CSV and TEXT. Let me correct them and add some tests.

@HyukjinKwon
Copy link
Member Author

@rxin I think this is ready to be reviewed.

@SparkQA
Copy link

SparkQA commented Mar 3, 2016

Test build #52360 has finished for PR 11464 at commit 4e212b1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 3, 2016

Test build #52368 has finished for PR 11464 at commit 2817c9a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@rxin
Copy link
Contributor

rxin commented Mar 3, 2016

@davies can you help review this? Thanks.

@davies
Copy link
Contributor

davies commented Mar 3, 2016

LGTM

@@ -396,6 +402,46 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
}
}

test("SPARK-13543 Set explicitly the output as uncompressed") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what exactly does this test do? it's not super clear here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it to test whether uncompressed mode would work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, which one would be better?
Should I just write like write the output as uncompressed via option?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea - i think that's better

@SparkQA
Copy link

SparkQA commented Mar 3, 2016

Test build #52378 has finished for PR 11464 at commit 2817c9a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 3, 2016

Test build #52385 has finished for PR 11464 at commit e4d4cfb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 3, 2016

Test build #52384 has finished for PR 11464 at commit d2ace37.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Mar 3, 2016

Thanks - merging this in master.

@asfgit asfgit closed this in cf95d72 Mar 3, 2016
@marmbrus
Copy link
Contributor

marmbrus commented Mar 3, 2016

Changes like this should probably also update the programming guide.

roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
…et/ORC via option()

## What changes were proposed in this pull request?

This PR adds the support to specify compression codecs for both ORC and Parquet.

## How was this patch tested?

unittests within IDE and code style tests with `dev/run_tests`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes apache#11464 from HyukjinKwon/SPARK-13543.
@HyukjinKwon HyukjinKwon deleted the SPARK-13543 branch October 1, 2016 06:42
asfgit pushed a commit that referenced this pull request Jan 6, 2018
…quetOptions', `parquet.compression` needs to be considered.

[SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 'ParquetOptions', `parquet.compression` needs to be considered.

## What changes were proposed in this pull request?
Since Hive 1.1, Hive allows users to set parquet compression codec via table-level properties parquet.compression. See the JIRA: https://issues.apache.org/jira/browse/HIVE-7858 . We do support orc.compression for ORC. Thus, for external users, it is more straightforward to support both. See the stackflow question: https://stackoverflow.com/questions/36941122/spark-sql-ignores-parquet-compression-propertie-specified-in-tblproperties
In Spark side, our table-level compression conf compression was added by #11464 since Spark 2.0.
We need to support both table-level conf. Users might also use session-level conf spark.sql.parquet.compression.codec. The priority rule will be like
If other compression codec configuration was found through hive or parquet, the precedence would be compression, parquet.compression, spark.sql.parquet.compression.codec. Acceptable values include: none, uncompressed, snappy, gzip, lzo.
The rule for Parquet is consistent with the ORC after the change.

Changes:
1.Increased acquiring 'compressionCodecClassName' from `parquet.compression`,and the precedence order is `compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just like what we do in `OrcOptions`.

2.Change `spark.sql.parquet.compression.codec` to support "none".Actually in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but it does not allowed to configured to "none".

3.Change `compressionCode` to `compressionCodecClassName`.

## How was this patch tested?
Add test.

Author: fjh100456 <fu.jinhua6@zte.com.cn>

Closes #20076 from fjh100456/ParquetOptionIssue.

(cherry picked from commit 7b78041)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
ghost pushed a commit to dbtsai/spark that referenced this pull request Jan 6, 2018
…quetOptions', `parquet.compression` needs to be considered.

[SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 'ParquetOptions', `parquet.compression` needs to be considered.

## What changes were proposed in this pull request?
Since Hive 1.1, Hive allows users to set parquet compression codec via table-level properties parquet.compression. See the JIRA: https://issues.apache.org/jira/browse/HIVE-7858 . We do support orc.compression for ORC. Thus, for external users, it is more straightforward to support both. See the stackflow question: https://stackoverflow.com/questions/36941122/spark-sql-ignores-parquet-compression-propertie-specified-in-tblproperties
In Spark side, our table-level compression conf compression was added by apache#11464 since Spark 2.0.
We need to support both table-level conf. Users might also use session-level conf spark.sql.parquet.compression.codec. The priority rule will be like
If other compression codec configuration was found through hive or parquet, the precedence would be compression, parquet.compression, spark.sql.parquet.compression.codec. Acceptable values include: none, uncompressed, snappy, gzip, lzo.
The rule for Parquet is consistent with the ORC after the change.

Changes:
1.Increased acquiring 'compressionCodecClassName' from `parquet.compression`,and the precedence order is `compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just like what we do in `OrcOptions`.

2.Change `spark.sql.parquet.compression.codec` to support "none".Actually in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but it does not allowed to configured to "none".

3.Change `compressionCode` to `compressionCodecClassName`.

## How was this patch tested?
Add test.

Author: fjh100456 <fu.jinhua6@zte.com.cn>

Closes apache#20076 from fjh100456/ParquetOptionIssue.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants