-
Notifications
You must be signed in to change notification settings - Fork 479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ORC-1463: [Java] Support Brotli codec #1714
Conversation
ByteBuffer out, | ||
ByteBuffer overflow, | ||
Options options) throws IOException { | ||
BrotliOptions brotliOptions = (BrotliOptions) options; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a lot of testing I found out there is a problem that if the input data does not start from 0, this will cause the decompression to fail. So here is an extra copy when dealing with this situation.
I create a issue to Brotli4j: hyperxpro/Brotli4j#126
Since this PR has been merged, it is easy for us to re-implement it if new version of Brotli4j is released. hyperxpro/Brotli4j#124
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed by brotli4j v1.15.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 LGTM, Pending CI.
Thank you, @deshanxiao
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM, too.
Thank you @williamhyun @dongjoon-hyun ! |
Oh, this seems to cause more transitive dependencies in addition to
It seems to cause a failure in Apache spark test because the dependency changes based on the underlying OSes.
|
Thank you, I will make it optional. |
Thank you! For Spark community, I'll exclude it officially and add a documentation later to guide the users to add that dependency from application side. |
### What changes were proposed in this pull request? This PR is aimed to add Brotli codec defined by [RFC7932](https://www.ietf.org/rfc/rfc7932.txt). This is a new PR implemented based on [Brotli4j](https://github.com/hyperxpro/Brotli4j). Previous PR: apache#1565 ### Why are the changes needed? To support more codec in ORC. ### How was this patch tested? UT and CI pipeline. Test brotli in example `CompressionWriter`. It will create a 10.2kb file while snappy produces a 44.2kb file. Closes apache#1714 from deshanxiao/orc-1463. Lead-authored-by: Deshan Xiao <deshanxiao@microsoft.com> Co-authored-by: deshanxiao <deshanxiao@microsoft.com> Signed-off-by: William Hyun <william@apache.org>
Currently, the master branch of Spark has used ORC 2.0.0 version and supports brotli compression encoding. (SPARK-44115) Can I propose a PR that Spark supports ORC brotli ? set spark.sql.orc.compression.codec=BROTLI;
By the way, https://github.com/netty/netty/blob/3cd364107167600e8eb4b0b85553ed895519e2ed/codec/pom.xml#L91-L125 |
Yes, it was intentional, @cxzl25 , because it's a new feature from Spark's perspective. We need to focus on Spark perspective to do that like Spark documentation and configs. Of course, you are welcome always. 😄 Go for it~
|
### What changes were proposed in this pull request? This PR aims to support ORC Brotli codec. ### Why are the changes needed? Currently, the master branch of Spark has used ORC 2.0.0 version and supports brotli compression encoding. ([SPARK-44115](https://issues.apache.org/jira/browse/SPARK-44115)) However, due to some hardcode checks in Spark, brotli cannot be used normally. ```sql set spark.sql.orc.compression.codec=BROTLI; ``` ``` java.lang.IllegalArgumentException: The value of spark.sql.orc.compression.codec should be one of uncompressed, lz4, lzo, snappy, zlib, none, zstd, but was brotli ``` See apache/orc#1714 ### Does this PR introduce _any_ user-facing change? Yes In this PR, add the corresponding brotli jar package, we can use orc brotli encoding. ```xml <dependency> <groupId>com.aayushatharva.brotli4j</groupId> <artifactId>brotli4j</artifactId> </dependency> ``` REF: https://github.com/netty/netty/blob/3cd364107167600e8eb4b0b85553ed895519e2ed/codec/pom.xml#L91-L125 ### How was this patch tested? local test <img width="850" alt="image" src="https://github.com/apache/spark/assets/3898450/4e4f9422-1cf2-45ef-8bcc-8bae6188beb7"> ### Was this patch authored or co-authored using generative AI tooling? No Closes #45584 from cxzl25/orc_brotli_codec. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request? This PR aims to support ORC Brotli codec. ### Why are the changes needed? Currently, the master branch of Spark has used ORC 2.0.0 version and supports brotli compression encoding. ([SPARK-44115](https://issues.apache.org/jira/browse/SPARK-44115)) However, due to some hardcode checks in Spark, brotli cannot be used normally. ```sql set spark.sql.orc.compression.codec=BROTLI; ``` ``` java.lang.IllegalArgumentException: The value of spark.sql.orc.compression.codec should be one of uncompressed, lz4, lzo, snappy, zlib, none, zstd, but was brotli ``` See apache/orc#1714 ### Does this PR introduce _any_ user-facing change? Yes In this PR, add the corresponding brotli jar package, we can use orc brotli encoding. ```xml <dependency> <groupId>com.aayushatharva.brotli4j</groupId> <artifactId>brotli4j</artifactId> </dependency> ``` REF: https://github.com/netty/netty/blob/3cd364107167600e8eb4b0b85553ed895519e2ed/codec/pom.xml#L91-L125 ### How was this patch tested? local test <img width="850" alt="image" src="https://github.com/apache/spark/assets/3898450/4e4f9422-1cf2-45ef-8bcc-8bae6188beb7"> ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#45584 from cxzl25/orc_brotli_codec. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
This PR is aimed to add Brotli codec defined by RFC7932.
This is a new PR implemented based on Brotli4j. Previous PR: #1565
Why are the changes needed?
To support more codec in ORC.
How was this patch tested?
UT and CI pipeline.
Test brotli in example
CompressionWriter
. It will create a 10.2kb file while snappy produces a 44.2kb file.