Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-1463: [Java] Support Brotli codec #1714

Closed
wants to merge 6 commits into from
Closed

ORC-1463: [Java] Support Brotli codec #1714

wants to merge 6 commits into from

Conversation

deshanxiao
Copy link
Contributor

@deshanxiao deshanxiao commented Dec 28, 2023

What changes were proposed in this pull request?

This PR is aimed to add Brotli codec defined by RFC7932.
This is a new PR implemented based on Brotli4j. Previous PR: #1565

Why are the changes needed?

To support more codec in ORC.

How was this patch tested?

UT and CI pipeline.

Test brotli in example CompressionWriter. It will create a 10.2kb file while snappy produces a 44.2kb file.

ByteBuffer out,
ByteBuffer overflow,
Options options) throws IOException {
BrotliOptions brotliOptions = (BrotliOptions) options;
Copy link
Contributor Author

@deshanxiao deshanxiao Dec 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a lot of testing I found out there is a problem that if the input data does not start from 0, this will cause the decompression to fail. So here is an extra copy when dealing with this situation.

I create a issue to Brotli4j: hyperxpro/Brotli4j#126

Since this PR has been merged, it is easy for us to re-implement it if new version of Brotli4j is released. hyperxpro/Brotli4j#124

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by brotli4j v1.15.0

@williamhyun williamhyun self-assigned this Jan 2, 2024
@williamhyun williamhyun added this to the 2.0.0 milestone Jan 2, 2024
Copy link
Member

@williamhyun williamhyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM, Pending CI.

Thank you, @deshanxiao

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM, too.

@deshanxiao
Copy link
Contributor Author

Thank you @williamhyun @dongjoon-hyun !

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jan 3, 2024

Oh, this seems to cause more transitive dependencies in addition to brotli4j.jar. Are these expected and required ones, @deshanxiao ? Can we remove service-1.15.0.jar?

[INFO] +- com.aayushatharva.brotli4j:brotli4j:jar:1.15.0:compile
[INFO] |  +- com.aayushatharva.brotli4j:service:jar:1.15.0:compile
[INFO] |  \- com.aayushatharva.brotli4j:native-osx-aarch64:jar:1.15.0:compile

It seems to cause a failure in Apache spark test because the dependency changes based on the underlying OSes.

+native-linux-x86_64/1.15.0//native-linux-x86_64-1.15.0.jar

@williamhyun
Copy link
Member

Thank you, I will make it optional.

@dongjoon-hyun
Copy link
Member

Thank you! For Spark community, I'll exclude it officially and add a documentation later to guide the users to add that dependency from application side.

cxzl25 pushed a commit to cxzl25/orc that referenced this pull request Jan 11, 2024
### What changes were proposed in this pull request?
This PR is aimed to add Brotli codec defined by [RFC7932](https://www.ietf.org/rfc/rfc7932.txt).
This is a new PR implemented based on [Brotli4j](https://github.com/hyperxpro/Brotli4j). Previous PR: apache#1565

### Why are the changes needed?
To support more codec in ORC.

### How was this patch tested?
UT and CI pipeline.

Test brotli in example `CompressionWriter`. It will create a 10.2kb file while snappy produces a 44.2kb file.

Closes apache#1714 from deshanxiao/orc-1463.

Lead-authored-by: Deshan Xiao <deshanxiao@microsoft.com>
Co-authored-by: deshanxiao <deshanxiao@microsoft.com>
Signed-off-by: William Hyun <william@apache.org>
@cxzl25
Copy link
Contributor

cxzl25 commented Mar 15, 2024

For Spark community, I'll exclude it officially and add a documentation later to guide the users to add that dependency from application side.

Currently, the master branch of Spark has used ORC 2.0.0 version and supports brotli compression encoding. (SPARK-44115)
However, due to some hardcode checks in Spark, brotli cannot be used normally.

Can I propose a PR that Spark supports ORC brotli ?
cxzl25/spark@0cd5d03

set spark.sql.orc.compression.codec=BROTLI;
java.lang.IllegalArgumentException: The value of spark.sql.orc.compression.codec should be one of uncompressed, lz4, lzo, snappy, zlib, none, zstd, but was brotli

By the way, netty-codec declares brotli-related dependencies for multiple OSs.

https://github.com/netty/netty/blob/3cd364107167600e8eb4b0b85553ed895519e2ed/codec/pom.xml#L91-L125

@dongjoon-hyun
Copy link
Member

Yes, it was intentional, @cxzl25 , because it's a new feature from Spark's perspective. We need to focus on Spark perspective to do that like Spark documentation and configs.

Of course, you are welcome always. 😄 Go for it~

Can I propose a PR that Spark supports ORC brotli ?

dongjoon-hyun pushed a commit to apache/spark that referenced this pull request Mar 19, 2024
### What changes were proposed in this pull request?
This PR aims to support ORC Brotli codec.

### Why are the changes needed?
Currently, the master branch of Spark has used ORC 2.0.0 version and supports brotli compression encoding. ([SPARK-44115](https://issues.apache.org/jira/browse/SPARK-44115))
However, due to some hardcode checks in Spark, brotli cannot be used normally.

```sql
set spark.sql.orc.compression.codec=BROTLI;
```

```
java.lang.IllegalArgumentException: The value of spark.sql.orc.compression.codec should be one of uncompressed, lz4, lzo, snappy, zlib, none, zstd, but was brotli
```

See apache/orc#1714

### Does this PR introduce _any_ user-facing change?
Yes

In this PR, add the corresponding brotli jar package, we can use orc brotli encoding.

```xml
    <dependency>
      <groupId>com.aayushatharva.brotli4j</groupId>
      <artifactId>brotli4j</artifactId>
    </dependency>
```

REF: https://github.com/netty/netty/blob/3cd364107167600e8eb4b0b85553ed895519e2ed/codec/pom.xml#L91-L125

### How was this patch tested?
local test

<img width="850" alt="image" src="https://github.com/apache/spark/assets/3898450/4e4f9422-1cf2-45ef-8bcc-8bae6188beb7">

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #45584 from cxzl25/orc_brotli_codec.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
sweisdb pushed a commit to sweisdb/spark that referenced this pull request Apr 1, 2024
### What changes were proposed in this pull request?
This PR aims to support ORC Brotli codec.

### Why are the changes needed?
Currently, the master branch of Spark has used ORC 2.0.0 version and supports brotli compression encoding. ([SPARK-44115](https://issues.apache.org/jira/browse/SPARK-44115))
However, due to some hardcode checks in Spark, brotli cannot be used normally.

```sql
set spark.sql.orc.compression.codec=BROTLI;
```

```
java.lang.IllegalArgumentException: The value of spark.sql.orc.compression.codec should be one of uncompressed, lz4, lzo, snappy, zlib, none, zstd, but was brotli
```

See apache/orc#1714

### Does this PR introduce _any_ user-facing change?
Yes

In this PR, add the corresponding brotli jar package, we can use orc brotli encoding.

```xml
    <dependency>
      <groupId>com.aayushatharva.brotli4j</groupId>
      <artifactId>brotli4j</artifactId>
    </dependency>
```

REF: https://github.com/netty/netty/blob/3cd364107167600e8eb4b0b85553ed895519e2ed/codec/pom.xml#L91-L125

### How was this patch tested?
local test

<img width="850" alt="image" src="https://github.com/apache/spark/assets/3898450/4e4f9422-1cf2-45ef-8bcc-8bae6188beb7">

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#45584 from cxzl25/orc_brotli_codec.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants