Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-47456][SQL] Support ORC Brotli codec #45584

Closed
wants to merge 3 commits into from

Conversation

cxzl25
Copy link
Contributor

@cxzl25 cxzl25 commented Mar 19, 2024

What changes were proposed in this pull request?

This PR aims to support ORC Brotli codec.

Why are the changes needed?

Currently, the master branch of Spark has used ORC 2.0.0 version and supports brotli compression encoding. (SPARK-44115)
However, due to some hardcode checks in Spark, brotli cannot be used normally.

set spark.sql.orc.compression.codec=BROTLI;
java.lang.IllegalArgumentException: The value of spark.sql.orc.compression.codec should be one of uncompressed, lz4, lzo, snappy, zlib, none, zstd, but was brotli

See apache/orc#1714

Does this PR introduce any user-facing change?

Yes

In this PR, add the corresponding brotli jar package, we can use orc brotli encoding.

    <dependency>
      <groupId>com.aayushatharva.brotli4j</groupId>
      <artifactId>brotli4j</artifactId>
    </dependency>

REF: https://github.com/netty/netty/blob/3cd364107167600e8eb4b0b85553ed895519e2ed/codec/pom.xml#L91-L125

How was this patch tested?

local test

image

Was this patch authored or co-authored using generative AI tooling?

No

Copy link
Member

@yaooqinn yaooqinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from my side, also cc @dongjoon-hyun

@@ -71,6 +71,9 @@ class OrcCodecSuite extends FileSourceCodecSuite {

override def format: String = "orc"
override val codecConfigName: String = SQLConf.ORC_COMPRESSION.key
// Exclude "BROTLI" because its dependencies
// require adding different jars according to different OSs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which jar should be used when the user wants to enable this feature?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one. I excluded intentionally during upgrading to Apache ORC 2.0.0.

spark/pom.xml

Lines 2615 to 2618 in 681b41f

<exclusion>
<groupId>com.aayushatharva.brotli4j</groupId>
<artifactId>brotli4j</artifactId>
</exclusion>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As he mentioned, native brotli should be available in the system, too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK~

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment, I've added the dependencies that need to be added to the PR description.

@@ -241,7 +241,7 @@ Data source options of ORC can be set via:
<tr>
<td><code>compression</code></td>
<td><code>zstd</code></td>
<td>compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, snappy, zlib, lzo, zstd and lz4). This will override <code>orc.compress</code> and <code>spark.sql.orc.compression.codec</code>.</td>
<td>compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, snappy, zlib, lzo, zstd, lz4 and brotli). This will override <code>orc.compress</code> and <code>spark.sql.orc.compression.codec</code>.</td>
Copy link
Member

@dongjoon-hyun dongjoon-hyun Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like Parquet, we need a Note here, @cxzl25 .

Note that <code>brotli</code> requires <code>BrotliCodec</code> to be installed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Updated.

@dongjoon-hyun
Copy link
Member

Merged to master for Apache Spark 4.0.0.

Thank you, @cxzl25 , @yaooqinn , @LuciferYang !

sweisdb pushed a commit to sweisdb/spark that referenced this pull request Apr 1, 2024
### What changes were proposed in this pull request?
This PR aims to support ORC Brotli codec.

### Why are the changes needed?
Currently, the master branch of Spark has used ORC 2.0.0 version and supports brotli compression encoding. ([SPARK-44115](https://issues.apache.org/jira/browse/SPARK-44115))
However, due to some hardcode checks in Spark, brotli cannot be used normally.

```sql
set spark.sql.orc.compression.codec=BROTLI;
```

```
java.lang.IllegalArgumentException: The value of spark.sql.orc.compression.codec should be one of uncompressed, lz4, lzo, snappy, zlib, none, zstd, but was brotli
```

See apache/orc#1714

### Does this PR introduce _any_ user-facing change?
Yes

In this PR, add the corresponding brotli jar package, we can use orc brotli encoding.

```xml
    <dependency>
      <groupId>com.aayushatharva.brotli4j</groupId>
      <artifactId>brotli4j</artifactId>
    </dependency>
```

REF: https://github.com/netty/netty/blob/3cd364107167600e8eb4b0b85553ed895519e2ed/codec/pom.xml#L91-L125

### How was this patch tested?
local test

<img width="850" alt="image" src="https://github.com/apache/spark/assets/3898450/4e4f9422-1cf2-45ef-8bcc-8bae6188beb7">

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#45584 from cxzl25/orc_brotli_codec.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants