Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-45664][SQL] Introduce a mapper for orc compression codecs #43528

Closed
wants to merge 1 commit into from

Conversation

beliefer
Copy link
Contributor

What changes were proposed in this pull request?

Currently, Spark supported all the orc compression codecs, but the orc supported compression codecs and spark supported are not completely one-on-one due to Spark introduce two compression codecs NONE and UNCOMPRESSED.

On the other hand, there are a lot of magic strings copy from orc compression codecs. This issue lead to developers need to manually maintain its consistency. It is easy to make mistakes and reduce development efficiency.

Why are the changes needed?

Let developers easy to use orc compression codecs.

Does this PR introduce any user-facing change?

'No'.
Introduce a new class.

How was this patch tested?

Exists test cases.

Was this patch authored or co-authored using generative AI tooling?

'No'.

@github-actions github-actions bot added the SQL label Oct 25, 2023
@beliefer beliefer force-pushed the SPARK-45664 branch 2 times, most recently from 44f3030 to b349200 Compare October 25, 2023 09:33
@beliefer
Copy link
Contributor Author

ping @dongjoon-hyun cc @srowen @viirya

@beliefer beliefer force-pushed the SPARK-45664 branch 2 times, most recently from 00b7b83 to 655fbeb Compare October 26, 2023 03:37
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question. May I ask why we do this ORC-specific change? Are you going to do the same things for all data sources like Parquet and Avro at Apache Spark 4.0.0?

@beliefer
Copy link
Contributor Author

Just a question. May I ask why we do this ORC-specific change? Are you going to do the same things for all data sources like Parquet and Avro at Apache Spark 4.0.0?

Because orc supported compression codecs and spark supported are not completely one-on-one due to Spark introduce two compression codecs NONE and UNCOMPRESSED. This change also make tests easy and reduce the magic strings.

I'm doing the same things for Parquet and Avro at Apache Spark 4.0.0.

@beliefer beliefer force-pushed the SPARK-45664 branch 2 times, most recently from 3258d65 to a1b8ddd Compare October 28, 2023 11:19
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thank you for keeping Apache Spark consistency.

I'm doing the same things for Parquet and Avro at Apache Spark 4.0.0.

@beliefer
Copy link
Contributor Author

@dongjoon-hyun Thank you!

dongjoon-hyun pushed a commit that referenced this pull request Oct 31, 2023
…rings copy from parquet|orc|avro compression codes

### What changes were proposed in this pull request?
This PR follows up #43562, #43528 and #43308.
The aim of this PR is to avoid magic strings copy from `parquet|orc|avro` compression codes.

This PR also simplify some test cases.

### Why are the changes needed?
Avoid magic strings copy from parquet|orc|avro compression codes

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
Exists test cases.

### Was this patch authored or co-authored using generative AI tooling?
'No'.

Closes #43604 from beliefer/parquet_orc_avro.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants