Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

What changes were proposed in this pull request?

Apache Spark 2.3 introduced native ORC supports with vectorization and many fixes. However, it's shipped as a not-default option. This PR enables native ORC implementation and predicate-pushdown by default for Apache Spark 2.4. We will improve and stabilize ORC data source before Apache Spark 2.4. And, eventually, Apache Spark will drop old Hive-based ORC code.

How was this patch tested?

Pass the Jenkins with existing tests.

@SparkQA
Copy link

SparkQA commented Feb 17, 2018

Test build #87525 has finished for PR 20634 at commit bde6818.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@gatorsmile gatorsmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gatorsmile
Copy link
Member

Thanks! Merged to master.

@asfgit asfgit closed this in 83c0087 Feb 20, 2018
@dongjoon-hyun
Copy link
Member Author

Thank you, @gatorsmile .

@dongjoon-hyun dongjoon-hyun deleted the SPARK-23456 branch February 20, 2018 18:03
peter-toth pushed a commit to peter-toth/spark that referenced this pull request Oct 6, 2018
Apache Spark 2.3 introduced `native` ORC supports with vectorization and many fixes. However, it's shipped as a not-default option. This PR enables `native` ORC implementation and predicate-pushdown by default for Apache Spark 2.4. We will improve and stabilize ORC data source before Apache Spark 2.4. And, eventually, Apache Spark will drop old Hive-based ORC code.

Pass the Jenkins with existing tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes apache#20634 from dongjoon-hyun/SPARK-23456.

Change-Id: Ib7ec85d2ae6b96451fd28370ef5f5e3924d10de8
otterc pushed a commit to linkedin/spark that referenced this pull request Mar 22, 2023
Apache Spark 2.3 introduced `native` ORC supports with vectorization and many fixes. However, it's shipped as a not-default option. This PR enables `native` ORC implementation and predicate-pushdown by default for Apache Spark 2.4. We will improve and stabilize ORC data source before Apache Spark 2.4. And, eventually, Apache Spark will drop old Hive-based ORC code.

Pass the Jenkins with existing tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes apache#20634 from dongjoon-hyun/SPARK-23456.
@pan3793
Copy link
Member

pan3793 commented Aug 4, 2023

... eventually, Apache Spark will drop old Hive-based ORC code.

@dongjoon-hyun Do you think it's a good time to drop the implementation spark.sql.orc.impl=hive in Spark 4.0? If you don't object, I will open a JIRA to track it.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Aug 4, 2023

Yes, it's a good chance. BTW, IIRC, there was a different at Hive ORC CHAR implementation before. So, we couldn't remove it for backward-compatibility issues. Since Spark implements many CHAR features, we need to re-verify that native implementation has all legacy Hive-based ORC features, @pan3793 .

@pan3793
Copy link
Member

pan3793 commented Aug 4, 2023

@dongjoon-hyun thanks for your response and detailed comment. SPARK-44677 is created to track the dropping work, will start to take a look after 3.5.0 released

@dongjoon-hyun
Copy link
Member Author

Thank you, @pan3793 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants