-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-23456][SPARK-21783] Turn on native ORC impl and PPD by default
#20634
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #87525 has finished for PR 20634 at commit
|
gatorsmile
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Thanks! Merged to master. |
|
Thank you, @gatorsmile . |
Apache Spark 2.3 introduced `native` ORC supports with vectorization and many fixes. However, it's shipped as a not-default option. This PR enables `native` ORC implementation and predicate-pushdown by default for Apache Spark 2.4. We will improve and stabilize ORC data source before Apache Spark 2.4. And, eventually, Apache Spark will drop old Hive-based ORC code. Pass the Jenkins with existing tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes apache#20634 from dongjoon-hyun/SPARK-23456. Change-Id: Ib7ec85d2ae6b96451fd28370ef5f5e3924d10de8
Apache Spark 2.3 introduced `native` ORC supports with vectorization and many fixes. However, it's shipped as a not-default option. This PR enables `native` ORC implementation and predicate-pushdown by default for Apache Spark 2.4. We will improve and stabilize ORC data source before Apache Spark 2.4. And, eventually, Apache Spark will drop old Hive-based ORC code. Pass the Jenkins with existing tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes apache#20634 from dongjoon-hyun/SPARK-23456.
@dongjoon-hyun Do you think it's a good time to drop the implementation |
|
Yes, it's a good chance. BTW, IIRC, there was a different at Hive ORC CHAR implementation before. So, we couldn't remove it for backward-compatibility issues. Since Spark implements many CHAR features, we need to re-verify that |
|
@dongjoon-hyun thanks for your response and detailed comment. SPARK-44677 is created to track the dropping work, will start to take a look after 3.5.0 released |
|
Thank you, @pan3793 ! |
What changes were proposed in this pull request?
Apache Spark 2.3 introduced
nativeORC supports with vectorization and many fixes. However, it's shipped as a not-default option. This PR enablesnativeORC implementation and predicate-pushdown by default for Apache Spark 2.4. We will improve and stabilize ORC data source before Apache Spark 2.4. And, eventually, Apache Spark will drop old Hive-based ORC code.How was this patch tested?
Pass the Jenkins with existing tests.