[SPARK-20682][SQL] Implement new ORC data source based on Apache ORC #17943

dongjoon-hyun · 2017-05-11T01:59:50Z

What changes were proposed in this pull request?

Since SPARK-2883, Apache Spark supports Apache ORC inside sql/hive module with Hive dependency. This issue aims to add a new and faster ORC data source inside sql/core and to replace the old ORC data source eventually. In this issue, the latest Apache ORC 1.4.0 (released yesterday) is used.

There are four key benefits.

Speed: Use both Spark ColumnarBatch and ORC RowBatch together later. In this PR, only RowBatch is used. This is faster than the current implementation in Spark. For ColumnarBatch, we need to benchmark and choose the fastest way to use it later. (Please refer some discussion on [SPARK-20682][SQL] Support a new faster ORC data source based on Apache ORC #17924)
Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC community more.
Usability: User can use ORC data sources without hive module, i.e, -Phive.
Maintainability: Reduce the Hive dependency and can remove old legacy code later.

The followings are two examples of comparisons in OrcReadBenchmark.scala.

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.12.4
Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz

SQL Single Int Column Scan:              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
SQL ORC Vectorized Reader                      278 /  320         56.5          17.7       1.0X
SQL ORC MR Reader                              348 /  358         45.2          22.1       0.8X
HIVE ORC MR Reader                             418 /  430         37.6          26.6       0.7X

Partitioned Table:                       Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
SQL Read data column                           273 /  283         57.6          17.4       1.0X
SQL Read partition column                      252 /  266         62.5          16.0       1.1X
SQL Read both columns                          283 /  293         55.5          18.0       1.0X
HIVE Read data column                          510 /  520         30.8          32.4       0.5X
HIVE Read partition column                     420 /  425         37.5          26.7       0.7X
HIVE Read both columns                         527 /  538         29.9          33.5       0.5X

How was this patch tested?

Pass the Jenkins tests with newly added test suites in sql/core.

SparkQA · 2017-05-11T04:15:39Z

Test build #76764 has finished for PR 17943 at commit 70bc00e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-11T05:12:33Z

Test build #76776 has started for PR 17943 at commit d1417aa.

dongjoon-hyun · 2017-05-11T07:34:19Z

Retest this please.

SparkQA · 2017-05-11T10:30:29Z

Test build #76790 has finished for PR 17943 at commit d1417aa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-05-11T20:58:10Z

Hi, @cloud-fan and @viirya .
Could you review new ORC data source (without ColumnarBatch part) again?

dongjoon-hyun · 2017-06-16T21:55:45Z

Retest this please

SparkQA · 2017-06-17T00:53:07Z

Test build #78193 has finished for PR 17943 at commit d1417aa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-09-08T20:04:05Z

Please refer the superset PR #17980 .

dongjoon-hyun mentioned this pull request May 11, 2017

[SPARK-20682][SQL] Support a new faster ORC data source based on Apache ORC #17924

Closed

[SPARK-20682][SQL] Implement new ORC data source based on Apache ORC

d1417aa

dongjoon-hyun mentioned this pull request May 15, 2017

[SPARK-20728][SQL] Make ORCFileFormat configurable between sql/hive and sql/core #17980

Closed

dongjoon-hyun mentioned this pull request Aug 4, 2017

[SPARK-21422][BUILD] Depend on Apache ORC 1.4.0 #18640

Closed

dongjoon-hyun mentioned this pull request Aug 16, 2017

[SPARK-20682][SQL] Update ORC data source based on Apache ORC library #18953

Closed

dongjoon-hyun closed this Sep 8, 2017

dongjoon-hyun deleted the SPARK-20682-2 branch September 9, 2017 23:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-20682][SQL] Implement new ORC data source based on Apache ORC #17943

[SPARK-20682][SQL] Implement new ORC data source based on Apache ORC #17943

dongjoon-hyun commented May 11, 2017

SparkQA commented May 11, 2017

SparkQA commented May 11, 2017

dongjoon-hyun commented May 11, 2017

SparkQA commented May 11, 2017

dongjoon-hyun commented May 11, 2017

dongjoon-hyun commented Jun 16, 2017

SparkQA commented Jun 17, 2017

dongjoon-hyun commented Sep 8, 2017

[SPARK-20682][SQL] Implement new ORC data source based on Apache ORC #17943

[SPARK-20682][SQL] Implement new ORC data source based on Apache ORC #17943

Conversation

dongjoon-hyun commented May 11, 2017

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented May 11, 2017

SparkQA commented May 11, 2017

dongjoon-hyun commented May 11, 2017

SparkQA commented May 11, 2017

dongjoon-hyun commented May 11, 2017

dongjoon-hyun commented Jun 16, 2017

SparkQA commented Jun 17, 2017

dongjoon-hyun commented Sep 8, 2017