[SPARK-16060][SQL] Vectorized Orc reader #13775

viirya · 2016-06-20T03:05:05Z

What changes were proposed in this pull request?

Currently Orc reader in Spark SQL doesn't support vectorized reading. As Hive Orc already support vectorization, we can add this support to improve Orc reading performance.

Benchmark 1

Benchmark code:

test("Benchmark for Orc") {
  val N = 500 << 12
    withOrcTable((0 until N).map(i => (i, i.toString, i.toLong, i.toDouble)), "t") {
      val benchmark = new Benchmark("Orc reader", N)
      benchmark.addCase("reading Orc file", 10) { iter =>
        sql("SELECT  sum(_1), count(_2), sum(_3), avg(_4) FROM t").collect()
      }
      benchmark.run()
  }
}

Before this patch (version 1, no column batch support):

Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic
Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
Orc reader:                              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------   
reading Orc file                              1405 / 1593          1.5         686.1       1.0X

After this patch (version 1, no column batch support):

Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic
Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
Orc reader:                              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
reading Orc file                               944 / 1126          2.2         460.8       1.0X

Before this patch (version 2, column batch support):

Java HotSpot(TM) 64-Bit Server VM 1.8.0_102-b14 on Linux 4.4.27-moby
Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
Orc reader:                              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
reading Orc file                               900 /  962          2.3         439.3       1.0X

After this patch (version 2, column batch support):

Java HotSpot(TM) 64-Bit Server VM 1.8.0_102-b14 on Linux 4.4.27-moby
Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
Orc reader:                              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
reading Orc file                               286 /  330          7.2         139.8       1.0X

Notice: After support Spark's column batch, the performance is largely improved as the operators can benefit from batch processing.

Benchmark 2

Benchmark code:

test("Benchmark for Orc") {
  val N = 500 << 12
  withOrcTable((0 until N).map(i => (i, i.toString, i.toLong, i.toDouble)), "t") {
    val benchmark = new Benchmark("Orc reader", N)
    benchmark.addCase("reading Orc file", 10) { iter =>
      sql("SELECT * FROM t").count()
    }
    benchmark.run()
  }
}

Before this patch (version 1, no column batch support):

Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-64-generic
Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
Orc reader:                              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
reading Orc file                               813 / 1018          2.5         397.1       1.0X

After this patch (version 1, no column batch support):

Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-64-generic
Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
Orc reader:                              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
reading Orc file                               508 /  693          4.0         248.2       1.0X

Before this patch (version 2, column batch support):

Java HotSpot(TM) 64-Bit Server VM 1.8.0_102-b14 on Linux 4.4.27-moby
Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
Orc reader:                              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
reading Orc file                               208 /  235          9.9         101.4       1.0X

After this patch (version 2, column batch support):

Java HotSpot(TM) 64-Bit Server VM 1.8.0_102-b14 on Linux 4.4.27-moby
Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
Orc reader:                              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
reading Orc file                               102 /  128         20.1          49.7       1.0X

Notice: For simply counting operation, batch processing doesn't have as much improvement as benchmark 1. It should be reasonable. But the improvement is also significant.

How was this patch tested?

Existing tests.

SparkQA · 2016-06-20T03:08:09Z

Test build #60827 has finished for PR 13775 at commit 7e7bb6c.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-06-20T04:17:48Z

@viirya could you re-run the benchmarks without calling collect(). Do a count or a simple aggregate instead, collect spends a tonne of time in serializing results from InternalRow to Row.

hvanhovell · 2016-06-20T04:19:40Z

Would PR #13676 help to improve performance?

rxin · 2016-06-20T04:26:52Z

@viirya when you construct a performance benchmark, you would want to minimize the overhead of things outside the code path you are testing. In this case, a lot of the time were spent in the collect operation.

SparkQA · 2016-06-20T04:53:11Z

Test build #60828 has finished for PR 13775 at commit 20b832e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-06-20T04:57:46Z

@hvanhovell @rxin Got it. Thanks! I will re-run the benchmark.

viirya · 2016-06-20T08:46:22Z

@hvanhovell @rxin I've re-run the benchmark and updated the results.

rxin · 2016-06-20T19:24:14Z

This is still wrong unfortunately --- count(*) is going to prune all the columns ...

viirya · 2016-06-20T23:10:57Z

@rxin oh, right...I will update this later. Thanks!

viirya · 2016-06-21T00:44:30Z

@hvanhovell @rxin I've updated the benchmark. Please let me know if this time it is appropriate. Thanks!

zjffdu · 2016-06-22T08:54:42Z

sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/orc/SparkVectorizedOrcRecordReader.java

+     * @throws HiveException
+     */
+    private VectorizedRowBatch constructVectorizedRowBatch(
+        StructObjectInspector oi) throws HiveException {


IOException instead of HiveException ?

SparkQA · 2016-06-22T15:57:16Z

Test build #61037 has finished for PR 13775 at commit 855bcfd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

bchocho · 2016-06-22T21:07:00Z

sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/orc/VectorizedSparkOrcNewRecordReader.java

+      BytesColumnVector bv = ((BytesColumnVector)columns[columnIDs.get(ordinal)]);
+      String str = null;
+      if (bv.isRepeating) {
+        str = new String(bv.vector[0], bv.start[0], bv.length[0], StandardCharsets.UTF_8);


Can creation of a String be avoided by using UTF8String.fromBytes? My understanding is that the encode/decode in new String(..) and UTF8String.fromString can add up.

viirya · 2016-06-22T22:32:52Z

retest this please.

SparkQA · 2016-06-23T00:34:12Z

Test build #61075 has finished for PR 13775 at commit 855bcfd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-23T04:07:40Z

Test build #61088 has finished for PR 13775 at commit 66ab632.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…ader3 Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala

SparkQA · 2016-06-28T17:34:56Z

Test build #61384 has finished for PR 13775 at commit 4c14278.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-06-29T02:22:57Z

@rxin @hvanhovell Available to review this? Or wait for after 2.0 release?

viirya · 2016-07-10T08:23:48Z

also cc @liancheng @yhuai

SparkQA · 2016-11-23T17:48:56Z

Test build #69082 has finished for PR 13775 at commit c297678.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class OrcColumnVector extends org.apache.spark.sql.execution.vectorized.ColumnVector

SparkQA · 2016-11-24T10:31:43Z

Test build #69124 has finished for PR 13775 at commit 8638a0e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-11-24T11:53:56Z

retest this please.

viirya · 2016-11-24T13:21:02Z

By supporting Spark's ColumnarBatch, the benchmarks show this vectorized Orc reader can boost 2 to 3x improvement.

I will continue to add more tests.

SparkQA · 2016-11-24T13:44:37Z

Test build #69131 has finished for PR 13775 at commit 8638a0e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-24T15:32:10Z

Test build #69133 has finished for PR 13775 at commit 55bb19f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-25T07:07:35Z

Test build #69148 has started for PR 13775 at commit 3014834.

SparkQA · 2016-11-25T07:17:35Z

Test build #69149 has started for PR 13775 at commit bd15842.

SparkQA · 2016-11-25T07:56:25Z

Test build #69147 has finished for PR 13775 at commit 160e924.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-11-25T09:03:53Z

@rxin @davies @hvanhovell @yhuai @tejasapatil @zjffdu @dafrista I've addressed review comments and add tests. Please help review this if you can. Thanks.

SparkQA · 2016-11-25T11:30:22Z

Test build #69156 has finished for PR 13775 at commit 0ac61b7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-05-10T03:30:39Z

@viirya . If possible, I'd like to benchmark this PR in my laptop.

dongjoon-hyun · 2017-05-10T03:32:02Z

Otherwise, may I implement this way in my PR as a Viirya's approach?

viirya · 2017-05-10T03:32:08Z

@dongjoon-hyun Sure. Do you need any help?

viirya · 2017-05-10T03:33:05Z

@dongjoon-hyun No problem.

dongjoon-hyun · 2017-05-10T03:33:39Z

Thank you! First, I'll try to rebase and run with my OrcReadBenchmark (which is similar with ParquetReadBenchmark).

dongjoon-hyun · 2017-05-10T03:49:25Z

Hmm. It seems Merge remote-tracking branch makes rebasing confused. Let me think how to compare this.

…rc reader ## What changes were proposed in this pull request? This is mostly from #13775 The wrapper solution is pretty good for string/binary type, as the ORC column vector doesn't keep bytes in a continuous memory region, and has a significant overhead when copying the data to Spark columnar batch. For other cases, the wrapper solution is almost same with the current solution. I think we can treat the wrapper solution as a baseline and keep improving the writing to Spark solution. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #20205 from cloud-fan/orc. (cherry picked from commit eaac60a) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…rc reader ## What changes were proposed in this pull request? This is mostly from #13775 The wrapper solution is pretty good for string/binary type, as the ORC column vector doesn't keep bytes in a continuous memory region, and has a significant overhead when copying the data to Spark columnar batch. For other cases, the wrapper solution is almost same with the current solution. I think we can treat the wrapper solution as a baseline and keep improving the writing to Spark solution. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #20205 from cloud-fan/orc.

viirya added 6 commits June 17, 2016 06:37

Add vectorized Orc reader support.

2861ac2

import.

eee8eca

If column is repeating, always using row id 0.

b753d09

Fix bugs of getBinary and numFields.

7d26f5e

Remove unnecessary change.

74fe936

Remove unnecessary change.

7e7bb6c

Add Apache license headers.

20b832e

zjffdu reviewed Jun 22, 2016
View reviewed changes

Adjust exception.

855bcfd

bchocho reviewed Jun 22, 2016
View reviewed changes

Avoid creating String in getUTF8String.

66ab632

Merge remote-tracking branch 'upstream/master' into vectorized-orc-re…

4c14278

…ader3 Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala

Add test for OrcColumnVector.

8638a0e

Expand OrcQuerySuite to test vectorized Orc reader.

55bb19f

Add test for VectorizedSparkOrcNewRecordReader.

160e924

Add partition column test.

bd15842

viirya force-pushed the vectorized-orc-reader3 branch from 3014834 to bd15842 Compare November 25, 2016 07:14

Expand tests.

0ac61b7

viirya closed this Feb 23, 2017

viirya mentioned this pull request May 10, 2017

[SPARK-20682][SQL] Support a new faster ORC data source based on Apache ORC #17924

Closed

cloud-fan mentioned this pull request Jan 9, 2018

[SPARK-16060][SQL][follow-up] add a wrapper solution for vectorized orc reader #20205

Closed

viirya deleted the vectorized-orc-reader3 branch December 27, 2023 18:34

[SPARK-16060][SQL] Vectorized Orc reader #13775

[SPARK-16060][SQL] Vectorized Orc reader #13775

Conversation

viirya commented Jun 20, 2016 • edited Loading

What changes were proposed in this pull request?

Benchmark 1

Benchmark 2

How was this patch tested?

SparkQA commented Jun 20, 2016

hvanhovell commented Jun 20, 2016

hvanhovell commented Jun 20, 2016

rxin commented Jun 20, 2016

SparkQA commented Jun 20, 2016

viirya commented Jun 20, 2016

viirya commented Jun 20, 2016

rxin commented Jun 20, 2016

viirya commented Jun 20, 2016

viirya commented Jun 21, 2016

zjffdu Jun 22, 2016

Choose a reason for hiding this comment

SparkQA commented Jun 22, 2016

bchocho Jun 22, 2016

Choose a reason for hiding this comment

viirya commented Jun 22, 2016

SparkQA commented Jun 23, 2016

SparkQA commented Jun 23, 2016

SparkQA commented Jun 28, 2016

viirya commented Jun 29, 2016

viirya commented Jul 10, 2016

SparkQA commented Nov 23, 2016

SparkQA commented Nov 24, 2016

viirya commented Nov 24, 2016

viirya commented Nov 24, 2016

SparkQA commented Nov 24, 2016

SparkQA commented Nov 24, 2016

SparkQA commented Nov 25, 2016

SparkQA commented Nov 25, 2016

SparkQA commented Nov 25, 2016

viirya commented Nov 25, 2016 • edited Loading

SparkQA commented Nov 25, 2016

dongjoon-hyun commented May 10, 2017

dongjoon-hyun commented May 10, 2017

viirya commented May 10, 2017

viirya commented May 10, 2017

dongjoon-hyun commented May 10, 2017

dongjoon-hyun commented May 10, 2017

viirya commented Jun 20, 2016 •

edited

Loading

viirya commented Nov 25, 2016 •

edited

Loading