Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16060][SQL] Vectorized Orc reader #13775

Closed
wants to merge 24 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Jun 20, 2016

What changes were proposed in this pull request?

Currently Orc reader in Spark SQL doesn't support vectorized reading. As Hive Orc already support vectorization, we can add this support to improve Orc reading performance.

Benchmark 1

Benchmark code:

test("Benchmark for Orc") {
  val N = 500 << 12
    withOrcTable((0 until N).map(i => (i, i.toString, i.toLong, i.toDouble)), "t") {
      val benchmark = new Benchmark("Orc reader", N)
      benchmark.addCase("reading Orc file", 10) { iter =>
        sql("SELECT  sum(_1), count(_2), sum(_3), avg(_4) FROM t").collect()
      }
      benchmark.run()
  }
}

Before this patch (version 1, no column batch support):

Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic
Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
Orc reader:                              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------   
reading Orc file                              1405 / 1593          1.5         686.1       1.0X

After this patch (version 1, no column batch support):

Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic
Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
Orc reader:                              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
reading Orc file                               944 / 1126          2.2         460.8       1.0X

Before this patch (version 2, column batch support):

Java HotSpot(TM) 64-Bit Server VM 1.8.0_102-b14 on Linux 4.4.27-moby
Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
Orc reader:                              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
reading Orc file                               900 /  962          2.3         439.3       1.0X

After this patch (version 2, column batch support):

Java HotSpot(TM) 64-Bit Server VM 1.8.0_102-b14 on Linux 4.4.27-moby
Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
Orc reader:                              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
reading Orc file                               286 /  330          7.2         139.8       1.0X

Notice: After support Spark's column batch, the performance is largely improved as the operators can benefit from batch processing.

Benchmark 2

Benchmark code:

test("Benchmark for Orc") {
  val N = 500 << 12
  withOrcTable((0 until N).map(i => (i, i.toString, i.toLong, i.toDouble)), "t") {
    val benchmark = new Benchmark("Orc reader", N)
    benchmark.addCase("reading Orc file", 10) { iter =>
      sql("SELECT * FROM t").count()
    }
    benchmark.run()
  }
}

Before this patch (version 1, no column batch support):

Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-64-generic
Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
Orc reader:                              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
reading Orc file                               813 / 1018          2.5         397.1       1.0X

After this patch (version 1, no column batch support):

Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-64-generic
Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
Orc reader:                              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
reading Orc file                               508 /  693          4.0         248.2       1.0X

Before this patch (version 2, column batch support):

Java HotSpot(TM) 64-Bit Server VM 1.8.0_102-b14 on Linux 4.4.27-moby
Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
Orc reader:                              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
reading Orc file                               208 /  235          9.9         101.4       1.0X

After this patch (version 2, column batch support):

Java HotSpot(TM) 64-Bit Server VM 1.8.0_102-b14 on Linux 4.4.27-moby
Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
Orc reader:                              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
reading Orc file                               102 /  128         20.1          49.7       1.0X

Notice: For simply counting operation, batch processing doesn't have as much improvement as benchmark 1. It should be reasonable. But the improvement is also significant.

How was this patch tested?

Existing tests.

@SparkQA
Copy link

SparkQA commented Jun 20, 2016

Test build #60827 has finished for PR 13775 at commit 7e7bb6c.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

@viirya could you re-run the benchmarks without calling collect(). Do a count or a simple aggregate instead, collect spends a tonne of time in serializing results from InternalRow to Row.

@hvanhovell
Copy link
Contributor

Would PR #13676 help to improve performance?

@rxin
Copy link
Contributor

rxin commented Jun 20, 2016

@viirya when you construct a performance benchmark, you would want to minimize the overhead of things outside the code path you are testing. In this case, a lot of the time were spent in the collect operation.

@SparkQA
Copy link

SparkQA commented Jun 20, 2016

Test build #60828 has finished for PR 13775 at commit 20b832e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jun 20, 2016

@hvanhovell @rxin Got it. Thanks! I will re-run the benchmark.

@viirya
Copy link
Member Author

viirya commented Jun 20, 2016

@hvanhovell @rxin I've re-run the benchmark and updated the results.

@rxin
Copy link
Contributor

rxin commented Jun 20, 2016

This is still wrong unfortunately --- count(*) is going to prune all the columns ...

@viirya
Copy link
Member Author

viirya commented Jun 20, 2016

@rxin oh, right...I will update this later. Thanks!

@viirya
Copy link
Member Author

viirya commented Jun 21, 2016

@hvanhovell @rxin I've updated the benchmark. Please let me know if this time it is appropriate. Thanks!

* @throws HiveException
*/
private VectorizedRowBatch constructVectorizedRowBatch(
StructObjectInspector oi) throws HiveException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IOException instead of HiveException ?

@SparkQA
Copy link

SparkQA commented Jun 22, 2016

Test build #61037 has finished for PR 13775 at commit 855bcfd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

BytesColumnVector bv = ((BytesColumnVector)columns[columnIDs.get(ordinal)]);
String str = null;
if (bv.isRepeating) {
str = new String(bv.vector[0], bv.start[0], bv.length[0], StandardCharsets.UTF_8);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can creation of a String be avoided by using UTF8String.fromBytes? My understanding is that the encode/decode in new String(..) and UTF8String.fromString can add up.

@viirya
Copy link
Member Author

viirya commented Jun 22, 2016

retest this please.

@SparkQA
Copy link

SparkQA commented Jun 23, 2016

Test build #61075 has finished for PR 13775 at commit 855bcfd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 23, 2016

Test build #61088 has finished for PR 13775 at commit 66ab632.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

…ader3

Conflicts:
	sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala
@SparkQA
Copy link

SparkQA commented Jun 28, 2016

Test build #61384 has finished for PR 13775 at commit 4c14278.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jun 29, 2016

@rxin @hvanhovell Available to review this? Or wait for after 2.0 release?

@viirya
Copy link
Member Author

viirya commented Jul 10, 2016

also cc @liancheng @yhuai

@SparkQA
Copy link

SparkQA commented Nov 23, 2016

Test build #69082 has finished for PR 13775 at commit c297678.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public class OrcColumnVector extends org.apache.spark.sql.execution.vectorized.ColumnVector

@SparkQA
Copy link

SparkQA commented Nov 24, 2016

Test build #69124 has finished for PR 13775 at commit 8638a0e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Nov 24, 2016

retest this please.

@viirya
Copy link
Member Author

viirya commented Nov 24, 2016

By supporting Spark's ColumnarBatch, the benchmarks show this vectorized Orc reader can boost 2 to 3x improvement.

I will continue to add more tests.

@SparkQA
Copy link

SparkQA commented Nov 24, 2016

Test build #69131 has finished for PR 13775 at commit 8638a0e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 24, 2016

Test build #69133 has finished for PR 13775 at commit 55bb19f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 25, 2016

Test build #69148 has started for PR 13775 at commit 3014834.

@SparkQA
Copy link

SparkQA commented Nov 25, 2016

Test build #69149 has started for PR 13775 at commit bd15842.

@SparkQA
Copy link

SparkQA commented Nov 25, 2016

Test build #69147 has finished for PR 13775 at commit 160e924.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Nov 25, 2016

@rxin @davies @hvanhovell @yhuai @tejasapatil @zjffdu @dafrista I've addressed review comments and add tests. Please help review this if you can. Thanks.

@SparkQA
Copy link

SparkQA commented Nov 25, 2016

Test build #69156 has finished for PR 13775 at commit 0ac61b7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

@viirya . If possible, I'd like to benchmark this PR in my laptop.

@dongjoon-hyun
Copy link
Member

Otherwise, may I implement this way in my PR as a Viirya's approach?

@viirya
Copy link
Member Author

viirya commented May 10, 2017

@dongjoon-hyun Sure. Do you need any help?

@viirya
Copy link
Member Author

viirya commented May 10, 2017

@dongjoon-hyun No problem.

@dongjoon-hyun
Copy link
Member

Thank you! First, I'll try to rebase and run with my OrcReadBenchmark (which is similar with ParquetReadBenchmark).

@dongjoon-hyun
Copy link
Member

Hmm. It seems Merge remote-tracking branch makes rebasing confused. Let me think how to compare this.

asfgit pushed a commit that referenced this pull request Jan 10, 2018
…rc reader

## What changes were proposed in this pull request?

This is mostly from #13775

The wrapper solution is pretty good for string/binary type, as the ORC column vector doesn't keep bytes in a continuous memory region, and has a significant overhead when copying the data to Spark columnar batch. For other cases, the wrapper solution is almost same with the current solution.

I think we can treat the wrapper solution as a baseline and keep improving the writing to Spark solution.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #20205 from cloud-fan/orc.

(cherry picked from commit eaac60a)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
asfgit pushed a commit that referenced this pull request Jan 10, 2018
…rc reader

## What changes were proposed in this pull request?

This is mostly from #13775

The wrapper solution is pretty good for string/binary type, as the ORC column vector doesn't keep bytes in a continuous memory region, and has a significant overhead when copying the data to Spark columnar batch. For other cases, the wrapper solution is almost same with the current solution.

I think we can treat the wrapper solution as a baseline and keep improving the writing to Spark solution.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #20205 from cloud-fan/orc.
@viirya viirya deleted the vectorized-orc-reader3 branch December 27, 2023 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
10 participants