-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-16060][SQL] Vectorized Orc reader #13775
Conversation
Test build #60827 has finished for PR 13775 at commit
|
@viirya could you re-run the benchmarks without calling collect(). Do a count or a simple aggregate instead, collect spends a tonne of time in serializing results from |
Would PR #13676 help to improve performance? |
@viirya when you construct a performance benchmark, you would want to minimize the overhead of things outside the code path you are testing. In this case, a lot of the time were spent in the collect operation. |
Test build #60828 has finished for PR 13775 at commit
|
@hvanhovell @rxin Got it. Thanks! I will re-run the benchmark. |
@hvanhovell @rxin I've re-run the benchmark and updated the results. |
This is still wrong unfortunately --- count(*) is going to prune all the columns ... |
@rxin oh, right...I will update this later. Thanks! |
@hvanhovell @rxin I've updated the benchmark. Please let me know if this time it is appropriate. Thanks! |
* @throws HiveException | ||
*/ | ||
private VectorizedRowBatch constructVectorizedRowBatch( | ||
StructObjectInspector oi) throws HiveException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IOException instead of HiveException ?
Test build #61037 has finished for PR 13775 at commit
|
BytesColumnVector bv = ((BytesColumnVector)columns[columnIDs.get(ordinal)]); | ||
String str = null; | ||
if (bv.isRepeating) { | ||
str = new String(bv.vector[0], bv.start[0], bv.length[0], StandardCharsets.UTF_8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can creation of a String be avoided by using UTF8String.fromBytes
? My understanding is that the encode/decode in new String(..)
and UTF8String.fromString
can add up.
retest this please. |
Test build #61075 has finished for PR 13775 at commit
|
Test build #61088 has finished for PR 13775 at commit
|
…ader3 Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala
Test build #61384 has finished for PR 13775 at commit
|
@rxin @hvanhovell Available to review this? Or wait for after 2.0 release? |
also cc @liancheng @yhuai |
Test build #69082 has finished for PR 13775 at commit
|
Test build #69124 has finished for PR 13775 at commit
|
retest this please. |
By supporting Spark's ColumnarBatch, the benchmarks show this vectorized Orc reader can boost 2 to 3x improvement. I will continue to add more tests. |
Test build #69131 has finished for PR 13775 at commit
|
Test build #69133 has finished for PR 13775 at commit
|
Test build #69148 has started for PR 13775 at commit |
3014834
to
bd15842
Compare
Test build #69149 has started for PR 13775 at commit |
Test build #69147 has finished for PR 13775 at commit
|
@rxin @davies @hvanhovell @yhuai @tejasapatil @zjffdu @dafrista I've addressed review comments and add tests. Please help review this if you can. Thanks. |
Test build #69156 has finished for PR 13775 at commit
|
@viirya . If possible, I'd like to benchmark this PR in my laptop. |
Otherwise, may I implement this way in my PR as a Viirya's approach? |
@dongjoon-hyun Sure. Do you need any help? |
@dongjoon-hyun No problem. |
Thank you! First, I'll try to rebase and run with my |
Hmm. It seems |
…rc reader ## What changes were proposed in this pull request? This is mostly from #13775 The wrapper solution is pretty good for string/binary type, as the ORC column vector doesn't keep bytes in a continuous memory region, and has a significant overhead when copying the data to Spark columnar batch. For other cases, the wrapper solution is almost same with the current solution. I think we can treat the wrapper solution as a baseline and keep improving the writing to Spark solution. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #20205 from cloud-fan/orc. (cherry picked from commit eaac60a) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…rc reader ## What changes were proposed in this pull request? This is mostly from #13775 The wrapper solution is pretty good for string/binary type, as the ORC column vector doesn't keep bytes in a continuous memory region, and has a significant overhead when copying the data to Spark columnar batch. For other cases, the wrapper solution is almost same with the current solution. I think we can treat the wrapper solution as a baseline and keep improving the writing to Spark solution. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #20205 from cloud-fan/orc.
What changes were proposed in this pull request?
Currently Orc reader in Spark SQL doesn't support vectorized reading. As Hive Orc already support vectorization, we can add this support to improve Orc reading performance.
Benchmark 1
Benchmark code:
Before this patch (version 1, no column batch support):
After this patch (version 1, no column batch support):
Before this patch (version 2, column batch support):
After this patch (version 2, column batch support):
Notice: After support Spark's column batch, the performance is largely improved as the operators can benefit from batch processing.
Benchmark 2
Benchmark code:
Before this patch (version 1, no column batch support):
After this patch (version 1, no column batch support):
Before this patch (version 2, column batch support):
After this patch (version 2, column batch support):
Notice: For simply counting operation, batch processing doesn't have as much improvement as benchmark 1. It should be reasonable. But the improvement is also significant.
How was this patch tested?
Existing tests.