[SPARK-13255][SQL] Update vectorized reader to directly return ColumnarBatch instead of InternalRows. #11435

nongli · 2016-02-29T21:19:44Z

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

Currently, the parquet reader returns rows one by one which is bad for performance. This patch
updates the reader to directly return ColumnarBatches. This is only enabled with whole stage
codegen, which is the only operator currently that is able to consume ColumnarBatches (instead
of rows). The current implementation is a bit of a hack to get this to work and we should do
more refactoring of these low level interfaces to make this work better.

How was this patch tested?

Results:
TPCDS:                             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)
---------------------------------------------------------------------------------
q55 (before)                             8897 / 9265         12.9          77.2
q55                                      5486 / 5753         21.0          47.6

SparkQA · 2016-02-29T21:45:35Z

Test build #52198 has finished for PR 11435 at commit c1b6ab1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public class UnsafeRowParquetRecordReader extends SpecificParquetRecordReaderBase<Object>

SparkQA · 2016-02-29T23:02:24Z

Test build #52199 has finished for PR 11435 at commit 490189c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-29T23:55:27Z

Test build #2592 has finished for PR 11435 at commit 490189c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-03T00:04:37Z

Test build #52344 has finished for PR 11435 at commit c5a041e.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

nongli · 2016-03-03T00:47:41Z

This patch is abusing the existing abstractions which will be/is being cleaned up in other patches. For a benchmark of the partition reading speed, I tested this on a table with 2 int cols, one in the data, one as the partition column. The results are:

Partitioned Table:                  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
Read data column                          217 /  262         72.3          13.8       1.0X
Read partition column                     104 /  150        150.6           6.6       2.1X
Read both columns                         240 /  266         65.5          15.3       0.9X

SparkQA · 2016-03-03T01:54:09Z

Test build #52345 has finished for PR 11435 at commit ce4099c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-03T02:09:13Z

Test build #52350 has finished for PR 11435 at commit 2908d01.

This patch fails Scala style tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-03-03T03:33:22Z

Test build #52351 has finished for PR 11435 at commit 670dee7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-03-03T19:48:24Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVectorUtils.java

+      } else if (t == DataTypes.StringType) {
+        UTF8String v = row.getUTF8String(fieldIdx);
+        for (int i = 0; i < capacity; i++) {
+          col.putByteArray(i, v.getBytes());


If row is UnsafeRow, v.getBytes() will copy the bytes, we should pull that out of the loop.

SparkQA · 2016-03-03T20:12:52Z

Test build #2607 has finished for PR 11435 at commit 670dee7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-03T22:29:18Z

Test build #52410 has finished for PR 11435 at commit 2a16ec0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-03T23:13:20Z

Test build #2611 has finished for PR 11435 at commit 2a16ec0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-04T02:31:27Z

Test build #2612 has finished for PR 11435 at commit 2a16ec0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…arBatch instead of InternalRows. Currently, the parquet reader returns rows one by one which is bad for performance. This patch updates the reader to directly return ColumnarBatches. This is only enabled with whole stage codegen, which is the only operator currently that is able to consume ColumnarBatches (instead of rows). The current implementation is a bit of a hack to get this to work and we should do more refactoring of these low level interfaces to make this work better. Results: TPCDS: Best/Avg Time(ms) Rate(M/s) Per Row(ns) --------------------------------------------------------------------------------- q55 (before) 8897 / 9265 12.9 77.2 q55 5486 / 5753 21.0 47.6

SparkQA · 2016-03-04T06:04:59Z

Test build #52446 has finished for PR 11435 at commit f5f1e2b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-04T08:47:29Z

Test build #52450 has finished for PR 11435 at commit ed79eee.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-03-04T18:12:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala

-       |   ${consume(ctx, columns).trim}
-       |   if (shouldStop()) {
-       |     return;
+       | if ($batch != null || $input.hasNext()) {


How about this:

if ($batch != null) { $scanBatches(); } else if ($input.hasNext()) { Object $value = $input.next(); if ($value instanceof $columnarBatchClz) { $batch = ($columnarBatchClz)$value; $scanBatches(); } else { $scanRows((InternalRow) $value); } }

SparkQA · 2016-03-04T21:05:46Z

Test build #52479 has finished for PR 11435 at commit 48102e3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-03-04T23:15:19Z

LGTM, merging this into master, thanks!

…narBatch instead of InternalRows. ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) Currently, the parquet reader returns rows one by one which is bad for performance. This patch updates the reader to directly return ColumnarBatches. This is only enabled with whole stage codegen, which is the only operator currently that is able to consume ColumnarBatches (instead of rows). The current implementation is a bit of a hack to get this to work and we should do more refactoring of these low level interfaces to make this work better. ## How was this patch tested? ``` Results: TPCDS: Best/Avg Time(ms) Rate(M/s) Per Row(ns) --------------------------------------------------------------------------------- q55 (before) 8897 / 9265 12.9 77.2 q55 5486 / 5753 21.0 47.6 ``` Author: Nong Li <nong@databricks.com> Closes apache#11435 from nongli/spark-13255.

…ntervalType correctly ### What changes were proposed in this pull request? [`ColumnVectorUtils.populate()` does not handle CalendarInterval type correctly](https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVectorUtils.java#L93-L94). The CalendarInterval type is in the format of [(months: int, days: int, microseconds: long)](https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java#L58 ). However, the function above misses `days` field, and sets `microseconds` field in wrong position. `ColumnVectorUtils.populate()` is used by [Parquet](https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L258) and [ORC](https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java#L171) vectorized reader to read partition column. So technically Spark can potentially produce wrong result if reading table with CalendarInterval partition column. However I also notice Spark [explicitly disallows writing data with CalendarInterval type](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L586 ), so it might not be a big deal for users. But it's worth to fix anyway. Caveat: I found the bug when reading through the related code path, but I never encountered the issue in production for partition column with CalendarInterval type. I think it should be an obvious fix unless anyone more experienced could find some more historical context. The code was introduced a long time ago where I couldn't find any more info why it was implemented as it is (#11435) ### Why are the changes needed? To fix potential correctness issue. ### Does this PR introduce _any_ user-facing change? No but fix the exiting correctness issue when reading partition column with CalendarInterval type. ### How was this patch tested? Added unit test in `ColumnVectorSuite.scala`. Verified the unit test failed with exception below without this PR: ``` java.lang.NullPointerException was thrown. java.lang.NullPointerException at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLongs(OnHeapColumnVector.java:345) at org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:94) at org.apache.spark.sql.execution.vectorized.ColumnVectorSuite.$anonfun$new$99(ColumnVectorSuite.scala:613) ``` Closes #35314 from c21/vector-fix. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

nongli force-pushed the spark-13255 branch from 2908d01 to 670dee7 Compare March 3, 2016 02:11

davies reviewed Mar 3, 2016
View reviewed changes

nongli force-pushed the spark-13255 branch from 670dee7 to 2a16ec0 Compare March 3, 2016 21:47

nongli added 8 commits March 3, 2016 20:09

Rebase fixes.

058556c

Fix partition columns.

2330576

Import order fixes

42875ac

Fix use after free issue.

cab64e5

CR and add partition column benchmark.

f35394c

CR

3450313

Fix batching.

f5f1e2b

nongli force-pushed the spark-13255 branch from 2a16ec0 to f5f1e2b Compare March 4, 2016 04:40

Fix test for bucketed tables.

ed79eee

davies reviewed Mar 4, 2016
View reviewed changes

CR

48102e3

asfgit closed this in a6e2bd3 Mar 4, 2016

nongli deleted the spark-13255 branch March 4, 2016 23:23

c21 mentioned this pull request Jan 25, 2022

[SPARK-38018][SQL] Fix ColumnVectorUtils.populate to handle CalendarIntervalType correctly #35314

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-13255][SQL] Update vectorized reader to directly return ColumnarBatch instead of InternalRows. #11435

[SPARK-13255][SQL] Update vectorized reader to directly return ColumnarBatch instead of InternalRows. #11435

nongli commented Feb 29, 2016

SparkQA commented Feb 29, 2016

SparkQA commented Feb 29, 2016

SparkQA commented Feb 29, 2016

SparkQA commented Mar 3, 2016

nongli commented Mar 3, 2016

SparkQA commented Mar 3, 2016

SparkQA commented Mar 3, 2016

SparkQA commented Mar 3, 2016

davies Mar 3, 2016

SparkQA commented Mar 3, 2016

SparkQA commented Mar 3, 2016

SparkQA commented Mar 3, 2016

SparkQA commented Mar 4, 2016

SparkQA commented Mar 4, 2016

SparkQA commented Mar 4, 2016

davies Mar 4, 2016

SparkQA commented Mar 4, 2016

davies commented Mar 4, 2016

[SPARK-13255][SQL] Update vectorized reader to directly return ColumnarBatch instead of InternalRows. #11435

[SPARK-13255][SQL] Update vectorized reader to directly return ColumnarBatch instead of InternalRows. #11435

Conversation

nongli commented Feb 29, 2016

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Feb 29, 2016

SparkQA commented Feb 29, 2016

SparkQA commented Feb 29, 2016

SparkQA commented Mar 3, 2016

nongli commented Mar 3, 2016

SparkQA commented Mar 3, 2016

SparkQA commented Mar 3, 2016

SparkQA commented Mar 3, 2016

davies Mar 3, 2016

Choose a reason for hiding this comment

SparkQA commented Mar 3, 2016

SparkQA commented Mar 3, 2016

SparkQA commented Mar 3, 2016

SparkQA commented Mar 4, 2016

SparkQA commented Mar 4, 2016

SparkQA commented Mar 4, 2016

davies Mar 4, 2016

Choose a reason for hiding this comment

SparkQA commented Mar 4, 2016

davies commented Mar 4, 2016