[SPARK-22643][SQL] ColumnarArray should be an immutable view #19842

cloud-fan · 2017-11-29T07:22:34Z

What changes were proposed in this pull request?

To make ColumnVector public, ColumnarArray need to be public too, and we should not have mutable public fields in a public class. This PR proposes to make ColumnarArray an immutable view of the data, and always create a new instance of ColumnarArray in ColumnVector#getArray

How was this patch tested?

new benchmark in ColumnarBatchBenchmark

cloud-fan · 2017-11-29T07:22:55Z

cc @michal-databricks @hvanhovell @kiszk @gatorsmile

SparkQA · 2017-11-29T08:05:01Z

Test build #84285 has finished for PR 19842 at commit aaa33dd.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2017-11-29T10:27:44Z

Jenkins, retest this please

SparkQA · 2017-11-29T13:14:05Z

Test build #84294 has finished for PR 19842 at commit aaa33dd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2017-11-29T13:21:08Z

@cloud-fan TPCDS does not have nested data or arrays. So I think we have to redo the benchmarks. A simple micro benchmark that touches a few elements in the array should probably do it.

hvanhovell · 2017-11-29T13:23:23Z

LGTM - pending benchmarks :)

kiszk · 2017-11-29T15:56:34Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java

-    resultArray.length = getArrayLength(rowId);
-    resultArray.offset = getArrayOffset(rowId);
-    return resultArray;
+    return new ColumnarArray(arrayData(), getArrayOffset(rowId), getArrayLength(rowId));


Is it better to create ColumnarArray for each rowID only once (e.g. by using caching)? I am curious whether we would see performance overhead for creating ColumnarArray to access elements of a multi-dimensional array (e.g. a[1][2] + a[1][3]).

I don't think that is a good idea. That would require us to keep an array of ColumnarArray around. That might mess with both GC and escape analysis. Let's just create a benchmark and check if we do not regress.

cloud-fan · 2017-11-29T17:10:51Z

sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchBenchmark.scala

@@ -265,20 +263,22 @@ object ColumnarBatchBenchmark {
    }

    /*
-    Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz


I rerun all the benchmarks in this file and update the results

cloud-fan · 2017-11-29T17:11:42Z

sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchBenchmark.scala

+    ByteBuffer API                                1411 / 1418        232.2           4.3       0.1X
+    DirectByteBuffer                               467 /  474        701.8           1.4       0.4X
+    Unsafe Buffer                                  178 /  185       1843.6           0.5       1.0X
+    Column(on heap)                                178 /  184       1840.8           0.5       1.0X


Previusly onheap column vector was much faster than java array, which is unreasonable and I can't reproduce it now.

cloud-fan · 2017-11-29T17:12:55Z

sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchBenchmark.scala

+    String Read/Write:                       Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
+    ------------------------------------------------------------------------------------------------
+    On Heap                                        332 /  338         49.3          20.3       1.0X
+    Off Heap                                       466 /  467         35.2          28.4       0.7X


this is due to the data copy saved in https://github.com/apache/spark/pull/19815/files#diff-f43d67d60091eab39c1310330bf7a8ffR211

cloud-fan · 2017-11-29T17:14:44Z

sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchBenchmark.scala

+    On Heap Read Size Only                         415 /  422        394.7           2.5       1.0X
+    Off Heap Read Size Only                        394 /  402        415.9           2.4       1.1X
+    On Heap Read Elements                         2558 / 2593         64.0          15.6       0.2X
+    Off Heap Read Elements                        3316 / 3317         49.4          20.2       0.1X


the result before this PR

On Heap Read Size Only 83 / 92 1970.3 0.5 1.0X Off Heap Read Size Only 98 / 110 1669.1 0.6 0.8X On Heap Read Elements 3190 / 3203 51.4 19.5 0.0X Off Heap Read Elements 3106 / 3146 52.8 19.0 0.0X

For the worst case, we just get the array and get its size, reusing the object has a good improvement. However if we also need to access the array elements(should be the most common case), the overhead is negligible

Thank you for running a benchmark. I understand reusing the object has a good performance.
I am curious whether the current catalyst can generate such a Java code for accessing nested array elements in SQL selectExpr("a[1][1] + a[1][2] + a[1][3] + a[1][4] + a[1][5]").

SparkQA · 2017-11-29T19:45:02Z

Test build #84307 has finished for PR 19842 at commit 83d5120.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-11-30T10:34:57Z

Since the benchmark shows negligible overhead for normal cases, I'm merging it to master, thanks!

## What changes were proposed in this pull request? Similar to apache#19842 , we should also make `ColumnarRow` an immutable view, and move forward to make `ColumnVector` public. ## How was this patch tested? Existing tests. The performance concern should be same as apache#19842 . Author: Wenchen Fan <wenchen@databricks.com> Closes apache#19898 from cloud-fan/row-id.

ColumnarArray should be an immutable view

aaa33dd

kiszk reviewed Nov 29, 2017

View reviewed changes

add benchmark

83d5120

cloud-fan commented Nov 29, 2017

View reviewed changes

asfgit closed this in 9c29c55 Nov 30, 2017

cloud-fan mentioned this pull request Dec 5, 2017

[SPARK-22703][SQL] make ColumnarRow an immutable view #19898

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-22643][SQL] ColumnarArray should be an immutable view #19842

[SPARK-22643][SQL] ColumnarArray should be an immutable view #19842

cloud-fan commented Nov 29, 2017 •

edited

cloud-fan commented Nov 29, 2017

SparkQA commented Nov 29, 2017

kiszk commented Nov 29, 2017

SparkQA commented Nov 29, 2017

hvanhovell commented Nov 29, 2017

hvanhovell commented Nov 29, 2017

kiszk Nov 29, 2017

hvanhovell Nov 29, 2017

cloud-fan Nov 29, 2017

cloud-fan Nov 29, 2017

cloud-fan Nov 29, 2017

cloud-fan Nov 29, 2017

kiszk Nov 30, 2017 •

edited

SparkQA commented Nov 29, 2017

cloud-fan commented Nov 30, 2017

[SPARK-22643][SQL] ColumnarArray should be an immutable view #19842

[SPARK-22643][SQL] ColumnarArray should be an immutable view #19842

Conversation

cloud-fan commented Nov 29, 2017 • edited

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Nov 29, 2017

SparkQA commented Nov 29, 2017

kiszk commented Nov 29, 2017

SparkQA commented Nov 29, 2017

hvanhovell commented Nov 29, 2017

hvanhovell commented Nov 29, 2017

kiszk Nov 29, 2017

Choose a reason for hiding this comment

hvanhovell Nov 29, 2017

Choose a reason for hiding this comment

cloud-fan Nov 29, 2017

Choose a reason for hiding this comment

cloud-fan Nov 29, 2017

Choose a reason for hiding this comment

cloud-fan Nov 29, 2017

Choose a reason for hiding this comment

cloud-fan Nov 29, 2017

Choose a reason for hiding this comment

kiszk Nov 30, 2017 • edited

Choose a reason for hiding this comment

SparkQA commented Nov 29, 2017

cloud-fan commented Nov 30, 2017

cloud-fan commented Nov 29, 2017 •

edited

kiszk Nov 30, 2017 •

edited