New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22643][SQL] ColumnarArray should be an immutable view #19842
Conversation
Test build #84285 has finished for PR 19842 at commit
|
Jenkins, retest this please |
Test build #84294 has finished for PR 19842 at commit
|
@cloud-fan TPCDS does not have nested data or arrays. So I think we have to redo the benchmarks. A simple micro benchmark that touches a few elements in the array should probably do it. |
LGTM - pending benchmarks :) |
resultArray.length = getArrayLength(rowId); | ||
resultArray.offset = getArrayOffset(rowId); | ||
return resultArray; | ||
return new ColumnarArray(arrayData(), getArrayOffset(rowId), getArrayLength(rowId)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better to create ColumnarArray
for each rowID
only once (e.g. by using caching)? I am curious whether we would see performance overhead for creating ColumnarArray
to access elements of a multi-dimensional array (e.g. a[1][2] + a[1][3]
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that is a good idea. That would require us to keep an array of ColumnarArray
around. That might mess with both GC and escape analysis. Let's just create a benchmark and check if we do not regress.
@@ -265,20 +263,22 @@ object ColumnarBatchBenchmark { | |||
} | |||
|
|||
/* | |||
Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rerun all the benchmarks in this file and update the results
ByteBuffer API 1411 / 1418 232.2 4.3 0.1X | ||
DirectByteBuffer 467 / 474 701.8 1.4 0.4X | ||
Unsafe Buffer 178 / 185 1843.6 0.5 1.0X | ||
Column(on heap) 178 / 184 1840.8 0.5 1.0X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previusly onheap column vector was much faster than java array, which is unreasonable and I can't reproduce it now.
String Read/Write: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative | ||
------------------------------------------------------------------------------------------------ | ||
On Heap 332 / 338 49.3 20.3 1.0X | ||
Off Heap 466 / 467 35.2 28.4 0.7X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is due to the data copy saved in https://github.com/apache/spark/pull/19815/files#diff-f43d67d60091eab39c1310330bf7a8ffR211
On Heap Read Size Only 415 / 422 394.7 2.5 1.0X | ||
Off Heap Read Size Only 394 / 402 415.9 2.4 1.1X | ||
On Heap Read Elements 2558 / 2593 64.0 15.6 0.2X | ||
Off Heap Read Elements 3316 / 3317 49.4 20.2 0.1X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the result before this PR
On Heap Read Size Only 83 / 92 1970.3 0.5 1.0X
Off Heap Read Size Only 98 / 110 1669.1 0.6 0.8X
On Heap Read Elements 3190 / 3203 51.4 19.5 0.0X
Off Heap Read Elements 3106 / 3146 52.8 19.0 0.0X
For the worst case, we just get the array and get its size, reusing the object has a good improvement. However if we also need to access the array elements(should be the most common case), the overhead is negligible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for running a benchmark. I understand reusing the object has a good performance.
I am curious whether the current catalyst can generate such a Java code for accessing nested array elements in SQL selectExpr("a[1][1] + a[1][2] + a[1][3] + a[1][4] + a[1][5]")
.
Test build #84307 has finished for PR 19842 at commit
|
Since the benchmark shows negligible overhead for normal cases, I'm merging it to master, thanks! |
## What changes were proposed in this pull request? Similar to apache#19842 , we should also make `ColumnarRow` an immutable view, and move forward to make `ColumnVector` public. ## How was this patch tested? Existing tests. The performance concern should be same as apache#19842 . Author: Wenchen Fan <wenchen@databricks.com> Closes apache#19898 from cloud-fan/row-id.
What changes were proposed in this pull request?
To make
ColumnVector
public,ColumnarArray
need to be public too, and we should not have mutable public fields in a public class. This PR proposes to makeColumnarArray
an immutable view of the data, and always create a new instance ofColumnarArray
inColumnVector#getArray
How was this patch tested?
new benchmark in
ColumnarBatchBenchmark