Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22643][SQL] ColumnarArray should be an immutable view #19842

Closed
wants to merge 2 commits into from

Conversation

cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Nov 29, 2017

What changes were proposed in this pull request?

To make ColumnVector public, ColumnarArray need to be public too, and we should not have mutable public fields in a public class. This PR proposes to make ColumnarArray an immutable view of the data, and always create a new instance of ColumnarArray in ColumnVector#getArray

How was this patch tested?

new benchmark in ColumnarBatchBenchmark

@cloud-fan
Copy link
Contributor Author

@SparkQA
Copy link

SparkQA commented Nov 29, 2017

Test build #84285 has finished for PR 19842 at commit aaa33dd.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member

kiszk commented Nov 29, 2017

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Nov 29, 2017

Test build #84294 has finished for PR 19842 at commit aaa33dd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

@cloud-fan TPCDS does not have nested data or arrays. So I think we have to redo the benchmarks. A simple micro benchmark that touches a few elements in the array should probably do it.

@hvanhovell
Copy link
Contributor

LGTM - pending benchmarks :)

resultArray.length = getArrayLength(rowId);
resultArray.offset = getArrayOffset(rowId);
return resultArray;
return new ColumnarArray(arrayData(), getArrayOffset(rowId), getArrayLength(rowId));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to create ColumnarArray for each rowID only once (e.g. by using caching)? I am curious whether we would see performance overhead for creating ColumnarArray to access elements of a multi-dimensional array (e.g. a[1][2] + a[1][3]).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that is a good idea. That would require us to keep an array of ColumnarArray around. That might mess with both GC and escape analysis. Let's just create a benchmark and check if we do not regress.

@@ -265,20 +263,22 @@ object ColumnarBatchBenchmark {
}

/*
Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rerun all the benchmarks in this file and update the results

ByteBuffer API 1411 / 1418 232.2 4.3 0.1X
DirectByteBuffer 467 / 474 701.8 1.4 0.4X
Unsafe Buffer 178 / 185 1843.6 0.5 1.0X
Column(on heap) 178 / 184 1840.8 0.5 1.0X
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previusly onheap column vector was much faster than java array, which is unreasonable and I can't reproduce it now.

String Read/Write: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
On Heap 332 / 338 49.3 20.3 1.0X
Off Heap 466 / 467 35.2 28.4 0.7X
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Heap Read Size Only 415 / 422 394.7 2.5 1.0X
Off Heap Read Size Only 394 / 402 415.9 2.4 1.1X
On Heap Read Elements 2558 / 2593 64.0 15.6 0.2X
Off Heap Read Elements 3316 / 3317 49.4 20.2 0.1X
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the result before this PR

On Heap Read Size Only                          83 /   92       1970.3           0.5       1.0X
Off Heap Read Size Only                         98 /  110       1669.1           0.6       0.8X
On Heap Read Elements                         3190 / 3203         51.4          19.5       0.0X
Off Heap Read Elements                        3106 / 3146         52.8          19.0       0.0X

For the worst case, we just get the array and get its size, reusing the object has a good improvement. However if we also need to access the array elements(should be the most common case), the overhead is negligible

Copy link
Member

@kiszk kiszk Nov 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for running a benchmark. I understand reusing the object has a good performance.
I am curious whether the current catalyst can generate such a Java code for accessing nested array elements in SQL selectExpr("a[1][1] + a[1][2] + a[1][3] + a[1][4] + a[1][5]").

@SparkQA
Copy link

SparkQA commented Nov 29, 2017

Test build #84307 has finished for PR 19842 at commit 83d5120.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

Since the benchmark shows negligible overhead for normal cases, I'm merging it to master, thanks!

@asfgit asfgit closed this in 9c29c55 Nov 30, 2017
ghost pushed a commit to dbtsai/spark that referenced this pull request Dec 7, 2017
## What changes were proposed in this pull request?

Similar to apache#19842 , we should also make `ColumnarRow` an immutable view, and move forward to make `ColumnVector` public.

## How was this patch tested?

Existing tests.

The performance concern should be same as apache#19842 .

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#19898 from cloud-fan/row-id.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants