ARROW-10812: [Rust] Make BooleanArray not a PrimitiveArray #8842

jorgecarleitao · 2020-12-05T21:21:57Z

This PR creates a new struct BooleanArray, that replaces PrimitiveArray<BooleanType>, so that we do not have to consider the differences between being bit-packed and non-bit packed.

This difference is causing a significant performance degradation described on ARROW-10453 and #8837 .

This usage of different logic is already observed in most of our kernels, as the code for byte-width and bit-packed is almost always different, due to how offsets are computed. With this PR, that offset computation no longer depends on bit-packed vs non-bit-packed.

IMPORTANT: this removed support from Boolean array to UnionArray, as UnionArray currently only supports PrimitiveType.

Micro benchmarks (worse to best, statistically insignificant ignored):

benchmark	variation
min nulls 512	33.7
record_batches_to_csv	23.1
array_string_from_vec 256	5.6
array_string_from_vec 512	5.2
take bool nulls 512	4.9
cast int32 to int64 512	2.5
equal_512	2.3
filter u8 very low selectivity	2.2
array_slice 512	2.1
take bool nulls 1024	2.0
cast int64 to int32 512	1.6
min 512	1.6
take i32 512	1.1
add 512	1.1
array_slice 2048	1.0
length	1.0
filter u8 low selectivity	0.9
filter u8 high selectivity	0.9
array_string_from_vec 128	0.9
cast int32 to float64 512	0.9
cast timestamp_ms to i64 512	0.8
take str null indices 512	0.6
sum 512	0.4
filter context u8 very low selectivity	-0.7
take i32 1024	-0.9
filter context f32 very low selectivity	-0.9
cast float64 to float32 512	-1.0
equal_nulls_512	-1.0
cast time32s to time32ms 512	-1.1
sort 2^12	-1.2
struct_array_from_vec 128	-1.4
array_from_vec 256	-1.4
array_from_vec 128	-1.5
filter context u8 high selectivity	-1.6
limit 512, 512	-1.7
equal_string_nulls_512	-1.8
take i32 nulls 1024	-1.8
struct_array_from_vec 512	-1.9
filter context f32 high selectivity	-2.0
cast timestamp_ms to timestamp_ns 512	-2.2
take i32 nulls 512	-2.3
buffer_bit_ops or	-2.4
array_from_vec 512	-2.6
cast float64 to uint64 512	-2.7
take str 512	-2.8
min nulls string 512	-3.1
cast int32 to int32 512	-3.3
array_slice 128	-3.3
filter context u8 w NULLs very low selectivity	-3.3
buffer_bit_ops and	-3.4
struct_array_from_vec 256	-4.2
cast int32 to uint32 512	-4.5
multiply 512	-5.2
equal_string_512	-5.5
take str null values null indices 1024	-6.8
sum nulls 512	-13.3
add_nulls_512	-17.6
like_utf8 scalar contains	-17.8
nlike_utf8 scalar contains	-17.9
nlike_utf8 scalar complex	-24.6
like_utf8 scalar complex	-25.2
cast time64ns to time32s 512	-42.7
cast date64 to date32 512	-49.1
cast date32 to date64 512	-50.7
nlike_utf8 scalar starts with	-51.1
nlike_utf8 scalar ends with	-55.1
like_utf8 scalar ends with	-55.5
like_utf8 scalar starts with	-56.3
nlike_utf8 scalar equals	-67.8
like_utf8 scalar equals	-74.2
eq Float32	-75.7
gt_eq Float32	-76.1
lt_eq Float32	-76.5
not	-77.1
and	-78.6
or	-78.7
lt_eq scalar Float32	-79.4
eq scalar Float32	-82.1
neq Float32	-82.1
lt scalar Float32	-82.1
lt Float32	-82.3
gt Float32	-82.4
gt_eq scalar Float32	-82.4
neq scalar Float32	-82.6
gt scalar Float32	-84.7

jorgecarleitao · 2020-12-05T21:23:16Z

@Dandandan , I think that this addresses your optimization at #8837, and enables the behavior more generally across the code that uses builders, if I am not mistaken. The benchmark speedups also seem similar to yours.

jorgecarleitao · 2020-12-05T21:24:23Z

cc @andygrove , as the performance degradation due to specialization was initially raised by you :)

github-actions · 2020-12-05T21:24:29Z

https://issues.apache.org/jira/browse/ARROW-10812

Dandandan · 2020-12-05T21:42:34Z

This is great work 👌

I think indeed this gets similar results and is a more general approach to this with wins across the board. I was actually also looking at this earlier based on a comment on a PR by you before starting the other PR, but couldn't totally figure out what had to be changed in order for this to work. Are the first benchmark results just noise?

I think after this, there are likely some remaining performance differences related to overhead of append and related functions which does some additional checks/bookkeeping, but I think we can look at that in follow-up work. Maybe we can think of similar APIs as used in Vec.

Dandandan · 2020-12-05T21:47:49Z

rust/arrow/src/array/array_boolean.rs

Would it be possible/better to have one for slices instead of Vec?

I tried to keep it the same to how all the other ones are, however, I agree with you that we should change them. Do you think we could make it on a separate PR, where we apply it to all of them?

Note also that in general it is preferable to use From<Iterator>, as we avoid a double allocation (Vec + Buffer)

I think we could do that in a separate PR!

rust/arrow/src/array/equal_json.rs

rust/arrow/src/datatypes.rs

rust/arrow/src/array/builder.rs

rust/arrow/src/compute/kernels/sort.rs

Dandandan · 2020-12-07T18:57:01Z

Needs a fmt :)

Dandandan · 2020-12-10T11:01:23Z

@alamb @andygrove @nevi-me

If you have time would be nice to have this PR reviewed and #8863 both bringing ~10-20% perf. improvements to queries in DataFusion.

Would like to do some more profiling this weekend to see find / prioritize performance improvements, one might be implementing vectorized hashing to speed up hash joins / aggregates but probably there is more low hanging fruit.

andygrove

I haven't had time to review this in detail (I only really have time at weekends for that, and that time is lilmited at the moment) but I agree with the direction of this and I don't want to hold things up. I see that @Dandandan has already reviewed this so I am good with merging this.

alamb · 2020-12-10T15:52:47Z

I will review it carefully shortly

…

On Thu, Dec 10, 2020 at 9:53 AM Andy Grove ***@***.***> wrote: ***@***.**** approved this pull request. I haven't had time to review this in detail (I only really have time at weekends for that, and that time is lilmited at the moment) but I agree with the direction of this and I don't want to hold things up. I see that @Dandandan <https://github.com/Dandandan> has already reviewed this so I am good with merging this. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8842 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADXZMPWZI2CJ2OIOTLPJDLSUDOFFANCNFSM4UO2EIRQ> .

alamb

I went through this PR pretty carefully. I had a question about safety when creating Boolean arrays from iterators that I think should be addressed; I think this PR could be merged as is however, and that done as a follow on task,.

IMPORTANT: this removed support from Boolean array to UnionArray, as UnionArray currently only supports PrimitiveType.

I recommend we file this as a bug (aka "lost support for Boolean in Union array") so that it is visible (and so that anyone who relies on that feature can perhaps help implement it again).

All in all, really nice work @jorgecarleitao -- I think it is a common approach to special case Vec as you have effectively done here (it is written into the C++ standard actually) so I think the design is 👍

alamb · 2020-12-10T16:10:33Z

rust/arrow/src/array/array_boolean.rs

It seems to me that this implementation effectively is relying on size_hint to return the precise size of the iterator otherwise it will crash (maybe) -- but I think the intention of size_hint is, as the name suggests, a hint.

I may be misreading the code, but if it is possible to crash / do some undefined behavior if the size_hint is not accurate, I suggest we file a ticket (I can do so, but I want to see if I am mis reading this code first) and try and fix that before 3.0.0

I agree.

This was based on the implementation for the primitive types, that also does that. I made this decision on purpose at the time because rust does not offer a From<FixedSizediter> type thing afaik.

However, in retrospect, we could remove this and instead allow the Mutable to grow (there is no reason not to since there will be a bound check anyways).

Let's do this in another PR, as this change affects all iterators.

Filed as https://issues.apache.org/jira/browse/ARROW-10886

rust/arrow/src/array/array_boolean.rs

codecov-io · 2020-12-11T04:03:26Z

Codecov Report

Merging #8842 (f07c600) into master (3deae8d) will increase coverage by 24.08%.
The diff coverage is 84.68%.

@@             Coverage Diff             @@
##           master    #8842       +/-   ##
===========================================
+ Coverage   52.98%   77.06%   +24.08%     
===========================================
  Files         172      174        +2     
  Lines       30750    40353     +9603     
===========================================
+ Hits        16294    31100    +14806     
+ Misses      14456     9253     -5203

Impacted Files	Coverage Δ
rust/arrow/src/compute/kernels/cast.rs	`96.34% <0.00%> (+0.12%)`	⬆️
rust/arrow/src/compute/kernels/filter.rs	`58.72% <0.00%> (ø)`
rust/arrow/src/util/string_writer.rs	`73.33% <ø> (ø)`
rust/parquet/src/arrow/record_reader.rs	`94.53% <ø> (+94.53%)`	⬆️
rust/arrow/src/array/ord.rs	`58.57% <20.00%> (-1.73%)`	⬇️
rust/arrow/src/array/iterator.rs	`93.45% <70.83%> (-6.55%)`	⬇️
rust/arrow/src/array/builder.rs	`81.73% <79.86%> (+<0.01%)`	⬆️
rust/arrow/src/csv/reader.rs	`93.10% <85.29%> (-0.90%)`	⬇️
rust/arrow/src/array/array_boolean.rs	`86.71% <86.71%> (ø)`
rust/arrow/src/array/array_primitive.rs	`91.78% <100.00%> (-1.32%)`	⬇️
... and 76 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3deae8d...f07c600. Read the comment docs.

Co-authored-by: Jörn Horstmann <git@jhorstmann.net>

rdettai · 2020-12-12T16:58:37Z

Great work on this! This is a really significant performance improvement!

@jorgecarleitao

This PR shows that there is still about a ~2x performance (compared to ~8x earlier) difference between using a builder vs using a mutable buffer directly after #8842 . This also accounts for a ~5% difference on some queries in DataFusion (when not using the simd feature, where the implementation doesn't use the builder). Also the bounds checks are a bit expensive. In some `value` functions they are explicitly not there whereas in other (like for string) they are there. I guess there will be always _some_ overhead in the builder as it does need to do some bookkeeping, but I think it's a good idea to see how we can write kernels while not losing too much performance. FYI @jorgecarleitao ``` Gnuplot not found, using plotters backend eq Float32 time: [107.02 us 107.29 us 107.60 us] change: [-54.994% -54.839% -54.681%] (p = 0.00 < 0.05) Performance has improved. Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) high mild 1 (1.00%) high severe eq scalar Float32 time: [70.271 us 70.356 us 70.446 us] change: [-48.540% -48.392% -48.258%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 2 (2.00%) high mild 2 (2.00%) high severe neq Float32 time: [71.580 us 71.655 us 71.732 us] change: [-58.072% -58.001% -57.931%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 1 (1.00%) low mild 4 (4.00%) high mild 2 (2.00%) high severe neq scalar Float32 time: [70.011 us 70.079 us 70.155 us] change: [-59.055% -58.980% -58.908%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 2 (2.00%) low severe 3 (3.00%) high mild lt Float32 time: [70.945 us 70.991 us 71.038 us] change: [-55.834% -55.757% -55.683%] (p = 0.00 < 0.05) Performance has improved. lt scalar Float32 time: [50.708 us 50.789 us 50.882 us] change: [-62.939% -62.825% -62.689%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 3 (3.00%) high mild 2 (2.00%) high severe lt_eq Float32 time: [106.29 us 106.40 us 106.52 us] change: [-42.593% -42.470% -42.350%] (p = 0.00 < 0.05) Performance has improved. Found 6 outliers among 100 measurements (6.00%) 1 (1.00%) low severe 1 (1.00%) low mild 2 (2.00%) high mild 2 (2.00%) high severe lt_eq scalar Float32 time: [71.089 us 71.170 us 71.261 us] change: [-52.021% -51.941% -51.857%] (p = 0.00 < 0.05) Performance has improved. Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) high mild 1 (1.00%) high severe gt Float32 time: [71.759 us 71.939 us 72.131 us] change: [-58.319% -58.190% -58.067%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 5 (5.00%) high mild 2 (2.00%) high severe gt scalar Float32 time: [38.748 us 38.782 us 38.821 us] change: [-73.757% -73.691% -73.624%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 4 (4.00%) high mild 1 (1.00%) high severe gt_eq Float32 time: [102.79 us 102.87 us 102.96 us] change: [-53.103% -52.953% -52.805%] (p = 0.00 < 0.05) Performance has improved. Found 8 outliers among 100 measurements (8.00%) 1 (1.00%) low severe 4 (4.00%) high mild 3 (3.00%) high severe gt_eq scalar Float32 time: [55.034 us 55.109 us 55.201 us] change: [-59.706% -59.544% -59.381%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 7 (7.00%) high mild ``` Closes #8900 from Dandandan/comparison_kernels Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

@jorgecarleitao

This PR shows that there is still about a ~2x performance (compared to ~8x earlier) difference between using a builder vs using a mutable buffer directly after apache/arrow#8842 . This also accounts for a ~5% difference on some queries in DataFusion (when not using the simd feature, where the implementation doesn't use the builder). Also the bounds checks are a bit expensive. In some `value` functions they are explicitly not there whereas in other (like for string) they are there. I guess there will be always _some_ overhead in the builder as it does need to do some bookkeeping, but I think it's a good idea to see how we can write kernels while not losing too much performance. FYI @jorgecarleitao ``` Gnuplot not found, using plotters backend eq Float32 time: [107.02 us 107.29 us 107.60 us] change: [-54.994% -54.839% -54.681%] (p = 0.00 < 0.05) Performance has improved. Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) high mild 1 (1.00%) high severe eq scalar Float32 time: [70.271 us 70.356 us 70.446 us] change: [-48.540% -48.392% -48.258%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 2 (2.00%) high mild 2 (2.00%) high severe neq Float32 time: [71.580 us 71.655 us 71.732 us] change: [-58.072% -58.001% -57.931%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 1 (1.00%) low mild 4 (4.00%) high mild 2 (2.00%) high severe neq scalar Float32 time: [70.011 us 70.079 us 70.155 us] change: [-59.055% -58.980% -58.908%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 2 (2.00%) low severe 3 (3.00%) high mild lt Float32 time: [70.945 us 70.991 us 71.038 us] change: [-55.834% -55.757% -55.683%] (p = 0.00 < 0.05) Performance has improved. lt scalar Float32 time: [50.708 us 50.789 us 50.882 us] change: [-62.939% -62.825% -62.689%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 3 (3.00%) high mild 2 (2.00%) high severe lt_eq Float32 time: [106.29 us 106.40 us 106.52 us] change: [-42.593% -42.470% -42.350%] (p = 0.00 < 0.05) Performance has improved. Found 6 outliers among 100 measurements (6.00%) 1 (1.00%) low severe 1 (1.00%) low mild 2 (2.00%) high mild 2 (2.00%) high severe lt_eq scalar Float32 time: [71.089 us 71.170 us 71.261 us] change: [-52.021% -51.941% -51.857%] (p = 0.00 < 0.05) Performance has improved. Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) high mild 1 (1.00%) high severe gt Float32 time: [71.759 us 71.939 us 72.131 us] change: [-58.319% -58.190% -58.067%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 5 (5.00%) high mild 2 (2.00%) high severe gt scalar Float32 time: [38.748 us 38.782 us 38.821 us] change: [-73.757% -73.691% -73.624%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 4 (4.00%) high mild 1 (1.00%) high severe gt_eq Float32 time: [102.79 us 102.87 us 102.96 us] change: [-53.103% -52.953% -52.805%] (p = 0.00 < 0.05) Performance has improved. Found 8 outliers among 100 measurements (8.00%) 1 (1.00%) low severe 4 (4.00%) high mild 3 (3.00%) high severe gt_eq scalar Float32 time: [55.034 us 55.109 us 55.201 us] change: [-59.706% -59.544% -59.381%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 7 (7.00%) high mild ``` Closes #8900 from Dandandan/comparison_kernels Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

github-actions bot added Component: Rust Component: Parquet labels Dec 5, 2020

Dandandan reviewed Dec 5, 2020

View reviewed changes

rust/arrow/src/array/equal_json.rs Outdated Show resolved Hide resolved

Dandandan reviewed Dec 5, 2020

View reviewed changes

rust/arrow/src/datatypes.rs Outdated Show resolved Hide resolved

Dandandan mentioned this pull request Dec 6, 2020

ARROW-10810: [Rust] Speed up comparison kernels #8837

Closed

github-actions bot added needs-rebase A PR that needs to be rebased by the author and removed needs-rebase A PR that needs to be rebased by the author labels Dec 6, 2020

Dandandan approved these changes Dec 7, 2020

View reviewed changes

jhorstmann reviewed Dec 7, 2020

View reviewed changes

rust/arrow/src/array/builder.rs Outdated Show resolved Hide resolved

jhorstmann reviewed Dec 7, 2020

View reviewed changes

rust/arrow/src/compute/kernels/sort.rs Outdated Show resolved Hide resolved

Dandandan approved these changes Dec 9, 2020

View reviewed changes

andygrove approved these changes Dec 10, 2020

View reviewed changes

alamb approved these changes Dec 10, 2020

View reviewed changes

jorgecarleitao and others added 6 commits December 11, 2020 05:05

WIP

ebae5af

Simplified code.

292f97f

Update rust/arrow/src/array/builder.rs

f6c01b9

Co-authored-by: Jörn Horstmann <git@jhorstmann.net>

Added test and removed duplicated code.

902179d

Simplified function signature.

4e0c7ae

Minor simplification.

f07c600

jorgecarleitao closed this in 2816f37 Dec 11, 2020

jorgecarleitao deleted the boolean branch December 11, 2020 04:40

Dandandan mentioned this pull request Dec 12, 2020

ARROW-10810: [Rust] Improve comparison kernels performance #8900

Closed

alamb mentioned this pull request Apr 26, 2021

Potential crash when creating arrays from Iterators if size_hint is not correct apache/arrow-rs#138

Closed

This was referenced Dec 12, 2020

[Rust] [DataFusion] Performance degredation after removing specialization #26430

Closed

[Rust] Make BooleanArray not a PrimitiveArray #26750

Closed

ARROW-10812: [Rust] Make BooleanArray not a PrimitiveArray #8842

ARROW-10812: [Rust] Make BooleanArray not a PrimitiveArray #8842

Uh oh!

Conversation

jorgecarleitao commented Dec 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorgecarleitao commented Dec 5, 2020

Uh oh!

jorgecarleitao commented Dec 5, 2020

Uh oh!

github-actions bot commented Dec 5, 2020

Uh oh!

Dandandan commented Dec 5, 2020

Uh oh!

Dandandan Dec 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorgecarleitao Dec 6, 2020

Choose a reason for hiding this comment

Uh oh!

Dandandan Dec 6, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Dandandan commented Dec 7, 2020

Uh oh!

Dandandan commented Dec 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Dec 10, 2020 via email

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Dec 10, 2020

Choose a reason for hiding this comment

Uh oh!

jorgecarleitao Dec 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorgecarleitao Dec 11, 2020

Choose a reason for hiding this comment

Uh oh!

alamb Dec 11, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov-io commented Dec 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

rdettai commented Dec 12, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

jorgecarleitao commented Dec 5, 2020 •

edited

Loading

Dandandan Dec 5, 2020 •

edited

Loading

Dandandan commented Dec 10, 2020 •

edited

Loading

jorgecarleitao Dec 10, 2020 •

edited

Loading

codecov-io commented Dec 11, 2020 •

edited

Loading