Skip to content

Conversation

@jorgecarleitao
Copy link
Member

@jorgecarleitao jorgecarleitao commented Dec 5, 2020

This PR creates a new struct BooleanArray, that replaces PrimitiveArray<BooleanType>, so that we do not have to consider the differences between being bit-packed and non-bit packed.

This difference is causing a significant performance degradation described on ARROW-10453 and #8837 .

This usage of different logic is already observed in most of our kernels, as the code for byte-width and bit-packed is almost always different, due to how offsets are computed. With this PR, that offset computation no longer depends on bit-packed vs non-bit-packed.

IMPORTANT: this removed support from Boolean array to UnionArray, as UnionArray currently only supports PrimitiveType.

Micro benchmarks (worse to best, statistically insignificant ignored):

benchmark variation
min nulls 512 33.7
record_batches_to_csv 23.1
array_string_from_vec 256 5.6
array_string_from_vec 512 5.2
take bool nulls 512 4.9
cast int32 to int64 512 2.5
equal_512 2.3
filter u8 very low selectivity 2.2
array_slice 512 2.1
take bool nulls 1024 2.0
cast int64 to int32 512 1.6
min 512 1.6
take i32 512 1.1
add 512 1.1
array_slice 2048 1.0
length 1.0
filter u8 low selectivity 0.9
filter u8 high selectivity 0.9
array_string_from_vec 128 0.9
cast int32 to float64 512 0.9
cast timestamp_ms to i64 512 0.8
take str null indices 512 0.6
sum 512 0.4
filter context u8 very low selectivity -0.7
take i32 1024 -0.9
filter context f32 very low selectivity -0.9
cast float64 to float32 512 -1.0
equal_nulls_512 -1.0
cast time32s to time32ms 512 -1.1
sort 2^12 -1.2
struct_array_from_vec 128 -1.4
array_from_vec 256 -1.4
array_from_vec 128 -1.5
filter context u8 high selectivity -1.6
limit 512, 512 -1.7
equal_string_nulls_512 -1.8
take i32 nulls 1024 -1.8
struct_array_from_vec 512 -1.9
filter context f32 high selectivity -2.0
cast timestamp_ms to timestamp_ns 512 -2.2
take i32 nulls 512 -2.3
buffer_bit_ops or -2.4
array_from_vec 512 -2.6
cast float64 to uint64 512 -2.7
take str 512 -2.8
min nulls string 512 -3.1
cast int32 to int32 512 -3.3
array_slice 128 -3.3
filter context u8 w NULLs very low selectivity -3.3
buffer_bit_ops and -3.4
struct_array_from_vec 256 -4.2
cast int32 to uint32 512 -4.5
multiply 512 -5.2
equal_string_512 -5.5
take str null values null indices 1024 -6.8
sum nulls 512 -13.3
add_nulls_512 -17.6
like_utf8 scalar contains -17.8
nlike_utf8 scalar contains -17.9
nlike_utf8 scalar complex -24.6
like_utf8 scalar complex -25.2
cast time64ns to time32s 512 -42.7
cast date64 to date32 512 -49.1
cast date32 to date64 512 -50.7
nlike_utf8 scalar starts with -51.1
nlike_utf8 scalar ends with -55.1
like_utf8 scalar ends with -55.5
like_utf8 scalar starts with -56.3
nlike_utf8 scalar equals -67.8
like_utf8 scalar equals -74.2
eq Float32 -75.7
gt_eq Float32 -76.1
lt_eq Float32 -76.5
not -77.1
and -78.6
or -78.7
lt_eq scalar Float32 -79.4
eq scalar Float32 -82.1
neq Float32 -82.1
lt scalar Float32 -82.1
lt Float32 -82.3
gt Float32 -82.4
gt_eq scalar Float32 -82.4
neq scalar Float32 -82.6
gt scalar Float32 -84.7

@jorgecarleitao
Copy link
Member Author

@Dandandan , I think that this addresses your optimization at #8837, and enables the behavior more generally across the code that uses builders, if I am not mistaken. The benchmark speedups also seem similar to yours.

@jorgecarleitao
Copy link
Member Author

cc @andygrove , as the performance degradation due to specialization was initially raised by you :)

@github-actions
Copy link

github-actions bot commented Dec 5, 2020

@Dandandan
Copy link
Contributor

This is great work 👌

I think indeed this gets similar results and is a more general approach to this with wins across the board. I was actually also looking at this earlier based on a comment on a PR by you before starting the other PR, but couldn't totally figure out what had to be changed in order for this to work. Are the first benchmark results just noise?

I think after this, there are likely some remaining performance differences related to overhead of append and related functions which does some additional checks/bookkeeping, but I think we can look at that in follow-up work. Maybe we can think of similar APIs as used in Vec.

Copy link
Contributor

@Dandandan Dandandan Dec 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible/better to have one for slices instead of Vec?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to keep it the same to how all the other ones are, however, I agree with you that we should change them. Do you think we could make it on a separate PR, where we apply it to all of them?

Note also that in general it is preferable to use From<Iterator>, as we avoid a double allocation (Vec + Buffer)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could do that in a separate PR!

@github-actions github-actions bot added needs-rebase A PR that needs to be rebased by the author and removed needs-rebase A PR that needs to be rebased by the author labels Dec 6, 2020
@Dandandan
Copy link
Contributor

Needs a fmt :)

@Dandandan
Copy link
Contributor

Dandandan commented Dec 10, 2020

@alamb @andygrove @nevi-me

If you have time would be nice to have this PR reviewed and #8863 both bringing ~10-20% perf. improvements to queries in DataFusion.

Would like to do some more profiling this weekend to see find / prioritize performance improvements, one might be implementing vectorized hashing to speed up hash joins / aggregates but probably there is more low hanging fruit.

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't had time to review this in detail (I only really have time at weekends for that, and that time is lilmited at the moment) but I agree with the direction of this and I don't want to hold things up. I see that @Dandandan has already reviewed this so I am good with merging this.

@alamb
Copy link
Contributor

alamb commented Dec 10, 2020 via email

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through this PR pretty carefully. I had a question about safety when creating Boolean arrays from iterators that I think should be addressed; I think this PR could be merged as is however, and that done as a follow on task,.

IMPORTANT: this removed support from Boolean array to UnionArray, as UnionArray currently only supports PrimitiveType.

I recommend we file this as a bug (aka "lost support for Boolean in Union array") so that it is visible (and so that anyone who relies on that feature can perhaps help implement it again).

All in all, really nice work @jorgecarleitao -- I think it is a common approach to special case Vec as you have effectively done here (it is written into the C++ standard actually) so I think the design is 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that this implementation effectively is relying on size_hint to return the precise size of the iterator otherwise it will crash (maybe) -- but I think the intention of size_hint is, as the name suggests, a hint.

I may be misreading the code, but if it is possible to crash / do some undefined behavior if the size_hint is not accurate, I suggest we file a ticket (I can do so, but I want to see if I am mis reading this code first) and try and fix that before 3.0.0

Copy link
Member Author

@jorgecarleitao jorgecarleitao Dec 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

This was based on the implementation for the primitive types, that also does that. I made this decision on purpose at the time because rust does not offer a From<FixedSizediter> type thing afaik.

However, in retrospect, we could remove this and instead allow the Mutable to grow (there is no reason not to since there will be a bound check anyways).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do this in another PR, as this change affects all iterators.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codecov-io
Copy link

codecov-io commented Dec 11, 2020

Codecov Report

Merging #8842 (f07c600) into master (3deae8d) will increase coverage by 24.08%.
The diff coverage is 84.68%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #8842       +/-   ##
===========================================
+ Coverage   52.98%   77.06%   +24.08%     
===========================================
  Files         172      174        +2     
  Lines       30750    40353     +9603     
===========================================
+ Hits        16294    31100    +14806     
+ Misses      14456     9253     -5203     
Impacted Files Coverage Δ
rust/arrow/src/compute/kernels/cast.rs 96.34% <0.00%> (+0.12%) ⬆️
rust/arrow/src/compute/kernels/filter.rs 58.72% <0.00%> (ø)
rust/arrow/src/util/string_writer.rs 73.33% <ø> (ø)
rust/parquet/src/arrow/record_reader.rs 94.53% <ø> (+94.53%) ⬆️
rust/arrow/src/array/ord.rs 58.57% <20.00%> (-1.73%) ⬇️
rust/arrow/src/array/iterator.rs 93.45% <70.83%> (-6.55%) ⬇️
rust/arrow/src/array/builder.rs 81.73% <79.86%> (+<0.01%) ⬆️
rust/arrow/src/csv/reader.rs 93.10% <85.29%> (-0.90%) ⬇️
rust/arrow/src/array/array_boolean.rs 86.71% <86.71%> (ø)
rust/arrow/src/array/array_primitive.rs 91.78% <100.00%> (-1.32%) ⬇️
... and 76 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3deae8d...f07c600. Read the comment docs.

@jorgecarleitao jorgecarleitao deleted the boolean branch December 11, 2020 04:40
@rdettai
Copy link
Contributor

rdettai commented Dec 12, 2020

Great work on this! This is a really significant performance improvement!

nevi-me pushed a commit that referenced this pull request Dec 15, 2020
This PR shows that there is still about a ~2x performance (compared to ~8x earlier) difference between using a builder vs using a mutable buffer directly after #8842 .
This also accounts for a ~5% difference on some queries in DataFusion (when not using the simd feature, where the implementation doesn't use the builder). Also the bounds checks are a bit expensive. In some `value` functions they are explicitly not there whereas in other (like for string) they are there.

I guess there will be always _some_ overhead in the builder as it does need to do some bookkeeping, but I think it's a good idea to see how we can write kernels while not losing too much performance.

FYI @jorgecarleitao

```
Gnuplot not found, using plotters backend
eq Float32              time:   [107.02 us 107.29 us 107.60 us]
                        change: [-54.994% -54.839% -54.681%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

eq scalar Float32       time:   [70.271 us 70.356 us 70.446 us]
                        change: [-48.540% -48.392% -48.258%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

neq Float32             time:   [71.580 us 71.655 us 71.732 us]
                        change: [-58.072% -58.001% -57.931%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe

neq scalar Float32      time:   [70.011 us 70.079 us 70.155 us]
                        change: [-59.055% -58.980% -58.908%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) low severe
  3 (3.00%) high mild

lt Float32              time:   [70.945 us 70.991 us 71.038 us]
                        change: [-55.834% -55.757% -55.683%] (p = 0.00 < 0.05)
                        Performance has improved.

lt scalar Float32       time:   [50.708 us 50.789 us 50.882 us]
                        change: [-62.939% -62.825% -62.689%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

lt_eq Float32           time:   [106.29 us 106.40 us 106.52 us]
                        change: [-42.593% -42.470% -42.350%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe

lt_eq scalar Float32    time:   [71.089 us 71.170 us 71.261 us]
                        change: [-52.021% -51.941% -51.857%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

gt Float32              time:   [71.759 us 71.939 us 72.131 us]
                        change: [-58.319% -58.190% -58.067%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

gt scalar Float32       time:   [38.748 us 38.782 us 38.821 us]
                        change: [-73.757% -73.691% -73.624%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

gt_eq Float32           time:   [102.79 us 102.87 us 102.96 us]
                        change: [-53.103% -52.953% -52.805%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  4 (4.00%) high mild
  3 (3.00%) high severe

gt_eq scalar Float32    time:   [55.034 us 55.109 us 55.201 us]
                        change: [-59.706% -59.544% -59.381%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  7 (7.00%) high mild
```

Closes #8900 from Dandandan/comparison_kernels

Authored-by: Heres, Daniel <danielheres@gmail.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
alamb pushed a commit to apache/arrow-rs that referenced this pull request Apr 20, 2021
This PR shows that there is still about a ~2x performance (compared to ~8x earlier) difference between using a builder vs using a mutable buffer directly after apache/arrow#8842 .
This also accounts for a ~5% difference on some queries in DataFusion (when not using the simd feature, where the implementation doesn't use the builder). Also the bounds checks are a bit expensive. In some `value` functions they are explicitly not there whereas in other (like for string) they are there.

I guess there will be always _some_ overhead in the builder as it does need to do some bookkeeping, but I think it's a good idea to see how we can write kernels while not losing too much performance.

FYI @jorgecarleitao

```
Gnuplot not found, using plotters backend
eq Float32              time:   [107.02 us 107.29 us 107.60 us]
                        change: [-54.994% -54.839% -54.681%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

eq scalar Float32       time:   [70.271 us 70.356 us 70.446 us]
                        change: [-48.540% -48.392% -48.258%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

neq Float32             time:   [71.580 us 71.655 us 71.732 us]
                        change: [-58.072% -58.001% -57.931%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe

neq scalar Float32      time:   [70.011 us 70.079 us 70.155 us]
                        change: [-59.055% -58.980% -58.908%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) low severe
  3 (3.00%) high mild

lt Float32              time:   [70.945 us 70.991 us 71.038 us]
                        change: [-55.834% -55.757% -55.683%] (p = 0.00 < 0.05)
                        Performance has improved.

lt scalar Float32       time:   [50.708 us 50.789 us 50.882 us]
                        change: [-62.939% -62.825% -62.689%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

lt_eq Float32           time:   [106.29 us 106.40 us 106.52 us]
                        change: [-42.593% -42.470% -42.350%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe

lt_eq scalar Float32    time:   [71.089 us 71.170 us 71.261 us]
                        change: [-52.021% -51.941% -51.857%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

gt Float32              time:   [71.759 us 71.939 us 72.131 us]
                        change: [-58.319% -58.190% -58.067%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

gt scalar Float32       time:   [38.748 us 38.782 us 38.821 us]
                        change: [-73.757% -73.691% -73.624%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

gt_eq Float32           time:   [102.79 us 102.87 us 102.96 us]
                        change: [-53.103% -52.953% -52.805%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  4 (4.00%) high mild
  3 (3.00%) high severe

gt_eq scalar Float32    time:   [55.034 us 55.109 us 55.201 us]
                        change: [-59.706% -59.544% -59.381%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  7 (7.00%) high mild
```

Closes #8900 from Dandandan/comparison_kernels

Authored-by: Heres, Daniel <danielheres@gmail.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants