-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10812: [Rust] Make BooleanArray not a PrimitiveArray #8842
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@Dandandan , I think that this addresses your optimization at #8837, and enables the behavior more generally across the code that uses builders, if I am not mistaken. The benchmark speedups also seem similar to yours. |
|
cc @andygrove , as the performance degradation due to specialization was initially raised by you :) |
|
This is great work 👌 I think indeed this gets similar results and is a more general approach to this with wins across the board. I was actually also looking at this earlier based on a comment on a PR by you before starting the other PR, but couldn't totally figure out what had to be changed in order for this to work. Are the first benchmark results just noise? I think after this, there are likely some remaining performance differences related to overhead of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible/better to have one for slices instead of Vec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to keep it the same to how all the other ones are, however, I agree with you that we should change them. Do you think we could make it on a separate PR, where we apply it to all of them?
Note also that in general it is preferable to use From<Iterator>, as we avoid a double allocation (Vec + Buffer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could do that in a separate PR!
|
Needs a |
|
If you have time would be nice to have this PR reviewed and #8863 both bringing ~10-20% perf. improvements to queries in DataFusion. Would like to do some more profiling this weekend to see find / prioritize performance improvements, one might be implementing vectorized hashing to speed up hash joins / aggregates but probably there is more low hanging fruit. |
andygrove
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't had time to review this in detail (I only really have time at weekends for that, and that time is lilmited at the moment) but I agree with the direction of this and I don't want to hold things up. I see that @Dandandan has already reviewed this so I am good with merging this.
|
I will review it carefully shortly
…On Thu, Dec 10, 2020 at 9:53 AM Andy Grove ***@***.***> wrote:
***@***.**** approved this pull request.
I haven't had time to review this in detail (I only really have time at
weekends for that, and that time is lilmited at the moment) but I agree
with the direction of this and I don't want to hold things up. I see that
@Dandandan <https://github.com/Dandandan> has already reviewed this so I
am good with merging this.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8842 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADXZMPWZI2CJ2OIOTLPJDLSUDOFFANCNFSM4UO2EIRQ>
.
|
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through this PR pretty carefully. I had a question about safety when creating Boolean arrays from iterators that I think should be addressed; I think this PR could be merged as is however, and that done as a follow on task,.
IMPORTANT: this removed support from Boolean array to UnionArray, as UnionArray currently only supports PrimitiveType.
I recommend we file this as a bug (aka "lost support for Boolean in Union array") so that it is visible (and so that anyone who relies on that feature can perhaps help implement it again).
All in all, really nice work @jorgecarleitao -- I think it is a common approach to special case Vec as you have effectively done here (it is written into the C++ standard actually) so I think the design is 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that this implementation effectively is relying on size_hint to return the precise size of the iterator otherwise it will crash (maybe) -- but I think the intention of size_hint is, as the name suggests, a hint.
I may be misreading the code, but if it is possible to crash / do some undefined behavior if the size_hint is not accurate, I suggest we file a ticket (I can do so, but I want to see if I am mis reading this code first) and try and fix that before 3.0.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree.
This was based on the implementation for the primitive types, that also does that. I made this decision on purpose at the time because rust does not offer a From<FixedSizediter> type thing afaik.
However, in retrospect, we could remove this and instead allow the Mutable to grow (there is no reason not to since there will be a bound check anyways).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do this in another PR, as this change affects all iterators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codecov Report
@@ Coverage Diff @@
## master #8842 +/- ##
===========================================
+ Coverage 52.98% 77.06% +24.08%
===========================================
Files 172 174 +2
Lines 30750 40353 +9603
===========================================
+ Hits 16294 31100 +14806
+ Misses 14456 9253 -5203
Continue to review full report at Codecov.
|
Co-authored-by: Jörn Horstmann <git@jhorstmann.net>
|
Great work on this! This is a really significant performance improvement! |
This PR shows that there is still about a ~2x performance (compared to ~8x earlier) difference between using a builder vs using a mutable buffer directly after #8842 . This also accounts for a ~5% difference on some queries in DataFusion (when not using the simd feature, where the implementation doesn't use the builder). Also the bounds checks are a bit expensive. In some `value` functions they are explicitly not there whereas in other (like for string) they are there. I guess there will be always _some_ overhead in the builder as it does need to do some bookkeeping, but I think it's a good idea to see how we can write kernels while not losing too much performance. FYI @jorgecarleitao ``` Gnuplot not found, using plotters backend eq Float32 time: [107.02 us 107.29 us 107.60 us] change: [-54.994% -54.839% -54.681%] (p = 0.00 < 0.05) Performance has improved. Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) high mild 1 (1.00%) high severe eq scalar Float32 time: [70.271 us 70.356 us 70.446 us] change: [-48.540% -48.392% -48.258%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 2 (2.00%) high mild 2 (2.00%) high severe neq Float32 time: [71.580 us 71.655 us 71.732 us] change: [-58.072% -58.001% -57.931%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 1 (1.00%) low mild 4 (4.00%) high mild 2 (2.00%) high severe neq scalar Float32 time: [70.011 us 70.079 us 70.155 us] change: [-59.055% -58.980% -58.908%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 2 (2.00%) low severe 3 (3.00%) high mild lt Float32 time: [70.945 us 70.991 us 71.038 us] change: [-55.834% -55.757% -55.683%] (p = 0.00 < 0.05) Performance has improved. lt scalar Float32 time: [50.708 us 50.789 us 50.882 us] change: [-62.939% -62.825% -62.689%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 3 (3.00%) high mild 2 (2.00%) high severe lt_eq Float32 time: [106.29 us 106.40 us 106.52 us] change: [-42.593% -42.470% -42.350%] (p = 0.00 < 0.05) Performance has improved. Found 6 outliers among 100 measurements (6.00%) 1 (1.00%) low severe 1 (1.00%) low mild 2 (2.00%) high mild 2 (2.00%) high severe lt_eq scalar Float32 time: [71.089 us 71.170 us 71.261 us] change: [-52.021% -51.941% -51.857%] (p = 0.00 < 0.05) Performance has improved. Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) high mild 1 (1.00%) high severe gt Float32 time: [71.759 us 71.939 us 72.131 us] change: [-58.319% -58.190% -58.067%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 5 (5.00%) high mild 2 (2.00%) high severe gt scalar Float32 time: [38.748 us 38.782 us 38.821 us] change: [-73.757% -73.691% -73.624%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 4 (4.00%) high mild 1 (1.00%) high severe gt_eq Float32 time: [102.79 us 102.87 us 102.96 us] change: [-53.103% -52.953% -52.805%] (p = 0.00 < 0.05) Performance has improved. Found 8 outliers among 100 measurements (8.00%) 1 (1.00%) low severe 4 (4.00%) high mild 3 (3.00%) high severe gt_eq scalar Float32 time: [55.034 us 55.109 us 55.201 us] change: [-59.706% -59.544% -59.381%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 7 (7.00%) high mild ``` Closes #8900 from Dandandan/comparison_kernels Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
This PR shows that there is still about a ~2x performance (compared to ~8x earlier) difference between using a builder vs using a mutable buffer directly after apache/arrow#8842 . This also accounts for a ~5% difference on some queries in DataFusion (when not using the simd feature, where the implementation doesn't use the builder). Also the bounds checks are a bit expensive. In some `value` functions they are explicitly not there whereas in other (like for string) they are there. I guess there will be always _some_ overhead in the builder as it does need to do some bookkeeping, but I think it's a good idea to see how we can write kernels while not losing too much performance. FYI @jorgecarleitao ``` Gnuplot not found, using plotters backend eq Float32 time: [107.02 us 107.29 us 107.60 us] change: [-54.994% -54.839% -54.681%] (p = 0.00 < 0.05) Performance has improved. Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) high mild 1 (1.00%) high severe eq scalar Float32 time: [70.271 us 70.356 us 70.446 us] change: [-48.540% -48.392% -48.258%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 2 (2.00%) high mild 2 (2.00%) high severe neq Float32 time: [71.580 us 71.655 us 71.732 us] change: [-58.072% -58.001% -57.931%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 1 (1.00%) low mild 4 (4.00%) high mild 2 (2.00%) high severe neq scalar Float32 time: [70.011 us 70.079 us 70.155 us] change: [-59.055% -58.980% -58.908%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 2 (2.00%) low severe 3 (3.00%) high mild lt Float32 time: [70.945 us 70.991 us 71.038 us] change: [-55.834% -55.757% -55.683%] (p = 0.00 < 0.05) Performance has improved. lt scalar Float32 time: [50.708 us 50.789 us 50.882 us] change: [-62.939% -62.825% -62.689%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 3 (3.00%) high mild 2 (2.00%) high severe lt_eq Float32 time: [106.29 us 106.40 us 106.52 us] change: [-42.593% -42.470% -42.350%] (p = 0.00 < 0.05) Performance has improved. Found 6 outliers among 100 measurements (6.00%) 1 (1.00%) low severe 1 (1.00%) low mild 2 (2.00%) high mild 2 (2.00%) high severe lt_eq scalar Float32 time: [71.089 us 71.170 us 71.261 us] change: [-52.021% -51.941% -51.857%] (p = 0.00 < 0.05) Performance has improved. Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) high mild 1 (1.00%) high severe gt Float32 time: [71.759 us 71.939 us 72.131 us] change: [-58.319% -58.190% -58.067%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 5 (5.00%) high mild 2 (2.00%) high severe gt scalar Float32 time: [38.748 us 38.782 us 38.821 us] change: [-73.757% -73.691% -73.624%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 4 (4.00%) high mild 1 (1.00%) high severe gt_eq Float32 time: [102.79 us 102.87 us 102.96 us] change: [-53.103% -52.953% -52.805%] (p = 0.00 < 0.05) Performance has improved. Found 8 outliers among 100 measurements (8.00%) 1 (1.00%) low severe 4 (4.00%) high mild 3 (3.00%) high severe gt_eq scalar Float32 time: [55.034 us 55.109 us 55.201 us] change: [-59.706% -59.544% -59.381%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 7 (7.00%) high mild ``` Closes #8900 from Dandandan/comparison_kernels Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
This PR creates a new struct
BooleanArray, that replacesPrimitiveArray<BooleanType>, so that we do not have to consider the differences between being bit-packed and non-bit packed.This difference is causing a significant performance degradation described on ARROW-10453 and #8837 .
This usage of different logic is already observed in most of our kernels, as the code for byte-width and bit-packed is almost always different, due to how offsets are computed. With this PR, that offset computation no longer depends on bit-packed vs non-bit-packed.
IMPORTANT: this removed support from Boolean array to UnionArray, as
UnionArraycurrently only supportsPrimitiveType.Micro benchmarks (worse to best, statistically insignificant ignored):