ARROW-10216: [Rust] Simd implementation for primitive min/max kernels #8685

jhorstmann · 2020-11-16T19:41:11Z

This refactors the simd aggregation to a reusable trait and adds implementations for min and max. It also handles NaN values now the same way as the sort kernel, meaning that NaN is considered to be greater than any other non-null value.

Some tests were failing with the simd feature active because the different order of additions rounded to a slightly different result. I reused the comparison function that @vertexclique implemented in his recent PR.

Update: Benchmarks (aggregate_kernels) show a 8x improvement with the simd feature:

sum performance stays the same at 37ns without nulls and 95 with nulls
min without nulls improves from 595ns to 75ns
min with nulls improves from 1725ns to 197ns

Without simd feature the performance decreased due to the added NaN handling:

min without nulls from 595ns to 1057ns

One strange result is that min with nulls improved from 1725ns to 1371ns despite now including NaN handling.

github-actions · 2020-11-16T19:46:46Z

https://issues.apache.org/jira/browse/ARROW-10216

vertexclique · 2020-11-16T22:25:51Z

This looks nice, I will check this out tomorrow!

vertexclique

If numeric trait bounds are written with correct layering it will definitely improve the sum kernel and have a shave from it in addition to min and max. On my machine, I have a 5 percent regression in sum aggregation. But these are not a must from my point of view.

rust/arrow/src/compute/kernels/aggregate.rs

vertexclique · 2020-11-17T09:33:40Z

rust/arrow/src/datatypes.rs

+            #[inline]
+            fn identity_for_min_op() -> Self::Native {
+                $max_value
+            }
+
+            #[inline]
+            fn identity_for_max_op() -> Self::Native {
+                $min_value
+            }


These can be removed.

I'm using those in https://github.com/apache/arrow/pull/8685/files#diff-45705886920f322ea90b662e46ed728e11c40402cedbe42d5718fe73d63c69abR273

Or do you mean there is an existing trait function in num::num that provides those constants for all numeric types?

Or do you mean there is an existing trait function in num::num that provides those constants for all numeric types?

yes: https://docs.rs/num/0.3.1/num/trait.Bounded.html
That will be useful in many places where we manually figuring out bounds.

"smallest finite number this type can represent" -> I have to think about or write some tests to see whether that works for floats.

Refactored so the implementation no longer needs those bounds

rust/arrow/src/datatypes.rs

…calculation

jhorstmann · 2020-11-18T20:12:45Z

rust/arrow/src/datatypes.rs

@@ -618,7 +628,7 @@ macro_rules! make_numeric_type {
            fn mask_from_u64(mask: u64) -> Self::SimdMask {
                match Self::lanes() {
                    8 => {
-                        let vecidx = i64x8::new(128, 64, 32, 16, 8, 4, 2, 1);
+                        let vecidx = i64x8::new(1, 2, 4, 8, 16, 32, 64, 128);


Tests for min/max exposed that these constants were actually in the wrong order. This didn't turn up in the sum tests because there all invalid elements were already set to 0. I think when I initially implemented this, I copied the initialization from code that used intel avx intrinsics which take the parameters in the opposite order 🤦

And the tests for mask generation were apparently also incorrect.

Maybe a line comment before this assignment could help others in understanding these numbers?

…timize-min-max-kernel

jhorstmann · 2020-11-18T23:05:16Z

I still need to add a few more tests involving infinities and NaNs

jhorstmann · 2020-11-20T18:42:03Z

Both scalar and simd implementations have issues involving null values, I added ignore tests with comments. The handling of NaN is also not consistent with the sort kernel, which follows the postgres model of sorting NaN as larger than any other number (including infinity). Solving those issues could be quite tricky and degrade the performance again.

jorgecarleitao · 2020-12-06T07:09:14Z

Hi @jhorstmann , just to double check: are you blocked and need help, or can we review it?

…gating

…optimize-min-max-kernel

jhorstmann · 2020-12-06T16:32:43Z

Hi @jorgecarleitao , I did not have much time to work on this the last two weeks, but have now found a solution for handling NaN values consistently. Benchmarks need to be updated still, but I think it is ready for review.

jorgecarleitao

I went through this PR and looks good to me. However, I am way out of my league in some of the code here, and thus I think we should have another person looking at this.

jorgecarleitao · 2020-12-07T05:46:15Z

rust/arrow/src/datatypes.rs

@@ -618,7 +628,7 @@ macro_rules! make_numeric_type {
            fn mask_from_u64(mask: u64) -> Self::SimdMask {
                match Self::lanes() {
                    8 => {
-                        let vecidx = i64x8::new(128, 64, 32, 16, 8, 4, 2, 1);
+                        let vecidx = i64x8::new(1, 2, 4, 8, 16, 32, 64, 128);


Maybe a line comment before this assignment could help others in understanding these numbers?

alamb · 2020-12-07T23:13:16Z

I am checking it out

alamb

Nice work @jhorstmann , @vertexclique and @jorgecarleitao -- In my opinion, this PR is looking quite good and is ready to merge. I verified the performance speed ups (they are impressive) as well reviewed the tests and code. While I won't say I understood all of the implementation in detail the structure was clear and the tests look thorough.

I vote

Here are the measurements I took -- I saw the slowdown for min without NaN handling -- but I think correctness trumps performance any day (if you don't constrain us to be correct, we can make it go as fast as you want)

On this branch @ 9beb585

cargo bench -- "min 512"
min 512                 time:   [1.5557 us 1.5695 us 1.5856 us]
min nulls 512           time:   [1.1191 us 1.1322 us 1.1479 us]
max 512                 time:   [1.5022 us 1.5107 us 1.5202 us]
max nulls 512           time:   [2.0813 us 2.1004 us 2.1196 us]
sum 512                 time:   [443.06 ns 447.89 ns 454.22 ns]
sum nulls 512           time:   [272.30 ns 275.02 ns 278.16 ns]


cargo bench --features simd -- "min 512"
min 512                 time:   [121.73 ns 122.88 ns 124.18 ns]
min nulls 512           time:   [231.45 ns 233.08 ns 234.85 ns]
max 512                 time:   [114.63 ns 115.75 ns 117.24 ns]
max nulls 512           time:   [233.71 ns 236.93 ns 240.56 ns]
sum 512                 time:   [47.217 ns 48.041 ns 48.873 ns]
sum nulls 512           time:   [88.283 ns 89.846 ns 92.079 ns]

On master @ b290e23

cargo bench -- "min 512" # and similar for max and sum
min 512                 time:   [467.62 ns 479.97 ns 494.11 ns]
min nulls 512           time:   [1.3551 us 1.3738 us 1.3952 us]
sum 512                 time:   [446.15 ns 451.55 ns 457.65 ns]
sum nulls 512           time:   [90.057 ns 91.178 ns 92.428 ns]

cargo bench --features simd -- "min 512"
min 512                 time:   [464.29 ns 470.35 ns 476.89 ns]
min nulls 512           time:   [1.4172 us 1.4427 us 1.4697 us]
sum 512                 time:   [44.391 ns 44.823 ns 45.358 ns]
sum nulls 512           time:   [232.58 ns 233.82 ns 235.28 ns]

Dandandan · 2020-12-08T20:41:46Z

@jhorstmann
I wonder if we should support a standardized ordering like IEEE 754 instead?

There is a (unstable) implementation in rust std for it.
https://doc.rust-lang.org/beta/std/primitive.f64.html#method.total_cmp

Sounds something we could have in the source for now (it is only a few lines), I think it might also simplify some code like the current sort kernels? What do you think?

jhorstmann · 2020-12-09T08:06:58Z

@Dandandan that sounds like good idea as it is a bit more standardized. The current ordering was choosen initially for the sort kernels, as it is the same ordering that Postgres implements. The main difference seems to be the handling of negative NaN. If we decide on a different ordering we should keep sort and min/max consistent.

One nice thing about the current impl for min/max is that all numeric types are handled the same, with the compiler optimizing away the is_nan function for integer types.

This implements ordering using IEEE 754 ordering as mentioned in this discussion (only for sort now). I think this simplifies NaN-handling quite a bit. Performance-wise doesn't seem to be a big difference. #8685 Closes #8882 from Dandandan/float_order Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

This implements ordering using IEEE 754 ordering as mentioned in this discussion (only for sort now). I think this simplifies NaN-handling quite a bit. Performance-wise doesn't seem to be a big difference. apache/arrow#8685 Closes #8882 from Dandandan/float_order Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

@vertexclique

This refactors the simd aggregation to a reusable trait and adds implementations for min and max. It also handles NaN values now the same way as the sort kernel, meaning that NaN is considered to be greater than any other non-null value. Some tests were failing with the simd feature active because the different order of additions rounded to a slightly different result. I reused the comparison function that @vertexclique implemented in his recent PR. **Update:** Benchmarks (`aggregate_kernels`) show a 8x improvement with the simd feature: - sum performance stays the same at 37ns without nulls and 95 with nulls - min without nulls improves from 595ns to 75ns - min with nulls improves from 1725ns to 197ns Without simd feature the performance decreased due to the added NaN handling: - min without nulls from 595ns to 1057ns One strange result is that min with nulls improved from 1725ns to 1371ns despite now including NaN handling. Closes apache#8685 from jhorstmann/ARROW-10216-simd-optimize-min-max-kernel Lead-authored-by: Jörn Horstmann <joern.horstmann@signavio.com> Co-authored-by: Jörn Horstmann <git@jhorstmann.net> Co-authored-by: Mahmut Bulut <vertexclique@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>

This implements ordering using IEEE 754 ordering as mentioned in this discussion (only for sort now). I think this simplifies NaN-handling quite a bit. Performance-wise doesn't seem to be a big difference. apache#8685 Closes apache#8882 from Dandandan/float_order Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

jhorstmann and others added 7 commits November 16, 2020 10:01

ARROW-10216 Refactoring and simd implementation of min aggregate kernel

b00216b

ARROW-10216 Assertion in range of machine epsilon

646813c

ARROW-10216 Add max aggregate and adjust epsilon in assertions

d37d0b4

ARROW-10216 Add datafusion benchmark for min/max aggregation

a9f4a2b

ARROW-10216 Add comments in ArrowNumericType

dbe0889

ARROW-10216 Formatting

ae1ffde

ARROW-10216 Comments

9db18dc

github-actions bot added Component: Rust - DataFusion Component: Rust labels Nov 16, 2020

vertexclique reviewed Nov 17, 2020

View reviewed changes

ARROW-10216 Add tests for min/max aggregate and fix bug in simd mask …

9f20b4c

…calculation

jhorstmann commented Nov 18, 2020

View reviewed changes

jhorstmann and others added 3 commits November 18, 2020 21:18

Merge remote-tracking branch 'origin/master' into ARROW-10216-simd-op…

2cd5730

…timize-min-max-kernel

ARROW-10216 Clippy fixes

a1b9482

ARROW-10216 Rewrite vector mask tests

35473ab

ARROW-10216 Add tests involving infinities and nan

0100e76

jhorstmann added 6 commits December 6, 2020 16:05

ARROW-10216 Refactor to track validity of each vector lane when aggre…

5dba3e2

…gating

ARROW-10216 Clippy fixes

38213b2

ARROW-10216 Remove comments about ignored tests

cae8554

Merge remote-tracking branch 'upstream/master' into ARROW-10216-simd-…

a462eac

…optimize-min-max-kernel

ARROW-10216 Remove no longer needed trait bounds

a9f3b61

ARROW-10216 Improve comments and variable names

0aea325

jorgecarleitao reviewed Dec 7, 2020

View reviewed changes

jorgecarleitao requested a review from alamb December 7, 2020 05:48

jorgecarleitao requested a review from nevi-me December 7, 2020 05:48

ARROW-10216 Comments for vector mask calculation

9beb585

alamb approved these changes Dec 8, 2020

View reviewed changes

jorgecarleitao closed this in bc5b74b Dec 8, 2020

Dandandan mentioned this pull request Dec 9, 2020

ARROW-10864: [Rust] Use standard ordering for floats #8882

Closed

asfimport mentioned this pull request Dec 9, 2020

[Rust] Simd implementation of min/max aggregation kernels for primitive types #26218

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-10216: [Rust] Simd implementation for primitive min/max kernels #8685

ARROW-10216: [Rust] Simd implementation for primitive min/max kernels #8685

jhorstmann commented Nov 16, 2020 •

edited

github-actions bot commented Nov 16, 2020

vertexclique commented Nov 16, 2020

vertexclique left a comment •

edited

vertexclique Nov 17, 2020

jhorstmann Nov 17, 2020

vertexclique Nov 17, 2020

jhorstmann Nov 17, 2020

jhorstmann Dec 6, 2020

jhorstmann Nov 18, 2020

jhorstmann Nov 18, 2020

jorgecarleitao Dec 7, 2020

jhorstmann commented Nov 18, 2020

jhorstmann commented Nov 20, 2020

jorgecarleitao commented Dec 6, 2020

jhorstmann commented Dec 6, 2020

jorgecarleitao left a comment

jorgecarleitao Dec 7, 2020

alamb commented Dec 7, 2020

alamb left a comment

Dandandan commented Dec 8, 2020

jhorstmann commented Dec 9, 2020

ARROW-10216: [Rust] Simd implementation for primitive min/max kernels #8685

ARROW-10216: [Rust] Simd implementation for primitive min/max kernels #8685

Conversation

jhorstmann commented Nov 16, 2020 • edited

github-actions bot commented Nov 16, 2020

vertexclique commented Nov 16, 2020

vertexclique left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhorstmann commented Nov 18, 2020

jhorstmann commented Nov 20, 2020

jorgecarleitao commented Dec 6, 2020

jhorstmann commented Dec 6, 2020

jorgecarleitao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Dec 7, 2020

alamb left a comment

Choose a reason for hiding this comment

Dandandan commented Dec 8, 2020

jhorstmann commented Dec 9, 2020

jhorstmann commented Nov 16, 2020 •

edited

vertexclique left a comment •

edited