New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-10216: [Rust] Simd implementation for primitive min/max kernels #8685
ARROW-10216: [Rust] Simd implementation for primitive min/max kernels #8685
Conversation
This looks nice, I will check this out tomorrow! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If numeric trait bounds are written with correct layering it will definitely improve the sum kernel and have a shave from it in addition to min and max. On my machine, I have a 5 percent regression in sum aggregation. But these are not a must from my point of view.
rust/arrow/src/datatypes.rs
Outdated
#[inline] | ||
fn identity_for_min_op() -> Self::Native { | ||
$max_value | ||
} | ||
|
||
#[inline] | ||
fn identity_for_max_op() -> Self::Native { | ||
$min_value | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using those in https://github.com/apache/arrow/pull/8685/files#diff-45705886920f322ea90b662e46ed728e11c40402cedbe42d5718fe73d63c69abR273
Or do you mean there is an existing trait function in num::num that provides those constants for all numeric types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or do you mean there is an existing trait function in num::num that provides those constants for all numeric types?
yes: https://docs.rs/num/0.3.1/num/trait.Bounded.html
That will be useful in many places where we manually figuring out bounds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"smallest finite number this type can represent" -> I have to think about or write some tests to see whether that works for floats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored so the implementation no longer needs those bounds
@@ -618,7 +628,7 @@ macro_rules! make_numeric_type { | |||
fn mask_from_u64(mask: u64) -> Self::SimdMask { | |||
match Self::lanes() { | |||
8 => { | |||
let vecidx = i64x8::new(128, 64, 32, 16, 8, 4, 2, 1); | |||
let vecidx = i64x8::new(1, 2, 4, 8, 16, 32, 64, 128); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests for min/max exposed that these constants were actually in the wrong order. This didn't turn up in the sum tests because there all invalid elements were already set to 0. I think when I initially implemented this, I copied the initialization from code that used intel avx intrinsics which take the parameters in the opposite order 🤦
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the tests for mask generation were apparently also incorrect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a line comment before this assignment could help others in understanding these numbers?
I still need to add a few more tests involving infinities and NaNs |
Both scalar and simd implementations have issues involving null values, I added |
Hi @jhorstmann , just to double check: are you blocked and need help, or can we review it? |
…optimize-min-max-kernel
Hi @jorgecarleitao , I did not have much time to work on this the last two weeks, but have now found a solution for handling NaN values consistently. Benchmarks need to be updated still, but I think it is ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through this PR and looks good to me. However, I am way out of my league in some of the code here, and thus I think we should have another person looking at this.
@@ -618,7 +628,7 @@ macro_rules! make_numeric_type { | |||
fn mask_from_u64(mask: u64) -> Self::SimdMask { | |||
match Self::lanes() { | |||
8 => { | |||
let vecidx = i64x8::new(128, 64, 32, 16, 8, 4, 2, 1); | |||
let vecidx = i64x8::new(1, 2, 4, 8, 16, 32, 64, 128); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a line comment before this assignment could help others in understanding these numbers?
I am checking it out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @jhorstmann , @vertexclique and @jorgecarleitao -- In my opinion, this PR is looking quite good and is ready to merge. I verified the performance speed ups (they are impressive) as well reviewed the tests and code. While I won't say I understood all of the implementation in detail the structure was clear and the tests look thorough.
I vote
Here are the measurements I took -- I saw the slowdown for min without NaN handling -- but I think correctness trumps performance any day (if you don't constrain us to be correct, we can make it go as fast as you want)
On this branch @ 9beb585
cargo bench -- "min 512"
min 512 time: [1.5557 us 1.5695 us 1.5856 us]
min nulls 512 time: [1.1191 us 1.1322 us 1.1479 us]
max 512 time: [1.5022 us 1.5107 us 1.5202 us]
max nulls 512 time: [2.0813 us 2.1004 us 2.1196 us]
sum 512 time: [443.06 ns 447.89 ns 454.22 ns]
sum nulls 512 time: [272.30 ns 275.02 ns 278.16 ns]
cargo bench --features simd -- "min 512"
min 512 time: [121.73 ns 122.88 ns 124.18 ns]
min nulls 512 time: [231.45 ns 233.08 ns 234.85 ns]
max 512 time: [114.63 ns 115.75 ns 117.24 ns]
max nulls 512 time: [233.71 ns 236.93 ns 240.56 ns]
sum 512 time: [47.217 ns 48.041 ns 48.873 ns]
sum nulls 512 time: [88.283 ns 89.846 ns 92.079 ns]
On master @ b290e23
cargo bench -- "min 512" # and similar for max and sum
min 512 time: [467.62 ns 479.97 ns 494.11 ns]
min nulls 512 time: [1.3551 us 1.3738 us 1.3952 us]
sum 512 time: [446.15 ns 451.55 ns 457.65 ns]
sum nulls 512 time: [90.057 ns 91.178 ns 92.428 ns]
cargo bench --features simd -- "min 512"
min 512 time: [464.29 ns 470.35 ns 476.89 ns]
min nulls 512 time: [1.4172 us 1.4427 us 1.4697 us]
sum 512 time: [44.391 ns 44.823 ns 45.358 ns]
sum nulls 512 time: [232.58 ns 233.82 ns 235.28 ns]
@jhorstmann There is a (unstable) implementation in rust std for it. Sounds something we could have in the source for now (it is only a few lines), I think it might also simplify some code like the current sort kernels? What do you think? |
@Dandandan that sounds like good idea as it is a bit more standardized. The current ordering was choosen initially for the sort kernels, as it is the same ordering that Postgres implements. The main difference seems to be the handling of negative NaN. If we decide on a different ordering we should keep sort and min/max consistent. One nice thing about the current impl for min/max is that all numeric types are handled the same, with the compiler optimizing away the |
This implements ordering using IEEE 754 ordering as mentioned in this discussion (only for sort now). I think this simplifies NaN-handling quite a bit. Performance-wise doesn't seem to be a big difference. #8685 Closes #8882 from Dandandan/float_order Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
This implements ordering using IEEE 754 ordering as mentioned in this discussion (only for sort now). I think this simplifies NaN-handling quite a bit. Performance-wise doesn't seem to be a big difference. apache/arrow#8685 Closes #8882 from Dandandan/float_order Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
This refactors the simd aggregation to a reusable trait and adds implementations for min and max. It also handles NaN values now the same way as the sort kernel, meaning that NaN is considered to be greater than any other non-null value. Some tests were failing with the simd feature active because the different order of additions rounded to a slightly different result. I reused the comparison function that @vertexclique implemented in his recent PR. **Update:** Benchmarks (`aggregate_kernels`) show a 8x improvement with the simd feature: - sum performance stays the same at 37ns without nulls and 95 with nulls - min without nulls improves from 595ns to 75ns - min with nulls improves from 1725ns to 197ns Without simd feature the performance decreased due to the added NaN handling: - min without nulls from 595ns to 1057ns One strange result is that min with nulls improved from 1725ns to 1371ns despite now including NaN handling. Closes apache#8685 from jhorstmann/ARROW-10216-simd-optimize-min-max-kernel Lead-authored-by: Jörn Horstmann <joern.horstmann@signavio.com> Co-authored-by: Jörn Horstmann <git@jhorstmann.net> Co-authored-by: Mahmut Bulut <vertexclique@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
This implements ordering using IEEE 754 ordering as mentioned in this discussion (only for sort now). I think this simplifies NaN-handling quite a bit. Performance-wise doesn't seem to be a big difference. apache#8685 Closes apache#8882 from Dandandan/float_order Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
This refactors the simd aggregation to a reusable trait and adds implementations for min and max. It also handles NaN values now the same way as the sort kernel, meaning that NaN is considered to be greater than any other non-null value.
Some tests were failing with the simd feature active because the different order of additions rounded to a slightly different result. I reused the comparison function that @vertexclique implemented in his recent PR.
Update: Benchmarks (
aggregate_kernels
) show a 8x improvement with the simd feature:Without simd feature the performance decreased due to the added NaN handling:
One strange result is that min with nulls improved from 1725ns to 1371ns despite now including NaN handling.