Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-10216: [Rust] Simd implementation for primitive min/max kernels #8685

Conversation

jhorstmann
Copy link
Contributor

@jhorstmann jhorstmann commented Nov 16, 2020

This refactors the simd aggregation to a reusable trait and adds implementations for min and max. It also handles NaN values now the same way as the sort kernel, meaning that NaN is considered to be greater than any other non-null value.

Some tests were failing with the simd feature active because the different order of additions rounded to a slightly different result. I reused the comparison function that @vertexclique implemented in his recent PR.

Update: Benchmarks (aggregate_kernels) show a 8x improvement with the simd feature:

  • sum performance stays the same at 37ns without nulls and 95 with nulls
  • min without nulls improves from 595ns to 75ns
  • min with nulls improves from 1725ns to 197ns

Without simd feature the performance decreased due to the added NaN handling:

  • min without nulls from 595ns to 1057ns

One strange result is that min with nulls improved from 1725ns to 1371ns despite now including NaN handling.

@github-actions
Copy link

@vertexclique
Copy link
Contributor

This looks nice, I will check this out tomorrow!

Copy link
Contributor

@vertexclique vertexclique left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If numeric trait bounds are written with correct layering it will definitely improve the sum kernel and have a shave from it in addition to min and max. On my machine, I have a 5 percent regression in sum aggregation. But these are not a must from my point of view.

rust/arrow/src/compute/kernels/aggregate.rs Show resolved Hide resolved
Comment on lines 793 to 801
#[inline]
fn identity_for_min_op() -> Self::Native {
$max_value
}

#[inline]
fn identity_for_max_op() -> Self::Native {
$min_value
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using those in https://github.com/apache/arrow/pull/8685/files#diff-45705886920f322ea90b662e46ed728e11c40402cedbe42d5718fe73d63c69abR273

Or do you mean there is an existing trait function in num::num that provides those constants for all numeric types?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or do you mean there is an existing trait function in num::num that provides those constants for all numeric types?

yes: https://docs.rs/num/0.3.1/num/trait.Bounded.html
That will be useful in many places where we manually figuring out bounds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"smallest finite number this type can represent" -> I have to think about or write some tests to see whether that works for floats.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored so the implementation no longer needs those bounds

rust/arrow/src/datatypes.rs Outdated Show resolved Hide resolved
@@ -618,7 +628,7 @@ macro_rules! make_numeric_type {
fn mask_from_u64(mask: u64) -> Self::SimdMask {
match Self::lanes() {
8 => {
let vecidx = i64x8::new(128, 64, 32, 16, 8, 4, 2, 1);
let vecidx = i64x8::new(1, 2, 4, 8, 16, 32, 64, 128);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests for min/max exposed that these constants were actually in the wrong order. This didn't turn up in the sum tests because there all invalid elements were already set to 0. I think when I initially implemented this, I copied the initialization from code that used intel avx intrinsics which take the parameters in the opposite order 🤦

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the tests for mask generation were apparently also incorrect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a line comment before this assignment could help others in understanding these numbers?

@jhorstmann
Copy link
Contributor Author

I still need to add a few more tests involving infinities and NaNs

@jhorstmann
Copy link
Contributor Author

Both scalar and simd implementations have issues involving null values, I added ignore tests with comments. The handling of NaN is also not consistent with the sort kernel, which follows the postgres model of sorting NaN as larger than any other number (including infinity). Solving those issues could be quite tricky and degrade the performance again.

@jorgecarleitao
Copy link
Member

Hi @jhorstmann , just to double check: are you blocked and need help, or can we review it?

@jhorstmann
Copy link
Contributor Author

Hi @jorgecarleitao , I did not have much time to work on this the last two weeks, but have now found a solution for handling NaN values consistently. Benchmarks need to be updated still, but I think it is ready for review.

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through this PR and looks good to me. However, I am way out of my league in some of the code here, and thus I think we should have another person looking at this.

@@ -618,7 +628,7 @@ macro_rules! make_numeric_type {
fn mask_from_u64(mask: u64) -> Self::SimdMask {
match Self::lanes() {
8 => {
let vecidx = i64x8::new(128, 64, 32, 16, 8, 4, 2, 1);
let vecidx = i64x8::new(1, 2, 4, 8, 16, 32, 64, 128);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a line comment before this assignment could help others in understanding these numbers?

@alamb
Copy link
Contributor

alamb commented Dec 7, 2020

I am checking it out

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @jhorstmann , @vertexclique and @jorgecarleitao -- In my opinion, this PR is looking quite good and is ready to merge. I verified the performance speed ups (they are impressive) as well reviewed the tests and code. While I won't say I understood all of the implementation in detail the structure was clear and the tests look thorough.

I vote :shipit:

Here are the measurements I took -- I saw the slowdown for min without NaN handling -- but I think correctness trumps performance any day (if you don't constrain us to be correct, we can make it go as fast as you want)

On this branch @ 9beb585

cargo bench -- "min 512"
min 512                 time:   [1.5557 us 1.5695 us 1.5856 us]
min nulls 512           time:   [1.1191 us 1.1322 us 1.1479 us]
max 512                 time:   [1.5022 us 1.5107 us 1.5202 us]
max nulls 512           time:   [2.0813 us 2.1004 us 2.1196 us]
sum 512                 time:   [443.06 ns 447.89 ns 454.22 ns]
sum nulls 512           time:   [272.30 ns 275.02 ns 278.16 ns]


cargo bench --features simd -- "min 512"
min 512                 time:   [121.73 ns 122.88 ns 124.18 ns]
min nulls 512           time:   [231.45 ns 233.08 ns 234.85 ns]
max 512                 time:   [114.63 ns 115.75 ns 117.24 ns]
max nulls 512           time:   [233.71 ns 236.93 ns 240.56 ns]
sum 512                 time:   [47.217 ns 48.041 ns 48.873 ns]
sum nulls 512           time:   [88.283 ns 89.846 ns 92.079 ns]

On master @ b290e23

cargo bench -- "min 512" # and similar for max and sum
min 512                 time:   [467.62 ns 479.97 ns 494.11 ns]
min nulls 512           time:   [1.3551 us 1.3738 us 1.3952 us]
sum 512                 time:   [446.15 ns 451.55 ns 457.65 ns]
sum nulls 512           time:   [90.057 ns 91.178 ns 92.428 ns]

cargo bench --features simd -- "min 512"
min 512                 time:   [464.29 ns 470.35 ns 476.89 ns]
min nulls 512           time:   [1.4172 us 1.4427 us 1.4697 us]
sum 512                 time:   [44.391 ns 44.823 ns 45.358 ns]
sum nulls 512           time:   [232.58 ns 233.82 ns 235.28 ns]

@Dandandan
Copy link
Contributor

@jhorstmann
I wonder if we should support a standardized ordering like IEEE 754 instead?

There is a (unstable) implementation in rust std for it.
https://doc.rust-lang.org/beta/std/primitive.f64.html#method.total_cmp

Sounds something we could have in the source for now (it is only a few lines), I think it might also simplify some code like the current sort kernels? What do you think?

@jhorstmann
Copy link
Contributor Author

@Dandandan that sounds like good idea as it is a bit more standardized. The current ordering was choosen initially for the sort kernels, as it is the same ordering that Postgres implements. The main difference seems to be the handling of negative NaN. If we decide on a different ordering we should keep sort and min/max consistent.

One nice thing about the current impl for min/max is that all numeric types are handled the same, with the compiler optimizing away the is_nan function for integer types.

alamb pushed a commit that referenced this pull request Jan 7, 2021
This implements ordering using IEEE 754 ordering as mentioned in this discussion (only for sort now).

I think this simplifies NaN-handling quite a bit. Performance-wise doesn't seem to be a big difference.

#8685

Closes #8882 from Dandandan/float_order

Authored-by: Heres, Daniel <danielheres@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
alamb pushed a commit to apache/arrow-rs that referenced this pull request Apr 20, 2021
This implements ordering using IEEE 754 ordering as mentioned in this discussion (only for sort now).

I think this simplifies NaN-handling quite a bit. Performance-wise doesn't seem to be a big difference.

apache/arrow#8685

Closes #8882 from Dandandan/float_order

Authored-by: Heres, Daniel <danielheres@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
This refactors the simd aggregation to a reusable trait and adds implementations for min and max. It also handles NaN values now the same way as the sort kernel, meaning that NaN is considered to be greater than any other non-null value.

Some tests were failing with the simd feature active because the different order of additions rounded to a slightly different result. I reused the comparison function that @vertexclique implemented in his recent PR.

**Update:** Benchmarks (`aggregate_kernels`) show a 8x improvement with the simd feature:

 - sum performance stays the same at 37ns without nulls and 95 with nulls
 - min without nulls improves from 595ns to 75ns
 - min with nulls improves from 1725ns to 197ns

Without simd feature the performance decreased due to the added NaN handling:

 - min without nulls from 595ns to 1057ns

One strange result is that min with nulls improved from 1725ns to 1371ns despite now including NaN handling.

Closes apache#8685 from jhorstmann/ARROW-10216-simd-optimize-min-max-kernel

Lead-authored-by: Jörn Horstmann <joern.horstmann@signavio.com>
Co-authored-by: Jörn Horstmann <git@jhorstmann.net>
Co-authored-by: Mahmut Bulut <vertexclique@gmail.com>
Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
This implements ordering using IEEE 754 ordering as mentioned in this discussion (only for sort now).

I think this simplifies NaN-handling quite a bit. Performance-wise doesn't seem to be a big difference.

apache#8685

Closes apache#8882 from Dandandan/float_order

Authored-by: Heres, Daniel <danielheres@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants