Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-10015: [Rust] Simd aggregate kernels #8370

Closed

Conversation

jhorstmann
Copy link
Contributor

Built on top of ARROW-10040 (#8262)

Benchmarks (run on a Ryzen 3700U laptop with some thermal problems)

Current master without simd:

$ cargo bench  --bench aggregate_kernels
sum 512                 time:   [3.9652 us 3.9722 us 3.9819 us]                     
                        change: [-0.2270% -0.0896% +0.0672%] (p = 0.23 > 0.05)
                        No change in performance detected.
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) high mild
  10 (10.00%) high severe

sum nulls 512           time:   [9.4577 us 9.4796 us 9.5112 us]                           
                        change: [+2.9175% +3.1309% +3.3937%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe

This branch without simd (speedup probably due to accessing the data via a slice):

sum 512                 time:   [1.1066 us 1.1113 us 1.1168 us]                     
                        change: [-72.648% -72.480% -72.310%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  7 (7.00%) high mild
  6 (6.00%) high severe

sum nulls 512           time:   [1.3279 us 1.3364 us 1.3469 us]                           
                        change: [-86.326% -86.209% -86.085%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 20 outliers among 100 measurements (20.00%)
  4 (4.00%) high mild
  16 (16.00%) high severe

This branch with simd:

sum 512                 time:   [108.58 ns 109.47 ns 110.57 ns]                    
                        change: [-90.164% -90.033% -89.850%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) high mild
  10 (10.00%) high severe

sum nulls 512           time:   [249.95 ns 250.50 ns 251.06 ns]                          
                        change: [-81.420% -81.281% -81.157%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

@github-actions
Copy link

github-actions bot commented Oct 6, 2020

@jorgecarleitao
Copy link
Member

This is really exciting!!

Just to check my understanding, we are talking 9.48 us vs 250 ns when we use this branch with SIMD vs current master without SIMD, which corresponds to a 36x overall improvement. Did you ran the current master with SIMD (to compare them with the same feature gate)?

I will start by reviewing the other branch first.

fyi @andygrove @alamb , this may have "some" impact on the query benchmarks of DataFusion.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think this looks really nice. I didn't examine the entire implementation in detail but I can if that is needed. Nice work @jhorstmann

@@ -1907,15 +1907,16 @@ impl TryFrom<Vec<(&str, ArrayRef)>> for StructArray {
let mut null: Option<Buffer> = None;
for (field_name, array) in values {
let child_datum = array.data();
let child_datum_len = child_datum.len();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious (for my own future edification) if this change actually improves performance or if it was just a code cleanup

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly, there might have been an issue a bit further down where it previously used the len (in bytes) of the null buffer but after the refactoring needs the length of the array. I don't remember if I did this for consistency or if it caused a test failure.

@@ -371,118 +388,165 @@ where

fn bitwise_bin_op_helper<F>(
left: &Buffer,
left_offset: usize,
left_offset_in_bits: usize,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 for improved naming

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder too now that you know lots about what bitwise_bin_op_helper does, if you might be able to add a summary in the comments

Like

/// This function creates a new Buffer view aligned on ....
fn bitwise_bin_op_helper<F>(

rust/arrow/src/compute/kernels/aggregate.rs Show resolved Hide resolved
rust/arrow/src/compute/kernels/aggregate.rs Show resolved Hide resolved
@jhorstmann jhorstmann force-pushed the ARROW-10015-simd-aggregate-kernels branch from a6c40a2 to b3ad057 Compare October 8, 2020 14:36
@jhorstmann
Copy link
Contributor Author

Rebased after merge of ARROW-10040 (#8262)

@andygrove
Copy link
Member

I'm planning on running TPC-H benchmarks later today with and without this patch.

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cargo run --release --bin tpch -- --debug --iterations 3 --path /mnt/tpch/s100/parquet --format parquet --query 1 --batch-size 4096 --concurrency 24

Master branch:

Query 1 iteration 0 took 14574 ms
Query 1 iteration 1 took 14623 ms
Query 1 iteration 2 took 14985 ms

This PR:

Query 1 iteration 0 took 14253 ms
Query 1 iteration 1 took 14112 ms
Query 1 iteration 2 took 14290 ms

I also verified that the query produced the same results, so LGTM.

@jhorstmann
Copy link
Contributor Author

@andygrove seems there is a lot else going on in that query and the sum aggregation is just a small part. I'll try to setup the benchmarks myself and have a look. Would be interesting to compare a similar query without the grouping, filtering and sorting.

@andygrove andygrove closed this in 0100121 Oct 8, 2020
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
Built on top of [ARROW-10040][1] (apache#8262)

Benchmarks (run on a Ryzen 3700U laptop with some thermal problems)

Current master without simd:

```
$ cargo bench  --bench aggregate_kernels
sum 512                 time:   [3.9652 us 3.9722 us 3.9819 us]
                        change: [-0.2270% -0.0896% +0.0672%] (p = 0.23 > 0.05)
                        No change in performance detected.
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) high mild
  10 (10.00%) high severe

sum nulls 512           time:   [9.4577 us 9.4796 us 9.5112 us]
                        change: [+2.9175% +3.1309% +3.3937%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe
```

This branch without simd (speedup probably due to accessing the data via a slice):

```
sum 512                 time:   [1.1066 us 1.1113 us 1.1168 us]
                        change: [-72.648% -72.480% -72.310%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  7 (7.00%) high mild
  6 (6.00%) high severe

sum nulls 512           time:   [1.3279 us 1.3364 us 1.3469 us]
                        change: [-86.326% -86.209% -86.085%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 20 outliers among 100 measurements (20.00%)
  4 (4.00%) high mild
  16 (16.00%) high severe
```

This branch with simd:

```
sum 512                 time:   [108.58 ns 109.47 ns 110.57 ns]
                        change: [-90.164% -90.033% -89.850%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) high mild
  10 (10.00%) high severe

sum nulls 512           time:   [249.95 ns 250.50 ns 251.06 ns]
                        change: [-81.420% -81.281% -81.157%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
```

 [1]: https://issues.apache.org/jira/browse/ARROW-10040

Closes apache#8370 from jhorstmann/ARROW-10015-simd-aggregate-kernels

Lead-authored-by: Jörn Horstmann <git@jhorstmann.net>
Co-authored-by: Jörn Horstmann <joern.horstmann@signavio.com>
Signed-off-by: Andy Grove <andygrove@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants