ARROW-10015: [Rust] Simd aggregate kernels #8370

jhorstmann · 2020-10-06T20:48:54Z

Benchmarks (run on a Ryzen 3700U laptop with some thermal problems)

Current master without simd:

$ cargo bench  --bench aggregate_kernels
sum 512                 time:   [3.9652 us 3.9722 us 3.9819 us]                     
                        change: [-0.2270% -0.0896% +0.0672%] (p = 0.23 > 0.05)
                        No change in performance detected.
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) high mild
  10 (10.00%) high severe

sum nulls 512           time:   [9.4577 us 9.4796 us 9.5112 us]                           
                        change: [+2.9175% +3.1309% +3.3937%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe

This branch without simd (speedup probably due to accessing the data via a slice):

sum 512                 time:   [1.1066 us 1.1113 us 1.1168 us]                     
                        change: [-72.648% -72.480% -72.310%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  7 (7.00%) high mild
  6 (6.00%) high severe

sum nulls 512           time:   [1.3279 us 1.3364 us 1.3469 us]                           
                        change: [-86.326% -86.209% -86.085%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 20 outliers among 100 measurements (20.00%)
  4 (4.00%) high mild
  16 (16.00%) high severe

This branch with simd:

sum 512                 time:   [108.58 ns 109.47 ns 110.57 ns]                    
                        change: [-90.164% -90.033% -89.850%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) high mild
  10 (10.00%) high severe

sum nulls 512           time:   [249.95 ns 250.50 ns 251.06 ns]                          
                        change: [-81.420% -81.281% -81.157%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

github-actions · 2020-10-06T21:19:12Z

https://issues.apache.org/jira/browse/ARROW-10015

jorgecarleitao · 2020-10-07T05:20:29Z

This is really exciting!!

Just to check my understanding, we are talking 9.48 us vs 250 ns when we use this branch with SIMD vs current master without SIMD, which corresponds to a 36x overall improvement. Did you ran the current master with SIMD (to compare them with the same feature gate)?

I will start by reviewing the other branch first.

fyi @andygrove @alamb , this may have "some" impact on the query benchmarks of DataFusion.

alamb

I personally think this looks really nice. I didn't examine the entire implementation in detail but I can if that is needed. Nice work @jhorstmann

alamb · 2020-10-07T10:47:41Z

rust/arrow/src/array/array.rs

@@ -1907,15 +1907,16 @@ impl TryFrom<Vec<(&str, ArrayRef)>> for StructArray {
        let mut null: Option<Buffer> = None;
        for (field_name, array) in values {
            let child_datum = array.data();
+            let child_datum_len = child_datum.len();


I am curious (for my own future edification) if this change actually improves performance or if it was just a code cleanup

If I remember correctly, there might have been an issue a bit further down where it previously used the len (in bytes) of the null buffer but after the refactoring needs the length of the array. I don't remember if I did this for consistency or if it caused a test failure.

alamb · 2020-10-07T10:48:34Z

rust/arrow/src/buffer.rs

@@ -371,118 +388,165 @@ where

 fn bitwise_bin_op_helper<F>(
    left: &Buffer,
-    left_offset: usize,
+    left_offset_in_bits: usize,


💯 for improved naming

I wonder too now that you know lots about what bitwise_bin_op_helper does, if you might be able to add a summary in the comments

Like

/// This function creates a new Buffer view aligned on .... fn bitwise_bin_op_helper<F>(

rust/arrow/src/compute/kernels/aggregate.rs

jhorstmann · 2020-10-08T14:37:44Z

Rebased after merge of ARROW-10040 (#8262)

andygrove · 2020-10-08T18:29:54Z

I'm planning on running TPC-H benchmarks later today with and without this patch.

andygrove

cargo run --release --bin tpch -- --debug --iterations 3 --path /mnt/tpch/s100/parquet --format parquet --query 1 --batch-size 4096 --concurrency 24

Master branch:

Query 1 iteration 0 took 14574 ms
Query 1 iteration 1 took 14623 ms
Query 1 iteration 2 took 14985 ms

This PR:

Query 1 iteration 0 took 14253 ms
Query 1 iteration 1 took 14112 ms
Query 1 iteration 2 took 14290 ms

I also verified that the query produced the same results, so LGTM.

jhorstmann · 2020-10-08T20:23:20Z

@andygrove seems there is a lot else going on in that query and the sum aggregation is just a small part. I'll try to setup the benchmarks myself and have a look. Would be interesting to compare a similar query without the grouping, filtering and sorting.

Built on top of [ARROW-10040][1] (apache#8262) Benchmarks (run on a Ryzen 3700U laptop with some thermal problems) Current master without simd: ``` $ cargo bench --bench aggregate_kernels sum 512 time: [3.9652 us 3.9722 us 3.9819 us] change: [-0.2270% -0.0896% +0.0672%] (p = 0.23 > 0.05) No change in performance detected. Found 14 outliers among 100 measurements (14.00%) 4 (4.00%) high mild 10 (10.00%) high severe sum nulls 512 time: [9.4577 us 9.4796 us 9.5112 us] change: [+2.9175% +3.1309% +3.3937%] (p = 0.00 < 0.05) Performance has regressed. Found 4 outliers among 100 measurements (4.00%) 1 (1.00%) high mild 3 (3.00%) high severe ``` This branch without simd (speedup probably due to accessing the data via a slice): ``` sum 512 time: [1.1066 us 1.1113 us 1.1168 us] change: [-72.648% -72.480% -72.310%] (p = 0.00 < 0.05) Performance has improved. Found 13 outliers among 100 measurements (13.00%) 7 (7.00%) high mild 6 (6.00%) high severe sum nulls 512 time: [1.3279 us 1.3364 us 1.3469 us] change: [-86.326% -86.209% -86.085%] (p = 0.00 < 0.05) Performance has improved. Found 20 outliers among 100 measurements (20.00%) 4 (4.00%) high mild 16 (16.00%) high severe ``` This branch with simd: ``` sum 512 time: [108.58 ns 109.47 ns 110.57 ns] change: [-90.164% -90.033% -89.850%] (p = 0.00 < 0.05) Performance has improved. Found 11 outliers among 100 measurements (11.00%) 1 (1.00%) high mild 10 (10.00%) high severe sum nulls 512 time: [249.95 ns 250.50 ns 251.06 ns] change: [-81.420% -81.281% -81.157%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 3 (3.00%) high mild 1 (1.00%) high severe ``` [1]: https://issues.apache.org/jira/browse/ARROW-10040 Closes apache#8370 from jhorstmann/ARROW-10015-simd-aggregate-kernels Lead-authored-by: Jörn Horstmann <git@jhorstmann.net> Co-authored-by: Jörn Horstmann <joern.horstmann@signavio.com> Signed-off-by: Andy Grove <andygrove@nvidia.com>

github-actions bot added the Component: Rust label Oct 6, 2020

jhorstmann mentioned this pull request Oct 6, 2020

ARROW-10040: [Rust] Iterate over and combine boolean buffers with arbitrary offsets #8262

Closed

github-actions bot added the Component: Rust - DataFusion label Oct 7, 2020

alamb reviewed Oct 7, 2020

View reviewed changes

jhorstmann and others added 2 commits October 8, 2020 16:35

ARROW-10015: [Rust] Simd implementation of sum aggregate kernel

7db7b17

ARROW-10015: [Rust] Add datafusion test involving sum kernel

b3ad057

jhorstmann force-pushed the ARROW-10015-simd-aggregate-kernels branch from a6c40a2 to b3ad057 Compare October 8, 2020 14:36

andygrove approved these changes Oct 8, 2020

View reviewed changes

andygrove closed this in 0100121 Oct 8, 2020

asfimport mentioned this pull request Oct 9, 2020

[Rust] Implement SIMD for aggregate kernel sum #26039

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ARROW-10015: [Rust] Simd aggregate kernels #8370

ARROW-10015: [Rust] Simd aggregate kernels #8370

Uh oh!

jhorstmann commented Oct 6, 2020

Uh oh!

github-actions bot commented Oct 6, 2020

Uh oh!

jorgecarleitao commented Oct 7, 2020

Uh oh!

alamb left a comment

Uh oh!

alamb Oct 7, 2020

Uh oh!

jhorstmann Oct 7, 2020

Uh oh!

alamb Oct 7, 2020

Uh oh!

alamb Oct 7, 2020

Uh oh!

Uh oh!

Uh oh!

jhorstmann commented Oct 8, 2020

Uh oh!

andygrove commented Oct 8, 2020

Uh oh!

andygrove left a comment

Uh oh!

jhorstmann commented Oct 8, 2020

Uh oh!

Uh oh!

ARROW-10015: [Rust] Simd aggregate kernels #8370

ARROW-10015: [Rust] Simd aggregate kernels #8370

Uh oh!

Conversation

jhorstmann commented Oct 6, 2020

Uh oh!

github-actions bot commented Oct 6, 2020

Uh oh!

jorgecarleitao commented Oct 7, 2020

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Oct 7, 2020

Choose a reason for hiding this comment

Uh oh!

jhorstmann Oct 7, 2020

Choose a reason for hiding this comment

Uh oh!

alamb Oct 7, 2020

Choose a reason for hiding this comment

Uh oh!

alamb Oct 7, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jhorstmann commented Oct 8, 2020

Uh oh!

andygrove commented Oct 8, 2020

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

jhorstmann commented Oct 8, 2020

Uh oh!

Uh oh!