New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-10015: [Rust] Simd aggregate kernels #8370
ARROW-10015: [Rust] Simd aggregate kernels #8370
Conversation
This is really exciting!! Just to check my understanding, we are talking I will start by reviewing the other branch first. fyi @andygrove @alamb , this may have "some" impact on the query benchmarks of DataFusion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally think this looks really nice. I didn't examine the entire implementation in detail but I can if that is needed. Nice work @jhorstmann
rust/arrow/src/array/array.rs
Outdated
@@ -1907,15 +1907,16 @@ impl TryFrom<Vec<(&str, ArrayRef)>> for StructArray { | |||
let mut null: Option<Buffer> = None; | |||
for (field_name, array) in values { | |||
let child_datum = array.data(); | |||
let child_datum_len = child_datum.len(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious (for my own future edification) if this change actually improves performance or if it was just a code cleanup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remember correctly, there might have been an issue a bit further down where it previously used the len (in bytes) of the null buffer but after the refactoring needs the length of the array. I don't remember if I did this for consistency or if it caused a test failure.
rust/arrow/src/buffer.rs
Outdated
@@ -371,118 +388,165 @@ where | |||
|
|||
fn bitwise_bin_op_helper<F>( | |||
left: &Buffer, | |||
left_offset: usize, | |||
left_offset_in_bits: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 for improved naming
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder too now that you know lots about what bitwise_bin_op_helper
does, if you might be able to add a summary in the comments
Like
/// This function creates a new Buffer view aligned on ....
fn bitwise_bin_op_helper<F>(
a6c40a2
to
b3ad057
Compare
Rebased after merge of ARROW-10040 (#8262) |
I'm planning on running TPC-H benchmarks later today with and without this patch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cargo run --release --bin tpch -- --debug --iterations 3 --path /mnt/tpch/s100/parquet --format parquet --query 1 --batch-size 4096 --concurrency 24
Master branch:
Query 1 iteration 0 took 14574 ms
Query 1 iteration 1 took 14623 ms
Query 1 iteration 2 took 14985 ms
This PR:
Query 1 iteration 0 took 14253 ms
Query 1 iteration 1 took 14112 ms
Query 1 iteration 2 took 14290 ms
I also verified that the query produced the same results, so LGTM.
@andygrove seems there is a lot else going on in that query and the sum aggregation is just a small part. I'll try to setup the benchmarks myself and have a look. Would be interesting to compare a similar query without the grouping, filtering and sorting. |
Built on top of [ARROW-10040][1] (apache#8262) Benchmarks (run on a Ryzen 3700U laptop with some thermal problems) Current master without simd: ``` $ cargo bench --bench aggregate_kernels sum 512 time: [3.9652 us 3.9722 us 3.9819 us] change: [-0.2270% -0.0896% +0.0672%] (p = 0.23 > 0.05) No change in performance detected. Found 14 outliers among 100 measurements (14.00%) 4 (4.00%) high mild 10 (10.00%) high severe sum nulls 512 time: [9.4577 us 9.4796 us 9.5112 us] change: [+2.9175% +3.1309% +3.3937%] (p = 0.00 < 0.05) Performance has regressed. Found 4 outliers among 100 measurements (4.00%) 1 (1.00%) high mild 3 (3.00%) high severe ``` This branch without simd (speedup probably due to accessing the data via a slice): ``` sum 512 time: [1.1066 us 1.1113 us 1.1168 us] change: [-72.648% -72.480% -72.310%] (p = 0.00 < 0.05) Performance has improved. Found 13 outliers among 100 measurements (13.00%) 7 (7.00%) high mild 6 (6.00%) high severe sum nulls 512 time: [1.3279 us 1.3364 us 1.3469 us] change: [-86.326% -86.209% -86.085%] (p = 0.00 < 0.05) Performance has improved. Found 20 outliers among 100 measurements (20.00%) 4 (4.00%) high mild 16 (16.00%) high severe ``` This branch with simd: ``` sum 512 time: [108.58 ns 109.47 ns 110.57 ns] change: [-90.164% -90.033% -89.850%] (p = 0.00 < 0.05) Performance has improved. Found 11 outliers among 100 measurements (11.00%) 1 (1.00%) high mild 10 (10.00%) high severe sum nulls 512 time: [249.95 ns 250.50 ns 251.06 ns] change: [-81.420% -81.281% -81.157%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 3 (3.00%) high mild 1 (1.00%) high severe ``` [1]: https://issues.apache.org/jira/browse/ARROW-10040 Closes apache#8370 from jhorstmann/ARROW-10015-simd-aggregate-kernels Lead-authored-by: Jörn Horstmann <git@jhorstmann.net> Co-authored-by: Jörn Horstmann <joern.horstmann@signavio.com> Signed-off-by: Andy Grove <andygrove@nvidia.com>
Built on top of ARROW-10040 (#8262)
Benchmarks (run on a Ryzen 3700U laptop with some thermal problems)
Current master without simd:
This branch without simd (speedup probably due to accessing the data via a slice):
This branch with simd: