Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance for arithmetic kernels with simd feature enabled (except for division/modulo) #1221

Merged
merged 3 commits into from
Jan 24, 2022

Conversation

jhorstmann
Copy link
Contributor

@jhorstmann jhorstmann commented Jan 22, 2022

Which issue does this PR close?

Removes explicit simd arithmetic kernels, except for division/modulo, as compiler autovectorization actually generates better code. Also adds a new divide_unchecked kernel for floating point division, which does not need to return an error for division by zero. The arithmetic benchmarks are changed to use a bigger batch size because otherwise it measured mostly allocation and branch overhead. Benchmark results on my laptop

Closes #1182.

Rationale for this change

Simplifies the code and increases the performance up to 2x when using the simd feature of this crate. rustc and llvm are quite good at generating optimized simd code.

What changes are included in this PR?

Are there any user-facing changes?

No. ArrowFloatNumericType::pow is now unused though and could be removed. The trait should be kept since some kernels might make only sense on floating point data.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jan 22, 2022
@@ -153,6 +153,7 @@ impl Buffer {
///
/// Note that this should be used cautiously, and the returned pointer should not be
/// stored anywhere, to avoid dangling pointers.
#[inline]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This did not get inlined in one of my benchmarks, which is weird for such a short method. I think it only gets called by a chain of methods marked as inline, so maybe that confused the compiler.

@codecov-commenter
Copy link

codecov-commenter commented Jan 22, 2022

Codecov Report

Merging #1221 (ac6a599) into master (0377aae) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1221      +/-   ##
==========================================
- Coverage   82.71%   82.70%   -0.01%     
==========================================
  Files         175      175              
  Lines       51711    51711              
==========================================
- Hits        42771    42770       -1     
- Misses       8940     8941       +1     
Impacted Files Coverage Δ
arrow/src/buffer/immutable.rs 98.92% <ø> (ø)
arrow/src/compute/kernels/arithmetic.rs 90.86% <100.00%> (ø)
arrow/src/datatypes/datatype.rs 66.38% <0.00%> (-0.43%) ⬇️
arrow/src/datatypes/field.rs 53.79% <0.00%> (-0.31%) ⬇️
parquet_derive/src/parquet_field.rs 66.43% <0.00%> (+0.22%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0377aae...ac6a599. Read the comment docs.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jhorstmann ! I have always wondered how much faster the simd specific kernels actually were (given LLVM's ability to autovectorize)

Can you please provide the benchmark results (and also ideally how you ran them)?

I would like to run them on a hosted server machine

@jhorstmann
Copy link
Contributor Author

Hi @alamb, the benchmark results and my analysis are in the issue. For the removed kernels, the auto-vectorized version was about 40% faster on my laptop. The command was

RUSTFLAGS="-C target-cpu=skylake" cargo +nightly bench --features simd --bench arithmetic_kernels

(and the same without the simd feature). Results on a server class machine and with target-cpu=skx should be even better.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code of this PR looks great ❤️ (deleting code quite nice)

I compared the performance on a "cascade lake" class server machine with the following commands:

# Checkout only the changes in benchmark code
git chekcout bc751794b7a82bb2639b63d8065dfb25a951f458
RUSTFLAGS="-C target-cpu=native" cargo +nightly bench --features simd --bench arithmetic_kernels

Then I ran the results on the code in this PR

# Check out the tip of this branch
git checkout ac6a599873e0547c0932726562f87ec210595cab
RUSTFLAGS="-C target-cpu=native" cargo +nightly bench --features simd --bench arithmetic_kernels

My results are consistent with @jhorstmann's :tada

add                     time:   [11.158 us 11.194 us 11.239 us]                 
                        change: [-46.244% -46.094% -45.905%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) low mild
  1 (1.00%) high mild
  3 (3.00%) high severe

subtract                time:   [12.028 us 12.052 us 12.087 us]                      
                        change: [-23.628% -23.089% -22.472%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) high mild
  9 (9.00%) high severe

multiply                time:   [12.048 us 12.066 us 12.087 us]                      
                        change: [-25.195% -24.787% -24.365%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) high mild
  8 (8.00%) high severe

divide                  time:   [22.329 us 22.353 us 22.379 us]                    
                        change: [+24.502% +24.980% +25.378%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

divide_unchecked        time:   [14.145 us 14.150 us 14.156 us]                              
                        change: [-19.806% -19.416% -19.086%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 24 outliers among 100 measurements (24.00%)
  22 (22.00%) high mild
  2 (2.00%) high severe

divide_scalar           time:   [11.353 us 11.357 us 11.362 us]                           
                        change: [-33.601% -33.478% -33.317%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

modulo                  time:   [377.00 us 377.28 us 377.57 us]                   
                        change: [+1.2601% +1.6124% +1.9082%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  3 (3.00%) high severe

Benchmarking modulo_scalar: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.1s, enable flat sampling, or reduce sample count to 50.
modulo_scalar           time:   [1.4036 ms 1.4043 ms 1.4052 ms]                           
                        change: [-14.915% -14.615% -14.243%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 17 outliers among 100 measurements (17.00%)
  5 (5.00%) high mild
  12 (12.00%) high severe

add_nulls               time:   [11.334 us 11.381 us 11.454 us]                       
                        change: [-25.846% -25.633% -25.348%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  5 (5.00%) high severe

divide_nulls            time:   [19.081 us 19.109 us 19.135 us]                          
                        change: [-1.4613% -0.9700% -0.5291%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 27 outliers among 100 measurements (27.00%)
  11 (11.00%) low severe
  1 (1.00%) low mild
  13 (13.00%) high mild
  2 (2.00%) high severe

divide_nulls_unchecked  time:   [13.564 us 13.569 us 13.575 us]                                    
                        change: [-24.266% -23.988% -23.792%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

divide_scalar_nulls     time:   [11.252 us 11.257 us 11.264 us]                                 
                        change: [-33.624% -33.527% -33.367%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) high mild
  5 (5.00%) high severe

modulo_nulls            time:   [678.82 us 679.17 us 679.60 us]                         
                        change: [-0.7578% -0.4909% -0.2869%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe

Benchmarking modulo_scalar_nulls: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.1s, enable flat sampling, or reduce sample count to 60.
modulo_scalar_nulls     time:   [1.0158 ms 1.0162 ms 1.0167 ms]                                 
                        change: [-15.660% -14.722% -14.057%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  6 (6.00%) high severe

@alamb alamb merged commit 2d6352b into apache:master Jan 24, 2022
@alamb
Copy link
Contributor

alamb commented Jan 24, 2022

Thanks @jhorstmann -- this is a really nice find and piece of work

@alamb alamb changed the title Remove explicit simd arithmetic kernels except for division/modulo Improve performance for arithmetic kernels with simd feature enabled (except for division/modulo) Feb 3, 2022
@alamb alamb mentioned this pull request Apr 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Evaluate performance of simd on simple arithmetic
3 participants