Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interesting benchmark results of min_max_helper #1400

Closed
HaoYang670 opened this issue Mar 5, 2022 · 7 comments
Closed

Interesting benchmark results of min_max_helper #1400

HaoYang670 opened this issue Mar 5, 2022 · 7 comments
Labels
question Further information is requested

Comments

@HaoYang670
Copy link
Contributor

Describe your question
I find some interesting benchmark results when I try to speed up the function min_max_help. https://github.com/apache/arrow-rs/blob/master/arrow/src/compute/kernels/aggregate.rs#L115-L130

The only thing that I rewrote is replacing iter().fold() by iter().reduce():

    if null_count == 0 {
        // optimized path for arrays without null values
        m.iter()
            .reduce(|acc, item| if cmp(acc, item) { item } else { acc })
            .copied()
    } else {
        n = T::default_value();
        let mut has_value = false;
        for (i, item) in m.iter().enumerate() {
            if data.is_valid(i) && (!has_value || cmp(&n, item)) {
                has_value = true;
                n = *item
            }
        }
        Some(n)
    }

Then I ran

cargo bench min

to find if there are any changes in performance.
And I got the result:

min 512                 time:   [415.18 ns 415.66 ns 416.25 ns]                    
                        change: [-50.211% -50.109% -50.002%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  2 (2.00%) low mild
  7 (7.00%) high mild
  6 (6.00%) high severe

min nulls 512           time:   [1.0308 us 1.0331 us 1.0356 us]                           
                        change: [+17.936% +18.114% +18.295%] (p = 0.00 < 0.05)
                        Performance has regressed.

The results are a little unexpected. And I have 2 questions:

  1. Why does min 512 have 50% performance improvement? I don't think iter.reduce is faster that iter.fold
  2. Why does min nulls 512 become slower? The null_count > 0 code block is not changed.

Need your help!

@HaoYang670 HaoYang670 added the question Further information is requested label Mar 5, 2022
@jhorstmann
Copy link
Contributor

Interesting, I'm actually seeing the opposite effect, with fold being faster. This is running on an Amd 3700U laptop. The generated code for fold and reduce seems to differ in that reduce contains several branches while fold looks relatively branchless. Different CPUs might handle one or the other better, although I would expect the branchless to win in the benchmark since the input consists of random numbers. I don't have any good explanation why this effects the null handling loop.

`

@HaoYang670
Copy link
Contributor Author

Interesting, I'm actually seeing the opposite effect, with fold being faster. This is running on an Amd 3700U laptop. The generated code for fold and reduce seems to differ in that reduce contains several branches while fold looks relatively branchless. Different CPUs might handle one or the other better, although I would expect the branchless to win in the benchmark since the input consists of random numbers. I don't have any good explanation why this effects the null handling loop.

`

More interesting! My desktop uses Intel i7 10700K processors.
Also, I am curious about why the compiler generates different code for fold and reduce. The reduce just uses fold in its implementation:

    #[inline]
    #[stable(feature = "iterator_fold_self", since = "1.51.0")]
    fn reduce<F>(mut self, f: F) -> Option<Self::Item>
    where
        Self: Sized,
        F: FnMut(Self::Item, Self::Item) -> Self::Item,
    {
        let first = self.next()?;
        Some(self.fold(first, f))
    }

@jhorstmann
Copy link
Contributor

Good point, I tried some other variations and it seems to come down to the handling of references vs copying. The following two variations lead to different code:

pub fn min_reduce_ref(array: &[f64]) -> Option<f64> {
    array.iter().reduce(|a, b| if cmp(a, b) { b } else { a }).copied()
}

pub fn min_reduce_copied(array: &[f64]) -> Option<f64> {
    array.iter().copied().reduce(|a, b| if cmp(a, b) { b } else { a })
}

The second one generates the same code that I saw for fold earlier. In theory it should be possible that both get optimized to the same code and this might be limitation in LLVM.

@HaoYang670
Copy link
Contributor Author

Strongly agree with you @jhorstmann. fold seems to copy all elements in the array, and reduce just takes the references. If the code is rewritten as

    if null_count == 0 {
        // optimized path for arrays without null values
        m.iter()
            .copied()
            .reduce(|acc, item| if cmp(&acc, &item) { item } else { acc })

and I can get the benchmark result:

min 512                 time:   [833.73 ns 834.34 ns 835.06 ns]                     
                        change: [-0.2047% -0.0757% +0.0383%] (p = 0.24 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

, which is same as fold.

But wait! Which one is faster in this context, copy or reference? On my desktop, I can get 50% performance improvement by using reference. However, you said fold was faster on your laptop. Is it a little weird?

@HaoYang670
Copy link
Contributor Author

HaoYang670 commented Mar 6, 2022

Also, using copy or reference seems to impact the performance of null handling loop. I get the following benchmark result:

  1. using m.iter().copied().reduce(...):
min nulls 512           time:   [866.42 ns 867.35 ns 868.33 ns]                           
                        change: [-1.1100% -0.9494% -0.8086%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
  7 (7.00%) high mild
  2 (2.00%) high severe
  1. using m.iter().reduce(...).copied()
min nulls 512           time:   [1.0525 us 1.0551 us 1.0576 us]                           
                        change: [+20.470% +20.793% +21.115%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

Is it able to reproduce these weird results on your laptop? @jhorstmann

@jhorstmann
Copy link
Contributor

I can reproduce both results and it seems AMD cpus are better at handling one version of the code and Intel better at the other version. For the non-null benchmarks I get:

Intel 10510U
copy + reduce: ~ 950ns
reduce references + copy: ~450ns

Amd 3700U (timings fluctuate a bit more on this laptop)
copy + reduce: ~745ns
reduce references + copy: ~970ns

For min nulls I don't really see significant differences. Code alignment might create some small differences in performance on some cpus in such microbenchmarks. For even better performance, especially for nullable arrays, you should look into enabling the simd feature

I'm now actually a bit worried about the correctness of the nullable version, I don't see has_value being used apart from the assignment, if that flag is false then the result should be None.

@HaoYang670
Copy link
Contributor Author

HaoYang670 commented Mar 6, 2022

I'm now actually a bit worried about the correctness of the nullable version, I don't see has_value being used apart from the assignment, if that flag is false then the result should be None.

There is at least one valid value in the array. Because we have tested all nulls array before:

if null_count == array.len() {
        return None;
}

https://github.com/apache/arrow-rs/blob/master/arrow/src/compute/kernels/aggregate.rs#L107-L109

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants