Interesting benchmark results of `min_max_helper` #1400

HaoYang670 · 2022-03-05T09:40:19Z

Describe your question
I find some interesting benchmark results when I try to speed up the function min_max_help. https://github.com/apache/arrow-rs/blob/master/arrow/src/compute/kernels/aggregate.rs#L115-L130

The only thing that I rewrote is replacing iter().fold() by iter().reduce():

    if null_count == 0 {
        // optimized path for arrays without null values
        m.iter()
            .reduce(|acc, item| if cmp(acc, item) { item } else { acc })
            .copied()
    } else {
        n = T::default_value();
        let mut has_value = false;
        for (i, item) in m.iter().enumerate() {
            if data.is_valid(i) && (!has_value || cmp(&n, item)) {
                has_value = true;
                n = *item
            }
        }
        Some(n)
    }

Then I ran

cargo bench min

to find if there are any changes in performance.
And I got the result:

min 512                 time:   [415.18 ns 415.66 ns 416.25 ns]                    
                        change: [-50.211% -50.109% -50.002%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  2 (2.00%) low mild
  7 (7.00%) high mild
  6 (6.00%) high severe

min nulls 512           time:   [1.0308 us 1.0331 us 1.0356 us]                           
                        change: [+17.936% +18.114% +18.295%] (p = 0.00 < 0.05)
                        Performance has regressed.

The results are a little unexpected. And I have 2 questions:

Why does min 512 have 50% performance improvement? I don't think iter.reduce is faster that iter.fold
Why does min nulls 512 become slower? The null_count > 0 code block is not changed.

Need your help!

The text was updated successfully, but these errors were encountered:

jhorstmann · 2022-03-05T11:08:04Z

Interesting, I'm actually seeing the opposite effect, with fold being faster. This is running on an Amd 3700U laptop. The generated code for fold and reduce seems to differ in that reduce contains several branches while fold looks relatively branchless. Different CPUs might handle one or the other better, although I would expect the branchless to win in the benchmark since the input consists of random numbers. I don't have any good explanation why this effects the null handling loop.

`

HaoYang670 · 2022-03-05T12:18:27Z

Interesting, I'm actually seeing the opposite effect, with fold being faster. This is running on an Amd 3700U laptop. The generated code for fold and reduce seems to differ in that reduce contains several branches while fold looks relatively branchless. Different CPUs might handle one or the other better, although I would expect the branchless to win in the benchmark since the input consists of random numbers. I don't have any good explanation why this effects the null handling loop.

`

More interesting! My desktop uses Intel i7 10700K processors.
Also, I am curious about why the compiler generates different code for fold and reduce. The reduce just uses fold in its implementation:

    #[inline]
    #[stable(feature = "iterator_fold_self", since = "1.51.0")]
    fn reduce<F>(mut self, f: F) -> Option<Self::Item>
    where
        Self: Sized,
        F: FnMut(Self::Item, Self::Item) -> Self::Item,
    {
        let first = self.next()?;
        Some(self.fold(first, f))
    }

jhorstmann · 2022-03-05T13:11:17Z

Good point, I tried some other variations and it seems to come down to the handling of references vs copying. The following two variations lead to different code:

pub fn min_reduce_ref(array: &[f64]) -> Option<f64> {
    array.iter().reduce(|a, b| if cmp(a, b) { b } else { a }).copied()
}

pub fn min_reduce_copied(array: &[f64]) -> Option<f64> {
    array.iter().copied().reduce(|a, b| if cmp(a, b) { b } else { a })
}

The second one generates the same code that I saw for fold earlier. In theory it should be possible that both get optimized to the same code and this might be limitation in LLVM.

HaoYang670 · 2022-03-06T02:48:12Z

Strongly agree with you @jhorstmann. fold seems to copy all elements in the array, and reduce just takes the references. If the code is rewritten as

    if null_count == 0 {
        // optimized path for arrays without null values
        m.iter()
            .copied()
            .reduce(|acc, item| if cmp(&acc, &item) { item } else { acc })

and I can get the benchmark result:

min 512                 time:   [833.73 ns 834.34 ns 835.06 ns]                     
                        change: [-0.2047% -0.0757% +0.0383%] (p = 0.24 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

, which is same as fold.

But wait! Which one is faster in this context, copy or reference? On my desktop, I can get 50% performance improvement by using reference. However, you said fold was faster on your laptop. Is it a little weird?

HaoYang670 · 2022-03-06T03:26:05Z

Also, using copy or reference seems to impact the performance of null handling loop. I get the following benchmark result:

using m.iter().copied().reduce(...):

min nulls 512           time:   [866.42 ns 867.35 ns 868.33 ns]                           
                        change: [-1.1100% -0.9494% -0.8086%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
  7 (7.00%) high mild
  2 (2.00%) high severe

using m.iter().reduce(...).copied()

min nulls 512           time:   [1.0525 us 1.0551 us 1.0576 us]                           
                        change: [+20.470% +20.793% +21.115%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

Is it able to reproduce these weird results on your laptop? @jhorstmann

jhorstmann · 2022-03-06T11:58:19Z

I can reproduce both results and it seems AMD cpus are better at handling one version of the code and Intel better at the other version. For the non-null benchmarks I get:

Intel 10510U
copy + reduce: ~ 950ns
reduce references + copy: ~450ns

Amd 3700U (timings fluctuate a bit more on this laptop)
copy + reduce: ~745ns
reduce references + copy: ~970ns

For min nulls I don't really see significant differences. Code alignment might create some small differences in performance on some cpus in such microbenchmarks. For even better performance, especially for nullable arrays, you should look into enabling the simd feature

I'm now actually a bit worried about the correctness of the nullable version, I don't see has_value being used apart from the assignment, if that flag is false then the result should be None.

HaoYang670 · 2022-03-06T12:12:36Z

I'm now actually a bit worried about the correctness of the nullable version, I don't see has_value being used apart from the assignment, if that flag is false then the result should be None.

There is at least one valid value in the array. Because we have tested all nulls array before:

if null_count == array.len() {
        return None;
}

https://github.com/apache/arrow-rs/blob/master/arrow/src/compute/kernels/aggregate.rs#L107-L109

HaoYang670 added the question Further information is requested label Mar 5, 2022

HaoYang670 closed this as completed Apr 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interesting benchmark results of `min_max_helper` #1400

Interesting benchmark results of `min_max_helper` #1400

HaoYang670 commented Mar 5, 2022

jhorstmann commented Mar 5, 2022

HaoYang670 commented Mar 5, 2022

jhorstmann commented Mar 5, 2022

HaoYang670 commented Mar 6, 2022

HaoYang670 commented Mar 6, 2022 •

edited

jhorstmann commented Mar 6, 2022

HaoYang670 commented Mar 6, 2022 •

edited

Interesting benchmark results of min_max_helper #1400

Interesting benchmark results of min_max_helper #1400

Comments

HaoYang670 commented Mar 5, 2022

jhorstmann commented Mar 5, 2022

HaoYang670 commented Mar 5, 2022

jhorstmann commented Mar 5, 2022

HaoYang670 commented Mar 6, 2022

HaoYang670 commented Mar 6, 2022 • edited

jhorstmann commented Mar 6, 2022

HaoYang670 commented Mar 6, 2022 • edited

Interesting benchmark results of `min_max_helper` #1400

Interesting benchmark results of `min_max_helper` #1400

HaoYang670 commented Mar 6, 2022 •

edited

HaoYang670 commented Mar 6, 2022 •

edited