ARROW-9873: [C++][Compute] Optimize mode kernel for integers in small value range #8091

cyb70289 · 2020-09-01T09:32:27Z

For int16/32/64 arrays with reasonable length, scan the array to find
min/max values first. If (max-min) is within some threshold, instead
of general hashmap, using a value indexed array can improve performance
significantly.

To be compatible with chunked array, value count array is transferred to
hashmap before merging with others. This is an overhead for short array.
Finding min/max may also introduce performance penalty in some cases.

Please note it's hard to benefit all use cases. By applying this patch:

about 2x performance uplift for integers in small value range
no obvious performance drop for normal cases
non-trivial performance drop in some cases
- 40% drop for short int8 array (8k length)
- 10% drop for sparse array (few distinct values, big value gap)

github-actions · 2020-09-01T09:46:31Z

https://issues.apache.org/jira/browse/ARROW-9873

pitrou · 2020-09-01T16:25:13Z

cpp/src/arrow/compute/kernels/aggregate_mode.cc

I don't think polymorphism is a good idea for performance. Instead, you should probably use templated code.
Also, when null_count == length, the implementation can be trivial.

Changed to template. Did get better performance. Thanks.

cyb70289 · 2020-09-02T06:38:19Z

Latest benchmark result after re-implementation.

# Tested on skylake (knight landing)
$ archery benchmark diff --suite-filter="arrow-compute-aggregate-benchmark" --benchmark-filter="^Mode" --cc=clang-9 --cxx=clang++-9

                      benchmark         baseline        contender    change %
// igonre these 100% cases, huge boost due to a simple trick, not very useful
    ModeKernelBoolean/1048576/1  123.216 MiB/sec  847.125 GiB/sec  703915.356 null_percent': 100.0
       ModeKernelInt8/1048576/1  896.330 MiB/sec  617.997 GiB/sec   70502.192 null_percent': 100.0
      ModeKernelInt16/1048576/1    2.886 GiB/sec  965.237 GiB/sec   33340.541 null_percent': 100.0
      ModeKernelInt32/1048576/1    5.732 GiB/sec  960.476 GiB/sec   16657.027 null_percent': 100.0
      ModeKernelInt64/1048576/1    7.925 GiB/sec  974.705 GiB/sec   12198.487 null_percent': 100.0

// big improvement for int16/32/64 with limited value range 
      ModeKernelInt16/1048576/0  128.522 MiB/sec  495.771 MiB/sec     285.749  'null_percent': 0.0
      ModeKernelInt32/1048576/0  257.694 MiB/sec  953.232 MiB/sec     269.909  'null_percent': 0.0
      ModeKernelInt64/1048576/0  516.624 MiB/sec    1.715 GiB/sec     240.027  'null_percent': 0.0
  ModeKernelInt32/1048576/10000  227.404 MiB/sec  690.032 MiB/sec     203.439 'null_percent': 0.01
  ModeKernelInt16/1048576/10000  115.419 MiB/sec  349.055 MiB/sec     202.425 'null_percent': 0.01
    ModeKernelInt32/1048576/100  229.661 MiB/sec  684.149 MiB/sec     197.895  'null_percent': 1.0
    ModeKernelInt16/1048576/100  116.084 MiB/sec  342.620 MiB/sec     195.148  'null_percent': 1.0
  ModeKernelInt64/1048576/10000  481.409 MiB/sec    1.302 GiB/sec     176.913 'null_percent': 0.01
    ModeKernelInt64/1048576/100  486.266 MiB/sec    1.297 GiB/sec     173.114  'null_percent': 1.0
     ModeKernelInt16/1048576/10  121.865 MiB/sec  315.932 MiB/sec     159.247 'null_percent': 10.0
     ModeKernelInt32/1048576/10  242.074 MiB/sec  625.162 MiB/sec     158.252 'null_percent': 10.0
     ModeKernelInt64/1048576/10  527.976 MiB/sec    1.199 GiB/sec     132.580 'null_percent': 10.0
      ModeKernelInt32/1048576/2  320.156 MiB/sec  429.196 MiB/sec      34.058 'null_percent': 50.0
      ModeKernelInt16/1048576/2  162.121 MiB/sec  196.310 MiB/sec      21.089 'null_percent': 50.0

// no obvious difference for bool/int8
     ModeKernelInt8/1048576/100  234.422 MiB/sec  251.464 MiB/sec       7.270  'null_percent': 1.0
      ModeKernelInt8/1048576/10  246.324 MiB/sec  258.110 MiB/sec       4.785 'null_percent': 10.0
   ModeKernelInt8/1048576/10000  239.496 MiB/sec  250.469 MiB/sec       4.582 'null_percent': 0.01
      ModeKernelInt64/1048576/2  812.020 MiB/sec  832.610 MiB/sec       2.536 'null_percent': 50.0
ModeKernelBoolean/1048576/10000   26.318 MiB/sec   26.509 MiB/sec       0.728 'null_percent': 0.01
  ModeKernelBoolean/1048576/100   26.510 MiB/sec   26.597 MiB/sec       0.327  'null_percent': 1.0
    ModeKernelBoolean/1048576/0   28.271 MiB/sec   28.274 MiB/sec       0.009  'null_percent': 0.0
       ModeKernelInt8/1048576/0  270.401 MiB/sec  269.025 MiB/sec      -0.509  'null_percent': 0.0
       ModeKernelInt8/1048576/2  190.410 MiB/sec  187.876 MiB/sec      -1.331 'null_percent': 50.0
   ModeKernelBoolean/1048576/10   28.007 MiB/sec   27.599 MiB/sec      -1.455 'null_percent': 10.0
    ModeKernelBoolean/1048576/2   27.157 MiB/sec   24.209 MiB/sec     -10.857 'null_percent': 50.0

cyb70289 · 2020-09-02T09:14:51Z

"RTools 35" CI failure is not related

pitrou · 2020-09-02T10:01:44Z

I get similar speedups on AMD Ryzen.

… value range For int16/32/64 arrays with reasonable length, scan the array to find min/max values first. If (max-min) is within some threshold, instead of general hashmap, using a value indexed array can improve performance significantly. To be compatible with chunked array, value count array is transferred to hashmap before merging with others. This is an overhead for short array. Finding min/max may also introduce performance penalty in some cases. Please note it's hard to benefit all use cases. By applying this patch: - about 2x performance uplift for integers in small value range - no obvious performance drop for normal cases - non-trivial performance drop in some cases * 40% drop for short int8 array (8k length) * 10% drop for sparse array (few distinct values, big value gap)

pitrou

+1, thank you @cyb70289

kou changed the title ~~ARROW-9873: [C++][Compute] Optimize mode kernel for integers in small…~~ ARROW-9873: [C++][Compute] Optimize mode kernel for integers in small value range Sep 1, 2020

pitrou reviewed Sep 1, 2020

View reviewed changes

cyb70289 marked this pull request as draft September 2, 2020 07:10

cyb70289 marked this pull request as ready for review September 2, 2020 09:15

cyb70289 and others added 3 commits September 2, 2020 15:36

Re-implement without polymorphism

fe98257

Improve tests

f4c42bd

pitrou approved these changes Sep 2, 2020

View reviewed changes

pitrou closed this in 823fe60 Sep 2, 2020

cyb70289 deleted the mode-count branch September 2, 2020 23:48

asfimport mentioned this pull request Sep 2, 2020

[C++][Compute] Improve mode kernel for intergers within limited value range #25909

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ARROW-9873: [C++][Compute] Optimize mode kernel for integers in small value range #8091

ARROW-9873: [C++][Compute] Optimize mode kernel for integers in small value range #8091

Uh oh!

cyb70289 commented Sep 1, 2020 •

edited by kou

Loading

Uh oh!

github-actions bot commented Sep 1, 2020

Uh oh!

pitrou Sep 1, 2020

Uh oh!

cyb70289 Sep 2, 2020

Uh oh!

cyb70289 commented Sep 2, 2020

Uh oh!

cyb70289 commented Sep 2, 2020

Uh oh!

pitrou commented Sep 2, 2020

Uh oh!

pitrou left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ARROW-9873: [C++][Compute] Optimize mode kernel for integers in small value range #8091

ARROW-9873: [C++][Compute] Optimize mode kernel for integers in small value range #8091

Uh oh!

Conversation

cyb70289 commented Sep 1, 2020 • edited by kou Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 1, 2020

Uh oh!

pitrou Sep 1, 2020

Choose a reason for hiding this comment

Uh oh!

cyb70289 Sep 2, 2020

Choose a reason for hiding this comment

Uh oh!

cyb70289 commented Sep 2, 2020

Uh oh!

cyb70289 commented Sep 2, 2020

Uh oh!

pitrou commented Sep 2, 2020

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cyb70289 commented Sep 1, 2020 •

edited by kou

Loading