Skip to content

Conversation

@cyb70289
Copy link
Contributor

@cyb70289 cyb70289 commented Sep 1, 2020

For int16/32/64 arrays with reasonable length, scan the array to find
min/max values first. If (max-min) is within some threshold, instead
of general hashmap, using a value indexed array can improve performance
significantly.

To be compatible with chunked array, value count array is transferred to
hashmap before merging with others. This is an overhead for short array.
Finding min/max may also introduce performance penalty in some cases.

Please note it's hard to benefit all use cases. By applying this patch:

  • about 2x performance uplift for integers in small value range
  • no obvious performance drop for normal cases
  • non-trivial performance drop in some cases
    • 40% drop for short int8 array (8k length)
    • 10% drop for sparse array (few distinct values, big value gap)

@github-actions
Copy link

github-actions bot commented Sep 1, 2020

@kou kou changed the title ARROW-9873: [C++][Compute] Optimize mode kernel for integers in small… ARROW-9873: [C++][Compute] Optimize mode kernel for integers in small value range Sep 1, 2020
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think polymorphism is a good idea for performance. Instead, you should probably use templated code.
Also, when null_count == length, the implementation can be trivial.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to template. Did get better performance. Thanks.

@cyb70289
Copy link
Contributor Author

cyb70289 commented Sep 2, 2020

Latest benchmark result after re-implementation.

# Tested on skylake (knight landing)
$ archery benchmark diff --suite-filter="arrow-compute-aggregate-benchmark" --benchmark-filter="^Mode" --cc=clang-9 --cxx=clang++-9

                      benchmark         baseline        contender    change %
// igonre these 100% cases, huge boost due to a simple trick, not very useful
    ModeKernelBoolean/1048576/1  123.216 MiB/sec  847.125 GiB/sec  703915.356 null_percent': 100.0
       ModeKernelInt8/1048576/1  896.330 MiB/sec  617.997 GiB/sec   70502.192 null_percent': 100.0
      ModeKernelInt16/1048576/1    2.886 GiB/sec  965.237 GiB/sec   33340.541 null_percent': 100.0
      ModeKernelInt32/1048576/1    5.732 GiB/sec  960.476 GiB/sec   16657.027 null_percent': 100.0
      ModeKernelInt64/1048576/1    7.925 GiB/sec  974.705 GiB/sec   12198.487 null_percent': 100.0

// big improvement for int16/32/64 with limited value range 
      ModeKernelInt16/1048576/0  128.522 MiB/sec  495.771 MiB/sec     285.749  'null_percent': 0.0
      ModeKernelInt32/1048576/0  257.694 MiB/sec  953.232 MiB/sec     269.909  'null_percent': 0.0
      ModeKernelInt64/1048576/0  516.624 MiB/sec    1.715 GiB/sec     240.027  'null_percent': 0.0
  ModeKernelInt32/1048576/10000  227.404 MiB/sec  690.032 MiB/sec     203.439 'null_percent': 0.01
  ModeKernelInt16/1048576/10000  115.419 MiB/sec  349.055 MiB/sec     202.425 'null_percent': 0.01
    ModeKernelInt32/1048576/100  229.661 MiB/sec  684.149 MiB/sec     197.895  'null_percent': 1.0
    ModeKernelInt16/1048576/100  116.084 MiB/sec  342.620 MiB/sec     195.148  'null_percent': 1.0
  ModeKernelInt64/1048576/10000  481.409 MiB/sec    1.302 GiB/sec     176.913 'null_percent': 0.01
    ModeKernelInt64/1048576/100  486.266 MiB/sec    1.297 GiB/sec     173.114  'null_percent': 1.0
     ModeKernelInt16/1048576/10  121.865 MiB/sec  315.932 MiB/sec     159.247 'null_percent': 10.0
     ModeKernelInt32/1048576/10  242.074 MiB/sec  625.162 MiB/sec     158.252 'null_percent': 10.0
     ModeKernelInt64/1048576/10  527.976 MiB/sec    1.199 GiB/sec     132.580 'null_percent': 10.0
      ModeKernelInt32/1048576/2  320.156 MiB/sec  429.196 MiB/sec      34.058 'null_percent': 50.0
      ModeKernelInt16/1048576/2  162.121 MiB/sec  196.310 MiB/sec      21.089 'null_percent': 50.0

// no obvious difference for bool/int8
     ModeKernelInt8/1048576/100  234.422 MiB/sec  251.464 MiB/sec       7.270  'null_percent': 1.0
      ModeKernelInt8/1048576/10  246.324 MiB/sec  258.110 MiB/sec       4.785 'null_percent': 10.0
   ModeKernelInt8/1048576/10000  239.496 MiB/sec  250.469 MiB/sec       4.582 'null_percent': 0.01
      ModeKernelInt64/1048576/2  812.020 MiB/sec  832.610 MiB/sec       2.536 'null_percent': 50.0
ModeKernelBoolean/1048576/10000   26.318 MiB/sec   26.509 MiB/sec       0.728 'null_percent': 0.01
  ModeKernelBoolean/1048576/100   26.510 MiB/sec   26.597 MiB/sec       0.327  'null_percent': 1.0
    ModeKernelBoolean/1048576/0   28.271 MiB/sec   28.274 MiB/sec       0.009  'null_percent': 0.0
       ModeKernelInt8/1048576/0  270.401 MiB/sec  269.025 MiB/sec      -0.509  'null_percent': 0.0
       ModeKernelInt8/1048576/2  190.410 MiB/sec  187.876 MiB/sec      -1.331 'null_percent': 50.0
   ModeKernelBoolean/1048576/10   28.007 MiB/sec   27.599 MiB/sec      -1.455 'null_percent': 10.0
    ModeKernelBoolean/1048576/2   27.157 MiB/sec   24.209 MiB/sec     -10.857 'null_percent': 50.0

@cyb70289 cyb70289 marked this pull request as draft September 2, 2020 07:10
@cyb70289
Copy link
Contributor Author

cyb70289 commented Sep 2, 2020

"RTools 35" CI failure is not related

@cyb70289 cyb70289 marked this pull request as ready for review September 2, 2020 09:15
@pitrou
Copy link
Member

pitrou commented Sep 2, 2020

I get similar speedups on AMD Ryzen.

cyb70289 and others added 3 commits September 2, 2020 15:36
… value range

For int16/32/64 arrays with reasonable length, scan the array to find
min/max values first. If (max-min) is within some threshold, instead
of general hashmap, using a value indexed array can improve performance
significantly.

To be compatible with chunked array, value count array is transferred to
hashmap before merging with others. This is an overhead for short array.
Finding min/max may also introduce performance penalty in some cases.

Please note it's hard to benefit all use cases. By applying this patch:
- about 2x performance uplift for integers in small value range
- no obvious performance drop for normal cases
- non-trivial performance drop in some cases
  * 40% drop for short int8 array (8k length)
  * 10% drop for sparse array (few distinct values, big value gap)
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thank you @cyb70289

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants