ARROW-8496: [C++] Refine ByteStreamSplitDecodeScalar #6962

cyb70289 · 2020-04-17T07:18:53Z

I simplified DecoderScalar code and see huge performance boost from
clang generated code. Per my test on Intel E5-2650 with clang-9,
Decode_Float_Scalar test jumps from 600M/s to 20G/s, even better
than SSE version(17G/s). Similar behaviour observed on Arm64.

Some digging shows clang auto vectorized the simplified decoder code,
but gcc cannot: https://godbolt.org/z/kq9FAs
Interestingly, gcc is able to auto vectorize EncoderFloatScalar code,
but clang cannot: https://godbolt.org/z/E3LnZD

NOTE: This scalar code is not tested in default x86_64 build, which
goes the SSE version. Arm64 build goes this scalar code path.

I simplified DecoderScalar code and see huge performance boost from clang generated code. Per my test on Intel E5-2650 with clang-9, Decode_Float_Scalar test jumps from 600M/s to 20G/s, even better than SSE version(17G/s). Similar behaviour observed on Arm64. Some digging shows clang auto vectorized the simplified decoder code, but gcc cannot: https://godbolt.org/z/kq9FAs Interestingly, gcc is able to auto vectorize EncoderFloatScalar code, but clang cannot: https://godbolt.org/z/E3LnZD NOTE: This scalar code is not tested in default x86_64 build, which goes the SSE version. Arm64 build goes this scalar code path.

github-actions · 2020-04-17T07:31:32Z

https://issues.apache.org/jira/browse/ARROW-8496

pitrou · 2020-04-18T09:17:37Z

Impressive! Here are some benchmarks on AMD Ryzen 3900X, clang 9.0.0:

Before:

BM_ByteStreamSplitDecode_Float_Scalar/1024         5913 ns         5913 ns       118357 bytes_per_second=660.663M/s
BM_ByteStreamSplitDecode_Float_Scalar/4096        23659 ns        23656 ns        29596 bytes_per_second=660.516M/s
BM_ByteStreamSplitDecode_Float_Scalar/32768      190297 ns       190273 ns         3682 bytes_per_second=656.952M/s
BM_ByteStreamSplitDecode_Float_Scalar/65536      381438 ns       381388 ns         1835 bytes_per_second=655.5M/s

BM_ByteStreamSplitDecode_Double_Scalar/1024        5925 ns         5924 ns       118225 bytes_per_second=1.28782G/s
BM_ByteStreamSplitDecode_Double_Scalar/4096       23884 ns        23881 ns        29317 bytes_per_second=1.2779G/s
BM_ByteStreamSplitDecode_Double_Scalar/32768     191488 ns       191464 ns         3658 bytes_per_second=1.27513G/s
BM_ByteStreamSplitDecode_Double_Scalar/65536     383239 ns       383184 ns         1828 bytes_per_second=1.27427G/s

After:

BM_ByteStreamSplitDecode_Float_Scalar/1024         68.7 ns         68.7 ns     10215790 bytes_per_second=55.5503G/s
BM_ByteStreamSplitDecode_Float_Scalar/4096          249 ns          249 ns      2807852 bytes_per_second=61.2114G/s
BM_ByteStreamSplitDecode_Float_Scalar/32768        2063 ns         2063 ns       339260 bytes_per_second=59.1792G/s
BM_ByteStreamSplitDecode_Float_Scalar/65536        4894 ns         4893 ns       151927 bytes_per_second=49.8979G/s

BM_ByteStreamSplitDecode_Double_Scalar/1024         650 ns          650 ns      1071757 bytes_per_second=11.7349G/s
BM_ByteStreamSplitDecode_Double_Scalar/4096        2623 ns         2623 ns       266614 bytes_per_second=11.6351G/s
BM_ByteStreamSplitDecode_Double_Scalar/32768      20962 ns        20959 ns        33429 bytes_per_second=11.6484G/s
BM_ByteStreamSplitDecode_Double_Scalar/65536      42958 ns        42952 ns        16292 bytes_per_second=11.3681G/s

Here the clang-vectorized ByteStreamSplitDecode_Float_Scalar is even faster than the hand-written AVX2 version (60 GB/s vs. 40 GB/s).

pitrou · 2020-04-18T09:18:40Z

@jianxind There may be some inspiration for the hand-vectorized versions here.

pitrou · 2020-04-18T09:22:51Z

Benchmarks on the same AMD CPU, but with gcc 7.5.0:

Before:

BM_ByteStreamSplitDecode_Float_Scalar/1024         1233 ns         1233 ns       569058 bytes_per_second=3.09409G/s
BM_ByteStreamSplitDecode_Float_Scalar/4096         4926 ns         4926 ns       142097 bytes_per_second=3.09783G/s
BM_ByteStreamSplitDecode_Float_Scalar/32768       37446 ns        37442 ns        17618 bytes_per_second=3.26028G/s
BM_ByteStreamSplitDecode_Float_Scalar/65536       75117 ns        75107 ns         9138 bytes_per_second=3.25056G/s

BM_ByteStreamSplitDecode_Double_Scalar/1024        3254 ns         3253 ns       212756 bytes_per_second=2.34521G/s
BM_ByteStreamSplitDecode_Double_Scalar/4096       13026 ns        13025 ns        52728 bytes_per_second=2.34308G/s
BM_ByteStreamSplitDecode_Double_Scalar/32768     104245 ns       104232 ns         6632 bytes_per_second=2.34227G/s
BM_ByteStreamSplitDecode_Double_Scalar/65536     209211 ns       209187 ns         3293 bytes_per_second=2.33418G/s

After:

BM_ByteStreamSplitDecode_Float_Scalar/1024          184 ns          183 ns      3651302 bytes_per_second=20.7903G/s
BM_ByteStreamSplitDecode_Float_Scalar/4096          734 ns          734 ns       947941 bytes_per_second=20.7934G/s
BM_ByteStreamSplitDecode_Float_Scalar/32768        5881 ns         5880 ns       118941 bytes_per_second=20.7589G/s
BM_ByteStreamSplitDecode_Float_Scalar/65536       11925 ns        11923 ns        58902 bytes_per_second=20.4761G/s

BM_ByteStreamSplitDecode_Double_Scalar/1024         454 ns          453 ns      1543895 bytes_per_second=16.8246G/s
BM_ByteStreamSplitDecode_Double_Scalar/4096        1834 ns         1834 ns       379755 bytes_per_second=16.6404G/s
BM_ByteStreamSplitDecode_Double_Scalar/32768      14701 ns        14699 ns        47171 bytes_per_second=16.6095G/s
BM_ByteStreamSplitDecode_Double_Scalar/65536      29892 ns        29888 ns        23436 bytes_per_second=16.3373G/s

pitrou

+1

I simplified DecoderScalar code and see huge performance boost from clang generated code. Per my test on Intel E5-2650 with clang-9, Decode_Float_Scalar test jumps from 600M/s to 20G/s, even better than SSE version(17G/s). Similar behaviour observed on Arm64. Some digging shows clang auto vectorized the simplified decoder code, but gcc cannot: https://godbolt.org/z/kq9FAs Interestingly, gcc is able to auto vectorize EncoderFloatScalar code, but clang cannot: https://godbolt.org/z/E3LnZD NOTE: This scalar code is not tested in default x86_64 build, which goes the SSE version. Arm64 build goes this scalar code path. Closes #6962 from cyb70289/bytesplit Authored-by: Yibo Cai <yibo.cai@arm.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

pitrou approved these changes Apr 18, 2020

View reviewed changes

pitrou closed this in d3a9a84 Apr 18, 2020

cyb70289 deleted the bytesplit branch April 18, 2020 14:25

asfimport mentioned this pull request Jul 14, 2020

[C++] Refine ByteStreamSplitDecodeScalar #24668

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-8496: [C++] Refine ByteStreamSplitDecodeScalar #6962

ARROW-8496: [C++] Refine ByteStreamSplitDecodeScalar #6962

cyb70289 commented Apr 17, 2020

github-actions bot commented Apr 17, 2020

pitrou commented Apr 18, 2020 •

edited

Loading

pitrou commented Apr 18, 2020

pitrou commented Apr 18, 2020

pitrou left a comment

ARROW-8496: [C++] Refine ByteStreamSplitDecodeScalar #6962

ARROW-8496: [C++] Refine ByteStreamSplitDecodeScalar #6962

Conversation

cyb70289 commented Apr 17, 2020

github-actions bot commented Apr 17, 2020

pitrou commented Apr 18, 2020 • edited Loading

pitrou commented Apr 18, 2020

pitrou commented Apr 18, 2020

pitrou left a comment

Choose a reason for hiding this comment

pitrou commented Apr 18, 2020 •

edited

Loading