Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-8496: [C++] Refine ByteStreamSplitDecodeScalar #6962

Closed
wants to merge 1 commit into from

Conversation

cyb70289
Copy link
Contributor

I simplified DecoderScalar code and see huge performance boost from
clang generated code. Per my test on Intel E5-2650 with clang-9,
Decode_Float_Scalar test jumps from 600M/s to 20G/s, even better
than SSE version(17G/s). Similar behaviour observed on Arm64.

Some digging shows clang auto vectorized the simplified decoder code,
but gcc cannot: https://godbolt.org/z/kq9FAs
Interestingly, gcc is able to auto vectorize EncoderFloatScalar code,
but clang cannot: https://godbolt.org/z/E3LnZD

NOTE: This scalar code is not tested in default x86_64 build, which
goes the SSE version. Arm64 build goes this scalar code path.

I simplified DecoderScalar code and see huge performance boost from
clang generated code. Per my test on Intel E5-2650 with clang-9,
Decode_Float_Scalar test jumps from 600M/s to 20G/s, even better
than SSE version(17G/s). Similar behaviour observed on Arm64.

Some digging shows clang auto vectorized the simplified decoder code,
but gcc cannot: https://godbolt.org/z/kq9FAs
Interestingly, gcc is able to auto vectorize EncoderFloatScalar code,
but clang cannot: https://godbolt.org/z/E3LnZD

NOTE: This scalar code is not tested in default x86_64 build, which
goes the SSE version. Arm64 build goes this scalar code path.
@github-actions
Copy link

@pitrou
Copy link
Member

pitrou commented Apr 18, 2020

Impressive! Here are some benchmarks on AMD Ryzen 3900X, clang 9.0.0:

  • Before:
BM_ByteStreamSplitDecode_Float_Scalar/1024         5913 ns         5913 ns       118357 bytes_per_second=660.663M/s
BM_ByteStreamSplitDecode_Float_Scalar/4096        23659 ns        23656 ns        29596 bytes_per_second=660.516M/s
BM_ByteStreamSplitDecode_Float_Scalar/32768      190297 ns       190273 ns         3682 bytes_per_second=656.952M/s
BM_ByteStreamSplitDecode_Float_Scalar/65536      381438 ns       381388 ns         1835 bytes_per_second=655.5M/s

BM_ByteStreamSplitDecode_Double_Scalar/1024        5925 ns         5924 ns       118225 bytes_per_second=1.28782G/s
BM_ByteStreamSplitDecode_Double_Scalar/4096       23884 ns        23881 ns        29317 bytes_per_second=1.2779G/s
BM_ByteStreamSplitDecode_Double_Scalar/32768     191488 ns       191464 ns         3658 bytes_per_second=1.27513G/s
BM_ByteStreamSplitDecode_Double_Scalar/65536     383239 ns       383184 ns         1828 bytes_per_second=1.27427G/s
  • After:
BM_ByteStreamSplitDecode_Float_Scalar/1024         68.7 ns         68.7 ns     10215790 bytes_per_second=55.5503G/s
BM_ByteStreamSplitDecode_Float_Scalar/4096          249 ns          249 ns      2807852 bytes_per_second=61.2114G/s
BM_ByteStreamSplitDecode_Float_Scalar/32768        2063 ns         2063 ns       339260 bytes_per_second=59.1792G/s
BM_ByteStreamSplitDecode_Float_Scalar/65536        4894 ns         4893 ns       151927 bytes_per_second=49.8979G/s

BM_ByteStreamSplitDecode_Double_Scalar/1024         650 ns          650 ns      1071757 bytes_per_second=11.7349G/s
BM_ByteStreamSplitDecode_Double_Scalar/4096        2623 ns         2623 ns       266614 bytes_per_second=11.6351G/s
BM_ByteStreamSplitDecode_Double_Scalar/32768      20962 ns        20959 ns        33429 bytes_per_second=11.6484G/s
BM_ByteStreamSplitDecode_Double_Scalar/65536      42958 ns        42952 ns        16292 bytes_per_second=11.3681G/s

Here the clang-vectorized ByteStreamSplitDecode_Float_Scalar is even faster than the hand-written AVX2 version (60 GB/s vs. 40 GB/s).

@pitrou
Copy link
Member

pitrou commented Apr 18, 2020

@jianxind There may be some inspiration for the hand-vectorized versions here.

@pitrou
Copy link
Member

pitrou commented Apr 18, 2020

Benchmarks on the same AMD CPU, but with gcc 7.5.0:

  • Before:
BM_ByteStreamSplitDecode_Float_Scalar/1024         1233 ns         1233 ns       569058 bytes_per_second=3.09409G/s
BM_ByteStreamSplitDecode_Float_Scalar/4096         4926 ns         4926 ns       142097 bytes_per_second=3.09783G/s
BM_ByteStreamSplitDecode_Float_Scalar/32768       37446 ns        37442 ns        17618 bytes_per_second=3.26028G/s
BM_ByteStreamSplitDecode_Float_Scalar/65536       75117 ns        75107 ns         9138 bytes_per_second=3.25056G/s

BM_ByteStreamSplitDecode_Double_Scalar/1024        3254 ns         3253 ns       212756 bytes_per_second=2.34521G/s
BM_ByteStreamSplitDecode_Double_Scalar/4096       13026 ns        13025 ns        52728 bytes_per_second=2.34308G/s
BM_ByteStreamSplitDecode_Double_Scalar/32768     104245 ns       104232 ns         6632 bytes_per_second=2.34227G/s
BM_ByteStreamSplitDecode_Double_Scalar/65536     209211 ns       209187 ns         3293 bytes_per_second=2.33418G/s
  • After:
BM_ByteStreamSplitDecode_Float_Scalar/1024          184 ns          183 ns      3651302 bytes_per_second=20.7903G/s
BM_ByteStreamSplitDecode_Float_Scalar/4096          734 ns          734 ns       947941 bytes_per_second=20.7934G/s
BM_ByteStreamSplitDecode_Float_Scalar/32768        5881 ns         5880 ns       118941 bytes_per_second=20.7589G/s
BM_ByteStreamSplitDecode_Float_Scalar/65536       11925 ns        11923 ns        58902 bytes_per_second=20.4761G/s

BM_ByteStreamSplitDecode_Double_Scalar/1024         454 ns          453 ns      1543895 bytes_per_second=16.8246G/s
BM_ByteStreamSplitDecode_Double_Scalar/4096        1834 ns         1834 ns       379755 bytes_per_second=16.6404G/s
BM_ByteStreamSplitDecode_Double_Scalar/32768      14701 ns        14699 ns        47171 bytes_per_second=16.6095G/s
BM_ByteStreamSplitDecode_Double_Scalar/65536      29892 ns        29888 ns        23436 bytes_per_second=16.3373G/s

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@pitrou pitrou closed this in d3a9a84 Apr 18, 2020
@cyb70289 cyb70289 deleted the bytesplit branch April 18, 2020 14:25
kszucs pushed a commit that referenced this pull request Apr 20, 2020
I simplified DecoderScalar code and see huge performance boost from
clang generated code. Per my test on Intel E5-2650 with clang-9,
Decode_Float_Scalar test jumps from 600M/s to 20G/s, even better
than SSE version(17G/s). Similar behaviour observed on Arm64.

Some digging shows clang auto vectorized the simplified decoder code,
but gcc cannot: https://godbolt.org/z/kq9FAs
Interestingly, gcc is able to auto vectorize EncoderFloatScalar code,
but clang cannot: https://godbolt.org/z/E3LnZD

NOTE: This scalar code is not tested in default x86_64 build, which
goes the SSE version. Arm64 build goes this scalar code path.

Closes #6962 from cyb70289/bytesplit

Authored-by: Yibo Cai <yibo.cai@arm.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants