-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-8496: [C++] Refine ByteStreamSplitDecodeScalar #6962
Conversation
I simplified DecoderScalar code and see huge performance boost from clang generated code. Per my test on Intel E5-2650 with clang-9, Decode_Float_Scalar test jumps from 600M/s to 20G/s, even better than SSE version(17G/s). Similar behaviour observed on Arm64. Some digging shows clang auto vectorized the simplified decoder code, but gcc cannot: https://godbolt.org/z/kq9FAs Interestingly, gcc is able to auto vectorize EncoderFloatScalar code, but clang cannot: https://godbolt.org/z/E3LnZD NOTE: This scalar code is not tested in default x86_64 build, which goes the SSE version. Arm64 build goes this scalar code path.
Impressive! Here are some benchmarks on AMD Ryzen 3900X, clang 9.0.0:
Here the clang-vectorized |
@jianxind There may be some inspiration for the hand-vectorized versions here. |
Benchmarks on the same AMD CPU, but with gcc 7.5.0:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
I simplified DecoderScalar code and see huge performance boost from clang generated code. Per my test on Intel E5-2650 with clang-9, Decode_Float_Scalar test jumps from 600M/s to 20G/s, even better than SSE version(17G/s). Similar behaviour observed on Arm64. Some digging shows clang auto vectorized the simplified decoder code, but gcc cannot: https://godbolt.org/z/kq9FAs Interestingly, gcc is able to auto vectorize EncoderFloatScalar code, but clang cannot: https://godbolt.org/z/E3LnZD NOTE: This scalar code is not tested in default x86_64 build, which goes the SSE version. Arm64 build goes this scalar code path. Closes #6962 from cyb70289/bytesplit Authored-by: Yibo Cai <yibo.cai@arm.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
I simplified DecoderScalar code and see huge performance boost from
clang generated code. Per my test on Intel E5-2650 with clang-9,
Decode_Float_Scalar test jumps from 600M/s to 20G/s, even better
than SSE version(17G/s). Similar behaviour observed on Arm64.
Some digging shows clang auto vectorized the simplified decoder code,
but gcc cannot: https://godbolt.org/z/kq9FAs
Interestingly, gcc is able to auto vectorize EncoderFloatScalar code,
but clang cannot: https://godbolt.org/z/E3LnZD
NOTE: This scalar code is not tested in default x86_64 build, which
goes the SSE version. Arm64 build goes this scalar code path.