-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-1828: [C++] Use SSE2 for the ByteStreamSplit encoder #6723
Conversation
The ByteStreamSplit encoder can benefit from using SSE2 intrinsics to speed-up data processing since most of the data is likey in the cache.
The patch encoding performance for floating types by a little bit and for double types by a lot. This in sync with my benchmark results which use the higher level API call write_table: https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf The patch also addresses a spelling mistake and some redundant test code. |
The two failing tests seem unrelated. |
I was also not able to better unify the code for float and double encoding. There's an extra branch which the compiler should be able to optimize out. |
@pitrou would you like to have a look? |
Benchmarks on an AMD Ryzen:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I won't try to understand the algorithm :-) Perhaps you want to add a small comment explaining it as you did in the SSE2 decoding function?
Impressive results. It doesn't match my intuition that SSE2 would yield 4-7x speedups but it's good to know that's possible |
Technically it should be no more than 4x. |
Amazing improvement. |
Yeah, if you wish. |
Do we even run tests on ARM? |
@martinradev The Arrow CI doesn't, but I suppose @cyb70289 does. Also, ARM is often little-endian nowadays. |
Thank you @martinradev ! |
The ByteStreamSplit encoder can benefit from using SSE2 intrinsics
to speed-up data processing since most of the data is likey in the
cache.
Benchmark results on Sandy Bridge 3930k:
BM_ByteStreamSplitEncode_Float_Scalar/1024 264 ns 264 ns 2679735 bytes_per_second=14.4706G/s
BM_ByteStreamSplitEncode_Float_Scalar/4096 1003 ns 1003 ns 688811 bytes_per_second=15.2114G/s
BM_ByteStreamSplitEncode_Float_Scalar/32768 11933 ns 11926 ns 59187 bytes_per_second=10.2353G/s
BM_ByteStreamSplitEncode_Float_Scalar/65536 28137 ns 28137 ns 24634 bytes_per_second=8.67699G/s
BM_ByteStreamSplitEncode_Double_Scalar/1024 2601 ns 2599 ns 266977 bytes_per_second=2.93583G/s
BM_ByteStreamSplitEncode_Double_Scalar/4096 32408 ns 32408 ns 21594 bytes_per_second=964.268M/s
BM_ByteStreamSplitEncode_Double_Scalar/32768 228019 ns 227850 ns 3079 bytes_per_second=1097.21M/s
BM_ByteStreamSplitEncode_Double_Scalar/65536 475051 ns 475051 ns 1477 bytes_per_second=1052.52M/s
BM_ByteStreamSplitEncode_Float_SSE2/1024 219 ns 219 ns 3156405 bytes_per_second=17.4093G/s
BM_ByteStreamSplitEncode_Float_SSE2/4096 860 ns 860 ns 796082 bytes_per_second=17.7457G/s
BM_ByteStreamSplitEncode_Float_SSE2/32768 11556 ns 11556 ns 61763 bytes_per_second=10.5632G/s
BM_ByteStreamSplitEncode_Float_SSE2/65536 27759 ns 27758 ns 25234 bytes_per_second=8.7953G/s
BM_ByteStreamSplitEncode_Double_SSE2/1024 433 ns 433 ns 1596315 bytes_per_second=17.6216G/s
BM_ByteStreamSplitEncode_Double_SSE2/4096 3358 ns 3358 ns 205874 bytes_per_second=9.08763G/s
BM_ByteStreamSplitEncode_Double_SSE2/32768 33812 ns 33812 ns 20505 bytes_per_second=7.22053G/s
BM_ByteStreamSplitEncode_Double_SSE2/65536 68419 ns 68417 ns 10078 bytes_per_second=7.13682G/s