Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-1828: [C++] Use SSE2 for the ByteStreamSplit encoder #6723

Closed

Conversation

martinradev
Copy link
Contributor

The ByteStreamSplit encoder can benefit from using SSE2 intrinsics
to speed-up data processing since most of the data is likey in the
cache.

Benchmark results on Sandy Bridge 3930k:
BM_ByteStreamSplitEncode_Float_Scalar/1024 264 ns 264 ns 2679735 bytes_per_second=14.4706G/s
BM_ByteStreamSplitEncode_Float_Scalar/4096 1003 ns 1003 ns 688811 bytes_per_second=15.2114G/s
BM_ByteStreamSplitEncode_Float_Scalar/32768 11933 ns 11926 ns 59187 bytes_per_second=10.2353G/s
BM_ByteStreamSplitEncode_Float_Scalar/65536 28137 ns 28137 ns 24634 bytes_per_second=8.67699G/s
BM_ByteStreamSplitEncode_Double_Scalar/1024 2601 ns 2599 ns 266977 bytes_per_second=2.93583G/s
BM_ByteStreamSplitEncode_Double_Scalar/4096 32408 ns 32408 ns 21594 bytes_per_second=964.268M/s
BM_ByteStreamSplitEncode_Double_Scalar/32768 228019 ns 227850 ns 3079 bytes_per_second=1097.21M/s
BM_ByteStreamSplitEncode_Double_Scalar/65536 475051 ns 475051 ns 1477 bytes_per_second=1052.52M/s
BM_ByteStreamSplitEncode_Float_SSE2/1024 219 ns 219 ns 3156405 bytes_per_second=17.4093G/s
BM_ByteStreamSplitEncode_Float_SSE2/4096 860 ns 860 ns 796082 bytes_per_second=17.7457G/s
BM_ByteStreamSplitEncode_Float_SSE2/32768 11556 ns 11556 ns 61763 bytes_per_second=10.5632G/s
BM_ByteStreamSplitEncode_Float_SSE2/65536 27759 ns 27758 ns 25234 bytes_per_second=8.7953G/s
BM_ByteStreamSplitEncode_Double_SSE2/1024 433 ns 433 ns 1596315 bytes_per_second=17.6216G/s
BM_ByteStreamSplitEncode_Double_SSE2/4096 3358 ns 3358 ns 205874 bytes_per_second=9.08763G/s
BM_ByteStreamSplitEncode_Double_SSE2/32768 33812 ns 33812 ns 20505 bytes_per_second=7.22053G/s
BM_ByteStreamSplitEncode_Double_SSE2/65536 68419 ns 68417 ns 10078 bytes_per_second=7.13682G/s

The ByteStreamSplit encoder can benefit from using SSE2 intrinsics
to speed-up data processing since most of the data is likey in the
cache.
@github-actions
Copy link

@martinradev
Copy link
Contributor Author

The patch encoding performance for floating types by a little bit and for double types by a lot. This in sync with my benchmark results which use the higher level API call write_table: https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf

The patch also addresses a spelling mistake and some redundant test code.

@martinradev
Copy link
Contributor Author

The two failing tests seem unrelated.

@martinradev
Copy link
Contributor Author

I was also not able to better unify the code for float and double encoding. There's an extra branch which the compiler should be able to optimize out.

@martinradev
Copy link
Contributor Author

@pitrou would you like to have a look?

@pitrou
Copy link
Member

pitrou commented Mar 26, 2020

Benchmarks on an AMD Ryzen:

-------------------------------------------------------------------------------------------------------
Benchmark                                             Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------
BM_ByteStreamSplitEncode_Float_Scalar/1024         1256 ns         1256 ns       556240 bytes_per_second=3.03831G/s
BM_ByteStreamSplitEncode_Float_Scalar/4096         5019 ns         5019 ns       139246 bytes_per_second=3.04051G/s
BM_ByteStreamSplitEncode_Float_Scalar/32768       40658 ns        40653 ns        17224 bytes_per_second=3.00272G/s
BM_ByteStreamSplitEncode_Float_Scalar/65536       81418 ns        81406 ns         8597 bytes_per_second=2.99905G/s
BM_ByteStreamSplitEncode_Double_Scalar/1024        2483 ns         2483 ns       289091 bytes_per_second=3.07281G/s
BM_ByteStreamSplitEncode_Double_Scalar/4096       10840 ns        10839 ns        64802 bytes_per_second=2.81557G/s
BM_ByteStreamSplitEncode_Double_Scalar/32768      84144 ns        84133 ns         8309 bytes_per_second=2.90186G/s
BM_ByteStreamSplitEncode_Double_Scalar/65536     168132 ns       168109 ns         4153 bytes_per_second=2.90456G/s
BM_ByteStreamSplitEncode_Float_SSE2/1024            181 ns          181 ns      3853519 bytes_per_second=21.0244G/s
BM_ByteStreamSplitEncode_Float_SSE2/4096            724 ns          723 ns       967243 bytes_per_second=21.0913G/s
BM_ByteStreamSplitEncode_Float_SSE2/32768          5791 ns         5790 ns       120647 bytes_per_second=21.0813G/s
BM_ByteStreamSplitEncode_Float_SSE2/65536         11632 ns        11630 ns        60418 bytes_per_second=20.9924G/s
BM_ByteStreamSplitEncode_Double_SSE2/1024           584 ns          584 ns      1193251 bytes_per_second=13.0651G/s
BM_ByteStreamSplitEncode_Double_SSE2/4096          2611 ns         2610 ns       267811 bytes_per_second=11.6911G/s
BM_ByteStreamSplitEncode_Double_SSE2/32768        20848 ns        20845 ns        33477 bytes_per_second=11.712G/s
BM_ByteStreamSplitEncode_Double_SSE2/65536        42722 ns        42715 ns        16044 bytes_per_second=11.4312G/s

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't try to understand the algorithm :-) Perhaps you want to add a small comment explaining it as you did in the SSE2 decoding function?

cpp/src/parquet/encoding_test.cc Show resolved Hide resolved
@wesm
Copy link
Member

wesm commented Mar 26, 2020

Impressive results. It doesn't match my intuition that SSE2 would yield 4-7x speedups but it's good to know that's possible

@martinradev
Copy link
Contributor Author

Technically it should be no more than 4x.
I haven't examined what the compiler emits but with using SSE2 there are essentially more registers to work with: the SSE registers for holding the data values and the standard x64 registers for address calculations, counting, etc.
I suppose there might be some register spilling or the compiler is not able to efficiently unroll the scalar loop. I do not know.

@martinradev martinradev requested a review from pitrou March 27, 2020 16:59
@cyb70289
Copy link
Contributor

Amazing improvement.
@martinradev , is it okay I port your algorithm to Arm64 Neon? I guess it will also benefit.

@martinradev
Copy link
Contributor Author

Yeah, if you wish.
I wonder whether the current scalar implementation is correct on big-endian architectures.

@martinradev
Copy link
Contributor Author

Do we even run tests on ARM?

@pitrou
Copy link
Member

pitrou commented Mar 30, 2020

@martinradev The Arrow CI doesn't, but I suppose @cyb70289 does. Also, ARM is often little-endian nowadays.

@pitrou pitrou closed this in da94098 Mar 30, 2020
@pitrou
Copy link
Member

pitrou commented Mar 30, 2020

Thank you @martinradev !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants