PARQUET-1828: [C++] Use SSE2 for the ByteStreamSplit encoder #6723

martinradev · 2020-03-25T23:29:16Z

The ByteStreamSplit encoder can benefit from using SSE2 intrinsics
to speed-up data processing since most of the data is likey in the
cache.

Benchmark results on Sandy Bridge 3930k:
BM_ByteStreamSplitEncode_Float_Scalar/1024 264 ns 264 ns 2679735 bytes_per_second=14.4706G/s
BM_ByteStreamSplitEncode_Float_Scalar/4096 1003 ns 1003 ns 688811 bytes_per_second=15.2114G/s
BM_ByteStreamSplitEncode_Float_Scalar/32768 11933 ns 11926 ns 59187 bytes_per_second=10.2353G/s
BM_ByteStreamSplitEncode_Float_Scalar/65536 28137 ns 28137 ns 24634 bytes_per_second=8.67699G/s
BM_ByteStreamSplitEncode_Double_Scalar/1024 2601 ns 2599 ns 266977 bytes_per_second=2.93583G/s
BM_ByteStreamSplitEncode_Double_Scalar/4096 32408 ns 32408 ns 21594 bytes_per_second=964.268M/s
BM_ByteStreamSplitEncode_Double_Scalar/32768 228019 ns 227850 ns 3079 bytes_per_second=1097.21M/s
BM_ByteStreamSplitEncode_Double_Scalar/65536 475051 ns 475051 ns 1477 bytes_per_second=1052.52M/s
BM_ByteStreamSplitEncode_Float_SSE2/1024 219 ns 219 ns 3156405 bytes_per_second=17.4093G/s
BM_ByteStreamSplitEncode_Float_SSE2/4096 860 ns 860 ns 796082 bytes_per_second=17.7457G/s
BM_ByteStreamSplitEncode_Float_SSE2/32768 11556 ns 11556 ns 61763 bytes_per_second=10.5632G/s
BM_ByteStreamSplitEncode_Float_SSE2/65536 27759 ns 27758 ns 25234 bytes_per_second=8.7953G/s
BM_ByteStreamSplitEncode_Double_SSE2/1024 433 ns 433 ns 1596315 bytes_per_second=17.6216G/s
BM_ByteStreamSplitEncode_Double_SSE2/4096 3358 ns 3358 ns 205874 bytes_per_second=9.08763G/s
BM_ByteStreamSplitEncode_Double_SSE2/32768 33812 ns 33812 ns 20505 bytes_per_second=7.22053G/s
BM_ByteStreamSplitEncode_Double_SSE2/65536 68419 ns 68417 ns 10078 bytes_per_second=7.13682G/s

The ByteStreamSplit encoder can benefit from using SSE2 intrinsics to speed-up data processing since most of the data is likey in the cache.

github-actions · 2020-03-25T23:31:43Z

https://issues.apache.org/jira/browse/PARQUET-1828

martinradev · 2020-03-26T08:38:28Z

The patch encoding performance for floating types by a little bit and for double types by a lot. This in sync with my benchmark results which use the higher level API call write_table: https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf

The patch also addresses a spelling mistake and some redundant test code.

martinradev · 2020-03-26T08:39:11Z

The two failing tests seem unrelated.

martinradev · 2020-03-26T08:40:09Z

I was also not able to better unify the code for float and double encoding. There's an extra branch which the compiler should be able to optimize out.

martinradev · 2020-03-26T08:40:22Z

@pitrou would you like to have a look?

pitrou · 2020-03-26T13:41:03Z

Benchmarks on an AMD Ryzen:

-------------------------------------------------------------------------------------------------------
Benchmark                                             Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------
BM_ByteStreamSplitEncode_Float_Scalar/1024         1256 ns         1256 ns       556240 bytes_per_second=3.03831G/s
BM_ByteStreamSplitEncode_Float_Scalar/4096         5019 ns         5019 ns       139246 bytes_per_second=3.04051G/s
BM_ByteStreamSplitEncode_Float_Scalar/32768       40658 ns        40653 ns        17224 bytes_per_second=3.00272G/s
BM_ByteStreamSplitEncode_Float_Scalar/65536       81418 ns        81406 ns         8597 bytes_per_second=2.99905G/s
BM_ByteStreamSplitEncode_Double_Scalar/1024        2483 ns         2483 ns       289091 bytes_per_second=3.07281G/s
BM_ByteStreamSplitEncode_Double_Scalar/4096       10840 ns        10839 ns        64802 bytes_per_second=2.81557G/s
BM_ByteStreamSplitEncode_Double_Scalar/32768      84144 ns        84133 ns         8309 bytes_per_second=2.90186G/s
BM_ByteStreamSplitEncode_Double_Scalar/65536     168132 ns       168109 ns         4153 bytes_per_second=2.90456G/s
BM_ByteStreamSplitEncode_Float_SSE2/1024            181 ns          181 ns      3853519 bytes_per_second=21.0244G/s
BM_ByteStreamSplitEncode_Float_SSE2/4096            724 ns          723 ns       967243 bytes_per_second=21.0913G/s
BM_ByteStreamSplitEncode_Float_SSE2/32768          5791 ns         5790 ns       120647 bytes_per_second=21.0813G/s
BM_ByteStreamSplitEncode_Float_SSE2/65536         11632 ns        11630 ns        60418 bytes_per_second=20.9924G/s
BM_ByteStreamSplitEncode_Double_SSE2/1024           584 ns          584 ns      1193251 bytes_per_second=13.0651G/s
BM_ByteStreamSplitEncode_Double_SSE2/4096          2611 ns         2610 ns       267811 bytes_per_second=11.6911G/s
BM_ByteStreamSplitEncode_Double_SSE2/32768        20848 ns        20845 ns        33477 bytes_per_second=11.712G/s
BM_ByteStreamSplitEncode_Double_SSE2/65536        42722 ns        42715 ns        16044 bytes_per_second=11.4312G/s

pitrou

I won't try to understand the algorithm :-) Perhaps you want to add a small comment explaining it as you did in the SSE2 decoding function?

cpp/src/parquet/encoding_test.cc

wesm · 2020-03-26T23:01:55Z

Impressive results. It doesn't match my intuition that SSE2 would yield 4-7x speedups but it's good to know that's possible

martinradev · 2020-03-27T11:03:00Z

Technically it should be no more than 4x.
I haven't examined what the compiler emits but with using SSE2 there are essentially more registers to work with: the SSE registers for holding the data values and the standard x64 registers for address calculations, counting, etc.
I suppose there might be some register spilling or the compiler is not able to efficiently unroll the scalar loop. I do not know.

cyb70289 · 2020-03-30T06:32:54Z

Amazing improvement.
@martinradev , is it okay I port your algorithm to Arm64 Neon? I guess it will also benefit.

martinradev · 2020-03-30T10:24:56Z

Yeah, if you wish.
I wonder whether the current scalar implementation is correct on big-endian architectures.

martinradev · 2020-03-30T10:25:33Z

Do we even run tests on ARM?

pitrou · 2020-03-30T11:16:46Z

@martinradev The Arrow CI doesn't, but I suppose @cyb70289 does. Also, ARM is often little-endian nowadays.

pitrou · 2020-03-30T11:21:12Z

Thank you @martinradev !

PARQUET-1828: [C++] Use SSE2 for the ByteStreamSplit encoder

8ee75b1

The ByteStreamSplit encoder can benefit from using SSE2 intrinsics to speed-up data processing since most of the data is likey in the cache.

pitrou requested changes Mar 26, 2020

View reviewed changes

cpp/src/parquet/encoding_test.cc Show resolved Hide resolved

PARQUET-1828: [C++] Add example for floats

47beab3

PARQUET-1828: [C++] Reduce the example's line length

c3d994d

martinradev requested a review from pitrou March 27, 2020 16:59

pitrou approved these changes Mar 30, 2020

View reviewed changes

pitrou closed this in da94098 Mar 30, 2020

asfimport mentioned this pull request Jun 23, 2024

[C++][Parquet] Add a SSE2 path for the ByteStreamSplit encoder implementation #42945

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-1828: [C++] Use SSE2 for the ByteStreamSplit encoder #6723

PARQUET-1828: [C++] Use SSE2 for the ByteStreamSplit encoder #6723

martinradev commented Mar 25, 2020

github-actions bot commented Mar 25, 2020

martinradev commented Mar 26, 2020

martinradev commented Mar 26, 2020

martinradev commented Mar 26, 2020

martinradev commented Mar 26, 2020

pitrou commented Mar 26, 2020

pitrou left a comment

wesm commented Mar 26, 2020

martinradev commented Mar 27, 2020

cyb70289 commented Mar 30, 2020

martinradev commented Mar 30, 2020

martinradev commented Mar 30, 2020

pitrou commented Mar 30, 2020

pitrou commented Mar 30, 2020

PARQUET-1828: [C++] Use SSE2 for the ByteStreamSplit encoder #6723

PARQUET-1828: [C++] Use SSE2 for the ByteStreamSplit encoder #6723

Conversation

martinradev commented Mar 25, 2020

github-actions bot commented Mar 25, 2020

martinradev commented Mar 26, 2020

martinradev commented Mar 26, 2020

martinradev commented Mar 26, 2020

martinradev commented Mar 26, 2020

pitrou commented Mar 26, 2020

pitrou left a comment

Choose a reason for hiding this comment

wesm commented Mar 26, 2020

martinradev commented Mar 27, 2020

cyb70289 commented Mar 30, 2020

martinradev commented Mar 30, 2020

martinradev commented Mar 30, 2020

pitrou commented Mar 30, 2020

pitrou commented Mar 30, 2020