New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-38542: [C++][Parquet] Faster scalar BYTE_STREAM_SPLIT #38529
Conversation
@cyb70289 Do you want to take a look at how this performs on ARM? |
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format?
or
In the case of PARQUET issues on JIRA the title also supports:
See also: |
} | ||
} | ||
|
||
inline void DoMergeStreams(const uint8_t** src_streams, int width, int64_t nvalues, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can width be a template argument here(and would it benefit the performance?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it can, and no, it probably wouldn't, as the compiler should inline this function into the callee.
Note that one reason for this work is that we might want to generalize BYTE_STREAM_SPLIT for more datatypes. |
Big performance improvement observed on Arm Neoverse-N2 CPU. Great work!
|
Looks the major improvement is the total instructions drop significantly. For BM_ByteStreamSplitEncode_Double_Scalar/65536` test, instructions per iteration drops from 5,478,109 to 1,915,403. Estimated roughly the total instructions to process 32 doubles.
|
(What about decoding?) |
Should be similar. Instructions per iteration drop significantly (5,213,354 -> 1,703,533). |
Another improvement I see is the better CPU pipeline parallelism from the optimized code. |
uint64_t f = src[stream + (i + 5) * width]; | ||
uint64_t g = src[stream + (i + 6) * width]; | ||
uint64_t h = src[stream + (i + 7) * width]; | ||
#if ARROW_LITTLE_ENDIAN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the endianness of machine compiling it is different with machine at runtime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then the whole of Arrow C++ is miscompiled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any situation where compile-time ARROW_LITTLE_ENDIAN
might be different from the runtime platform's endianness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cross-compiling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the current definition. Can it be mis-compiled?
#ifndef __BYTE_ORDER__
#error "__BYTE_ORDER__ not defined"
#endif
#
#ifndef __ORDER_LITTLE_ENDIAN__
#error "__ORDER_LITTLE_ENDIAN__ not defined"
#endif
#
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
#define ARROW_LITTLE_ENDIAN 1
#else
#define ARROW_LITTLE_ENDIAN 0
#endif
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may want to open an issue for this. There doesn't seem to be a reliable way to make a compile-time or constexpr
check in C++17, AFAICT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then the whole of Arrow C++ is miscompiled?
Maybe it's ok to test it when starting the user-side program. Since this is not the only place may causing problem when cross-compiling worked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about opening an issue to switch to std::endian once we start C++20 adoption?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with this part of code, however, that maybe means introduce a kind of dynamic code dispatch, like if LITTLE_ENDIAN , otherwise goes to BIG_ENDIAN code.
As you referenced:
As stated earlier, the only "real" way to detect Big Endian is to use runtime tests.
std::endian
seems also do a compile time checking rather than runtime. See https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0463r1.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, a lots of code already uses compile-time macro. If we meet this problem, we may having a test for SIMD instructions, endian, etc. Rather than dynamic dispatch it here...
c2f44f5
to
659b27a
Compare
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This patch LGTM
Also, if src
in DoSplit*
doesn't fit in L1 Cache, would it get much slower?
uint8_t** dest_streams) { | ||
// Value empirically chosen to provide the best performance on the author's machine | ||
constexpr int kBlockSize = 32; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: do we need DCHECK
the width
is expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
width
can only be 4/8 now right. After Fp16 is introduced, we might supports 2/4/8, other value should be rejected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this function should support any non-zero width
, so I don't think it is necessary to reject anything here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmmm here it only supports FLOAT/DOUBLE?
But I think it's also ok when width > 0
. It's ok to me if you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Parquet only supports FLOAT/DOUBLE, but there is no reason for this function to be restricted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree not to restrict the width unless required. This is now a common util not specific to parquet.
It's also already out of L1 cache when num_values = 65536. |
Oh finally I got it. It might benefits from CPU prefetch? Thanks for the great code here |
Yes, I think it should. Though to be honest I haven't tried to check it. But 64k |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code LGTM. I'm not so familiar with high-performance code writing so maybe waiting for @cyb70289 's further review
I compared L1 and L2 missing events. L1 data cache MPKI (misses per kilo instructions) drops significantly from 49.6 to 10.2, L2 MPKI increases from 0.062 to 0.236. The optimized code is also better for L1 data cache. My understanding is as below:
For the optimized code, the 1st loop will populate L1 data cache with all necessary data for the next 7 loops. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 7281851. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 7 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…e#38529) ### Rationale for this change BYTE_STREAM_SPLIT encoding and decoding benefit from SIMD accelerations on x86, but scalar implementations are used otherwise. ### What changes are included in this PR? Improve the speed of scalar implementations of BYTE_STREAM_SPLIT by using a blocked algorithm. Benchmark numbers on the author's machine (AMD Ryzen 9 3900X): * before: ``` ------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------------------------- BM_ByteStreamSplitDecode_Float_Scalar/1024 1374 ns 1374 ns 510396 bytes_per_second=2.77574G/s BM_ByteStreamSplitDecode_Float_Scalar/4096 5483 ns 5483 ns 127498 bytes_per_second=2.78303G/s BM_ByteStreamSplitDecode_Float_Scalar/32768 44042 ns 44035 ns 15905 bytes_per_second=2.77212G/s BM_ByteStreamSplitDecode_Float_Scalar/65536 87966 ns 87952 ns 7962 bytes_per_second=2.77583G/s BM_ByteStreamSplitDecode_Double_Scalar/1024 2583 ns 2583 ns 271436 bytes_per_second=2.95408G/s BM_ByteStreamSplitDecode_Double_Scalar/4096 10533 ns 10532 ns 65695 bytes_per_second=2.89761G/s BM_ByteStreamSplitDecode_Double_Scalar/32768 84067 ns 84053 ns 8275 bytes_per_second=2.90459G/s BM_ByteStreamSplitDecode_Double_Scalar/65536 168332 ns 168309 ns 4155 bytes_per_second=2.9011G/s BM_ByteStreamSplitEncode_Float_Scalar/1024 1435 ns 1435 ns 484278 bytes_per_second=2.65802G/s BM_ByteStreamSplitEncode_Float_Scalar/4096 5725 ns 5725 ns 121877 bytes_per_second=2.66545G/s BM_ByteStreamSplitEncode_Float_Scalar/32768 46291 ns 46283 ns 15134 bytes_per_second=2.63745G/s BM_ByteStreamSplitEncode_Float_Scalar/65536 91139 ns 91128 ns 7707 bytes_per_second=2.6791G/s BM_ByteStreamSplitEncode_Double_Scalar/1024 3093 ns 3093 ns 226198 bytes_per_second=2.46663G/s BM_ByteStreamSplitEncode_Double_Scalar/4096 12724 ns 12722 ns 54522 bytes_per_second=2.39873G/s BM_ByteStreamSplitEncode_Double_Scalar/32768 100488 ns 100475 ns 6957 bytes_per_second=2.42987G/s BM_ByteStreamSplitEncode_Double_Scalar/65536 200885 ns 200852 ns 3486 bytes_per_second=2.43105G/s ``` * after: ``` ------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------------------------- BM_ByteStreamSplitDecode_Float_Scalar/1024 932 ns 932 ns 753352 bytes_per_second=4.09273G/s BM_ByteStreamSplitDecode_Float_Scalar/4096 3715 ns 3715 ns 188394 bytes_per_second=4.10783G/s BM_ByteStreamSplitDecode_Float_Scalar/32768 30167 ns 30162 ns 23441 bytes_per_second=4.04716G/s BM_ByteStreamSplitDecode_Float_Scalar/65536 59483 ns 59475 ns 11744 bytes_per_second=4.10496G/s BM_ByteStreamSplitDecode_Double_Scalar/1024 1862 ns 1862 ns 374715 bytes_per_second=4.09823G/s BM_ByteStreamSplitDecode_Double_Scalar/4096 7554 ns 7553 ns 91975 bytes_per_second=4.04038G/s BM_ByteStreamSplitDecode_Double_Scalar/32768 60429 ns 60421 ns 11499 bytes_per_second=4.04067G/s BM_ByteStreamSplitDecode_Double_Scalar/65536 120992 ns 120972 ns 5756 bytes_per_second=4.03631G/s BM_ByteStreamSplitEncode_Float_Scalar/1024 737 ns 737 ns 947423 bytes_per_second=5.17843G/s BM_ByteStreamSplitEncode_Float_Scalar/4096 2934 ns 2933 ns 239459 bytes_per_second=5.20175G/s BM_ByteStreamSplitEncode_Float_Scalar/32768 23730 ns 23727 ns 29243 bytes_per_second=5.14485G/s BM_ByteStreamSplitEncode_Float_Scalar/65536 47671 ns 47664 ns 14682 bytes_per_second=5.12209G/s BM_ByteStreamSplitEncode_Double_Scalar/1024 1517 ns 1517 ns 458928 bytes_per_second=5.02827G/s BM_ByteStreamSplitEncode_Double_Scalar/4096 6224 ns 6223 ns 111361 bytes_per_second=4.90407G/s BM_ByteStreamSplitEncode_Double_Scalar/32768 49719 ns 49713 ns 14059 bytes_per_second=4.91099G/s BM_ByteStreamSplitEncode_Double_Scalar/65536 99445 ns 99432 ns 7027 bytes_per_second=4.91072G/s ``` ### Are these changes tested? Yes, though the scalar implementations are unfortunately only exercised on non-x86 by default (see added comment in the PR). ### Are there any user-facing changes? No. * Closes: apache#38542 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
Haven't read through the discussion here, but from the release performance report, I noticed that the Not sure why the conbench report above didn't say anything about that though. The speedups of this PR might also be hardware related? |
Maybe it's compiler related🤔 Auto-Vectorize might optimizing code like this patch's original code... |
It can be hardware- or compiler-related indeed. |
OK, completely ignore my comment. The benchmark is expressed in "bytes per second", so higher is better ... Thus the correct conclusion is: the benchmarks nicely captured the significant improvement from this PR! ;) |
Aha... |
…e#38529) ### Rationale for this change BYTE_STREAM_SPLIT encoding and decoding benefit from SIMD accelerations on x86, but scalar implementations are used otherwise. ### What changes are included in this PR? Improve the speed of scalar implementations of BYTE_STREAM_SPLIT by using a blocked algorithm. Benchmark numbers on the author's machine (AMD Ryzen 9 3900X): * before: ``` ------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------------------------- BM_ByteStreamSplitDecode_Float_Scalar/1024 1374 ns 1374 ns 510396 bytes_per_second=2.77574G/s BM_ByteStreamSplitDecode_Float_Scalar/4096 5483 ns 5483 ns 127498 bytes_per_second=2.78303G/s BM_ByteStreamSplitDecode_Float_Scalar/32768 44042 ns 44035 ns 15905 bytes_per_second=2.77212G/s BM_ByteStreamSplitDecode_Float_Scalar/65536 87966 ns 87952 ns 7962 bytes_per_second=2.77583G/s BM_ByteStreamSplitDecode_Double_Scalar/1024 2583 ns 2583 ns 271436 bytes_per_second=2.95408G/s BM_ByteStreamSplitDecode_Double_Scalar/4096 10533 ns 10532 ns 65695 bytes_per_second=2.89761G/s BM_ByteStreamSplitDecode_Double_Scalar/32768 84067 ns 84053 ns 8275 bytes_per_second=2.90459G/s BM_ByteStreamSplitDecode_Double_Scalar/65536 168332 ns 168309 ns 4155 bytes_per_second=2.9011G/s BM_ByteStreamSplitEncode_Float_Scalar/1024 1435 ns 1435 ns 484278 bytes_per_second=2.65802G/s BM_ByteStreamSplitEncode_Float_Scalar/4096 5725 ns 5725 ns 121877 bytes_per_second=2.66545G/s BM_ByteStreamSplitEncode_Float_Scalar/32768 46291 ns 46283 ns 15134 bytes_per_second=2.63745G/s BM_ByteStreamSplitEncode_Float_Scalar/65536 91139 ns 91128 ns 7707 bytes_per_second=2.6791G/s BM_ByteStreamSplitEncode_Double_Scalar/1024 3093 ns 3093 ns 226198 bytes_per_second=2.46663G/s BM_ByteStreamSplitEncode_Double_Scalar/4096 12724 ns 12722 ns 54522 bytes_per_second=2.39873G/s BM_ByteStreamSplitEncode_Double_Scalar/32768 100488 ns 100475 ns 6957 bytes_per_second=2.42987G/s BM_ByteStreamSplitEncode_Double_Scalar/65536 200885 ns 200852 ns 3486 bytes_per_second=2.43105G/s ``` * after: ``` ------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------------------------- BM_ByteStreamSplitDecode_Float_Scalar/1024 932 ns 932 ns 753352 bytes_per_second=4.09273G/s BM_ByteStreamSplitDecode_Float_Scalar/4096 3715 ns 3715 ns 188394 bytes_per_second=4.10783G/s BM_ByteStreamSplitDecode_Float_Scalar/32768 30167 ns 30162 ns 23441 bytes_per_second=4.04716G/s BM_ByteStreamSplitDecode_Float_Scalar/65536 59483 ns 59475 ns 11744 bytes_per_second=4.10496G/s BM_ByteStreamSplitDecode_Double_Scalar/1024 1862 ns 1862 ns 374715 bytes_per_second=4.09823G/s BM_ByteStreamSplitDecode_Double_Scalar/4096 7554 ns 7553 ns 91975 bytes_per_second=4.04038G/s BM_ByteStreamSplitDecode_Double_Scalar/32768 60429 ns 60421 ns 11499 bytes_per_second=4.04067G/s BM_ByteStreamSplitDecode_Double_Scalar/65536 120992 ns 120972 ns 5756 bytes_per_second=4.03631G/s BM_ByteStreamSplitEncode_Float_Scalar/1024 737 ns 737 ns 947423 bytes_per_second=5.17843G/s BM_ByteStreamSplitEncode_Float_Scalar/4096 2934 ns 2933 ns 239459 bytes_per_second=5.20175G/s BM_ByteStreamSplitEncode_Float_Scalar/32768 23730 ns 23727 ns 29243 bytes_per_second=5.14485G/s BM_ByteStreamSplitEncode_Float_Scalar/65536 47671 ns 47664 ns 14682 bytes_per_second=5.12209G/s BM_ByteStreamSplitEncode_Double_Scalar/1024 1517 ns 1517 ns 458928 bytes_per_second=5.02827G/s BM_ByteStreamSplitEncode_Double_Scalar/4096 6224 ns 6223 ns 111361 bytes_per_second=4.90407G/s BM_ByteStreamSplitEncode_Double_Scalar/32768 49719 ns 49713 ns 14059 bytes_per_second=4.91099G/s BM_ByteStreamSplitEncode_Double_Scalar/65536 99445 ns 99432 ns 7027 bytes_per_second=4.91072G/s ``` ### Are these changes tested? Yes, though the scalar implementations are unfortunately only exercised on non-x86 by default (see added comment in the PR). ### Are there any user-facing changes? No. * Closes: apache#38542 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
Rationale for this change
BYTE_STREAM_SPLIT encoding and decoding benefit from SIMD accelerations on x86, but scalar implementations are used otherwise.
What changes are included in this PR?
Improve the speed of scalar implementations of BYTE_STREAM_SPLIT by using a blocked algorithm.
Benchmark numbers on the author's machine (AMD Ryzen 9 3900X):
Are these changes tested?
Yes, though the scalar implementations are unfortunately only exercised on non-x86 by default (see added comment in the PR).
Are there any user-facing changes?
No.