New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-8579 [C++] SIMD for spaced decoding and encoding. #7029
Conversation
Below is the benchmark data on size 4096. Scalar:
SSE:
AVX512:
|
I'd gladly see a AVX2 or SSE version indeed, as many CPUs don't have AVX512. |
Just curious if you see and impact on parquet-arrow-reader-writer benchmarks? That is the ultimate goal of the speedup. |
No impact, I checked all items for parquet-arrow-reader-writer-benchmark... Below is the perf top on the bench-marking of BM_ReadColumn<true,Int32Type> and BM_WriteColumn<true,Int32Type>, seems these function is not on the path for them. BM_ReadColumn<true,Int32Type>: BM_WriteColumn<true,Int32Type>: |
bb8ef96
to
8f07bb8
Compare
1217f38
to
abc084c
Compare
@pitrou @emkornfield |
Sorry for the late reply. Might as well append it to this PR. |
A general question: why is this limited to |
_mm512_mask_expand_epi16/_mm512_mask_expand_epi18 are based on AVX512_VBMI2 support which is a new feature started from icelake architecture. So for 8-bit, current skylake AVX512 need fall back to SSE path. And dose arrow has 16bit type data? |
Definitely, there's |
1. Create the spaced encoding/decoding benchmark items 2. More unittest for spaced API SIMD implementation 3. SSE(epi8, epi32, epi64) and AVX512(epi32, epi64) added 3. Move spaced scalar and SIMD to new head file Signed-off-by: Frank Du <frank.du@intel.com>
I just put "static_assert(sizeof(T) != 2)" in the entry of spaced API, no error reported. Seems no int16/uint16 in the supported type of parquet. // Mirrors parquet::Type |
SSE(epi8, epi32, epi64) version appended. |
Oh, you're definitely right. I was thinking about Arrow types, sorry. |
Benchmarks on AMD Ryzen:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very nice improvement. Here are some comments.
# Usage: python3 spaced_sse_codegen.py > spaced_sse_generated.h | ||
|
||
|
||
def print_mask_expand_bitmap(width, length): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to have an explanation for the data generation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add
namespace arrow { | ||
namespace internal { | ||
|
||
static constexpr uint8_t kMask128SseCompressEpi32[] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it useful to have the definitions in a .h
file rather than .cc
? Those tables are a bit large and it would be better not to duplicate them around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, will move to cc file
@@ -199,6 +199,107 @@ static void BM_PlainDecodingFloat(benchmark::State& state) { | |||
|
|||
BENCHMARK(BM_PlainDecodingFloat)->Range(MIN_RANGE, MAX_RANGE); | |||
|
|||
static void BM_PlainEncodingSpacedBoolean(benchmark::State& state) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not how booleans are encoded in Parquet (i.e. 1-bit data), is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed it's the one bit encoding. The put space include two stage, first is to filter the value as the valid bit map, later is call the the boolean encode routine on the valid data which remove spaced values already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean boolean encoding converts a bitmap to an array of bool
?
|
||
#if defined(ARROW_HAVE_SSE4_2) | ||
template <typename T> | ||
int SpacedCompressSseShuffle(const T* values, int num_values, const uint8_t* valid_bits, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an explanation of the algorithms somewhere? Can you add a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add more
cpp/src/arrow/util/spaced.h
Outdated
// Thin table used, it need add back the offset of high and compact two parts | ||
__m128i src = | ||
_mm_loadu_si128(reinterpret_cast<const __m128i*>(values + idx_values)); | ||
__m128i mask = _mm_set_epi64x(*(reinterpret_cast<const uint64_t*>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be more readable if you moved the reinterpret_cast
s at the start of the function, and wrote something like kMask128SseCompressEpi8Thin[valid_byte_value_high]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(same for all other occurrences of a similar pattern)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3x, will change
warning C4267: 'argument': conversion from 'size_t' to 'int', possible loss of data Signed-off-by: Frank Du <frank.du@intel.com>
table to cc file more doc, better code readable Signed-off-by: Frank Du <frank.du@intel.com>
def print_mask_expand_table(pack_width, mask_length): | ||
""" | ||
Generate the lookup mask table for SSE expand shuffle control mask. | ||
Ex, for epi32 full table(pack_width = 4(32/8), mask_length = 16(128/8)), the available mask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me, this doesn't really make things much clearer. What is a "epi32"? What does "pack_width = 4(32/8)" mean? Also, what does the special value "0x80" mean? Why "0x80"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add more
const int num_values = state.range(0); | ||
bool* values = new bool[num_values]; | ||
// Fixed half spaced pattern | ||
std::vector<uint8_t> valid_bits(arrow::BitUtil::BytesForBits(num_values), 0b10101010); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use std::vector
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems can't, the PutSpaced only accept T* input, but Put has std::vector input.
Does it has a way or necessary to convert a std::vector to bool*?
virtual void Put(const T* src, int num_values) = 0;
virtual void Put(const std::vector& src, int num_values = -1);
virtual void PutSpaced(const T* src, int num_values, const uint8_t* valid_bits,
int64_t valid_bits_offset) = 0;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, fair enough. Let's keep it like that, then, or use unique_ptr
.
Signed-off-by: Frank Du <frank.du@intel.com>
@pitrou @emkornfield @fsaintjacques Hi, Kindly let me know what I can do more to step forward this patch? Thanks. |
I'm really concerned about continuing to drag our feet on runtime SIMD dispatching. The hole will continue to get dug deeper and deeper |
I just sent an e-mail to the ML about it |
Close this one. Revisit later util a runtime SIMD is settled. I guess I can commit the unit test and benchmark parts firstly, then we can get some survive during revisit at least the test/benchmark. |
Signed-off-by: Frank Du frank.du@intel.com