ARROW-8579 [C++] SIMD for spaced decoding and encoding. #7029

frankdjx · 2020-04-24T07:27:58Z

Create the spaced encoding/decoding benchmark items
More unittest for spaced API SIMD implementation
SSE(epi8, epi32, epi64) and AVX512(epi32, epi64) added
Move spaced scalar and SIMD to new head file

Signed-off-by: Frank Du frank.du@intel.com

github-actions · 2020-04-24T07:31:41Z

https://issues.apache.org/jira/browse/ARROW-8579

frankdjx · 2020-04-24T07:37:10Z

Below is the benchmark data on size 4096.

Scalar:

BM_PlainDecodingSpacedBoolean/4096        5894 ns         5888 ns       118928 bytes_per_second=663.435M/s
BM_PlainDecodingSpacedFloat/4096          3362 ns         3358 ns       209023 bytes_per_second=4.54393G/s
BM_PlainDecodingSpacedDouble/4096         3625 ns         3622 ns       192998 bytes_per_second=8.42641G/s

BM_PlainEncodingSpacedBoolean/4096        6319 ns         6313 ns       110791 bytes_per_second=618.767M/s
BM_PlainEncodingSpacedFloat/4096          4432 ns         4426 ns       158687 bytes_per_second=3.44732G/s
BM_PlainEncodingSpacedDouble/4096         4873 ns         4858 ns       143571 bytes_per_second=6.2815G/s

SSE:

BM_PlainDecodingSpacedBoolean/4096        2841 ns         2838 ns       246636 bytes_per_second=1.34394G/s
BM_PlainDecodingSpacedFloat/4096          1272 ns         1271 ns       551415 bytes_per_second=12.0043G/s
BM_PlainDecodingSpacedDouble/4096         2454 ns         2451 ns       285807 bytes_per_second=12.4491G/s

BM_PlainEncodingSpacedBoolean/4096        3744 ns         3740 ns       187030 bytes_per_second=1044.38M/s
BM_PlainEncodingSpacedFloat/4096          3105 ns         3101 ns       223733 bytes_per_second=4.91998G/s
BM_PlainEncodingSpacedDouble/4096         1128 ns         1127 ns       621762 bytes_per_second=6.76959G/s

AVX512:

BM_PlainDecodingSpacedBoolean/4096         848 ns          847 ns       824442 bytes_per_second=4.50542G/s
BM_PlainDecodingSpacedFloat/4096           544 ns          543 ns      1290583 bytes_per_second=28.076G/s
BM_PlainDecodingSpacedDouble/4096         1244 ns         1243 ns       567255 bytes_per_second=24.5575G/s

BM_PlainEncodingSpacedBoolean/4096        3481 ns         3478 ns       201293 bytes_per_second=1123.16M/s
BM_PlainEncodingSpacedFloat/4096          1666 ns         1664 ns       413996 bytes_per_second=9.16787G/s
BM_PlainEncodingSpacedDouble/4096          636 ns          635 ns      1103326 bytes_per_second=12.0144G/s

cpp/src/arrow/util/spaced.h

pitrou · 2020-04-24T09:53:59Z

I'd gladly see a AVX2 or SSE version indeed, as many CPUs don't have AVX512.

cpp/src/arrow/util/spaced.h

emkornfield · 2020-04-26T06:37:23Z

Just curious if you see and impact on parquet-arrow-reader-writer benchmarks? That is the ultimate goal of the speedup.

frankdjx · 2020-04-26T07:27:15Z

Just curious if you see and impact on parquet-arrow-reader-writer benchmarks? That is the ultimate goal of the speedup.

No impact, I checked all items for parquet-arrow-reader-writer-benchmark...

Below is the perf top on the bench-marking of BM_ReadColumn<true,Int32Type> and BM_WriteColumn<true,Int32Type>, seems these function is not on the path for them.

BM_ReadColumn<true,Int32Type>:
31.60% libparquet.so.18.0.0 [.] _ZN5arrow4util10RleDecoder22GetBatchWithDictSpacedIiEEiPKT_iPS3_iiPKhl
21.74% libparquet.so.18.0.0 [.] _ZN7parquet8internalL24DefinitionLevelsToBitmapEPKslssPlS3_Phl

BM_WriteColumn<true,Int32Type>:
20.64% libparquet.so.18.0.0 [.] _ZN5mpark6detail10visitation4base17make_fmatrix_implIONS1_7variant13value_visitorIRZN7parquet5arrow12_GLOBAL__N_19WritePathENS7_12Ele
16.19% libparquet.so.18.0.0 [.] _ZN7parquet15DictEncoderImplINS_12PhysicalTypeILNS_4Type4typeE1EEEE3PutERKi.constprop.455
11.50% libparquet.so.18.0.0 [.] _ZN7parquet12LevelEncoder6EncodeEiPKs
7.93% libparquet.so.18.0.0 [.] _ZN5arrow4util10RleEncoder15FlushLiteralRunEb

frankdjx · 2020-04-30T08:53:42Z

I'd gladly see a AVX2 or SSE version indeed, as many CPUs don't have AVX512.

@pitrou @emkornfield
Yeah, I has a version of SSE, would you like me to append it to this PR or a new PR after this closed? frankdjx@dce2949

emkornfield · 2020-05-04T03:14:14Z

Sorry for the late reply. Might as well append it to this PR.

pitrou · 2020-05-04T09:56:37Z

A general question: why is this limited to sizeof(T) == 4 and sizeof(T) == 8? There are 8-bit and 16-bit types as well.

frankdjx · 2020-05-05T03:55:54Z

A general question: why is this limited to sizeof(T) == 4 and sizeof(T) == 8? There are 8-bit and 16-bit types as well.

_mm512_mask_expand_epi16/_mm512_mask_expand_epi18 are based on AVX512_VBMI2 support which is a new feature started from icelake architecture. So for 8-bit, current skylake AVX512 need fall back to SSE path. And dose arrow has 16bit type data?

pitrou · 2020-05-05T08:51:12Z

Definitely, there's int16 and uint16 in addition to float16 (which isn't really supported currently for anything but storage). I may miss others.

1. Create the spaced encoding/decoding benchmark items 2. More unittest for spaced API SIMD implementation 3. SSE(epi8, epi32, epi64) and AVX512(epi32, epi64) added 3. Move spaced scalar and SIMD to new head file Signed-off-by: Frank Du <frank.du@intel.com>

frankdjx · 2020-05-06T03:12:22Z

Definitely, there's int16 and uint16 in addition to float16 (which isn't really supported currently for anything but storage). I may miss others.

I just put "static_assert(sizeof(T) != 2)" in the entry of spaced API, no error reported. Seems no int16/uint16 in the supported type of parquet.

// Mirrors parquet::Type
struct Type {
enum type {
BOOLEAN = 0,
INT32 = 1,
INT64 = 2,
INT96 = 3,
FLOAT = 4,
DOUBLE = 5,
BYTE_ARRAY = 6,
FIXED_LEN_BYTE_ARRAY = 7,
// Should always be last element.
UNDEFINED = 8
};
};

frankdjx · 2020-05-06T07:41:40Z

Sorry for the late reply. Might as well append it to this PR.

SSE(epi8, epi32, epi64) version appended.

pitrou · 2020-05-06T07:42:59Z

Oh, you're definitely right. I was thinking about Arrow types, sorry.

pitrou · 2020-05-06T09:12:35Z

Benchmarks on AMD Ryzen:

Scalar:

----------------------------------------------------------------------------------------------
Benchmark                                    Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------
BM_PlainEncodingSpacedBoolean/32768      40353 ns        40343 ns        55868 bytes_per_second=774.601M/s
BM_PlainDecodingSpacedBoolean/32768      26626 ns        26620 ns        78965 bytes_per_second=1.14643G/s
BM_PlainEncodingSpacedFloat/32768        21256 ns        21251 ns        97590 bytes_per_second=5.74415G/s
BM_PlainEncodingSpacedDouble/32768       24643 ns        24638 ns        85219 bytes_per_second=9.90919G/s
BM_PlainDecodingSpacedFloat/32768        21135 ns        21129 ns       100701 bytes_per_second=5.77725G/s
BM_PlainDecodingSpacedDouble/32768       22763 ns        22757 ns        92009 bytes_per_second=10.7281G/s

AVX2:

----------------------------------------------------------------------------------------------
Benchmark                                    Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------
BM_PlainEncodingSpacedBoolean/32768      20718 ns        20715 ns       102447 bytes_per_second=1.47321G/s
BM_PlainDecodingSpacedBoolean/32768      19129 ns        19127 ns       110421 bytes_per_second=1.59553G/s
BM_PlainEncodingSpacedFloat/32768         8117 ns         8116 ns       258508 bytes_per_second=15.0408G/s
BM_PlainEncodingSpacedDouble/32768       15342 ns        15339 ns       137662 bytes_per_second=15.9159G/s
BM_PlainDecodingSpacedFloat/32768         8384 ns         8383 ns       259971 bytes_per_second=14.5617G/s
BM_PlainDecodingSpacedDouble/32768       14877 ns        14875 ns       136365 bytes_per_second=16.4123G/s

pitrou

This is a very nice improvement. Here are some comments.

pitrou · 2020-05-06T09:15:15Z

cpp/src/arrow/util/spaced_sse_codegen.py

+# Usage: python3 spaced_sse_codegen.py > spaced_sse_generated.h
+
+
+def print_mask_expand_bitmap(width, length):


Is it possible to have an explanation for the data generation?

pitrou · 2020-05-06T09:15:55Z

cpp/src/arrow/util/spaced_sse_generated.h

+namespace arrow {
+namespace internal {
+
+static constexpr uint8_t kMask128SseCompressEpi32[] = {


Is it useful to have the definitions in a .h file rather than .cc? Those tables are a bit large and it would be better not to duplicate them around.

Good catch, will move to cc file

pitrou · 2020-05-06T09:18:39Z

cpp/src/parquet/encoding_benchmark.cc

@@ -199,6 +199,107 @@ static void BM_PlainDecodingFloat(benchmark::State& state) {

 BENCHMARK(BM_PlainDecodingFloat)->Range(MIN_RANGE, MAX_RANGE);

+static void BM_PlainEncodingSpacedBoolean(benchmark::State& state) {


That's not how booleans are encoded in Parquet (i.e. 1-bit data), is it?

Indeed it's the one bit encoding. The put space include two stage, first is to filter the value as the valid bit map, later is call the the boolean encode routine on the valid data which remove spaced values already.

You mean boolean encoding converts a bitmap to an array of bool?

pitrou · 2020-05-06T09:36:24Z

cpp/src/arrow/util/spaced.h

+
+#if defined(ARROW_HAVE_SSE4_2)
+template <typename T>
+int SpacedCompressSseShuffle(const T* values, int num_values, const uint8_t* valid_bits,


Is there an explanation of the algorithms somewhere? Can you add a comment?

Will add more

pitrou · 2020-05-06T09:37:26Z

cpp/src/arrow/util/spaced.h

+      // Thin table used, it need add back the offset of high and compact two parts
+      __m128i src =
+          _mm_loadu_si128(reinterpret_cast<const __m128i*>(values + idx_values));
+      __m128i mask = _mm_set_epi64x(*(reinterpret_cast<const uint64_t*>(


Would be more readable if you moved the reinterpret_casts at the start of the function, and wrote something like kMask128SseCompressEpi8Thin[valid_byte_value_high].

(same for all other occurrences of a similar pattern)

3x, will change

warning C4267: 'argument': conversion from 'size_t' to 'int', possible loss of data Signed-off-by: Frank Du <frank.du@intel.com>

table to cc file more doc, better code readable Signed-off-by: Frank Du <frank.du@intel.com>

pitrou · 2020-05-07T11:29:36Z

cpp/src/arrow/util/spaced_sse_table_codegen.py

+def print_mask_expand_table(pack_width, mask_length):
+    """
+    Generate the lookup mask table for SSE expand shuffle control mask.
+    Ex, for epi32 full table(pack_width = 4(32/8), mask_length = 16(128/8)), the available mask


For me, this doesn't really make things much clearer. What is a "epi32"? What does "pack_width = 4(32/8)" mean? Also, what does the special value "0x80" mean? Why "0x80"?

pitrou · 2020-05-07T11:32:51Z

cpp/src/parquet/encoding_benchmark.cc

+  const int num_values = state.range(0);
+  bool* values = new bool[num_values];
+  // Fixed half spaced pattern
+  std::vector<uint8_t> valid_bits(arrow::BitUtil::BytesForBits(num_values), 0b10101010);


Should use std::vector instead.

Seems can't, the PutSpaced only accept T* input, but Put has std::vector input.

Does it has a way or necessary to convert a std::vector to bool*?

virtual void Put(const T* src, int num_values) = 0;
virtual void Put(const std::vector& src, int num_values = -1);
virtual void PutSpaced(const T* src, int num_values, const uint8_t* valid_bits,
int64_t valid_bits_offset) = 0;

Ah, fair enough. Let's keep it like that, then, or use unique_ptr.

Signed-off-by: Frank Du <frank.du@intel.com>

frankdjx · 2020-05-13T01:31:14Z

@pitrou @emkornfield @fsaintjacques

Hi,

Kindly let me know what I can do more to step forward this patch? Thanks.

wesm · 2020-05-13T01:40:19Z

I'm really concerned about continuing to drag our feet on runtime SIMD dispatching. The hole will continue to get dug deeper and deeper

wesm · 2020-05-13T01:47:25Z

I just sent an e-mail to the ML about it

frankdjx · 2020-05-14T02:51:10Z

Close this one. Revisit later util a runtime SIMD is settled. I guess I can commit the unit test and benchmark parts firstly, then we can get some survive during revisit at least the test/benchmark.

kiszk reviewed Apr 24, 2020

View reviewed changes

cpp/src/arrow/util/spaced.h Show resolved Hide resolved

kiszk reviewed Apr 24, 2020

View reviewed changes

cpp/src/arrow/util/spaced.h Show resolved Hide resolved

pitrou self-requested a review April 24, 2020 09:54

emkornfield reviewed Apr 26, 2020

View reviewed changes