Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-8579 [C++] SIMD for spaced decoding and encoding. #7029

Closed
wants to merge 5 commits into from

Conversation

frankdjx
Copy link
Contributor

@frankdjx frankdjx commented Apr 24, 2020

  1. Create the spaced encoding/decoding benchmark items
  2. More unittest for spaced API SIMD implementation
  3. SSE(epi8, epi32, epi64) and AVX512(epi32, epi64) added
  4. Move spaced scalar and SIMD to new head file

Signed-off-by: Frank Du frank.du@intel.com

@github-actions
Copy link

@frankdjx
Copy link
Contributor Author

frankdjx commented Apr 24, 2020

Below is the benchmark data on size 4096.

Scalar:

BM_PlainDecodingSpacedBoolean/4096        5894 ns         5888 ns       118928 bytes_per_second=663.435M/s
BM_PlainDecodingSpacedFloat/4096          3362 ns         3358 ns       209023 bytes_per_second=4.54393G/s
BM_PlainDecodingSpacedDouble/4096         3625 ns         3622 ns       192998 bytes_per_second=8.42641G/s

BM_PlainEncodingSpacedBoolean/4096        6319 ns         6313 ns       110791 bytes_per_second=618.767M/s
BM_PlainEncodingSpacedFloat/4096          4432 ns         4426 ns       158687 bytes_per_second=3.44732G/s
BM_PlainEncodingSpacedDouble/4096         4873 ns         4858 ns       143571 bytes_per_second=6.2815G/s

SSE:

BM_PlainDecodingSpacedBoolean/4096        2841 ns         2838 ns       246636 bytes_per_second=1.34394G/s
BM_PlainDecodingSpacedFloat/4096          1272 ns         1271 ns       551415 bytes_per_second=12.0043G/s
BM_PlainDecodingSpacedDouble/4096         2454 ns         2451 ns       285807 bytes_per_second=12.4491G/s

BM_PlainEncodingSpacedBoolean/4096        3744 ns         3740 ns       187030 bytes_per_second=1044.38M/s
BM_PlainEncodingSpacedFloat/4096          3105 ns         3101 ns       223733 bytes_per_second=4.91998G/s
BM_PlainEncodingSpacedDouble/4096         1128 ns         1127 ns       621762 bytes_per_second=6.76959G/s

AVX512:

BM_PlainDecodingSpacedBoolean/4096         848 ns          847 ns       824442 bytes_per_second=4.50542G/s
BM_PlainDecodingSpacedFloat/4096           544 ns          543 ns      1290583 bytes_per_second=28.076G/s
BM_PlainDecodingSpacedDouble/4096         1244 ns         1243 ns       567255 bytes_per_second=24.5575G/s

BM_PlainEncodingSpacedBoolean/4096        3481 ns         3478 ns       201293 bytes_per_second=1123.16M/s
BM_PlainEncodingSpacedFloat/4096          1666 ns         1664 ns       413996 bytes_per_second=9.16787G/s
BM_PlainEncodingSpacedDouble/4096          636 ns          635 ns      1103326 bytes_per_second=12.0144G/s

@pitrou
Copy link
Member

pitrou commented Apr 24, 2020

I'd gladly see a AVX2 or SSE version indeed, as many CPUs don't have AVX512.

@pitrou pitrou self-requested a review April 24, 2020 09:54
@emkornfield
Copy link
Contributor

Just curious if you see and impact on parquet-arrow-reader-writer benchmarks? That is the ultimate goal of the speedup.

@frankdjx
Copy link
Contributor Author

Just curious if you see and impact on parquet-arrow-reader-writer benchmarks? That is the ultimate goal of the speedup.

No impact, I checked all items for parquet-arrow-reader-writer-benchmark...

Below is the perf top on the bench-marking of BM_ReadColumn<true,Int32Type> and BM_WriteColumn<true,Int32Type>, seems these function is not on the path for them.

BM_ReadColumn<true,Int32Type>:
31.60% libparquet.so.18.0.0 [.] _ZN5arrow4util10RleDecoder22GetBatchWithDictSpacedIiEEiPKT_iPS3_iiPKhl
21.74% libparquet.so.18.0.0 [.] _ZN7parquet8internalL24DefinitionLevelsToBitmapEPKslssPlS3_Phl

BM_WriteColumn<true,Int32Type>:
20.64% libparquet.so.18.0.0 [.] _ZN5mpark6detail10visitation4base17make_fmatrix_implIONS1_7variant13value_visitorIRZN7parquet5arrow12_GLOBAL__N_19WritePathENS7_12Ele
16.19% libparquet.so.18.0.0 [.] _ZN7parquet15DictEncoderImplINS_12PhysicalTypeILNS_4Type4typeE1EEEE3PutERKi.constprop.455
11.50% libparquet.so.18.0.0 [.] _ZN7parquet12LevelEncoder6EncodeEiPKs
7.93% libparquet.so.18.0.0 [.] _ZN5arrow4util10RleEncoder15FlushLiteralRunEb

@frankdjx frankdjx marked this pull request as draft April 26, 2020 10:45
@frankdjx frankdjx force-pushed the spaced-avx512 branch 2 times, most recently from bb8ef96 to 8f07bb8 Compare April 26, 2020 11:24
@frankdjx frankdjx marked this pull request as ready for review April 26, 2020 11:46
@frankdjx frankdjx force-pushed the spaced-avx512 branch 2 times, most recently from 1217f38 to abc084c Compare April 29, 2020 06:21
@frankdjx
Copy link
Contributor Author

I'd gladly see a AVX2 or SSE version indeed, as many CPUs don't have AVX512.

@pitrou @emkornfield
Yeah, I has a version of SSE, would you like me to append it to this PR or a new PR after this closed? frankdjx@dce2949

@emkornfield
Copy link
Contributor

Sorry for the late reply. Might as well append it to this PR.

@pitrou
Copy link
Member

pitrou commented May 4, 2020

A general question: why is this limited to sizeof(T) == 4 and sizeof(T) == 8? There are 8-bit and 16-bit types as well.

@frankdjx
Copy link
Contributor Author

frankdjx commented May 5, 2020

A general question: why is this limited to sizeof(T) == 4 and sizeof(T) == 8? There are 8-bit and 16-bit types as well.

_mm512_mask_expand_epi16/_mm512_mask_expand_epi18 are based on AVX512_VBMI2 support which is a new feature started from icelake architecture. So for 8-bit, current skylake AVX512 need fall back to SSE path. And dose arrow has 16bit type data?

@pitrou
Copy link
Member

pitrou commented May 5, 2020

Definitely, there's int16 and uint16 in addition to float16 (which isn't really supported currently for anything but storage). I may miss others.

1. Create the spaced encoding/decoding benchmark items
2. More unittest for spaced API SIMD implementation
3. SSE(epi8, epi32, epi64) and AVX512(epi32, epi64) added
3. Move spaced scalar and SIMD to new head file

Signed-off-by: Frank Du <frank.du@intel.com>
@frankdjx frankdjx changed the title ARROW-8579 [C++] Add AVX512 SIMD for spaced decoding and encoding. ARROW-8579 [C++] SIMD for spaced decoding and encoding. May 6, 2020
@frankdjx
Copy link
Contributor Author

frankdjx commented May 6, 2020

Definitely, there's int16 and uint16 in addition to float16 (which isn't really supported currently for anything but storage). I may miss others.

I just put "static_assert(sizeof(T) != 2)" in the entry of spaced API, no error reported. Seems no int16/uint16 in the supported type of parquet.

// Mirrors parquet::Type
struct Type {
enum type {
BOOLEAN = 0,
INT32 = 1,
INT64 = 2,
INT96 = 3,
FLOAT = 4,
DOUBLE = 5,
BYTE_ARRAY = 6,
FIXED_LEN_BYTE_ARRAY = 7,
// Should always be last element.
UNDEFINED = 8
};
};

@frankdjx
Copy link
Contributor Author

frankdjx commented May 6, 2020

Sorry for the late reply. Might as well append it to this PR.

SSE(epi8, epi32, epi64) version appended.

@pitrou
Copy link
Member

pitrou commented May 6, 2020

Oh, you're definitely right. I was thinking about Arrow types, sorry.

@pitrou
Copy link
Member

pitrou commented May 6, 2020

Benchmarks on AMD Ryzen:

  • Scalar:
----------------------------------------------------------------------------------------------
Benchmark                                    Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------
BM_PlainEncodingSpacedBoolean/32768      40353 ns        40343 ns        55868 bytes_per_second=774.601M/s
BM_PlainDecodingSpacedBoolean/32768      26626 ns        26620 ns        78965 bytes_per_second=1.14643G/s
BM_PlainEncodingSpacedFloat/32768        21256 ns        21251 ns        97590 bytes_per_second=5.74415G/s
BM_PlainEncodingSpacedDouble/32768       24643 ns        24638 ns        85219 bytes_per_second=9.90919G/s
BM_PlainDecodingSpacedFloat/32768        21135 ns        21129 ns       100701 bytes_per_second=5.77725G/s
BM_PlainDecodingSpacedDouble/32768       22763 ns        22757 ns        92009 bytes_per_second=10.7281G/s
  • AVX2:
----------------------------------------------------------------------------------------------
Benchmark                                    Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------
BM_PlainEncodingSpacedBoolean/32768      20718 ns        20715 ns       102447 bytes_per_second=1.47321G/s
BM_PlainDecodingSpacedBoolean/32768      19129 ns        19127 ns       110421 bytes_per_second=1.59553G/s
BM_PlainEncodingSpacedFloat/32768         8117 ns         8116 ns       258508 bytes_per_second=15.0408G/s
BM_PlainEncodingSpacedDouble/32768       15342 ns        15339 ns       137662 bytes_per_second=15.9159G/s
BM_PlainDecodingSpacedFloat/32768         8384 ns         8383 ns       259971 bytes_per_second=14.5617G/s
BM_PlainDecodingSpacedDouble/32768       14877 ns        14875 ns       136365 bytes_per_second=16.4123G/s

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very nice improvement. Here are some comments.

# Usage: python3 spaced_sse_codegen.py > spaced_sse_generated.h


def print_mask_expand_bitmap(width, length):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have an explanation for the data generation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add

namespace arrow {
namespace internal {

static constexpr uint8_t kMask128SseCompressEpi32[] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it useful to have the definitions in a .h file rather than .cc? Those tables are a bit large and it would be better not to duplicate them around.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, will move to cc file

@@ -199,6 +199,107 @@ static void BM_PlainDecodingFloat(benchmark::State& state) {

BENCHMARK(BM_PlainDecodingFloat)->Range(MIN_RANGE, MAX_RANGE);

static void BM_PlainEncodingSpacedBoolean(benchmark::State& state) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not how booleans are encoded in Parquet (i.e. 1-bit data), is it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it's the one bit encoding. The put space include two stage, first is to filter the value as the valid bit map, later is call the the boolean encode routine on the valid data which remove spaced values already.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean boolean encoding converts a bitmap to an array of bool?


#if defined(ARROW_HAVE_SSE4_2)
template <typename T>
int SpacedCompressSseShuffle(const T* values, int num_values, const uint8_t* valid_bits,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an explanation of the algorithms somewhere? Can you add a comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add more

// Thin table used, it need add back the offset of high and compact two parts
__m128i src =
_mm_loadu_si128(reinterpret_cast<const __m128i*>(values + idx_values));
__m128i mask = _mm_set_epi64x(*(reinterpret_cast<const uint64_t*>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be more readable if you moved the reinterpret_casts at the start of the function, and wrote something like kMask128SseCompressEpi8Thin[valid_byte_value_high].

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(same for all other occurrences of a similar pattern)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3x, will change

@frankdjx frankdjx marked this pull request as draft May 7, 2020 03:11
frankdjx and others added 2 commits May 7, 2020 03:19
warning C4267: 'argument': conversion from 'size_t' to 'int', possible loss of data

Signed-off-by: Frank Du <frank.du@intel.com>
table to cc file
more doc, better code readable

Signed-off-by: Frank Du <frank.du@intel.com>
@frankdjx frankdjx marked this pull request as ready for review May 7, 2020 06:20
@frankdjx frankdjx requested a review from pitrou May 7, 2020 06:20
def print_mask_expand_table(pack_width, mask_length):
"""
Generate the lookup mask table for SSE expand shuffle control mask.
Ex, for epi32 full table(pack_width = 4(32/8), mask_length = 16(128/8)), the available mask
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, this doesn't really make things much clearer. What is a "epi32"? What does "pack_width = 4(32/8)" mean? Also, what does the special value "0x80" mean? Why "0x80"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add more

const int num_values = state.range(0);
bool* values = new bool[num_values];
// Fixed half spaced pattern
std::vector<uint8_t> valid_bits(arrow::BitUtil::BytesForBits(num_values), 0b10101010);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use std::vector instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems can't, the PutSpaced only accept T* input, but Put has std::vector input.

Does it has a way or necessary to convert a std::vector to bool*?

virtual void Put(const T* src, int num_values) = 0;
virtual void Put(const std::vector& src, int num_values = -1);
virtual void PutSpaced(const T* src, int num_values, const uint8_t* valid_bits,
int64_t valid_bits_offset) = 0;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, fair enough. Let's keep it like that, then, or use unique_ptr.

@fsaintjacques fsaintjacques self-requested a review May 7, 2020 22:23
Signed-off-by: Frank Du <frank.du@intel.com>
@frankdjx frankdjx requested a review from pitrou May 8, 2020 02:16
@frankdjx
Copy link
Contributor Author

@pitrou @emkornfield @fsaintjacques

Hi,

Kindly let me know what I can do more to step forward this patch? Thanks.

@wesm
Copy link
Member

wesm commented May 13, 2020

I'm really concerned about continuing to drag our feet on runtime SIMD dispatching. The hole will continue to get dug deeper and deeper

@wesm
Copy link
Member

wesm commented May 13, 2020

I just sent an e-mail to the ML about it

@frankdjx
Copy link
Contributor Author

Close this one. Revisit later util a runtime SIMD is settled. I guess I can commit the unit test and benchmark parts firstly, then we can get some survive during revisit at least the test/benchmark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants