GH-38542: [C++][Parquet] Faster scalar BYTE_STREAM_SPLIT #38529

pitrou · 2023-10-31T16:48:33Z

Rationale for this change

BYTE_STREAM_SPLIT encoding and decoding benefit from SIMD accelerations on x86, but scalar implementations are used otherwise.

What changes are included in this PR?

Improve the speed of scalar implementations of BYTE_STREAM_SPLIT by using a blocked algorithm.

Benchmark numbers on the author's machine (AMD Ryzen 9 3900X):

before:

-------------------------------------------------------------------------------------------------------
Benchmark                                             Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------
BM_ByteStreamSplitDecode_Float_Scalar/1024         1374 ns         1374 ns       510396 bytes_per_second=2.77574G/s
BM_ByteStreamSplitDecode_Float_Scalar/4096         5483 ns         5483 ns       127498 bytes_per_second=2.78303G/s
BM_ByteStreamSplitDecode_Float_Scalar/32768       44042 ns        44035 ns        15905 bytes_per_second=2.77212G/s
BM_ByteStreamSplitDecode_Float_Scalar/65536       87966 ns        87952 ns         7962 bytes_per_second=2.77583G/s

BM_ByteStreamSplitDecode_Double_Scalar/1024        2583 ns         2583 ns       271436 bytes_per_second=2.95408G/s
BM_ByteStreamSplitDecode_Double_Scalar/4096       10533 ns        10532 ns        65695 bytes_per_second=2.89761G/s
BM_ByteStreamSplitDecode_Double_Scalar/32768      84067 ns        84053 ns         8275 bytes_per_second=2.90459G/s
BM_ByteStreamSplitDecode_Double_Scalar/65536     168332 ns       168309 ns         4155 bytes_per_second=2.9011G/s

BM_ByteStreamSplitEncode_Float_Scalar/1024         1435 ns         1435 ns       484278 bytes_per_second=2.65802G/s
BM_ByteStreamSplitEncode_Float_Scalar/4096         5725 ns         5725 ns       121877 bytes_per_second=2.66545G/s
BM_ByteStreamSplitEncode_Float_Scalar/32768       46291 ns        46283 ns        15134 bytes_per_second=2.63745G/s
BM_ByteStreamSplitEncode_Float_Scalar/65536       91139 ns        91128 ns         7707 bytes_per_second=2.6791G/s

BM_ByteStreamSplitEncode_Double_Scalar/1024        3093 ns         3093 ns       226198 bytes_per_second=2.46663G/s
BM_ByteStreamSplitEncode_Double_Scalar/4096       12724 ns        12722 ns        54522 bytes_per_second=2.39873G/s
BM_ByteStreamSplitEncode_Double_Scalar/32768     100488 ns       100475 ns         6957 bytes_per_second=2.42987G/s
BM_ByteStreamSplitEncode_Double_Scalar/65536     200885 ns       200852 ns         3486 bytes_per_second=2.43105G/s

after:

-------------------------------------------------------------------------------------------------------
Benchmark                                             Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------
BM_ByteStreamSplitDecode_Float_Scalar/1024          932 ns          932 ns       753352 bytes_per_second=4.09273G/s
BM_ByteStreamSplitDecode_Float_Scalar/4096         3715 ns         3715 ns       188394 bytes_per_second=4.10783G/s
BM_ByteStreamSplitDecode_Float_Scalar/32768       30167 ns        30162 ns        23441 bytes_per_second=4.04716G/s
BM_ByteStreamSplitDecode_Float_Scalar/65536       59483 ns        59475 ns        11744 bytes_per_second=4.10496G/s

BM_ByteStreamSplitDecode_Double_Scalar/1024        1862 ns         1862 ns       374715 bytes_per_second=4.09823G/s
BM_ByteStreamSplitDecode_Double_Scalar/4096        7554 ns         7553 ns        91975 bytes_per_second=4.04038G/s
BM_ByteStreamSplitDecode_Double_Scalar/32768      60429 ns        60421 ns        11499 bytes_per_second=4.04067G/s
BM_ByteStreamSplitDecode_Double_Scalar/65536     120992 ns       120972 ns         5756 bytes_per_second=4.03631G/s

BM_ByteStreamSplitEncode_Float_Scalar/1024          737 ns          737 ns       947423 bytes_per_second=5.17843G/s
BM_ByteStreamSplitEncode_Float_Scalar/4096         2934 ns         2933 ns       239459 bytes_per_second=5.20175G/s
BM_ByteStreamSplitEncode_Float_Scalar/32768       23730 ns        23727 ns        29243 bytes_per_second=5.14485G/s
BM_ByteStreamSplitEncode_Float_Scalar/65536       47671 ns        47664 ns        14682 bytes_per_second=5.12209G/s

BM_ByteStreamSplitEncode_Double_Scalar/1024        1517 ns         1517 ns       458928 bytes_per_second=5.02827G/s
BM_ByteStreamSplitEncode_Double_Scalar/4096        6224 ns         6223 ns       111361 bytes_per_second=4.90407G/s
BM_ByteStreamSplitEncode_Double_Scalar/32768      49719 ns        49713 ns        14059 bytes_per_second=4.91099G/s
BM_ByteStreamSplitEncode_Double_Scalar/65536      99445 ns        99432 ns         7027 bytes_per_second=4.91072G/s

Are these changes tested?

Yes, though the scalar implementations are unfortunately only exercised on non-x86 by default (see added comment in the PR).

Are there any user-facing changes?

No.

Closes: [C++][Parquet] Faster scalar BYTE_STREAM_SPLIT #38542

pitrou · 2023-10-31T16:49:03Z

@cyb70289 Do you want to take a look at how this performs on ARM?

github-actions · 2023-10-31T16:49:03Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

mapleFU · 2023-10-31T17:00:17Z

cpp/src/arrow/util/byte_stream_split.h

+  }
+}
+
+inline void DoMergeStreams(const uint8_t** src_streams, int width, int64_t nvalues,


can width be a template argument here(and would it benefit the performance?)

Yes, it can, and no, it probably wouldn't, as the compiler should inline this function into the callee.

pitrou · 2023-10-31T17:27:31Z

Note that one reason for this work is that we might want to generalize BYTE_STREAM_SPLIT for more datatypes.

cpp/src/arrow/util/byte_stream_split.h

cyb70289 · 2023-11-01T01:35:12Z

Big performance improvement observed on Arm Neoverse-N2 CPU. Great work!

before

BM_ByteStreamSplitDecode_Float_Scalar/1024         2241 ns         2241 ns       312363 bytes_per_second=1.70208G/s
BM_ByteStreamSplitDecode_Float_Scalar/4096         8949 ns         8949 ns        78233 bytes_per_second=1.70515G/s
BM_ByteStreamSplitDecode_Float_Scalar/32768       71555 ns        71557 ns         9780 bytes_per_second=1.70592G/s
BM_ByteStreamSplitDecode_Float_Scalar/65536      143178 ns       143174 ns         4889 bytes_per_second=1.7052G/s

BM_ByteStreamSplitDecode_Double_Scalar/1024        4035 ns         4035 ns       173607 bytes_per_second=1.89097G/s
BM_ByteStreamSplitDecode_Double_Scalar/4096       16115 ns        16114 ns        43351 bytes_per_second=1.89381G/s
BM_ByteStreamSplitDecode_Double_Scalar/32768     138440 ns       138442 ns         5055 bytes_per_second=1.76348G/s
BM_ByteStreamSplitDecode_Double_Scalar/65536     277661 ns       277651 ns         2522 bytes_per_second=1.75862G/s

BM_ByteStreamSplitEncode_Float_Scalar/1024         2275 ns         2275 ns       307740 bytes_per_second=1.6771G/s
BM_ByteStreamSplitEncode_Float_Scalar/4096         9107 ns         9107 ns        77037 bytes_per_second=1.67555G/s
BM_ByteStreamSplitEncode_Float_Scalar/32768       72365 ns        72364 ns         9670 bytes_per_second=1.68689G/s
BM_ByteStreamSplitEncode_Float_Scalar/65536      145514 ns       145518 ns         4809 bytes_per_second=1.67773G/s

BM_ByteStreamSplitEncode_Double_Scalar/1024        6262 ns         6262 ns       111792 bytes_per_second=1.21838G/s
BM_ByteStreamSplitEncode_Double_Scalar/4096       25252 ns        25253 ns        27710 bytes_per_second=1.20847G/s
BM_ByteStreamSplitEncode_Double_Scalar/32768     162099 ns       162096 ns         4319 bytes_per_second=1.50614G/s
BM_ByteStreamSplitEncode_Double_Scalar/65536     323703 ns       323682 ns         2162 bytes_per_second=1.50852G/s

after

BM_ByteStreamSplitDecode_Float_Scalar/1024         1129 ns         1129 ns       619351 bytes_per_second=3.37958G/s
BM_ByteStreamSplitDecode_Float_Scalar/4096         4499 ns         4499 ns       155591 bytes_per_second=3.39128G/s
BM_ByteStreamSplitDecode_Float_Scalar/32768       36209 ns        36210 ns        19331 bytes_per_second=3.37113G/s
BM_ByteStreamSplitDecode_Float_Scalar/65536       72520 ns        72521 ns         9650 bytes_per_second=3.36649G/s

BM_ByteStreamSplitDecode_Double_Scalar/1024        2999 ns         2999 ns       233336 bytes_per_second=2.54407G/s
BM_ByteStreamSplitDecode_Double_Scalar/4096       11979 ns        11979 ns        58390 bytes_per_second=2.54754G/s
BM_ByteStreamSplitDecode_Double_Scalar/32768      96324 ns        96323 ns         7266 bytes_per_second=2.5346G/s
BM_ByteStreamSplitDecode_Double_Scalar/65536     194568 ns       194574 ns         3595 bytes_per_second=2.50949G/s

BM_ByteStreamSplitEncode_Float_Scalar/1024          936 ns          936 ns       744854 bytes_per_second=4.07715G/s
BM_ByteStreamSplitEncode_Float_Scalar/4096         3705 ns         3705 ns       188993 bytes_per_second=4.11788G/s
BM_ByteStreamSplitEncode_Float_Scalar/32768       29817 ns        29816 ns        23459 bytes_per_second=4.09408G/s
BM_ByteStreamSplitEncode_Float_Scalar/65536       60051 ns        60050 ns        11632 bytes_per_second=4.06565G/s

BM_ByteStreamSplitEncode_Double_Scalar/1024        1872 ns         1871 ns       373845 bytes_per_second=4.07669G/s
BM_ByteStreamSplitEncode_Double_Scalar/4096        7471 ns         7471 ns        93678 bytes_per_second=4.0849G/s
BM_ByteStreamSplitEncode_Double_Scalar/32768      60360 ns        60362 ns        11600 bytes_per_second=4.04463G/s
BM_ByteStreamSplitEncode_Double_Scalar/65536     121104 ns       121107 ns         5776 bytes_per_second=4.0318G/s

cyb70289 · 2023-11-01T03:28:15Z

Looks the major improvement is the total instructions drop significantly. For BM_ByteStreamSplitEncode_Double_Scalar/65536` test, instructions per iteration drops from 5,478,109 to 1,915,403.

Estimated roughly the total instructions to process 32 doubles.
Original code executes 1760 instructions, but optimized code only need 783 instructions.

Original:

BM_ByteStreamSplitEncode_Double_Scalar -> DoSplitStreams(num_values)
kNumStreams=sizeof(double)=8

  +--->19f78:       8b0200e1        add     x1, x7, x2
  |    19f7c:       8b050266        add     x6, x19, x5
  |    19f80:       d2800000        mov     x0, #0x0                        // #0
  | +->19f84:       386068c3        ldrb    w3, [x6, x0]
  | |  19f88:       91000400        add     x0, x0, #0x1
  | |  19f8c:       39000023        strb    w3, [x1]
  | |  19f90:       8b040021        add     x1, x1, x4
  | |  19f94:       f100201f        cmp     x0, #0x8  // for (size_t j = 0U; j < kNumStreams=8; ++j)
  | +--19f98:       54ffff61        b.ne    19f84 <parquet::BM_ByteStreamSplitEncode_Double_Scalar(benchmark::State&)+0x224>  // b.any
  |    19f9c:       91000442        add     x2, x2, #0x1
  |    19fa0:       910020a5        add     x5, x5, #0x8
  |    19fa4:       eb02009f        cmp     x4, x2    // for (size_t i = 0U; i < num_values; ++i)
  +----19fa8:       54fffe81        b.ne    19f78 <parquet::BM_ByteStreamSplitEncode_Double_Scalar(benchmark::State&)+0x218>  // b.any

Assume num_values = 32, calculate total instructions as below:
outer loop: count = 32,   instructions = 7*32 = 224 (exclude inner loop)
inner loop: count = 32*8, instructions = 8*32*6 = 1536
Total instructions = 224+1536 = 1760

Optimized:

BM_ByteStreamSplitEncode_Double_Scalar -> DoSplitStreams(nvalues, width=sizeof(double)=8)

+----->19cb8:       aa1303ef        mov     x15, x19
|      19cbc:       aa0003ee        mov     x14, x0
|      19cc0:       5280000c        mov     w12, #0x0                       // #0
| +--->19cc4:       f94001f0        ldr     x16, [x15]
| |    19cc8:       aa0e03e4        mov     x4, x14
| |    19ccc:       d2800005        mov     x5, #0x0                        // #0
| | +->19cd0:       39404086        ldrb    w6, [x4, #16]
| | |  19cd4:       3940808b        ldrb    w11, [x4, #32]
| | |  19cd8:       39402082        ldrb    w2, [x4, #8]
| | |  19cdc:       3940608a        ldrb    w10, [x4, #24]
| | |  19ce0:       d370bcc6        lsl     x6, x6, #16
| | |  19ce4:       3940c088        ldrb    w8, [x4, #48]
| | |  19ce8:       d3607d6b        lsl     x11, x11, #32
| | |  19cec:       3940a087        ldrb    w7, [x4, #40]
| | |  19cf0:       aa0220c2        orr     x2, x6, x2, lsl #8
| | |  19cf4:       aa0a616a        orr     x10, x11, x10, lsl #24
| | |  19cf8:       3940e086        ldrb    w6, [x4, #56]
| | |  19cfc:       3844048b        ldrb    w11, [x4], #64
| | |  19d00:       d3503d08        lsl     x8, x8, #48
| | |  19d04:       aa0a0042        orr     x2, x2, x10
| | |  19d08:       aa07a107        orr     x7, x8, x7, lsl #40
| | |  19d0c:       aa070042        orr     x2, x2, x7
| | |  19d10:       aa06e166        orr     x6, x11, x6, lsl #56
| | |  19d14:       aa060042        orr     x2, x2, x6
| | |  19d18:       f8256a02        str     x2, [x16, x5]
| | |  19d1c:       910020a5        add     x5, x5, #0x8
| | |  19d20:       f10080bf        cmp     x5, #0x20  // for (int i = 0; i < kBlockSize=32; i += 8)
| | +--19d24:       54fffd61        b.ne    19cd0 <parquet::BM_ByteStreamSplitEncode_Double_Scalar(benchmark::State&)+0x260>  // b.any
| |    19d28:       91008210        add     x16, x16, #0x20
| |    19d2c:       1100058c        add     w12, w12, #0x1
| |    19d30:       f80085f0        str     x16, [x15], #8
| |    19d34:       910005ce        add     x14, x14, #0x1
| |    19d38:       7100219f        cmp     w12, #0x8  // for (int stream = 0; stream < width=8; ++stream)
| +----19d3c:       54fffc41        b.ne    19cc4 <parquet::BM_ByteStreamSplitEncode_Double_Scalar(benchmark::State&)+0x254>  // b.any
|      19d40:       d1008021        sub     x1, x1, #0x20
|      19d44:       91040000        add     x0, x0, #0x100
|      19d48:       f1007c3f        cmp     x1, #0x1f  // while (nvalues >= kBlockSize=32)
+------19d4c:       54fffb6c        b.gt    19cb8 <parquet::BM_ByteStreamSplitEncode_Double_Scalar(benchmark::State&)+0x248>

Assume nvalues = 32, calculate total instructions as below:
outer loop:  count = 1,   instructions = 7 (exclude middle and inner loops)
middle loop: count = 8,   instructions = 8*9 = 72 (exclude inner loop)
inner loop:  count = 8*4, instructions = 8*4*22 = 704
Total instructions = 7+72+704 = 783

mapleFU · 2023-11-01T04:23:32Z

(What about decoding?)

cyb70289 · 2023-11-01T04:41:21Z

(What about decoding?)

Should be similar. Instructions per iteration drop significantly (5,213,354 -> 1,703,533).
Be careful about the value, I just divided total instructions (from perf) by iterations shown in the benchmark report. I guess warmup time is not included in the reported iteration.

cyb70289 · 2023-11-01T04:50:00Z

Another improvement I see is the better CPU pipeline parallelism from the optimized code.
IPC (instructions per cycle) increases from 4.06 to 4.83. This is amazing as the highest IPC of neoverse-n2, in theory, is 5.
Topdown metric "retiring" increases from 67.7% to 92.6%, which means 92.6% CPU cycles are doing useful works, a very high value.

wgtmac · 2023-11-01T14:18:36Z

cpp/src/arrow/util/byte_stream_split.h

+        uint64_t f = src[stream + (i + 5) * width];
+        uint64_t g = src[stream + (i + 6) * width];
+        uint64_t h = src[stream + (i + 7) * width];
+#if ARROW_LITTLE_ENDIAN


What if the endianness of machine compiling it is different with machine at runtime?

Then the whole of Arrow C++ is miscompiled?

Is there any situation where compile-time ARROW_LITTLE_ENDIAN might be different from the runtime platform's endianness?

Cross-compiling?

Here is the current definition. Can it be mis-compiled?

#ifndef __BYTE_ORDER__ #error "__BYTE_ORDER__ not defined" #endif # #ifndef __ORDER_LITTLE_ENDIAN__ #error "__ORDER_LITTLE_ENDIAN__ not defined" #endif # #if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ #define ARROW_LITTLE_ENDIAN 1 #else #define ARROW_LITTLE_ENDIAN 0 #endif

You may want to open an issue for this. There doesn't seem to be a reliable way to make a compile-time or constexpr check in C++17, AFAICT.

Then the whole of Arrow C++ is miscompiled?

Maybe it's ok to test it when starting the user-side program. Since this is not the only place may causing problem when cross-compiling worked.

What about opening an issue to switch to std::endian once we start C++20 adoption?

I'm not familiar with this part of code, however, that maybe means introduce a kind of dynamic code dispatch, like if LITTLE_ENDIAN , otherwise goes to BIG_ENDIAN code.

As you referenced:

As stated earlier, the only "real" way to detect Big Endian is to use runtime tests.

std::endian seems also do a compile time checking rather than runtime. See https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0463r1.html

IMO, a lots of code already uses compile-time macro. If we meet this problem, we may having a test for SIMD instructions, endian, etc. Rather than dynamic dispatch it here...

github-actions · 2023-11-01T16:17:45Z

⚠️ GitHub issue #38542 has been automatically assigned in GitHub to PR creator.

pitrou · 2023-11-01T16:28:01Z

I've added some more extensive tests for the various implementations and this is now ready for re-review. @wgtmac @mapleFU @cyb70289

mapleFU

This patch LGTM

Also, if src in DoSplit* doesn't fit in L1 Cache, would it get much slower?

mapleFU · 2023-11-01T16:59:11Z

cpp/src/arrow/util/byte_stream_split_internal.h

+                           uint8_t** dest_streams) {
+  // Value empirically chosen to provide the best performance on the author's machine
+  constexpr int kBlockSize = 32;
+


Nit: do we need DCHECK the width is expected?

What do you mean?

width can only be 4/8 now right. After Fp16 is introduced, we might supports 2/4/8, other value should be rejected?

Well, this function should support any non-zero width, so I don't think it is necessary to reject anything here.

Hmmmmm here it only supports FLOAT/DOUBLE?

https://github.com/apache/parquet-format/blob/master/Encodings.md#byte-stream-split-byte_stream_split--9

But I think it's also ok when width > 0. It's ok to me if you prefer.

Yes, Parquet only supports FLOAT/DOUBLE, but there is no reason for this function to be restricted.

I agree not to restrict the width unless required. This is now a common util not specific to parquet.

pitrou · 2023-11-01T17:10:56Z

Also, if src in DoSplit* doesn't fit in L1 Cache, would it get much slower?

It's also already out of L1 cache when num_values = 65536.

mapleFU · 2023-11-01T17:25:56Z

It's also already out of L1 cache when num_values = 65536.

Oh finally I got it. It might benefits from CPU prefetch? Thanks for the great code here

pitrou · 2023-11-01T17:27:22Z

It might benefits from CPU prefetch?

Yes, I think it should. Though to be honest I haven't tried to check it. But 64k double values is already 512kB, so 1MB with both the input and output buffer, which is more than L2 cache size on my CPU...

mapleFU

The code LGTM. I'm not so familiar with high-performance code writing so maybe waiting for @cyb70289 's further review

cyb70289 · 2023-11-02T02:59:55Z

I compared L1 and L2 missing events. L1 data cache MPKI (misses per kilo instructions) drops significantly from 49.6 to 10.2, L2 MPKI increases from 0.062 to 0.236.

The optimized code is also better for L1 data cache. My understanding is as below:

Original code loads data sequentially: offset 0,1,2,3,.....
Optimized code loads data by block in interleave:

1st outer loop: offset 0,  8, 16, 24, ..., 120
2nd outer loop: offset 1,  9, 17, 25, ..., 121
...
8th outer loop: offset 7, 15, 23, 31, ..., 127

For the optimized code, the 1st loop will populate L1 data cache with all necessary data for the next 7 loops.
Original code may be better for L2 prefetcher as the pattern is quite easy to predict, but optimized code reduces L1 misses which is a big win.

cyb70289

LGTM

conbench-apache-arrow · 2023-11-02T17:46:15Z

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 7281851.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 7 possible false positives for unstable benchmarks that are known to sometimes produce them.

…e#38529) ### Rationale for this change BYTE_STREAM_SPLIT encoding and decoding benefit from SIMD accelerations on x86, but scalar implementations are used otherwise. ### What changes are included in this PR? Improve the speed of scalar implementations of BYTE_STREAM_SPLIT by using a blocked algorithm. Benchmark numbers on the author's machine (AMD Ryzen 9 3900X): * before: ``` ------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------------------------- BM_ByteStreamSplitDecode_Float_Scalar/1024 1374 ns 1374 ns 510396 bytes_per_second=2.77574G/s BM_ByteStreamSplitDecode_Float_Scalar/4096 5483 ns 5483 ns 127498 bytes_per_second=2.78303G/s BM_ByteStreamSplitDecode_Float_Scalar/32768 44042 ns 44035 ns 15905 bytes_per_second=2.77212G/s BM_ByteStreamSplitDecode_Float_Scalar/65536 87966 ns 87952 ns 7962 bytes_per_second=2.77583G/s BM_ByteStreamSplitDecode_Double_Scalar/1024 2583 ns 2583 ns 271436 bytes_per_second=2.95408G/s BM_ByteStreamSplitDecode_Double_Scalar/4096 10533 ns 10532 ns 65695 bytes_per_second=2.89761G/s BM_ByteStreamSplitDecode_Double_Scalar/32768 84067 ns 84053 ns 8275 bytes_per_second=2.90459G/s BM_ByteStreamSplitDecode_Double_Scalar/65536 168332 ns 168309 ns 4155 bytes_per_second=2.9011G/s BM_ByteStreamSplitEncode_Float_Scalar/1024 1435 ns 1435 ns 484278 bytes_per_second=2.65802G/s BM_ByteStreamSplitEncode_Float_Scalar/4096 5725 ns 5725 ns 121877 bytes_per_second=2.66545G/s BM_ByteStreamSplitEncode_Float_Scalar/32768 46291 ns 46283 ns 15134 bytes_per_second=2.63745G/s BM_ByteStreamSplitEncode_Float_Scalar/65536 91139 ns 91128 ns 7707 bytes_per_second=2.6791G/s BM_ByteStreamSplitEncode_Double_Scalar/1024 3093 ns 3093 ns 226198 bytes_per_second=2.46663G/s BM_ByteStreamSplitEncode_Double_Scalar/4096 12724 ns 12722 ns 54522 bytes_per_second=2.39873G/s BM_ByteStreamSplitEncode_Double_Scalar/32768 100488 ns 100475 ns 6957 bytes_per_second=2.42987G/s BM_ByteStreamSplitEncode_Double_Scalar/65536 200885 ns 200852 ns 3486 bytes_per_second=2.43105G/s ``` * after: ``` ------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------------------------- BM_ByteStreamSplitDecode_Float_Scalar/1024 932 ns 932 ns 753352 bytes_per_second=4.09273G/s BM_ByteStreamSplitDecode_Float_Scalar/4096 3715 ns 3715 ns 188394 bytes_per_second=4.10783G/s BM_ByteStreamSplitDecode_Float_Scalar/32768 30167 ns 30162 ns 23441 bytes_per_second=4.04716G/s BM_ByteStreamSplitDecode_Float_Scalar/65536 59483 ns 59475 ns 11744 bytes_per_second=4.10496G/s BM_ByteStreamSplitDecode_Double_Scalar/1024 1862 ns 1862 ns 374715 bytes_per_second=4.09823G/s BM_ByteStreamSplitDecode_Double_Scalar/4096 7554 ns 7553 ns 91975 bytes_per_second=4.04038G/s BM_ByteStreamSplitDecode_Double_Scalar/32768 60429 ns 60421 ns 11499 bytes_per_second=4.04067G/s BM_ByteStreamSplitDecode_Double_Scalar/65536 120992 ns 120972 ns 5756 bytes_per_second=4.03631G/s BM_ByteStreamSplitEncode_Float_Scalar/1024 737 ns 737 ns 947423 bytes_per_second=5.17843G/s BM_ByteStreamSplitEncode_Float_Scalar/4096 2934 ns 2933 ns 239459 bytes_per_second=5.20175G/s BM_ByteStreamSplitEncode_Float_Scalar/32768 23730 ns 23727 ns 29243 bytes_per_second=5.14485G/s BM_ByteStreamSplitEncode_Float_Scalar/65536 47671 ns 47664 ns 14682 bytes_per_second=5.12209G/s BM_ByteStreamSplitEncode_Double_Scalar/1024 1517 ns 1517 ns 458928 bytes_per_second=5.02827G/s BM_ByteStreamSplitEncode_Double_Scalar/4096 6224 ns 6223 ns 111361 bytes_per_second=4.90407G/s BM_ByteStreamSplitEncode_Double_Scalar/32768 49719 ns 49713 ns 14059 bytes_per_second=4.91099G/s BM_ByteStreamSplitEncode_Double_Scalar/65536 99445 ns 99432 ns 7027 bytes_per_second=4.91072G/s ``` ### Are these changes tested? Yes, though the scalar implementations are unfortunately only exercised on non-x86 by default (see added comment in the PR). ### Are there any user-facing changes? No. * Closes: apache#38542 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

jorisvandenbossche · 2024-01-12T12:47:54Z

Haven't read through the discussion here, but from the release performance report, I noticed that the BM_ByteStreamSplitDecode_xx_Scalar benchmarks showed a regression apparently for this commit. See for example https://conbench.ursa.dev/benchmark-results/06543cccc43d719180005886ca3643dd/

Not sure why the conbench report above didn't say anything about that though. The speedups of this PR might also be hardware related?

mapleFU · 2024-01-12T12:51:37Z

The speedups of this PR might also be hardware related?

Maybe it's compiler related🤔 Auto-Vectorize might optimizing code like this patch's original code...

pitrou · 2024-01-12T13:13:06Z

It can be hardware- or compiler-related indeed.

jorisvandenbossche · 2024-01-12T15:22:08Z

OK, completely ignore my comment. The benchmark is expressed in "bytes per second", so higher is better ... Thus the correct conclusion is: the benchmarks nicely captured the significant improvement from this PR! ;)

mapleFU · 2024-01-12T15:23:03Z

Aha...

…e#38529) ### Rationale for this change BYTE_STREAM_SPLIT encoding and decoding benefit from SIMD accelerations on x86, but scalar implementations are used otherwise. ### What changes are included in this PR? Improve the speed of scalar implementations of BYTE_STREAM_SPLIT by using a blocked algorithm. Benchmark numbers on the author's machine (AMD Ryzen 9 3900X): * before: ``` ------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------------------------- BM_ByteStreamSplitDecode_Float_Scalar/1024 1374 ns 1374 ns 510396 bytes_per_second=2.77574G/s BM_ByteStreamSplitDecode_Float_Scalar/4096 5483 ns 5483 ns 127498 bytes_per_second=2.78303G/s BM_ByteStreamSplitDecode_Float_Scalar/32768 44042 ns 44035 ns 15905 bytes_per_second=2.77212G/s BM_ByteStreamSplitDecode_Float_Scalar/65536 87966 ns 87952 ns 7962 bytes_per_second=2.77583G/s BM_ByteStreamSplitDecode_Double_Scalar/1024 2583 ns 2583 ns 271436 bytes_per_second=2.95408G/s BM_ByteStreamSplitDecode_Double_Scalar/4096 10533 ns 10532 ns 65695 bytes_per_second=2.89761G/s BM_ByteStreamSplitDecode_Double_Scalar/32768 84067 ns 84053 ns 8275 bytes_per_second=2.90459G/s BM_ByteStreamSplitDecode_Double_Scalar/65536 168332 ns 168309 ns 4155 bytes_per_second=2.9011G/s BM_ByteStreamSplitEncode_Float_Scalar/1024 1435 ns 1435 ns 484278 bytes_per_second=2.65802G/s BM_ByteStreamSplitEncode_Float_Scalar/4096 5725 ns 5725 ns 121877 bytes_per_second=2.66545G/s BM_ByteStreamSplitEncode_Float_Scalar/32768 46291 ns 46283 ns 15134 bytes_per_second=2.63745G/s BM_ByteStreamSplitEncode_Float_Scalar/65536 91139 ns 91128 ns 7707 bytes_per_second=2.6791G/s BM_ByteStreamSplitEncode_Double_Scalar/1024 3093 ns 3093 ns 226198 bytes_per_second=2.46663G/s BM_ByteStreamSplitEncode_Double_Scalar/4096 12724 ns 12722 ns 54522 bytes_per_second=2.39873G/s BM_ByteStreamSplitEncode_Double_Scalar/32768 100488 ns 100475 ns 6957 bytes_per_second=2.42987G/s BM_ByteStreamSplitEncode_Double_Scalar/65536 200885 ns 200852 ns 3486 bytes_per_second=2.43105G/s ``` * after: ``` ------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------------------------- BM_ByteStreamSplitDecode_Float_Scalar/1024 932 ns 932 ns 753352 bytes_per_second=4.09273G/s BM_ByteStreamSplitDecode_Float_Scalar/4096 3715 ns 3715 ns 188394 bytes_per_second=4.10783G/s BM_ByteStreamSplitDecode_Float_Scalar/32768 30167 ns 30162 ns 23441 bytes_per_second=4.04716G/s BM_ByteStreamSplitDecode_Float_Scalar/65536 59483 ns 59475 ns 11744 bytes_per_second=4.10496G/s BM_ByteStreamSplitDecode_Double_Scalar/1024 1862 ns 1862 ns 374715 bytes_per_second=4.09823G/s BM_ByteStreamSplitDecode_Double_Scalar/4096 7554 ns 7553 ns 91975 bytes_per_second=4.04038G/s BM_ByteStreamSplitDecode_Double_Scalar/32768 60429 ns 60421 ns 11499 bytes_per_second=4.04067G/s BM_ByteStreamSplitDecode_Double_Scalar/65536 120992 ns 120972 ns 5756 bytes_per_second=4.03631G/s BM_ByteStreamSplitEncode_Float_Scalar/1024 737 ns 737 ns 947423 bytes_per_second=5.17843G/s BM_ByteStreamSplitEncode_Float_Scalar/4096 2934 ns 2933 ns 239459 bytes_per_second=5.20175G/s BM_ByteStreamSplitEncode_Float_Scalar/32768 23730 ns 23727 ns 29243 bytes_per_second=5.14485G/s BM_ByteStreamSplitEncode_Float_Scalar/65536 47671 ns 47664 ns 14682 bytes_per_second=5.12209G/s BM_ByteStreamSplitEncode_Double_Scalar/1024 1517 ns 1517 ns 458928 bytes_per_second=5.02827G/s BM_ByteStreamSplitEncode_Double_Scalar/4096 6224 ns 6223 ns 111361 bytes_per_second=4.90407G/s BM_ByteStreamSplitEncode_Double_Scalar/32768 49719 ns 49713 ns 14059 bytes_per_second=4.91099G/s BM_ByteStreamSplitEncode_Double_Scalar/65536 99445 ns 99432 ns 7027 bytes_per_second=4.91072G/s ``` ### Are these changes tested? Yes, though the scalar implementations are unfortunately only exercised on non-x86 by default (see added comment in the PR). ### Are there any user-facing changes? No. * Closes: apache#38542 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

github-actions bot added Component: Parquet Component: C++ awaiting review Awaiting review labels Oct 31, 2023

mapleFU reviewed Oct 31, 2023

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Oct 31, 2023

pitrou commented Oct 31, 2023

View reviewed changes

cpp/src/arrow/util/byte_stream_split.h Outdated Show resolved Hide resolved

wgtmac reviewed Nov 1, 2023

View reviewed changes

pitrou added 3 commits November 1, 2023 15:38

[C++][Parquet] Faster scalar BYTE_STREAM_SPLIT

9a96ab9

Make byte_stream_split.h internal

063e5ed

Use safe unaligned loads and stores

659b27a

pitrou force-pushed the faster_byte_stream_split branch from c2f44f5 to 659b27a Compare November 1, 2023 14:49

pitrou added 2 commits November 1, 2023 15:53

Nit: using int64_t not size_t

761de29

Add tests for both SIMD and scalar implementations

965e8e2

pitrou changed the title ~~[C++][Parquet] Faster scalar BYTE_STREAM_SPLIT~~ GH-38542: [C++][Parquet] Faster scalar BYTE_STREAM_SPLIT Nov 1, 2023

Remove outdated comment

1b25f35

pitrou marked this pull request as ready for review November 1, 2023 16:25

mapleFU reviewed Nov 1, 2023

View reviewed changes

mapleFU approved these changes Nov 1, 2023

View reviewed changes

cyb70289 approved these changes Nov 2, 2023

View reviewed changes

wgtmac approved these changes Nov 2, 2023

View reviewed changes

pitrou merged commit 7281851 into apache:main Nov 2, 2023
49 checks passed

pitrou removed the awaiting committer review Awaiting committer review label Nov 2, 2023

pitrou deleted the faster_byte_stream_split branch November 2, 2023 09:07

GH-38542: [C++][Parquet] Faster scalar BYTE_STREAM_SPLIT #38529

GH-38542: [C++][Parquet] Faster scalar BYTE_STREAM_SPLIT #38529

Conversation

pitrou commented Oct 31, 2023 • edited by github-actions bot

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

pitrou commented Oct 31, 2023

github-actions bot commented Oct 31, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou commented Oct 31, 2023

cyb70289 commented Nov 1, 2023

cyb70289 commented Nov 1, 2023 • edited

mapleFU commented Nov 1, 2023 • edited by cyb70289

cyb70289 commented Nov 1, 2023

cyb70289 commented Nov 1, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Nov 1, 2023

pitrou commented Nov 1, 2023

mapleFU left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou commented Nov 1, 2023

mapleFU commented Nov 1, 2023

pitrou commented Nov 1, 2023

mapleFU left a comment

Choose a reason for hiding this comment

cyb70289 commented Nov 2, 2023

cyb70289 left a comment

Choose a reason for hiding this comment

conbench-apache-arrow bot commented Nov 2, 2023

jorisvandenbossche commented Jan 12, 2024

mapleFU commented Jan 12, 2024 • edited

pitrou commented Jan 12, 2024

jorisvandenbossche commented Jan 12, 2024

mapleFU commented Jan 12, 2024

pitrou commented Oct 31, 2023 •

edited by github-actions bot

cyb70289 commented Nov 1, 2023 •

edited

mapleFU commented Nov 1, 2023 •

edited by cyb70289

mapleFU commented Jan 12, 2024 •

edited