Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-38542: [C++][Parquet] Faster scalar BYTE_STREAM_SPLIT #38529

Merged
merged 6 commits into from Nov 2, 2023

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Oct 31, 2023

Rationale for this change

BYTE_STREAM_SPLIT encoding and decoding benefit from SIMD accelerations on x86, but scalar implementations are used otherwise.

What changes are included in this PR?

Improve the speed of scalar implementations of BYTE_STREAM_SPLIT by using a blocked algorithm.

Benchmark numbers on the author's machine (AMD Ryzen 9 3900X):

  • before:
-------------------------------------------------------------------------------------------------------
Benchmark                                             Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------
BM_ByteStreamSplitDecode_Float_Scalar/1024         1374 ns         1374 ns       510396 bytes_per_second=2.77574G/s
BM_ByteStreamSplitDecode_Float_Scalar/4096         5483 ns         5483 ns       127498 bytes_per_second=2.78303G/s
BM_ByteStreamSplitDecode_Float_Scalar/32768       44042 ns        44035 ns        15905 bytes_per_second=2.77212G/s
BM_ByteStreamSplitDecode_Float_Scalar/65536       87966 ns        87952 ns         7962 bytes_per_second=2.77583G/s

BM_ByteStreamSplitDecode_Double_Scalar/1024        2583 ns         2583 ns       271436 bytes_per_second=2.95408G/s
BM_ByteStreamSplitDecode_Double_Scalar/4096       10533 ns        10532 ns        65695 bytes_per_second=2.89761G/s
BM_ByteStreamSplitDecode_Double_Scalar/32768      84067 ns        84053 ns         8275 bytes_per_second=2.90459G/s
BM_ByteStreamSplitDecode_Double_Scalar/65536     168332 ns       168309 ns         4155 bytes_per_second=2.9011G/s

BM_ByteStreamSplitEncode_Float_Scalar/1024         1435 ns         1435 ns       484278 bytes_per_second=2.65802G/s
BM_ByteStreamSplitEncode_Float_Scalar/4096         5725 ns         5725 ns       121877 bytes_per_second=2.66545G/s
BM_ByteStreamSplitEncode_Float_Scalar/32768       46291 ns        46283 ns        15134 bytes_per_second=2.63745G/s
BM_ByteStreamSplitEncode_Float_Scalar/65536       91139 ns        91128 ns         7707 bytes_per_second=2.6791G/s

BM_ByteStreamSplitEncode_Double_Scalar/1024        3093 ns         3093 ns       226198 bytes_per_second=2.46663G/s
BM_ByteStreamSplitEncode_Double_Scalar/4096       12724 ns        12722 ns        54522 bytes_per_second=2.39873G/s
BM_ByteStreamSplitEncode_Double_Scalar/32768     100488 ns       100475 ns         6957 bytes_per_second=2.42987G/s
BM_ByteStreamSplitEncode_Double_Scalar/65536     200885 ns       200852 ns         3486 bytes_per_second=2.43105G/s
  • after:
-------------------------------------------------------------------------------------------------------
Benchmark                                             Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------
BM_ByteStreamSplitDecode_Float_Scalar/1024          932 ns          932 ns       753352 bytes_per_second=4.09273G/s
BM_ByteStreamSplitDecode_Float_Scalar/4096         3715 ns         3715 ns       188394 bytes_per_second=4.10783G/s
BM_ByteStreamSplitDecode_Float_Scalar/32768       30167 ns        30162 ns        23441 bytes_per_second=4.04716G/s
BM_ByteStreamSplitDecode_Float_Scalar/65536       59483 ns        59475 ns        11744 bytes_per_second=4.10496G/s

BM_ByteStreamSplitDecode_Double_Scalar/1024        1862 ns         1862 ns       374715 bytes_per_second=4.09823G/s
BM_ByteStreamSplitDecode_Double_Scalar/4096        7554 ns         7553 ns        91975 bytes_per_second=4.04038G/s
BM_ByteStreamSplitDecode_Double_Scalar/32768      60429 ns        60421 ns        11499 bytes_per_second=4.04067G/s
BM_ByteStreamSplitDecode_Double_Scalar/65536     120992 ns       120972 ns         5756 bytes_per_second=4.03631G/s

BM_ByteStreamSplitEncode_Float_Scalar/1024          737 ns          737 ns       947423 bytes_per_second=5.17843G/s
BM_ByteStreamSplitEncode_Float_Scalar/4096         2934 ns         2933 ns       239459 bytes_per_second=5.20175G/s
BM_ByteStreamSplitEncode_Float_Scalar/32768       23730 ns        23727 ns        29243 bytes_per_second=5.14485G/s
BM_ByteStreamSplitEncode_Float_Scalar/65536       47671 ns        47664 ns        14682 bytes_per_second=5.12209G/s

BM_ByteStreamSplitEncode_Double_Scalar/1024        1517 ns         1517 ns       458928 bytes_per_second=5.02827G/s
BM_ByteStreamSplitEncode_Double_Scalar/4096        6224 ns         6223 ns       111361 bytes_per_second=4.90407G/s
BM_ByteStreamSplitEncode_Double_Scalar/32768      49719 ns        49713 ns        14059 bytes_per_second=4.91099G/s
BM_ByteStreamSplitEncode_Double_Scalar/65536      99445 ns        99432 ns         7027 bytes_per_second=4.91072G/s

Are these changes tested?

Yes, though the scalar implementations are unfortunately only exercised on non-x86 by default (see added comment in the PR).

Are there any user-facing changes?

No.

@pitrou
Copy link
Member Author

pitrou commented Oct 31, 2023

@cyb70289 Do you want to take a look at how this performs on ARM?

Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

}
}

inline void DoMergeStreams(const uint8_t** src_streams, int width, int64_t nvalues,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can width be a template argument here(and would it benefit the performance?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can, and no, it probably wouldn't, as the compiler should inline this function into the callee.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Oct 31, 2023
@pitrou
Copy link
Member Author

pitrou commented Oct 31, 2023

Note that one reason for this work is that we might want to generalize BYTE_STREAM_SPLIT for more datatypes.

@cyb70289
Copy link
Contributor

cyb70289 commented Nov 1, 2023

Big performance improvement observed on Arm Neoverse-N2 CPU. Great work!

  • before
BM_ByteStreamSplitDecode_Float_Scalar/1024         2241 ns         2241 ns       312363 bytes_per_second=1.70208G/s
BM_ByteStreamSplitDecode_Float_Scalar/4096         8949 ns         8949 ns        78233 bytes_per_second=1.70515G/s
BM_ByteStreamSplitDecode_Float_Scalar/32768       71555 ns        71557 ns         9780 bytes_per_second=1.70592G/s
BM_ByteStreamSplitDecode_Float_Scalar/65536      143178 ns       143174 ns         4889 bytes_per_second=1.7052G/s

BM_ByteStreamSplitDecode_Double_Scalar/1024        4035 ns         4035 ns       173607 bytes_per_second=1.89097G/s
BM_ByteStreamSplitDecode_Double_Scalar/4096       16115 ns        16114 ns        43351 bytes_per_second=1.89381G/s
BM_ByteStreamSplitDecode_Double_Scalar/32768     138440 ns       138442 ns         5055 bytes_per_second=1.76348G/s
BM_ByteStreamSplitDecode_Double_Scalar/65536     277661 ns       277651 ns         2522 bytes_per_second=1.75862G/s

BM_ByteStreamSplitEncode_Float_Scalar/1024         2275 ns         2275 ns       307740 bytes_per_second=1.6771G/s
BM_ByteStreamSplitEncode_Float_Scalar/4096         9107 ns         9107 ns        77037 bytes_per_second=1.67555G/s
BM_ByteStreamSplitEncode_Float_Scalar/32768       72365 ns        72364 ns         9670 bytes_per_second=1.68689G/s
BM_ByteStreamSplitEncode_Float_Scalar/65536      145514 ns       145518 ns         4809 bytes_per_second=1.67773G/s

BM_ByteStreamSplitEncode_Double_Scalar/1024        6262 ns         6262 ns       111792 bytes_per_second=1.21838G/s
BM_ByteStreamSplitEncode_Double_Scalar/4096       25252 ns        25253 ns        27710 bytes_per_second=1.20847G/s
BM_ByteStreamSplitEncode_Double_Scalar/32768     162099 ns       162096 ns         4319 bytes_per_second=1.50614G/s
BM_ByteStreamSplitEncode_Double_Scalar/65536     323703 ns       323682 ns         2162 bytes_per_second=1.50852G/s
  • after
BM_ByteStreamSplitDecode_Float_Scalar/1024         1129 ns         1129 ns       619351 bytes_per_second=3.37958G/s
BM_ByteStreamSplitDecode_Float_Scalar/4096         4499 ns         4499 ns       155591 bytes_per_second=3.39128G/s
BM_ByteStreamSplitDecode_Float_Scalar/32768       36209 ns        36210 ns        19331 bytes_per_second=3.37113G/s
BM_ByteStreamSplitDecode_Float_Scalar/65536       72520 ns        72521 ns         9650 bytes_per_second=3.36649G/s

BM_ByteStreamSplitDecode_Double_Scalar/1024        2999 ns         2999 ns       233336 bytes_per_second=2.54407G/s
BM_ByteStreamSplitDecode_Double_Scalar/4096       11979 ns        11979 ns        58390 bytes_per_second=2.54754G/s
BM_ByteStreamSplitDecode_Double_Scalar/32768      96324 ns        96323 ns         7266 bytes_per_second=2.5346G/s
BM_ByteStreamSplitDecode_Double_Scalar/65536     194568 ns       194574 ns         3595 bytes_per_second=2.50949G/s

BM_ByteStreamSplitEncode_Float_Scalar/1024          936 ns          936 ns       744854 bytes_per_second=4.07715G/s
BM_ByteStreamSplitEncode_Float_Scalar/4096         3705 ns         3705 ns       188993 bytes_per_second=4.11788G/s
BM_ByteStreamSplitEncode_Float_Scalar/32768       29817 ns        29816 ns        23459 bytes_per_second=4.09408G/s
BM_ByteStreamSplitEncode_Float_Scalar/65536       60051 ns        60050 ns        11632 bytes_per_second=4.06565G/s

BM_ByteStreamSplitEncode_Double_Scalar/1024        1872 ns         1871 ns       373845 bytes_per_second=4.07669G/s
BM_ByteStreamSplitEncode_Double_Scalar/4096        7471 ns         7471 ns        93678 bytes_per_second=4.0849G/s
BM_ByteStreamSplitEncode_Double_Scalar/32768      60360 ns        60362 ns        11600 bytes_per_second=4.04463G/s
BM_ByteStreamSplitEncode_Double_Scalar/65536     121104 ns       121107 ns         5776 bytes_per_second=4.0318G/s

@cyb70289
Copy link
Contributor

cyb70289 commented Nov 1, 2023

Looks the major improvement is the total instructions drop significantly. For BM_ByteStreamSplitEncode_Double_Scalar/65536` test, instructions per iteration drops from 5,478,109 to 1,915,403.

Estimated roughly the total instructions to process 32 doubles.
Original code executes 1760 instructions, but optimized code only need 783 instructions.

  • Original:
BM_ByteStreamSplitEncode_Double_Scalar -> DoSplitStreams(num_values)
kNumStreams=sizeof(double)=8

  +--->19f78:       8b0200e1        add     x1, x7, x2
  |    19f7c:       8b050266        add     x6, x19, x5
  |    19f80:       d2800000        mov     x0, #0x0                        // #0
  | +->19f84:       386068c3        ldrb    w3, [x6, x0]
  | |  19f88:       91000400        add     x0, x0, #0x1
  | |  19f8c:       39000023        strb    w3, [x1]
  | |  19f90:       8b040021        add     x1, x1, x4
  | |  19f94:       f100201f        cmp     x0, #0x8  // for (size_t j = 0U; j < kNumStreams=8; ++j)
  | +--19f98:       54ffff61        b.ne    19f84 <parquet::BM_ByteStreamSplitEncode_Double_Scalar(benchmark::State&)+0x224>  // b.any
  |    19f9c:       91000442        add     x2, x2, #0x1
  |    19fa0:       910020a5        add     x5, x5, #0x8
  |    19fa4:       eb02009f        cmp     x4, x2    // for (size_t i = 0U; i < num_values; ++i)
  +----19fa8:       54fffe81        b.ne    19f78 <parquet::BM_ByteStreamSplitEncode_Double_Scalar(benchmark::State&)+0x218>  // b.any
Assume num_values = 32, calculate total instructions as below:
outer loop: count = 32,   instructions = 7*32 = 224 (exclude inner loop)
inner loop: count = 32*8, instructions = 8*32*6 = 1536
Total instructions = 224+1536 = 1760
  • Optimized:
BM_ByteStreamSplitEncode_Double_Scalar -> DoSplitStreams(nvalues, width=sizeof(double)=8)

+----->19cb8:       aa1303ef        mov     x15, x19
|      19cbc:       aa0003ee        mov     x14, x0
|      19cc0:       5280000c        mov     w12, #0x0                       // #0
| +--->19cc4:       f94001f0        ldr     x16, [x15]
| |    19cc8:       aa0e03e4        mov     x4, x14
| |    19ccc:       d2800005        mov     x5, #0x0                        // #0
| | +->19cd0:       39404086        ldrb    w6, [x4, #16]
| | |  19cd4:       3940808b        ldrb    w11, [x4, #32]
| | |  19cd8:       39402082        ldrb    w2, [x4, #8]
| | |  19cdc:       3940608a        ldrb    w10, [x4, #24]
| | |  19ce0:       d370bcc6        lsl     x6, x6, #16
| | |  19ce4:       3940c088        ldrb    w8, [x4, #48]
| | |  19ce8:       d3607d6b        lsl     x11, x11, #32
| | |  19cec:       3940a087        ldrb    w7, [x4, #40]
| | |  19cf0:       aa0220c2        orr     x2, x6, x2, lsl #8
| | |  19cf4:       aa0a616a        orr     x10, x11, x10, lsl #24
| | |  19cf8:       3940e086        ldrb    w6, [x4, #56]
| | |  19cfc:       3844048b        ldrb    w11, [x4], #64
| | |  19d00:       d3503d08        lsl     x8, x8, #48
| | |  19d04:       aa0a0042        orr     x2, x2, x10
| | |  19d08:       aa07a107        orr     x7, x8, x7, lsl #40
| | |  19d0c:       aa070042        orr     x2, x2, x7
| | |  19d10:       aa06e166        orr     x6, x11, x6, lsl #56
| | |  19d14:       aa060042        orr     x2, x2, x6
| | |  19d18:       f8256a02        str     x2, [x16, x5]
| | |  19d1c:       910020a5        add     x5, x5, #0x8
| | |  19d20:       f10080bf        cmp     x5, #0x20  // for (int i = 0; i < kBlockSize=32; i += 8)
| | +--19d24:       54fffd61        b.ne    19cd0 <parquet::BM_ByteStreamSplitEncode_Double_Scalar(benchmark::State&)+0x260>  // b.any
| |    19d28:       91008210        add     x16, x16, #0x20
| |    19d2c:       1100058c        add     w12, w12, #0x1
| |    19d30:       f80085f0        str     x16, [x15], #8
| |    19d34:       910005ce        add     x14, x14, #0x1
| |    19d38:       7100219f        cmp     w12, #0x8  // for (int stream = 0; stream < width=8; ++stream)
| +----19d3c:       54fffc41        b.ne    19cc4 <parquet::BM_ByteStreamSplitEncode_Double_Scalar(benchmark::State&)+0x254>  // b.any
|      19d40:       d1008021        sub     x1, x1, #0x20
|      19d44:       91040000        add     x0, x0, #0x100
|      19d48:       f1007c3f        cmp     x1, #0x1f  // while (nvalues >= kBlockSize=32)
+------19d4c:       54fffb6c        b.gt    19cb8 <parquet::BM_ByteStreamSplitEncode_Double_Scalar(benchmark::State&)+0x248>
Assume nvalues = 32, calculate total instructions as below:
outer loop:  count = 1,   instructions = 7 (exclude middle and inner loops)
middle loop: count = 8,   instructions = 8*9 = 72 (exclude inner loop)
inner loop:  count = 8*4, instructions = 8*4*22 = 704
Total instructions = 7+72+704 = 783

@mapleFU
Copy link
Member

mapleFU commented Nov 1, 2023

(What about decoding?)

@cyb70289
Copy link
Contributor

cyb70289 commented Nov 1, 2023

(What about decoding?)

Should be similar. Instructions per iteration drop significantly (5,213,354 -> 1,703,533).
Be careful about the value, I just divided total instructions (from perf) by iterations shown in the benchmark report. I guess warmup time is not included in the reported iteration.

@cyb70289
Copy link
Contributor

cyb70289 commented Nov 1, 2023

Another improvement I see is the better CPU pipeline parallelism from the optimized code.
IPC (instructions per cycle) increases from 4.06 to 4.83. This is amazing as the highest IPC of neoverse-n2, in theory, is 5.
Topdown metric "retiring" increases from 67.7% to 92.6%, which means 92.6% CPU cycles are doing useful works, a very high value.

uint64_t f = src[stream + (i + 5) * width];
uint64_t g = src[stream + (i + 6) * width];
uint64_t h = src[stream + (i + 7) * width];
#if ARROW_LITTLE_ENDIAN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the endianness of machine compiling it is different with machine at runtime?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the whole of Arrow C++ is miscompiled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any situation where compile-time ARROW_LITTLE_ENDIAN might be different from the runtime platform's endianness?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cross-compiling?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the current definition. Can it be mis-compiled?

#ifndef __BYTE_ORDER__
#error "__BYTE_ORDER__ not defined"
#endif
#
#ifndef __ORDER_LITTLE_ENDIAN__
#error "__ORDER_LITTLE_ENDIAN__ not defined"
#endif
#
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
#define ARROW_LITTLE_ENDIAN 1
#else
#define ARROW_LITTLE_ENDIAN 0
#endif

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to open an issue for this. There doesn't seem to be a reliable way to make a compile-time or constexpr check in C++17, AFAICT.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the whole of Arrow C++ is miscompiled?

Maybe it's ok to test it when starting the user-side program. Since this is not the only place may causing problem when cross-compiling worked.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about opening an issue to switch to std::endian once we start C++20 adoption?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with this part of code, however, that maybe means introduce a kind of dynamic code dispatch, like if LITTLE_ENDIAN , otherwise goes to BIG_ENDIAN code.

As you referenced:

As stated earlier, the only "real" way to detect Big Endian is to use runtime tests.

std::endian seems also do a compile time checking rather than runtime. See https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0463r1.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, a lots of code already uses compile-time macro. If we meet this problem, we may having a test for SIMD instructions, endian, etc. Rather than dynamic dispatch it here...

@pitrou pitrou changed the title [C++][Parquet] Faster scalar BYTE_STREAM_SPLIT GH-38542: [C++][Parquet] Faster scalar BYTE_STREAM_SPLIT Nov 1, 2023
Copy link

github-actions bot commented Nov 1, 2023

⚠️ GitHub issue #38542 has been automatically assigned in GitHub to PR creator.

@pitrou pitrou marked this pull request as ready for review November 1, 2023 16:25
@pitrou
Copy link
Member Author

pitrou commented Nov 1, 2023

I've added some more extensive tests for the various implementations and this is now ready for re-review. @wgtmac @mapleFU @cyb70289

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch LGTM

Also, if src in DoSplit* doesn't fit in L1 Cache, would it get much slower?

uint8_t** dest_streams) {
// Value empirically chosen to provide the best performance on the author's machine
constexpr int kBlockSize = 32;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: do we need DCHECK the width is expected?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

width can only be 4/8 now right. After Fp16 is introduced, we might supports 2/4/8, other value should be rejected?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this function should support any non-zero width, so I don't think it is necessary to reject anything here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmmm here it only supports FLOAT/DOUBLE?

https://github.com/apache/parquet-format/blob/master/Encodings.md#byte-stream-split-byte_stream_split--9

But I think it's also ok when width > 0. It's ok to me if you prefer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Parquet only supports FLOAT/DOUBLE, but there is no reason for this function to be restricted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree not to restrict the width unless required. This is now a common util not specific to parquet.

@pitrou
Copy link
Member Author

pitrou commented Nov 1, 2023

Also, if src in DoSplit* doesn't fit in L1 Cache, would it get much slower?

It's also already out of L1 cache when num_values = 65536.

@mapleFU
Copy link
Member

mapleFU commented Nov 1, 2023

It's also already out of L1 cache when num_values = 65536.

Oh finally I got it. It might benefits from CPU prefetch? Thanks for the great code here

@pitrou
Copy link
Member Author

pitrou commented Nov 1, 2023

It might benefits from CPU prefetch?

Yes, I think it should. Though to be honest I haven't tried to check it. But 64k double values is already 512kB, so 1MB with both the input and output buffer, which is more than L2 cache size on my CPU...

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code LGTM. I'm not so familiar with high-performance code writing so maybe waiting for @cyb70289 's further review

@cyb70289
Copy link
Contributor

cyb70289 commented Nov 2, 2023

I compared L1 and L2 missing events. L1 data cache MPKI (misses per kilo instructions) drops significantly from 49.6 to 10.2, L2 MPKI increases from 0.062 to 0.236.

The optimized code is also better for L1 data cache. My understanding is as below:

  • Original code loads data sequentially: offset 0,1,2,3,.....
  • Optimized code loads data by block in interleave:
1st outer loop: offset 0,  8, 16, 24, ..., 120
2nd outer loop: offset 1,  9, 17, 25, ..., 121
...
8th outer loop: offset 7, 15, 23, 31, ..., 127

For the optimized code, the 1st loop will populate L1 data cache with all necessary data for the next 7 loops.
Original code may be better for L2 prefetcher as the pattern is quite easy to predict, but optimized code reduces L1 misses which is a big win.

Copy link
Contributor

@cyb70289 cyb70289 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pitrou pitrou merged commit 7281851 into apache:main Nov 2, 2023
49 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Nov 2, 2023
@pitrou pitrou deleted the faster_byte_stream_split branch November 2, 2023 09:07
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 7281851.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 7 possible false positives for unstable benchmarks that are known to sometimes produce them.

loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…e#38529)

### Rationale for this change

BYTE_STREAM_SPLIT encoding and decoding benefit from SIMD accelerations on x86, but scalar implementations are used otherwise.

### What changes are included in this PR?

Improve the speed of scalar implementations of BYTE_STREAM_SPLIT by using a blocked algorithm.

Benchmark numbers on the author's machine (AMD Ryzen 9 3900X):

* before:
```
-------------------------------------------------------------------------------------------------------
Benchmark                                             Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------
BM_ByteStreamSplitDecode_Float_Scalar/1024         1374 ns         1374 ns       510396 bytes_per_second=2.77574G/s
BM_ByteStreamSplitDecode_Float_Scalar/4096         5483 ns         5483 ns       127498 bytes_per_second=2.78303G/s
BM_ByteStreamSplitDecode_Float_Scalar/32768       44042 ns        44035 ns        15905 bytes_per_second=2.77212G/s
BM_ByteStreamSplitDecode_Float_Scalar/65536       87966 ns        87952 ns         7962 bytes_per_second=2.77583G/s

BM_ByteStreamSplitDecode_Double_Scalar/1024        2583 ns         2583 ns       271436 bytes_per_second=2.95408G/s
BM_ByteStreamSplitDecode_Double_Scalar/4096       10533 ns        10532 ns        65695 bytes_per_second=2.89761G/s
BM_ByteStreamSplitDecode_Double_Scalar/32768      84067 ns        84053 ns         8275 bytes_per_second=2.90459G/s
BM_ByteStreamSplitDecode_Double_Scalar/65536     168332 ns       168309 ns         4155 bytes_per_second=2.9011G/s

BM_ByteStreamSplitEncode_Float_Scalar/1024         1435 ns         1435 ns       484278 bytes_per_second=2.65802G/s
BM_ByteStreamSplitEncode_Float_Scalar/4096         5725 ns         5725 ns       121877 bytes_per_second=2.66545G/s
BM_ByteStreamSplitEncode_Float_Scalar/32768       46291 ns        46283 ns        15134 bytes_per_second=2.63745G/s
BM_ByteStreamSplitEncode_Float_Scalar/65536       91139 ns        91128 ns         7707 bytes_per_second=2.6791G/s

BM_ByteStreamSplitEncode_Double_Scalar/1024        3093 ns         3093 ns       226198 bytes_per_second=2.46663G/s
BM_ByteStreamSplitEncode_Double_Scalar/4096       12724 ns        12722 ns        54522 bytes_per_second=2.39873G/s
BM_ByteStreamSplitEncode_Double_Scalar/32768     100488 ns       100475 ns         6957 bytes_per_second=2.42987G/s
BM_ByteStreamSplitEncode_Double_Scalar/65536     200885 ns       200852 ns         3486 bytes_per_second=2.43105G/s
```
* after:
```
-------------------------------------------------------------------------------------------------------
Benchmark                                             Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------
BM_ByteStreamSplitDecode_Float_Scalar/1024          932 ns          932 ns       753352 bytes_per_second=4.09273G/s
BM_ByteStreamSplitDecode_Float_Scalar/4096         3715 ns         3715 ns       188394 bytes_per_second=4.10783G/s
BM_ByteStreamSplitDecode_Float_Scalar/32768       30167 ns        30162 ns        23441 bytes_per_second=4.04716G/s
BM_ByteStreamSplitDecode_Float_Scalar/65536       59483 ns        59475 ns        11744 bytes_per_second=4.10496G/s

BM_ByteStreamSplitDecode_Double_Scalar/1024        1862 ns         1862 ns       374715 bytes_per_second=4.09823G/s
BM_ByteStreamSplitDecode_Double_Scalar/4096        7554 ns         7553 ns        91975 bytes_per_second=4.04038G/s
BM_ByteStreamSplitDecode_Double_Scalar/32768      60429 ns        60421 ns        11499 bytes_per_second=4.04067G/s
BM_ByteStreamSplitDecode_Double_Scalar/65536     120992 ns       120972 ns         5756 bytes_per_second=4.03631G/s

BM_ByteStreamSplitEncode_Float_Scalar/1024          737 ns          737 ns       947423 bytes_per_second=5.17843G/s
BM_ByteStreamSplitEncode_Float_Scalar/4096         2934 ns         2933 ns       239459 bytes_per_second=5.20175G/s
BM_ByteStreamSplitEncode_Float_Scalar/32768       23730 ns        23727 ns        29243 bytes_per_second=5.14485G/s
BM_ByteStreamSplitEncode_Float_Scalar/65536       47671 ns        47664 ns        14682 bytes_per_second=5.12209G/s

BM_ByteStreamSplitEncode_Double_Scalar/1024        1517 ns         1517 ns       458928 bytes_per_second=5.02827G/s
BM_ByteStreamSplitEncode_Double_Scalar/4096        6224 ns         6223 ns       111361 bytes_per_second=4.90407G/s
BM_ByteStreamSplitEncode_Double_Scalar/32768      49719 ns        49713 ns        14059 bytes_per_second=4.91099G/s
BM_ByteStreamSplitEncode_Double_Scalar/65536      99445 ns        99432 ns         7027 bytes_per_second=4.91072G/s
```

### Are these changes tested?

Yes, though the scalar implementations are unfortunately only exercised on non-x86 by default (see added comment in the PR).

### Are there any user-facing changes?

No.

* Closes: apache#38542

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@jorisvandenbossche
Copy link
Member

Haven't read through the discussion here, but from the release performance report, I noticed that the BM_ByteStreamSplitDecode_xx_Scalar benchmarks showed a regression apparently for this commit. See for example https://conbench.ursa.dev/benchmark-results/06543cccc43d719180005886ca3643dd/

Not sure why the conbench report above didn't say anything about that though. The speedups of this PR might also be hardware related?

@mapleFU
Copy link
Member

mapleFU commented Jan 12, 2024

The speedups of this PR might also be hardware related?

Maybe it's compiler related🤔 Auto-Vectorize might optimizing code like this patch's original code...

@pitrou
Copy link
Member Author

pitrou commented Jan 12, 2024

It can be hardware- or compiler-related indeed.

@jorisvandenbossche
Copy link
Member

OK, completely ignore my comment. The benchmark is expressed in "bytes per second", so higher is better ... Thus the correct conclusion is: the benchmarks nicely captured the significant improvement from this PR! ;)

@mapleFU
Copy link
Member

mapleFU commented Jan 12, 2024

Aha...

dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…e#38529)

### Rationale for this change

BYTE_STREAM_SPLIT encoding and decoding benefit from SIMD accelerations on x86, but scalar implementations are used otherwise.

### What changes are included in this PR?

Improve the speed of scalar implementations of BYTE_STREAM_SPLIT by using a blocked algorithm.

Benchmark numbers on the author's machine (AMD Ryzen 9 3900X):

* before:
```
-------------------------------------------------------------------------------------------------------
Benchmark                                             Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------
BM_ByteStreamSplitDecode_Float_Scalar/1024         1374 ns         1374 ns       510396 bytes_per_second=2.77574G/s
BM_ByteStreamSplitDecode_Float_Scalar/4096         5483 ns         5483 ns       127498 bytes_per_second=2.78303G/s
BM_ByteStreamSplitDecode_Float_Scalar/32768       44042 ns        44035 ns        15905 bytes_per_second=2.77212G/s
BM_ByteStreamSplitDecode_Float_Scalar/65536       87966 ns        87952 ns         7962 bytes_per_second=2.77583G/s

BM_ByteStreamSplitDecode_Double_Scalar/1024        2583 ns         2583 ns       271436 bytes_per_second=2.95408G/s
BM_ByteStreamSplitDecode_Double_Scalar/4096       10533 ns        10532 ns        65695 bytes_per_second=2.89761G/s
BM_ByteStreamSplitDecode_Double_Scalar/32768      84067 ns        84053 ns         8275 bytes_per_second=2.90459G/s
BM_ByteStreamSplitDecode_Double_Scalar/65536     168332 ns       168309 ns         4155 bytes_per_second=2.9011G/s

BM_ByteStreamSplitEncode_Float_Scalar/1024         1435 ns         1435 ns       484278 bytes_per_second=2.65802G/s
BM_ByteStreamSplitEncode_Float_Scalar/4096         5725 ns         5725 ns       121877 bytes_per_second=2.66545G/s
BM_ByteStreamSplitEncode_Float_Scalar/32768       46291 ns        46283 ns        15134 bytes_per_second=2.63745G/s
BM_ByteStreamSplitEncode_Float_Scalar/65536       91139 ns        91128 ns         7707 bytes_per_second=2.6791G/s

BM_ByteStreamSplitEncode_Double_Scalar/1024        3093 ns         3093 ns       226198 bytes_per_second=2.46663G/s
BM_ByteStreamSplitEncode_Double_Scalar/4096       12724 ns        12722 ns        54522 bytes_per_second=2.39873G/s
BM_ByteStreamSplitEncode_Double_Scalar/32768     100488 ns       100475 ns         6957 bytes_per_second=2.42987G/s
BM_ByteStreamSplitEncode_Double_Scalar/65536     200885 ns       200852 ns         3486 bytes_per_second=2.43105G/s
```
* after:
```
-------------------------------------------------------------------------------------------------------
Benchmark                                             Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------
BM_ByteStreamSplitDecode_Float_Scalar/1024          932 ns          932 ns       753352 bytes_per_second=4.09273G/s
BM_ByteStreamSplitDecode_Float_Scalar/4096         3715 ns         3715 ns       188394 bytes_per_second=4.10783G/s
BM_ByteStreamSplitDecode_Float_Scalar/32768       30167 ns        30162 ns        23441 bytes_per_second=4.04716G/s
BM_ByteStreamSplitDecode_Float_Scalar/65536       59483 ns        59475 ns        11744 bytes_per_second=4.10496G/s

BM_ByteStreamSplitDecode_Double_Scalar/1024        1862 ns         1862 ns       374715 bytes_per_second=4.09823G/s
BM_ByteStreamSplitDecode_Double_Scalar/4096        7554 ns         7553 ns        91975 bytes_per_second=4.04038G/s
BM_ByteStreamSplitDecode_Double_Scalar/32768      60429 ns        60421 ns        11499 bytes_per_second=4.04067G/s
BM_ByteStreamSplitDecode_Double_Scalar/65536     120992 ns       120972 ns         5756 bytes_per_second=4.03631G/s

BM_ByteStreamSplitEncode_Float_Scalar/1024          737 ns          737 ns       947423 bytes_per_second=5.17843G/s
BM_ByteStreamSplitEncode_Float_Scalar/4096         2934 ns         2933 ns       239459 bytes_per_second=5.20175G/s
BM_ByteStreamSplitEncode_Float_Scalar/32768       23730 ns        23727 ns        29243 bytes_per_second=5.14485G/s
BM_ByteStreamSplitEncode_Float_Scalar/65536       47671 ns        47664 ns        14682 bytes_per_second=5.12209G/s

BM_ByteStreamSplitEncode_Double_Scalar/1024        1517 ns         1517 ns       458928 bytes_per_second=5.02827G/s
BM_ByteStreamSplitEncode_Double_Scalar/4096        6224 ns         6223 ns       111361 bytes_per_second=4.90407G/s
BM_ByteStreamSplitEncode_Double_Scalar/32768      49719 ns        49713 ns        14059 bytes_per_second=4.91099G/s
BM_ByteStreamSplitEncode_Double_Scalar/65536      99445 ns        99432 ns         7027 bytes_per_second=4.91072G/s
```

### Are these changes tested?

Yes, though the scalar implementations are unfortunately only exercised on non-x86 by default (see added comment in the PR).

### Are there any user-facing changes?

No.

* Closes: apache#38542

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][Parquet] Faster scalar BYTE_STREAM_SPLIT
5 participants