GH-39122: [C++][Parquet] Optimize FLBA record reader #39124

pitrou · 2023-12-07T14:36:04Z

Rationale for this change

The FLBA implementation of RecordReader is suboptimal:

it doesn't preallocate the output array
it reads the decoded validity bitmap one bit at a time and recreates it, one bit at a time

What changes are included in this PR?

Optimize the FLBA implementation of RecordReader so as to avoid the aforementioned inefficiencies.
I did a quick-and-dirty benchmark on a Parquet file with two columns:

column 1: uncompressed, PLAIN-encoded, FLBA<3> with no nulls
column 2: uncompressed, PLAIN-encoded, FLBA<3> with 25% nulls

With git main, the file can be read at 465 MB/s. With this PR, the file can be read at 700 MB/s.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

Closes: [C++] [Parquet] FLBA reader does not pre-reserve memory #39122

pitrou · 2023-12-07T14:36:50Z

This is an alternative to #39120. @Hattonuri could you try it out?

pitrou · 2023-12-07T15:02:43Z

@ursabot please benchmark

ursabot · 2023-12-07T15:02:53Z

Benchmark runs are scheduled for commit e0b5609. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

pitrou · 2023-12-07T15:02:58Z

@github-actions crossbow submit -g cpp

github-actions · 2023-12-07T15:05:32Z

Revision: e0b5609

Submitted crossbow builds: ursacomputing/crossbow @ actions-c4f6660a25

Task	Status
test-alpine-linux-cpp
test-build-cpp-fuzz
test-conda-cpp
test-conda-cpp-valgrind
test-cuda-cpp
test-debian-11-cpp-amd64
test-debian-11-cpp-i386
test-fedora-38-cpp
test-ubuntu-20.04-cpp
test-ubuntu-20.04-cpp-bundled
test-ubuntu-20.04-cpp-minimal-with-formats
test-ubuntu-20.04-cpp-thread-sanitizer
test-ubuntu-22.04-cpp
test-ubuntu-22.04-cpp-20
test-ubuntu-22.04-cpp-no-threading

mapleFU

+1

cpp/src/parquet/column_reader.cc

conbench-apache-arrow · 2023-12-07T20:44:31Z

Thanks for your patience. Conbench analyzed the 6 benchmarking runs that have been run so far on PR commit e0b5609.

There were 5 benchmark results indicating a performance regression:

Pull Request Run on ursa-thinkcentre-m75q at 2023-12-07 19:12:42Z
- BM_ArrowBinaryDict (C++) with params=EncodeDictDirectInt8/1048576, source=cpp-micro, suite=parquet-encoding-benchmark
- BM_ArrowBinaryDict (C++) with params=EncodeDictDirectInt64/1048576, source=cpp-micro, suite=parquet-encoding-benchmark
and 3 more (see the report linked below)

The full Conbench report has more details.

Hattonuri · 2023-12-07T21:56:42Z

Comparing to my pull request you have 5% speedup

But I accidentally found that between this commit and 2dcee3f

There was a commit that increased the number of page faults by two times

In this commit i have

Command being timed: "./parquet_playground"
User time (seconds): 132.78
System time (seconds): 101.82
Percent of CPU this job got: 497%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:47.18
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 654652
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 56145278
Voluntary context switches: 69281
Involuntary context switches: 716
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

And before it was

Command being timed: "./parquet_playground"
User time (seconds): 153.61
System time (seconds): 44.16
Percent of CPU this job got: 506%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:39.06
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 707356
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 27169121
Voluntary context switches: 46224
Involuntary context switches: 570
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

I guess your pull request optimizes User time. But as we can see page faults increased from 27169121 to 56145278 and system time increased by 2.5 times. Also context switches increased by 1.5 times.
In my "playground" i just create acero ScanNode + SinkNode and use MakeGeneratorReader calls to Next() reading ListArrays of Decimal128 with length <= 20 (almost always 20)
Fragment readahead is 2, batch readahead is 1 and require_sequenced_output is enabled with default backpressure

pitrou · 2023-12-07T22:49:59Z

I don't think we can infer anything from that number of page faults, except if it's stable otherwise.

Hattonuri · 2023-12-08T00:00:28Z

When capturing perf record -D 1000 -e minor-faults -g --freq=max --call-graph=dwarf,32768
this is old one

and this is new one

Big difference is seen in ZSTD_decompressSequences_bmi2 - it started to cause 8 times more minor faults
Also CreateBuffer of started to take much time. About RawBytesToDecimal probably it also started to take much time or it is part of transfering to decimal as well as read values spaced

mapleFU · 2023-12-08T02:38:55Z

🤔Though benchmark using SystemAllocator is a bit weird, should we use something like jemalloc?

Also ZSTD_decompressSequences_bmi2 shouldn't introduce any differences if the page we read doesn't changed, except we read more pages?

pitrou · 2023-12-08T08:31:21Z

Big difference is seen in ZSTD_decompressSequences_bmi2 - it started to cause 8 times more minor faults

Are you just comparing percentages? This is silly, isn't it?

Hattonuri · 2023-12-08T10:45:34Z

🤔Though benchmark using SystemAllocator is a bit weird, should we use something like jemalloc?

Also ZSTD_decompressSequences_bmi2 shouldn't introduce any differences if the page we read doesn't changed, except we read more pages?

i compile all my project with tcmalloc

Are you just comparing percentages? This is silly, isn't it?

I think that it's not in that case because page faults increased and zstd increased - so we can still see that that it is changed

I will try to binary search breaking commit. But I think that it needs another issue because we see that the problem is not with that PR

pitrou · 2023-12-08T11:40:26Z

I will try to binary search breaking commit.

No "breakage" has been proven, though. You've found out a variation in number of page faults. So what? Are you sure the system was quiet? Are you sure the number of page faults is otherwise a stable quantity?

That said, it will be interesting to see the results of your investigation.

Hattonuri · 2023-12-08T13:09:45Z

Are you sure the system was quiet?

It did not affect the measurement because
I did measurement in a way like that: build + perf new, build + perf old, build + perf new ... and 5 round of like that

Are you sure the number of page faults is otherwise a stable quantity?

Yes, the results are stable. page faults number repeats

Hattonuri · 2023-12-08T13:22:24Z

About investigation - i mean that i will put in another issue the results because they repeat stably - so i can find the reason with binary search

wgtmac · 2023-12-09T14:52:46Z

BTW, this patch reminds me of this ancient PR: #14353

conbench-apache-arrow · 2023-12-10T14:05:18Z

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 20c975d.

There were 8 benchmark results indicating a performance regression:

Commit Run on ursa-i9-9960x at 2023-12-09 17:15:10Z
- file-read (R) with compression=uncompressed, dataset=fanniemae_2016Q4, file_type=feather, language=R, output_type=dataframe
- file-read (R) with compression=lz4, dataset=nyctaxi_2010-01, file_type=feather, language=R, output_type=table
and 6 more (see the report linked below)

The full Conbench report has more details. It also includes information about 27 possible false positives for unstable benchmarks that are known to sometimes produce them.

) ### Rationale for this change The FLBA implementation of RecordReader is suboptimal: * it doesn't preallocate the output array * it reads the decoded validity bitmap one bit at a time and recreates it, one bit at a time ### What changes are included in this PR? Optimize the FLBA implementation of RecordReader so as to avoid the aforementioned inefficiencies. I did a quick-and-dirty benchmark on a Parquet file with two columns: * column 1: uncompressed, PLAIN-encoded, FLBA<3> with no nulls * column 2: uncompressed, PLAIN-encoded, FLBA<3> with 25% nulls With git main, the file can be read at 465 MB/s. With this PR, the file can be read at 700 MB/s. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * Closes: apache#39122 Lead-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>

apacheGH-39122: [C++][Parquet] Optimize FLBA record reader

e0b5609

github-actions bot added Component: Parquet Component: C++ awaiting review Awaiting review labels Dec 7, 2023

pitrou mentioned this pull request Dec 7, 2023

GH-39122: [C++] [Parquet] FLBA reader reading preallocs and null count check #39120

Closed

pitrou marked this pull request as ready for review December 7, 2023 15:02

pitrou requested a review from wgtmac as a code owner December 7, 2023 15:02

mapleFU approved these changes Dec 7, 2023

View reviewed changes

cpp/src/parquet/column_reader.cc Outdated Show resolved Hide resolved

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Dec 7, 2023

pitrou commented Dec 7, 2023

View reviewed changes

cpp/src/parquet/column_reader.cc Outdated Show resolved Hide resolved

Remove spurious inclusion

3b9c1db

wgtmac approved these changes Dec 9, 2023

View reviewed changes

pitrou merged commit 20c975d into apache:main Dec 9, 2023
29 of 31 checks passed

pitrou removed the awaiting committer review Awaiting committer review label Dec 9, 2023

pitrou deleted the gh39122-flba-record-reader branch December 9, 2023 16:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-39122: [C++][Parquet] Optimize FLBA record reader #39124

GH-39122: [C++][Parquet] Optimize FLBA record reader #39124

pitrou commented Dec 7, 2023 •

edited

pitrou commented Dec 7, 2023

pitrou commented Dec 7, 2023

ursabot commented Dec 7, 2023

pitrou commented Dec 7, 2023

github-actions bot commented Dec 7, 2023

mapleFU left a comment

conbench-apache-arrow bot commented Dec 7, 2023

Hattonuri commented Dec 7, 2023 •

edited

pitrou commented Dec 7, 2023

Hattonuri commented Dec 8, 2023

mapleFU commented Dec 8, 2023 •

edited

pitrou commented Dec 8, 2023 •

edited

Hattonuri commented Dec 8, 2023 •

edited

pitrou commented Dec 8, 2023 •

edited

Hattonuri commented Dec 8, 2023 •

edited

Hattonuri commented Dec 8, 2023

wgtmac commented Dec 9, 2023

conbench-apache-arrow bot commented Dec 10, 2023

GH-39122: [C++][Parquet] Optimize FLBA record reader #39124

GH-39122: [C++][Parquet] Optimize FLBA record reader #39124

Conversation

pitrou commented Dec 7, 2023 • edited

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

pitrou commented Dec 7, 2023

pitrou commented Dec 7, 2023

ursabot commented Dec 7, 2023

pitrou commented Dec 7, 2023

github-actions bot commented Dec 7, 2023

mapleFU left a comment

Choose a reason for hiding this comment

conbench-apache-arrow bot commented Dec 7, 2023

Hattonuri commented Dec 7, 2023 • edited

pitrou commented Dec 7, 2023

Hattonuri commented Dec 8, 2023

mapleFU commented Dec 8, 2023 • edited

pitrou commented Dec 8, 2023 • edited

Hattonuri commented Dec 8, 2023 • edited

pitrou commented Dec 8, 2023 • edited

Hattonuri commented Dec 8, 2023 • edited

Hattonuri commented Dec 8, 2023

wgtmac commented Dec 9, 2023

conbench-apache-arrow bot commented Dec 10, 2023

pitrou commented Dec 7, 2023 •

edited

Hattonuri commented Dec 7, 2023 •

edited

mapleFU commented Dec 8, 2023 •

edited

pitrou commented Dec 8, 2023 •

edited

Hattonuri commented Dec 8, 2023 •

edited

pitrou commented Dec 8, 2023 •

edited

Hattonuri commented Dec 8, 2023 •

edited