Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-39122: [C++][Parquet] Optimize FLBA record reader #39124

Merged
merged 2 commits into from
Dec 9, 2023

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Dec 7, 2023

Rationale for this change

The FLBA implementation of RecordReader is suboptimal:

  • it doesn't preallocate the output array
  • it reads the decoded validity bitmap one bit at a time and recreates it, one bit at a time

What changes are included in this PR?

Optimize the FLBA implementation of RecordReader so as to avoid the aforementioned inefficiencies.
I did a quick-and-dirty benchmark on a Parquet file with two columns:

  • column 1: uncompressed, PLAIN-encoded, FLBA<3> with no nulls
  • column 2: uncompressed, PLAIN-encoded, FLBA<3> with 25% nulls

With git main, the file can be read at 465 MB/s. With this PR, the file can be read at 700 MB/s.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

@pitrou
Copy link
Member Author

pitrou commented Dec 7, 2023

This is an alternative to #39120. @Hattonuri could you try it out?

@pitrou
Copy link
Member Author

pitrou commented Dec 7, 2023

@ursabot please benchmark

@pitrou pitrou marked this pull request as ready for review December 7, 2023 15:02
@pitrou pitrou requested a review from wgtmac as a code owner December 7, 2023 15:02
@ursabot
Copy link

ursabot commented Dec 7, 2023

Benchmark runs are scheduled for commit e0b5609. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

@pitrou
Copy link
Member Author

pitrou commented Dec 7, 2023

@github-actions crossbow submit -g cpp

Copy link

github-actions bot commented Dec 7, 2023

Revision: e0b5609

Submitted crossbow builds: ursacomputing/crossbow @ actions-c4f6660a25

Task Status
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind Azure
test-cuda-cpp GitHub Actions
test-debian-11-cpp-amd64 GitHub Actions
test-debian-11-cpp-i386 GitHub Actions
test-fedora-38-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-20.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-20.04-cpp-thread-sanitizer GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

cpp/src/parquet/column_reader.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Dec 7, 2023
Copy link

Thanks for your patience. Conbench analyzed the 6 benchmarking runs that have been run so far on PR commit e0b5609.

There were 5 benchmark results indicating a performance regression:

The full Conbench report has more details.

@Hattonuri
Copy link
Contributor

Hattonuri commented Dec 7, 2023

Comparing to my pull request you have 5% speedup

But I accidentally found that between this commit and 2dcee3f

There was a commit that increased the number of page faults by two times

In this commit i have

Command being timed: "./parquet_playground"
User time (seconds): 132.78
System time (seconds): 101.82
Percent of CPU this job got: 497%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:47.18
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 654652
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 56145278
Voluntary context switches: 69281
Involuntary context switches: 716
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

And before it was

Command being timed: "./parquet_playground"
User time (seconds): 153.61
System time (seconds): 44.16
Percent of CPU this job got: 506%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:39.06
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 707356
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 27169121
Voluntary context switches: 46224
Involuntary context switches: 570
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

I guess your pull request optimizes User time. But as we can see page faults increased from 27169121 to 56145278 and system time increased by 2.5 times. Also context switches increased by 1.5 times.
In my "playground" i just create acero ScanNode + SinkNode and use MakeGeneratorReader calls to Next() reading ListArrays of Decimal128 with length <= 20 (almost always 20)
Fragment readahead is 2, batch readahead is 1 and require_sequenced_output is enabled with default backpressure

@pitrou
Copy link
Member Author

pitrou commented Dec 7, 2023

I don't think we can infer anything from that number of page faults, except if it's stable otherwise.

@Hattonuri
Copy link
Contributor

When capturing perf record -D 1000 -e minor-faults -g --freq=max --call-graph=dwarf,32768
this is old one
image

and this is new one
image

Big difference is seen in ZSTD_decompressSequences_bmi2 - it started to cause 8 times more minor faults
Also CreateBuffer of started to take much time. About RawBytesToDecimal probably it also started to take much time or it is part of transfering to decimal as well as read values spaced

@mapleFU
Copy link
Member

mapleFU commented Dec 8, 2023

🤔Though benchmark using SystemAllocator is a bit weird, should we use something like jemalloc?

Also ZSTD_decompressSequences_bmi2 shouldn't introduce any differences if the page we read doesn't changed, except we read more pages?

@pitrou
Copy link
Member Author

pitrou commented Dec 8, 2023

Big difference is seen in ZSTD_decompressSequences_bmi2 - it started to cause 8 times more minor faults

Are you just comparing percentages? This is silly, isn't it?

@Hattonuri
Copy link
Contributor

Hattonuri commented Dec 8, 2023

🤔Though benchmark using SystemAllocator is a bit weird, should we use something like jemalloc?

Also ZSTD_decompressSequences_bmi2 shouldn't introduce any differences if the page we read doesn't changed, except we read more pages?

i compile all my project with tcmalloc

Are you just comparing percentages? This is silly, isn't it?

I think that it's not in that case because page faults increased and zstd increased - so we can still see that that it is changed

I will try to binary search breaking commit. But I think that it needs another issue because we see that the problem is not with that PR

@pitrou
Copy link
Member Author

pitrou commented Dec 8, 2023

I will try to binary search breaking commit.

No "breakage" has been proven, though. You've found out a variation in number of page faults. So what? Are you sure the system was quiet? Are you sure the number of page faults is otherwise a stable quantity?

That said, it will be interesting to see the results of your investigation.

@Hattonuri
Copy link
Contributor

Hattonuri commented Dec 8, 2023

Are you sure the system was quiet?

It did not affect the measurement because
I did measurement in a way like that: build + perf new, build + perf old, build + perf new ... and 5 round of like that

Are you sure the number of page faults is otherwise a stable quantity?

Yes, the results are stable. page faults number repeats

@Hattonuri
Copy link
Contributor

About investigation - i mean that i will put in another issue the results because they repeat stably - so i can find the reason with binary search

@wgtmac
Copy link
Member

wgtmac commented Dec 9, 2023

BTW, this patch reminds me of this ancient PR: #14353

@pitrou pitrou merged commit 20c975d into apache:main Dec 9, 2023
29 of 31 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Dec 9, 2023
@pitrou pitrou deleted the gh39122-flba-record-reader branch December 9, 2023 16:23
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 20c975d.

There were 8 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 27 possible false positives for unstable benchmarks that are known to sometimes produce them.

mapleFU pushed a commit to mapleFU/arrow that referenced this pull request Dec 13, 2023
)

### Rationale for this change

The FLBA implementation of RecordReader is suboptimal:
* it doesn't preallocate the output array
* it reads the decoded validity bitmap one bit at a time and recreates it, one bit at a time

### What changes are included in this PR?

Optimize the FLBA implementation of RecordReader so as to avoid the aforementioned inefficiencies.
I did a quick-and-dirty benchmark on a Parquet file with two columns:
* column 1: uncompressed, PLAIN-encoded, FLBA<3> with no nulls
* column 2: uncompressed, PLAIN-encoded, FLBA<3> with 25% nulls

With git main, the file can be read at 465 MB/s. With this PR, the file can be read at 700 MB/s.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* Closes: apache#39122

Lead-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
clayburn pushed a commit to clayburn/arrow that referenced this pull request Jan 23, 2024
)

### Rationale for this change

The FLBA implementation of RecordReader is suboptimal:
* it doesn't preallocate the output array
* it reads the decoded validity bitmap one bit at a time and recreates it, one bit at a time

### What changes are included in this PR?

Optimize the FLBA implementation of RecordReader so as to avoid the aforementioned inefficiencies.
I did a quick-and-dirty benchmark on a Parquet file with two columns:
* column 1: uncompressed, PLAIN-encoded, FLBA<3> with no nulls
* column 2: uncompressed, PLAIN-encoded, FLBA<3> with 25% nulls

With git main, the file can be read at 465 MB/s. With this PR, the file can be read at 700 MB/s.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* Closes: apache#39122

Lead-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
)

### Rationale for this change

The FLBA implementation of RecordReader is suboptimal:
* it doesn't preallocate the output array
* it reads the decoded validity bitmap one bit at a time and recreates it, one bit at a time

### What changes are included in this PR?

Optimize the FLBA implementation of RecordReader so as to avoid the aforementioned inefficiencies.
I did a quick-and-dirty benchmark on a Parquet file with two columns:
* column 1: uncompressed, PLAIN-encoded, FLBA<3> with no nulls
* column 2: uncompressed, PLAIN-encoded, FLBA<3> with 25% nulls

With git main, the file can be read at 465 MB/s. With this PR, the file can be read at 700 MB/s.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* Closes: apache#39122

Lead-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] [Parquet] FLBA reader does not pre-reserve memory
5 participants