ARROW-16294: [C++] Improve performance of parquet readahead #12967

westonpace · 2022-04-23T00:11:02Z

Turns out the batch size we were slicing internally for parquet (in the TableBatchReader) was not the batch size from the scanner. I added batch slicing in file_parquet to leave the parquet reader itself more or less unchanged here (and it's not clear it would make more sense to use the smaller batch size slicing inside the reader anyways).
Changed parquet readahead so it reads ahead more than one row group. It now tries to keep batch_size * batch_readahead reads in flight. Users that want more parallel reads can increase batch readahead.

github-actions · 2022-04-23T00:11:23Z

https://issues.apache.org/jira/browse/ARROW-16294

github-actions · 2022-04-23T00:11:24Z

⚠️ Ticket has no components in JIRA, make sure you assign one.

github-actions · 2022-04-23T00:11:25Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

cpp/src/parquet/arrow/reader.cc

… so instead of reading 1 row group ahead it reads enough row groups ahead to satisfy batch_size * batch_readahead.

…in the header to match the impl.

lidavidm

Looks like there's a lint failure, and MSVC is unhappy as usual:


D:/a/arrow/arrow/cpp/src/arrow/dataset/file_parquet.cc(465,1): error C2220: the following warning is treated as an error [D:\a\arrow\arrow\build\cpp\src\arrow\dataset\arrow_dataset_shared.vcxproj]
D:/a/arrow/arrow/cpp/src/arrow/dataset/file_parquet.cc(465,1): warning C4244: 'argument': conversion from 'int64_t' to 'int', possible loss of data [D:\a\arrow\arrow\build\cpp\src\arrow\dataset\arrow_dataset_shared.vcxproj]

cpp/src/arrow/compute/exec/source_node.cc

…row count' variables in the area. Got rid of accidental iostream include.

westonpace · 2022-04-27T18:34:43Z

Looks like there's a lint failure, and MSVC is unhappy as usual:

The dangers of thinking I can fix an MSVC complaint using Github's edit feature 😆

I think this is good now. I changed the variable itself to int64_t instead of adding static_cast since it is now a "rows" variable and not a "batches" variable.

The behavior for 0 readahead is now "fetch batch when asked for" instead of defaulting to something like 1 row (which probably would have been the same thing anyways but I think this is more clear about our intentions).

lidavidm · 2022-04-27T18:39:31Z

cpp/src/parquet/arrow/reader.cc

+  }
+
+  void FetchNext() {
+    int row_group_index = readahead_index_++;


Seems MSVC is still unhappy here:

C:/projects/arrow/cpp/src/parquet/arrow/reader.cc(1099): error C2220: warning treated as error - no 'object' file generated C:/projects/arrow/cpp/src/parquet/arrow/reader.cc(1099): warning C4267: 'initializing': conversion from 'size_t' to 'int', possible loss of data

ursabot · 2022-05-04T00:18:22Z

Benchmark runs are scheduled for baseline = f03f090 and contender = 24f3722. 24f3722 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.31% ⬆️0.04%] test-mac-arm
[Finished ⬇️0.36% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.2% ⬆️0.08%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 24f37229 ec2-t3-xlarge-us-east-2
[Finished] 24f37229 test-mac-arm
[Finished] 24f37229 ursa-i9-9960x
[Finished] 24f37229 ursa-thinkcentre-m75q
[Finished] f03f090d ec2-t3-xlarge-us-east-2
[Failed] f03f090d test-mac-arm
[Finished] f03f090d ursa-i9-9960x
[Finished] f03f090d ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot · 2022-05-04T00:18:32Z

['Python', 'R'] benchmarks have high level of regressions.
test-mac-arm

westonpace requested a review from lidavidm April 23, 2022 00:11

github-actions bot added Component: C++ Component: Python Component: Parquet labels Apr 23, 2022

lidavidm approved these changes Apr 23, 2022

View reviewed changes

cpp/src/parquet/arrow/reader.cc Outdated Show resolved Hide resolved

lidavidm reviewed Apr 23, 2022

View reviewed changes

cpp/src/parquet/arrow/reader.cc Outdated Show resolved Hide resolved

westonpace added 2 commits April 26, 2022 16:45

ARROW-16294: Adds batch slicing to parquet. Changes parquet readahead…

731841e

… so instead of reading 1 row group ahead it reads enough row groups ahead to satisfy batch_size * batch_readahead.

ARROW-16294: Added logic for the no-readahead case. Renamed argument …

9783e31

…in the header to match the impl.

westonpace force-pushed the feature/ARROW-16294--improve-parquet-readahead branch from 58d9269 to 9783e31 Compare April 27, 2022 03:04

lidavidm reviewed Apr 27, 2022

View reviewed changes

cpp/src/arrow/compute/exec/source_node.cc Outdated Show resolved Hide resolved

ARROW-16294: Changed rows_to_readahead to an int64_t to match other '…

46d2094

…row count' variables in the area. Got rid of accidental iostream include.

westonpace force-pushed the feature/ARROW-16294--improve-parquet-readahead branch from c1b2c38 to 46d2094 Compare April 27, 2022 16:12

westonpace requested a review from lidavidm April 27, 2022 18:34

lidavidm approved these changes Apr 27, 2022

View reviewed changes

ARROW-16294: signed/unsigned mismatch

d3b8e2f

lidavidm closed this in 24f3722 Apr 27, 2022

asfimport mentioned this pull request May 4, 2022

[C++] Improve performance of parquet readahead #31683

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ARROW-16294: [C++] Improve performance of parquet readahead #12967

ARROW-16294: [C++] Improve performance of parquet readahead #12967

Uh oh!

westonpace commented Apr 23, 2022

Uh oh!

github-actions bot commented Apr 23, 2022

Uh oh!

github-actions bot commented Apr 23, 2022

Uh oh!

github-actions bot commented Apr 23, 2022

Uh oh!

Uh oh!

Uh oh!

lidavidm left a comment

Uh oh!

Uh oh!

westonpace commented Apr 27, 2022

Uh oh!

lidavidm Apr 27, 2022

Uh oh!

ursabot commented May 4, 2022

Uh oh!

ursabot commented May 4, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ARROW-16294: [C++] Improve performance of parquet readahead #12967

ARROW-16294: [C++] Improve performance of parquet readahead #12967

Uh oh!

Conversation

westonpace commented Apr 23, 2022

Uh oh!

github-actions bot commented Apr 23, 2022

Uh oh!

github-actions bot commented Apr 23, 2022

Uh oh!

github-actions bot commented Apr 23, 2022

Uh oh!

Uh oh!

Uh oh!

lidavidm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

westonpace commented Apr 27, 2022

Uh oh!

lidavidm Apr 27, 2022

Choose a reason for hiding this comment

Uh oh!

ursabot commented May 4, 2022

Uh oh!

ursabot commented May 4, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants