-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-16294: [C++] Improve performance of parquet readahead #12967
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-16294: [C++] Improve performance of parquet readahead #12967
Conversation
|
|
|
|
… so instead of reading 1 row group ahead it reads enough row groups ahead to satisfy batch_size * batch_readahead.
…in the header to match the impl.
58d9269 to
9783e31
Compare
lidavidm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there's a lint failure, and MSVC is unhappy as usual:
D:/a/arrow/arrow/cpp/src/arrow/dataset/file_parquet.cc(465,1): error C2220: the following warning is treated as an error [D:\a\arrow\arrow\build\cpp\src\arrow\dataset\arrow_dataset_shared.vcxproj]
D:/a/arrow/arrow/cpp/src/arrow/dataset/file_parquet.cc(465,1): warning C4244: 'argument': conversion from 'int64_t' to 'int', possible loss of data [D:\a\arrow\arrow\build\cpp\src\arrow\dataset\arrow_dataset_shared.vcxproj]
…row count' variables in the area. Got rid of accidental iostream include.
c1b2c38 to
46d2094
Compare
The dangers of thinking I can fix an MSVC complaint using Github's edit feature 😆 I think this is good now. I changed the variable itself to int64_t instead of adding static_cast since it is now a "rows" variable and not a "batches" variable. The behavior for 0 readahead is now "fetch batch when asked for" instead of defaulting to something like 1 row (which probably would have been the same thing anyways but I think this is more clear about our intentions). |
cpp/src/parquet/arrow/reader.cc
Outdated
| } | ||
|
|
||
| void FetchNext() { | ||
| int row_group_index = readahead_index_++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems MSVC is still unhappy here:
C:/projects/arrow/cpp/src/parquet/arrow/reader.cc(1099): error C2220: warning treated as error - no 'object' file generated
C:/projects/arrow/cpp/src/parquet/arrow/reader.cc(1099): warning C4267: 'initializing': conversion from 'size_t' to 'int', possible loss of data
|
Benchmark runs are scheduled for baseline = f03f090 and contender = 24f3722. 24f3722 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
|
['Python', 'R'] benchmarks have high level of regressions. |
batch_size * batch_readaheadreads in flight. Users that want more parallel reads can increase batch readahead.