Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace filter_row_groups with ReadOptions in parquet SerializedFileReader #1389

Merged
merged 7 commits into from
Mar 6, 2022

Conversation

yjshen
Copy link
Member

@yjshen yjshen commented Mar 3, 2022

Which issue does this PR close?

Closes #158.

Rationale for this change

To support parallel parquet reading at row group level.

What changes are included in this PR?

One extra parameter while filtering row groups using row group metadata.

The midpoint and range comparison are from parquet-mr ParquetInputSplit semantic by selecting belonged row groups.

Are there any user-facing changes?

Yes.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Mar 3, 2022
@yjshen yjshen changed the title Filter row groups by comparing midpoint with offset range Introduce ReadOptions with builder API, filter row groups that satisfy all filters, and enable filter row groups by range. Mar 4, 2022
@yjshen yjshen requested review from alamb and sunchao March 4, 2022 08:33
@codecov-commenter
Copy link

codecov-commenter commented Mar 4, 2022

Codecov Report

Merging #1389 (fcf10f9) into master (89ee9ac) will increase coverage by 0.09%.
The diff coverage is 98.93%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1389      +/-   ##
==========================================
+ Coverage   83.03%   83.12%   +0.09%     
==========================================
  Files         181      181              
  Lines       52936    53329     +393     
==========================================
+ Hits        43955    44332     +377     
- Misses       8981     8997      +16     
Impacted Files Coverage Δ
parquet/src/file/serialized_reader.rs 95.65% <98.93%> (+1.01%) ⬆️
parquet/src/file/statistics.rs 91.73% <0.00%> (-2.07%) ⬇️
parquet/src/util/cursor.rs 77.31% <0.00%> (-1.69%) ⬇️
parquet/src/schema/types.rs 85.64% <0.00%> (-1.51%) ⬇️
arrow/src/csv/writer.rs 71.32% <0.00%> (-0.82%) ⬇️
arrow/src/array/array_dictionary.rs 91.12% <0.00%> (-0.47%) ⬇️
arrow/src/array/data.rs 83.15% <0.00%> (-0.44%) ⬇️
arrow/src/json/writer.rs 92.11% <0.00%> (-0.37%) ⬇️
arrow/src/ipc/writer.rs 83.14% <0.00%> (-0.32%) ⬇️
arrow/src/compute/kernels/temporal.rs 97.13% <0.00%> (-0.22%) ⬇️
... and 26 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 89ee9ac...fcf10f9. Read the comment docs.

@yjshen
Copy link
Member Author

yjshen commented Mar 4, 2022

Integration test failure seems unrelated.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yjshen . Looks great to me!

self
}

/// Add a range predicate on filtering row groups if their midpoints are within the range
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe indicate whether the start and end is inclusive or exclusive

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is important. I've updated the doc.

/// Get midpoint offset for a row group
fn get_midpoint_offset(meta: &RowGroupMetaData) -> i64 {
let col = meta.column(0);
let mut offset = col.data_page_offset();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For encrypted Parquet files we'll need to use file_offset but it's fine for now since it's not supported anyways.

Copy link
Contributor

@liukun4515 liukun4515 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alamb alamb added the api-change Changes to the arrow API label Mar 6, 2022
@alamb alamb changed the title Introduce ReadOptions with builder API, filter row groups that satisfy all filters, and enable filter row groups by range. Introduce ReadOptions with builder API, for parquet filter row groups that satisfy all filters, and enable filter row groups by range. Mar 6, 2022
@alamb alamb merged commit 2bca71e into apache:master Mar 6, 2022
@alamb
Copy link
Contributor

alamb commented Mar 6, 2022

Thanks @yjshen

@alamb alamb changed the title Introduce ReadOptions with builder API, for parquet filter row groups that satisfy all filters, and enable filter row groups by range. Replace filter_row_groups with ReadOptions in parquet SerializedFileReader Mar 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet ArrayReader should allow reading a subset of row groups
6 participants