Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat:Add function for row alignment with page mask #1791

Merged
merged 13 commits into from
Jun 6, 2022
1 change: 1 addition & 0 deletions parquet/src/file/metadata.rs
Original file line number Diff line number Diff line change
Expand Up @@ -223,6 +223,7 @@ pub struct RowGroupMetaData {
num_rows: i64,
total_byte_size: i64,
schema_descr: SchemaDescPtr,
// Todo add filter result -> row range
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I think about this the more I wonder whether the metadata structs are the right place to put the index information. They're parsed and interpreted separately from the main metadata, and so I think it makes sense for them to be stored separately?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. page index stored in file-meta level.
My thought is read less pageIndex after rowgroup filter

let mut filtered_row_groups = Vec::<RowGroupMetaData>::new();
for (i, rg_meta) in row_groups.into_iter().enumerate() {
let mut keep = true;
for predicate in &mut predicates {
if !predicate(&rg_meta, i) {
keep = false;
break;
}
}
if keep {
filtered_row_groups.push(rg_meta);
}
}

metadata: ParquetMetaData::new(
metadata.file_metadata().clone(),
filtered_row_groups,
),

So i want to read index here and insert it into RowGroupMetaData.
It was just a simple idea at first, maybe we can find a better way in the process of implementation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

page index stored in file-meta level.

It isn't even file-meta level, it isn't part of the footer but stored as separate pages 😅

It was just a simple idea at first, maybe we can find a better way in the process of implementation

Provided we take care to ensure we keep things pub(crate) so we don't break APIs, this seems like a good strategy 👍

Copy link
Member Author

@Ted-Jiang Ted-Jiang Jun 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't even file-meta level, it isn't part of the footer but stored as separate pages 😅

yes separately from RowGroup, before the footer !😂

}

impl RowGroupMetaData {
Expand Down
1 change: 1 addition & 0 deletions parquet/src/file/page_index/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,4 @@

pub mod index;
pub mod index_reader;
pub(crate) mod range;