Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tustvold/extract parquet statistics #16

Conversation

tustvold
Copy link

Which issue does this PR close?

Closes #.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

comphead and others added 6 commits November 27, 2023 17:34
* fix: wrong result of range function

* fix test

* add ut

* add ut

* nit

* nit

---------

Co-authored-by: zhongjingxiong <zhongjingxiong@bytedance.com>
* refactor: output-ordering

* chore: test

* chore: cr comment

Co-authored-by: Alex Huang <huangweijun1001@gmail.com>

---------

Co-authored-by: Alex Huang <huangweijun1001@gmail.com>
pub(crate) fn prune_row_groups_by_statistics(
parquet_schema: &SchemaDescriptor,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tbh we need the full FileMetadata here in order to inspect the ColumnOrder - I decided against this as it would result in a load of test churn

}

// This could be made more efficient (#TBD)
let parquet_idx = (0..parquet_schema.columns().len())
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the fix for apache#8335

return Ok(new_empty_array(self.field.data_type()));
}
/// Extracts the min statistics from an iterator of [`ParquetStatistics`] to an [`ArrayRef`]
pub fn min_statistics<'a, I: Iterator<Item = Option<&'a ParquetStatistics>>>(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I had this as ColumnChunkMetadata, however, we don't actually need anything beyond the ParquetStatistics, so it seemed peculiar to require the full ColumnChunkMetadata. Additionally one method to support the column index using the same array logic would be to coerce both kinds of statistics to the same representation

pub(crate) struct RowGroupStatisticsConverter<'a> {
field: &'a Field,
/// Returns the parquet column index and the corresponding arrow field
pub fn parquet_column<'a>(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is basically a hack, if/when we upstream this we can use the parquet-private ParquetField which handles this properly

@alamb alamb merged commit 641142b into alamb:alamb/extract_parquet_statistics Nov 28, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants