Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change StatisticsConverter::row_group_counts to return None for non existent columns in parquet files #10965

Closed
Tracked by #10922
alamb opened this issue Jun 17, 2024 · 1 comment · Fixed by #10973
Closed
Tracked by #10922
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Jun 17, 2024

Is your feature request related to a problem or challenge?

While working on #10926 @marvinlanhenke has an excellent question:

Why do we return row counts for a non existing column in row_group_row_counts?

(this was because there was some inconsistency between data pages and row counts)

The reason StatisticsConverter::row_group_counts returns row counts even for a non existent column is because it is the API needed for PruningStatistics here:

fn row_counts(&self, _column: &Column) -> Option<ArrayRef> {
// row counts are the same for all columns in a row group
StatisticsConverter::row_group_row_counts(self.metadata_iter())
.ok()
.map(|counts| Arc::new(counts) as ArrayRef)

It is possible to do because the ParquetMetadata knows how many row groups there are even when there are no row group statistics, but it doesn't make logical sense. Furthermore, for data pages, it is different if the "page index" is not present as then it doesn't even make sense to ask how many rows are in each data page as we don't have any data pages

Thus I think row_group_row_counts should also default to returning None if the column is not present, as @marvinlanhenke has done for null_counts_page_statistics in

So I guess my new proposal would be to return Option like:

impl<'a> StatisticsConverter<'a> {
...
  /// return OK(None) if the column does not exist
  pub(crate) fn null_counts_page_statistics<'a, I>(iterator: I) -> Result<Option<UInt64Array>> {
  ...
  }

  /// return OK(None) if the column does not exist
    pub fn data_page_row_counts<I>(
        &self,
        column_offset_index: &ParquetOffsetIndex,
        row_group_metadatas: &[RowGroupMetaData],
        row_group_indices: I,
    ) -> Result<Option<UInt64Array>>
    where
        I: IntoIterator<Item = &'a usize>,
    {
    ...
    }

}

The rationale to return an Option rather than an error is that creating and ignoring DataFusionError via ok() still requires a string allocation, which is wasteful

I realize this is done many places already in the statistics extraction code, but I think for those cases it is meant to make the code resilent to incorrectly encoded parquet files rather than something that is "expected" to happen

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants