-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EPIC] Efficiently and correctly extract parquet statistics into ArrayRefs #10453
Comments
In terms of sequencing of this feature what I would recommend First PRPurpose: Sketch out the API, and test framework
Second PR (draft)purpose: demonstrate the API can be used in DataFusion, also ensure test coverage is adequate Third+Fourth+... PRsAdd support for the remaining datatypes, along with tests |
I start working on the first PR |
After working through an actual example in #10549 I have a new API proposal: NGA-TRAN#118 Here is what the API looks like /// What type of statistics should be extracted?
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Hash)]
pub enum RequestedStatistics {
/// Minimum Value
Min,
/// Maximum Value
Max,
/// Null Count, returned as a [`UInt64Array`])
NullCount,
}
/// Extracts Parquet statistics as Arrow arrays
///
/// This is used to convert Parquet statistics to Arrow arrays, with proper type
/// conversions. This information can be used for pruning parquet files or row
/// groups based on the statistics embedded in parquet files
///
/// # Schemas
///
/// The schema of the parquet file and the arrow schema are used to convert the
/// underlying statistics value (stored as a parquet value) into the
/// corresponding Arrow value. For example, Decimals are stored as binary in
/// parquet files.
///
/// The parquet_schema and arrow _schema do not have to be identical (for
/// example, the columns may be in different orders and one or the other schemas
/// may have additional columns). The function [`parquet_column`] is used to
/// match the column in the parquet file to the column in the arrow schema.
///
/// # Multiple parquet files
///
/// This API is designed to support efficiently extracting statistics from
/// multiple parquet files (hence why the parquet schema is passed in as an
/// argument). This is useful when building an index for a directory of parquet
/// files.
///
#[derive(Debug)]
pub struct StatisticsConverter<'a> {
/// The name of the column to extract statistics for
column_name: &'a str,
/// The type of statistics to extract
statistics_type: RequestedStatistics,
/// The arrow schema of the query
arrow_schema: &'a Schema,
/// The field (with data type) of the column in the arrow schema
arrow_field: &'a Field,
}
impl<'a> StatisticsConverter<'a> {
/// Returns a [`UInt64Array`] with counts for each row group
///
/// The returned array has no nulls, and has one value for each row group.
/// Each value is the number of rows in the row group.
pub fn row_counts(metadata: &ParquetMetaData) -> Result<UInt64Array> {
...
}
/// create an new statistics converter
pub fn try_new(
column_name: &'a str,
statistics_type: RequestedStatistics,
arrow_schema: &'a Schema,
) -> Result<Self> {
...
}
/// extract the statistics from a parquet file, given the parquet file's metadata
///
/// The returned array contains 1 value for each row group in the parquet
/// file in order
///
/// Each value is either
/// * the requested statistics type for the column
/// * a null value, if the statistics can not be extracted
///
/// Note that a null value does NOT mean the min or max value was actually
/// `null` it means it the requested statistic is unknown
///
/// Reasons for not being able to extract the statistics include:
/// * the column is not present in the parquet file
/// * statistics for the column are not present in the row group
/// * the stored statistic value can not be converted to the requested type
pub fn extract(&self, metadata: &ParquetMetaData) -> Result<ArrayRef> {
...
}
} I am envisioning this API could also easily support Extract from multiple files in one go impl<'a> StatisticsConverter<'a> {
..
/// Extract metadata from multiple parquet files into an single arrow array
/// one element per row group per file
fn extract_multi(&self, metadata: impl IntoIterator<Item = &ParquetMetadata>))-> Result<ArrayRef> {
...
} Extract information from the page index as well impl<'a> StatisticsConverter<'a> {
..
/// Extract metadata from page indexes across all row groups. The returned array has one element
/// per page across all row groups
fn extract_page(&self, metadata: impl IntoIterator<Item = &ParquetMetadata>))-> Result<ArrayRef> {
...
} |
parse_metadata
, decode_metadata
and decode_footer
apache/arrow-rs#5781
@alamb I have created 2 more bug tickets but I cannot edit the description to add them in the subtasks. Can you help with that? |
Done |
FYI I have a proposed API change in #10806 |
Given how far we have come with this ticket, I plan to close this ticket and do some organizing of the remaining tasks as follow on tickets / epics |
This issue is done enough -- I am consolidating the remaining todo items under #10922 |
Is your feature request related to a problem or challenge?
There are at least three places that parquet statistics are extracted into ArrayRefs today
ParquetExec (Pruning pages): https://github.com/apache/datafusion/blob/671cef85c550969ab2c86d644968a048cb181c0c/datafusion/core/src/datasource/physical_plan/parquet/page_filter.rs#L393-L392
ListingTable (pruning files): https://github.com/apache/datafusion/blob/97148bd105fc2102b0444f2d67ef535937da5dfe/datafusion/core/src/datasource/file_format/parquet.rs#L295-L294
Not only are there three copies of the code, they are all subtly different (e.g. #8295) and have varying degrees of testing
Describe the solution you'd like
I would like one API with the following properties:
ArrayRef
s suitable to pass to PruningPredicateDescribe alternatives you've considered
Some ideas from apache/arrow-rs#4328
Subtasks
i8
i16
columns in parquet #10585datafusion/core/src/datasource/physical_plan/parquet/statistics.rs
todatafusion/core/tests/parquet/arrow_statistics.rs
ListingTable
to use the new API for file pruningf16
columns #10757Time32
andTime64
columns #10751Interval
columns #10752Duration
columns #10754LargeBinary
columns #10753LargeUtf8
columns #10756Decimal256
columns #10755Follow on projects:
ArrayRef
s #10806Here is a proposed API:
Maybe it would make sense to have something more builder style:
(This is similar to the existing API parquet::arrow::parquet_to_arrow_schema)
Note
Statistics
above isStatistics
There is a version of this code here in DataFusion that could perhaps be adapted:
datafusion/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs
Lines 179 to 186 in accce97
Testing
I suggest we add a new module to the existing parquet test in https://github.com/apache/datafusion/blob/main/datafusion/core/tests/parquet_exec.rs
The tests should look like:
I can help writing these tests
I personally suggest:
cc @tustvold in case you have other ideas
Additional context
This code likely eventually would be good to have in the parquet crate -- see apache/arrow-rs#4328. However, I think initially we should do it in DataFusion to iterate faster and figure out the API before moving it up there
There are a bunch of related improvements that I think become much simpler with this feature:
The text was updated successfully, but these errors were encountered: