-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support parquet statistics for struct columns #8334
Comments
It appears that statistics for such types is ready as all nulls |
Yes, this is what I would expect in the absence of a column name collision. My greater concern is that the presence of the struct column will mess up the ordinals of the other columns |
I added tests for this case here: #8294 (comment) (TLDR is it seems to do the right thing, though it would be good to get a second set of eyes) |
I read in statistics /// Lookups up the parquet column by name
///
/// Returns the parquet column index and the corresponding arrow field
pub(crate) fn parquet_column<'a>(
parquet_schema: &SchemaDescriptor,
arrow_schema: &'a Schema,
name: &str,
) -> Option<(usize, &'a FieldRef)> {
let (root_idx, field) = arrow_schema.fields.find(name)?;
if field.data_type().is_nested() {
// Nested fields are not supported and require non-trivial logic
// to correctly walk the parquet schema accounting for the
// logical type rules - <https://github.com/apache/parquet-format/blob/master/LogicalTypes.md>
//
// For example a ListArray could correspond to anything from 1 to 3 levels
// in the parquet schema
return None;
}
// This could be made more efficient (#TBD)
let parquet_idx = (0..parquet_schema.columns().len())
.find(|x| parquet_schema.get_column_root_idx(*x) == root_idx)?;
Some((parquet_idx, field))
}```
`git blame` shows @alamb as an author of those lines... I'll look into the rules. I suppose aggregation functions will need to be updated to walk the schema correctly? |
@alamb I realized the last part of the code is actually the question I have for you, apologies I'll look into the rules. I suppose aggregation functions will need to be updated to walk the schema correctly? |
That sounds likely. I think the first thing to do would be to write a test case showing the incorrect / lack of behavior. Then we can work out how to solve the problem |
@alamb I have done some progress in getting a test, but I need guidance here: let (idx, _) = parquet_column(parquet_schema, &schema, "struct_col").unwrap();
assert_eq!(idx, 0);
let row_groups = metadata.row_groups();
let iter = row_groups.iter().map(|x| x.column(idx).statistics()); The test now fails because row_groups.iter().map(|x| x.column(idx).statistics()) doesn't return the statistics of the first column of the RecordBatch, the struct, but of the first column of the struct. I can definitely create myself the statistics doing something like that: let datatype = DataType::Struct(
Fields::from(vec![
Field::new("bool_col", DataType::Boolean, false),
Field::new("int_col", DataType::Boolean, false),
]),
);
// Comment next line, change iter above for the two fields of the struct, and build the min as a struxt
let min_field1 = min_statistics(&datatype, iter1.clone()).unwrap();
let min_field2 = min_statistics(&datatype, iter2.clone()).unwrap();
let min = find out how to do this//
assert_eq!(
&min,
&expected_min,
"Min. Statistics\n\n{}\n\n",
DisplayStats(row_groups)
); but I am not sure that's the right way to proceed |
Is your feature request related to a problem or challenge?
While working on #8294 @tustvold noted that the statistics extraction code does not do the right thing with
Struct
s that were written to parquetI think the usecase is something like the following query:
Describe the solution you'd like
I would like the parquet statistics reading code to handle structs (I think).
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: