Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-39064: [C++][Parquet] Support row group filtering for nested paths for struct fields #39065

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Dec 4, 2023

Rationale for this change

Currently when filtering with a nested field reference, we were taking the corresponding parquet SchemaField for just the first index of the nested path, i.e. the parent node in the Parquet schema. But logically, filtering on statistics only works for a primitive leaf node.

This PR changes that logic to iterate over all indices of the FieldPath, if nested, to ensure we use the actual corresponding child leaf node of the ParquetSchema to get the statistics from.

Are there any user-facing changes?

No, only improving performance by doing the filtering at the row group stage, instead of afterwards on the read data

Copy link

github-actions bot commented Dec 4, 2023

⚠️ GitHub issue #39064 has been automatically assigned in GitHub to PR creator.

@apache apache deleted a comment from github-actions bot Dec 4, 2023
Comment on lines 360 to 361
const Field& field, const parquet::Statistics& statistics) {
auto field_expr = compute::field_ref(field.name());
const Field& field, const FieldRef& field_ref, const parquet::Statistics& statistics) {
auto field_expr = compute::field_ref(field_ref);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if there is a better way, but the reason I am passing the FieldRef through the various function calls (through ColumnChunkStatisticsAsExpression to EvaluateStatisticsAsExpression), is that once we are here and we have the "schild" field, we don't know what the original full nested field was.

Neither the SchemaField passed to ColumnChunkStatisticsAsExpression nor the Field (schema_field.field) passed to EvaluateStatisticsAsExpression currently have that information AFAIU.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Dec 4, 2023
Comment on lines 177 to 178
static std::optional<compute::Expression> EvaluateStatisticsAsExpression(
const Field& field, const parquet::Statistics& statistics);
const Field& field, const FieldRef& field_ref, const parquet::Statistics& statistics);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is public API? So I can add a variant with the original signature that creates a FieldRef from the Field?

@github-actions github-actions bot added Component: Python awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Dec 5, 2023
@jorisvandenbossche jorisvandenbossche marked this pull request as ready for review December 6, 2023 07:43
@mapleFU
Copy link
Member

mapleFU commented Dec 6, 2023

I'll take a quick look now

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General LGTM

Comment on lines +912 to +914
if (schema_field->field->type()->id() != Type::STRUCT) {
return Status::Invalid("nested paths only supported for structs");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this limit user passing an filter on Map/List?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but we currently also don't support any predicate kernels for those data types at the moment AFAIK.

For example for a list column, you can't do something like "list_field > 1" because 1) such kernel isn't implemented, and 2) that actually also doesn't really make sense as a list scalar contains multiple values, so that doesn't evaluate to simple True/False, you need some kind of aggregation like "elementwise_all(list_field > 1)" (i.e. are "all" (or any) values in a list scalar larger than 1).
And even then simplifying such more complex expression based on the parquet statistics would also need to be implemented.

(I would like to see this work at some point, but that's certainly future work)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind add a test for the "List/Map" filter doesn't work in cpp

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree list/map filter is so hard to filtering, which might need extra predicates. Let disable it now, but maybe we can test some more complex struct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind add a test for the "List/Map" filter doesn't work in cpp

Filtering with a list or map field actually already fails in an earlier step, when binding the filter expression to the schema (and binding isn't done in FilterRowGroups, it's expected to already be done, also in the test for this it is done up front in the test setup code).

cpp/src/arrow/dataset/file_parquet.cc Outdated Show resolved Hide resolved
cpp/src/arrow/dataset/file_parquet.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Dec 6, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Dec 6, 2023
Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM on parquet side, but I think we need a more one familiar with dataset to review it

cc @bkietz

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one nit

@@ -897,16 +907,25 @@ Result<std::vector<compute::Expression>> ParquetFileFragment::TestRowGroups(
ARROW_ASSIGN_OR_RAISE(auto match, ref.FindOneOrNone(*physical_schema_));

if (match.empty()) continue;
if (statistics_expressions_complete_[match[0]]) continue;
statistics_expressions_complete_[match[0]] = true;
const SchemaField* schema_field = &manifest_->schema_fields[match[0]];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is the same logic as FieldPath::Get, would you mind extracting it as a separate function? It would be nice to have a clear single entry point for future work on nested field references in parquet

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean including the for loop below, right?

I think a method on the SchemaManifest might be a logical place to have this. It already has a GetColumnField to return a SchemaField based on a single integer index. There could be a variant which accepts a FieldPath

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to merge this before the 15.0 branch cut-off, so going to merge as is, but will look into factoring it out as a helper function in a follow-up!

@github-actions github-actions bot added awaiting merge Awaiting merge awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting merge Awaiting merge labels Dec 19, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jan 8, 2024
…uet-dataset-row-group-filtering-nested-path
@jorisvandenbossche jorisvandenbossche force-pushed the gh-39064-parquet-dataset-row-group-filtering-nested-path branch from 033d6e6 to fb10913 Compare January 8, 2024 12:46
@jorisvandenbossche jorisvandenbossche merged commit ffcfabd into apache:main Jan 8, 2024
33 checks passed
@jorisvandenbossche jorisvandenbossche removed the awaiting change review Awaiting change review label Jan 8, 2024
@jorisvandenbossche jorisvandenbossche deleted the gh-39064-parquet-dataset-row-group-filtering-nested-path branch January 8, 2024 15:07
@github-actions github-actions bot added the awaiting changes Awaiting changes label Jan 8, 2024
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit ffcfabd.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them.

@raulcd
Copy link
Member

raulcd commented Jan 9, 2024

@github-actions crossbow submit wheel-macos-*

Copy link

github-actions bot commented Jan 9, 2024

Revision: fb10913

Submitted crossbow builds: ursacomputing/crossbow @ actions-e8de21cfb7

Task Status
wheel-macos-big-sur-cp310-arm64 GitHub Actions
wheel-macos-big-sur-cp311-arm64 GitHub Actions
wheel-macos-big-sur-cp312-arm64 GitHub Actions
wheel-macos-big-sur-cp38-arm64 GitHub Actions
wheel-macos-big-sur-cp39-arm64 GitHub Actions
wheel-macos-catalina-cp310-amd64 GitHub Actions
wheel-macos-catalina-cp311-amd64 GitHub Actions
wheel-macos-catalina-cp312-amd64 GitHub Actions
wheel-macos-catalina-cp38-amd64 GitHub Actions
wheel-macos-catalina-cp39-amd64 GitHub Actions

clayburn pushed a commit to clayburn/arrow that referenced this pull request Jan 23, 2024
… paths for struct fields (apache#39065)

### Rationale for this change

Currently when filtering with a nested field reference, we were taking the corresponding parquet SchemaField for just the first index of the nested path, i.e. the parent node in the Parquet schema. But logically, filtering on statistics only works for a primitive leaf node.

This PR changes that logic to iterate over all indices of the FieldPath, if nested, to ensure we use the actual corresponding child leaf node of the ParquetSchema to get the statistics from.

### Are there any user-facing changes?

No, only improving performance by doing the filtering at the row group stage, instead of afterwards on the read data

* Closes: apache#39064

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
… paths for struct fields (apache#39065)

### Rationale for this change

Currently when filtering with a nested field reference, we were taking the corresponding parquet SchemaField for just the first index of the nested path, i.e. the parent node in the Parquet schema. But logically, filtering on statistics only works for a primitive leaf node.

This PR changes that logic to iterate over all indices of the FieldPath, if nested, to ensure we use the actual corresponding child leaf node of the ParquetSchema to get the statistics from.

### Are there any user-facing changes?

No, only improving performance by doing the filtering at the row group stage, instead of afterwards on the read data

* Closes: apache#39064

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Feb 28, 2024
… paths for struct fields (apache#39065)

### Rationale for this change

Currently when filtering with a nested field reference, we were taking the corresponding parquet SchemaField for just the first index of the nested path, i.e. the parent node in the Parquet schema. But logically, filtering on statistics only works for a primitive leaf node.

This PR changes that logic to iterate over all indices of the FieldPath, if nested, to ensure we use the actual corresponding child leaf node of the ParquetSchema to get the statistics from.

### Are there any user-facing changes?

No, only improving performance by doing the filtering at the row group stage, instead of afterwards on the read data

* Closes: apache#39064

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][Parquet] Support row group filtering for nested paths
4 participants