GH-39064: [C++][Parquet] Support row group filtering for nested paths for struct fields #39065

jorisvandenbossche · 2023-12-04T16:39:58Z

Rationale for this change

Currently when filtering with a nested field reference, we were taking the corresponding parquet SchemaField for just the first index of the nested path, i.e. the parent node in the Parquet schema. But logically, filtering on statistics only works for a primitive leaf node.

This PR changes that logic to iterate over all indices of the FieldPath, if nested, to ensure we use the actual corresponding child leaf node of the ParquetSchema to get the statistics from.

Are there any user-facing changes?

No, only improving performance by doing the filtering at the row group stage, instead of afterwards on the read data

Closes: [C++][Parquet] Support row group filtering for nested paths #39064

… paths for struct fields

github-actions · 2023-12-04T16:40:27Z

⚠️ GitHub issue #39064 has been automatically assigned in GitHub to PR creator.

jorisvandenbossche · 2023-12-04T16:47:54Z

cpp/src/arrow/dataset/file_parquet.cc

-    const Field& field, const parquet::Statistics& statistics) {
-  auto field_expr = compute::field_ref(field.name());
+    const Field& field, const FieldRef& field_ref, const parquet::Statistics& statistics) {
+  auto field_expr = compute::field_ref(field_ref);


I don't know if there is a better way, but the reason I am passing the FieldRef through the various function calls (through ColumnChunkStatisticsAsExpression to EvaluateStatisticsAsExpression), is that once we are here and we have the "schild" field, we don't know what the original full nested field was.

Neither the SchemaField passed to ColumnChunkStatisticsAsExpression nor the Field (schema_field.field) passed to EvaluateStatisticsAsExpression currently have that information AFAIU.

jorisvandenbossche · 2023-12-04T16:49:24Z

cpp/src/arrow/dataset/file_parquet.h

  static std::optional<compute::Expression> EvaluateStatisticsAsExpression(
-      const Field& field, const parquet::Statistics& statistics);
+      const Field& field, const FieldRef& field_ref, const parquet::Statistics& statistics);


This is public API? So I can add a variant with the original signature that creates a FieldRef from the Field?

mapleFU · 2023-12-06T07:45:39Z

I'll take a quick look now

mapleFU

General LGTM

mapleFU · 2023-12-06T08:08:41Z

cpp/src/arrow/dataset/file_parquet.cc

+      if (schema_field->field->type()->id() != Type::STRUCT) {
+        return Status::Invalid("nested paths only supported for structs");
+      }


So this limit user passing an filter on Map/List?

Yes, but we currently also don't support any predicate kernels for those data types at the moment AFAIK.

For example for a list column, you can't do something like "list_field > 1" because 1) such kernel isn't implemented, and 2) that actually also doesn't really make sense as a list scalar contains multiple values, so that doesn't evaluate to simple True/False, you need some kind of aggregation like "elementwise_all(list_field > 1)" (i.e. are "all" (or any) values in a list scalar larger than 1).
And even then simplifying such more complex expression based on the parquet statistics would also need to be implemented.

(I would like to see this work at some point, but that's certainly future work)

Would you mind add a test for the "List/Map" filter doesn't work in cpp

I agree list/map filter is so hard to filtering, which might need extra predicates. Let disable it now, but maybe we can test some more complex struct?

Would you mind add a test for the "List/Map" filter doesn't work in cpp

Filtering with a list or map field actually already fails in an earlier step, when binding the filter expression to the schema (and binding isn't done in FilterRowGroups, it's expected to already be done, also in the test for this it is done up front in the test setup code).

cpp/src/arrow/dataset/file_parquet.cc

mapleFU

This LGTM on parquet side, but I think we need a more one familiar with dataset to review it

cc @bkietz

bkietz

LGTM, just one nit

bkietz · 2023-12-19T15:10:05Z

cpp/src/arrow/dataset/file_parquet.cc

@@ -897,16 +907,25 @@ Result<std::vector<compute::Expression>> ParquetFileFragment::TestRowGroups(
    ARROW_ASSIGN_OR_RAISE(auto match, ref.FindOneOrNone(*physical_schema_));

    if (match.empty()) continue;
-    if (statistics_expressions_complete_[match[0]]) continue;
-    statistics_expressions_complete_[match[0]] = true;
+    const SchemaField* schema_field = &manifest_->schema_fields[match[0]];


Since this is the same logic as FieldPath::Get, would you mind extracting it as a separate function? It would be nice to have a clear single entry point for future work on nested field references in parquet

You mean including the for loop below, right?

I think a method on the SchemaManifest might be a logical place to have this. It already has a GetColumnField to return a SchemaField based on a single integer index. There could be a variant which accepts a FieldPath

I would like to merge this before the 15.0 branch cut-off, so going to merge as is, but will look into factoring it out as a helper function in a follow-up!

…uet-dataset-row-group-filtering-nested-path

conbench-apache-arrow · 2024-01-09T02:14:59Z

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit ffcfabd.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them.

raulcd · 2024-01-09T16:50:53Z

@github-actions crossbow submit wheel-macos-*

github-actions · 2024-01-09T16:53:34Z

Revision: fb10913

Submitted crossbow builds: ursacomputing/crossbow @ actions-e8de21cfb7

Task	Status
wheel-macos-big-sur-cp310-arm64
wheel-macos-big-sur-cp311-arm64
wheel-macos-big-sur-cp312-arm64
wheel-macos-big-sur-cp38-arm64
wheel-macos-big-sur-cp39-arm64
wheel-macos-catalina-cp310-amd64
wheel-macos-catalina-cp311-amd64
wheel-macos-catalina-cp312-amd64
wheel-macos-catalina-cp38-amd64
wheel-macos-catalina-cp39-amd64

… paths for struct fields (apache#39065) ### Rationale for this change Currently when filtering with a nested field reference, we were taking the corresponding parquet SchemaField for just the first index of the nested path, i.e. the parent node in the Parquet schema. But logically, filtering on statistics only works for a primitive leaf node. This PR changes that logic to iterate over all indices of the FieldPath, if nested, to ensure we use the actual corresponding child leaf node of the ParquetSchema to get the statistics from. ### Are there any user-facing changes? No, only improving performance by doing the filtering at the row group stage, instead of afterwards on the read data * Closes: apache#39064 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

apacheGH-39064: [C++][Parquet] Support row group filtering for nested…

3312381

… paths for struct fields

github-actions bot added Component: C++ awaiting committer review Awaiting committer review labels Dec 4, 2023

apache deleted a comment from github-actions bot Dec 4, 2023

jorisvandenbossche commented Dec 4, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Dec 4, 2023

jorisvandenbossche commented Dec 4, 2023

View reviewed changes

jorisvandenbossche added 3 commits December 5, 2023 11:24

format

b1115bf

fix for multiple nested fields + add python tests

56476cf

add back compat shim for EvaluateStatisticsAsExpression

b87c829

github-actions bot added Component: Python awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Dec 5, 2023

fixup

0d3da99

jorisvandenbossche added the Component: Parquet label Dec 5, 2023

github-actions bot removed the Component: Parquet label Dec 5, 2023

add small C++ test

bd8f127

jorisvandenbossche marked this pull request as ready for review December 6, 2023 07:43

jorisvandenbossche requested a review from westonpace as a code owner December 6, 2023 07:43

mapleFU reviewed Dec 6, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Dec 6, 2023

address feedback

4205837

github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Dec 6, 2023

mapleFU approved these changes Dec 7, 2023

View reviewed changes

bkietz approved these changes Dec 19, 2023

View reviewed changes

github-actions bot added awaiting merge Awaiting merge awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting merge Awaiting merge labels Dec 19, 2023

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jan 8, 2024

Merge remote-tracking branch 'upstream/main' into apachegh-39064-parq…

fb10913

…uet-dataset-row-group-filtering-nested-path

jorisvandenbossche force-pushed the gh-39064-parquet-dataset-row-group-filtering-nested-path branch from 033d6e6 to fb10913 Compare January 8, 2024 12:46

jorisvandenbossche merged commit ffcfabd into apache:main Jan 8, 2024
33 checks passed

jorisvandenbossche removed the awaiting change review Awaiting change review label Jan 8, 2024

jorisvandenbossche deleted the gh-39064-parquet-dataset-row-group-filtering-nested-path branch January 8, 2024 15:07

github-actions bot added the awaiting changes Awaiting changes label Jan 8, 2024

raulcd mentioned this pull request Jan 9, 2024

GH-39537: [Packaging][Python] Add a numpy<2 pin to the install requirements for the 15.x release branch #39538

Merged

jorisvandenbossche mentioned this pull request Jan 11, 2024

[CI][Python] Some macOS wheels are failing with a segmentation fault when running test_parquet_dataset_lazy_filtering #39562

Closed

rouault mentioned this pull request May 14, 2024

[C++][Parquet] Predicate pushdown through arrow::dataset::ScanBuilder::Filter() not available on list fields #41651

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-39064: [C++][Parquet] Support row group filtering for nested paths for struct fields #39065

GH-39064: [C++][Parquet] Support row group filtering for nested paths for struct fields #39065

jorisvandenbossche commented Dec 4, 2023 •

edited

github-actions bot commented Dec 4, 2023

jorisvandenbossche Dec 4, 2023 •

edited

jorisvandenbossche Dec 4, 2023

mapleFU commented Dec 6, 2023

mapleFU left a comment

mapleFU Dec 6, 2023

jorisvandenbossche Dec 6, 2023

mapleFU Dec 6, 2023

mapleFU Dec 6, 2023

jorisvandenbossche Dec 6, 2023

mapleFU left a comment

bkietz left a comment

bkietz Dec 19, 2023

jorisvandenbossche Dec 20, 2023

jorisvandenbossche Jan 8, 2024

conbench-apache-arrow bot commented Jan 9, 2024

raulcd commented Jan 9, 2024

github-actions bot commented Jan 9, 2024

GH-39064: [C++][Parquet] Support row group filtering for nested paths for struct fields #39065

GH-39064: [C++][Parquet] Support row group filtering for nested paths for struct fields #39065

Conversation

jorisvandenbossche commented Dec 4, 2023 • edited

Rationale for this change

Are there any user-facing changes?

github-actions bot commented Dec 4, 2023

jorisvandenbossche Dec 4, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mapleFU commented Dec 6, 2023

mapleFU left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mapleFU left a comment

Choose a reason for hiding this comment

bkietz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

conbench-apache-arrow bot commented Jan 9, 2024

raulcd commented Jan 9, 2024

github-actions bot commented Jan 9, 2024

jorisvandenbossche commented Dec 4, 2023 •

edited

jorisvandenbossche Dec 4, 2023 •

edited