[C++][Parquet] Support row group filtering for nested paths #39064

jorisvandenbossche · 2023-12-04T16:32:10Z

Currently the filtering of row groups based on a predicate only supports non-nested paths. When getting the statistics, this only works for a leaf node:

arrow/cpp/src/arrow/dataset/file_parquet.cc

Lines 160 to 170 in f7947cc

    
           std::optional<compute::Expression> ColumnChunkStatisticsAsExpression( 
        
               const SchemaField& schema_field, const parquet::RowGroupMetaData& metadata) { 
        
             // For the remaining of this function, failure to extract/parse statistics 
        
             // are ignored by returning nullptr. The goal is two fold. First 
        
             // avoid an optimization which breaks the computation. Second, allow the 
        
             // following columns to maybe succeed in extracting column statistics. 
        
             // For now, only leaf (primitive) types are supported. 
        
             if (!schema_field.is_leaf()) { 
        
               return std::nullopt; 
        
             }

but we are calling this ColumnChunkStatisticsAsExpression function with the struct parent, and not with the struct field leaf. The schema_field passed to the function above is created with match[0], i.e. only the first part of the matching field path:

arrow/cpp/src/arrow/dataset/file_parquet.cc

Line 903 in f7947cc

const SchemaField& schema_field = manifest_->schema_fields[match[0]];

To illustrate this, creating a small test file with a nested struct column and consisting of two row groups:

import pyarrow as pa
import pyarrow.parquet as pq

struct_arr = pa.StructArray.from_arrays([[1, 2, 3, 4]]*4, names=["xmin", "xmax", "ymin", "ymax"])
table = pa.table({"geom": [1, 2, 3, 4], "bbox": struct_arr})

pq.write_table(table, "test_bbox_struct.parquet", row_group_size=2)

Reading this through the Datasets API with a filter seems to filter this correctly:

import pyarrow.dataset as ds
dataset = ds.dataset("test_bbox_struct.parquet", format="parquet")

dataset.to_table(filter=ds.field("bbox", "xmax") <=2).to_pandas()
#    geom                                          bbox
# 0     1  {'xmin': 1, 'xmax': 1, 'ymin': 1, 'ymax': 1}
# 1     2  {'xmin': 2, 'xmax': 2, 'ymin': 2, 'ymax': 2}

However, that is only because we correctly filter this with a nested field ref in the second step, i.e. doing an actual filter operation after reading the data. But if we look at APIs that just does the row group filtering step, we can see this is currently not being filtered at the row group stage:

In [2]: fragment = list(dataset.get_fragments())[0]

In [3]: fragment.split_by_row_group()
Out[3]: 
[<pyarrow.dataset.ParquetFileFragment path=test_bbox_struct.parquet>,
 <pyarrow.dataset.ParquetFileFragment path=test_bbox_struct.parquet>]

In [4]: fragment.split_by_row_group(filter=ds.field("bbox", "xmax") <=2)
Out[4]: 
[<pyarrow.dataset.ParquetFileFragment path=test_bbox_struct.parquet>,
 <pyarrow.dataset.ParquetFileFragment path=test_bbox_struct.parquet>]

The text was updated successfully, but these errors were encountered:

… paths for struct fields

…uet-dataset-row-group-filtering-nested-path

… for struct fields (#39065) ### Rationale for this change Currently when filtering with a nested field reference, we were taking the corresponding parquet SchemaField for just the first index of the nested path, i.e. the parent node in the Parquet schema. But logically, filtering on statistics only works for a primitive leaf node. This PR changes that logic to iterate over all indices of the FieldPath, if nested, to ensure we use the actual corresponding child leaf node of the ParquetSchema to get the statistics from. ### Are there any user-facing changes? No, only improving performance by doing the filtering at the row group stage, instead of afterwards on the read data * Closes: #39064 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

… paths for struct fields (apache#39065) ### Rationale for this change Currently when filtering with a nested field reference, we were taking the corresponding parquet SchemaField for just the first index of the nested path, i.e. the parent node in the Parquet schema. But logically, filtering on statistics only works for a primitive leaf node. This PR changes that logic to iterate over all indices of the FieldPath, if nested, to ensure we use the actual corresponding child leaf node of the ParquetSchema to get the statistics from. ### Are there any user-facing changes? No, only improving performance by doing the filtering at the row group stage, instead of afterwards on the read data * Closes: apache#39064 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue Dec 4, 2023

apacheGH-39064: [C++][Parquet] Support row group filtering for nested…

3312381

… paths for struct fields

github-actions bot mentioned this issue Dec 4, 2023

GH-39064: [C++][Parquet] Support row group filtering for nested paths for struct fields #39065

Merged

github-actions bot assigned jorisvandenbossche Dec 4, 2023

jorisvandenbossche added Component: Parquet Component: C++ labels Dec 4, 2023

jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue Jan 8, 2024

Merge remote-tracking branch 'upstream/main' into apachegh-39064-parq…

033d6e6

…uet-dataset-row-group-filtering-nested-path

jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue Jan 8, 2024

Merge remote-tracking branch 'upstream/main' into apachegh-39064-parq…

fb10913

…uet-dataset-row-group-filtering-nested-path

jorisvandenbossche closed this as completed in #39065 Jan 8, 2024

jorisvandenbossche added this to the 15.0.0 milestone Jan 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Parquet] Support row group filtering for nested paths #39064

[C++][Parquet] Support row group filtering for nested paths #39064

jorisvandenbossche commented Dec 4, 2023

[C++][Parquet] Support row group filtering for nested paths #39064

[C++][Parquet] Support row group filtering for nested paths #39064

Comments

jorisvandenbossche commented Dec 4, 2023