Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Support row group filtering for nested paths #39064

Closed
jorisvandenbossche opened this issue Dec 4, 2023 · 0 comments · Fixed by #39065
Closed

[C++][Parquet] Support row group filtering for nested paths #39064

jorisvandenbossche opened this issue Dec 4, 2023 · 0 comments · Fixed by #39065

Comments

@jorisvandenbossche
Copy link
Member

Currently the filtering of row groups based on a predicate only supports non-nested paths. When getting the statistics, this only works for a leaf node:

std::optional<compute::Expression> ColumnChunkStatisticsAsExpression(
const SchemaField& schema_field, const parquet::RowGroupMetaData& metadata) {
// For the remaining of this function, failure to extract/parse statistics
// are ignored by returning nullptr. The goal is two fold. First
// avoid an optimization which breaks the computation. Second, allow the
// following columns to maybe succeed in extracting column statistics.
// For now, only leaf (primitive) types are supported.
if (!schema_field.is_leaf()) {
return std::nullopt;
}

but we are calling this ColumnChunkStatisticsAsExpression function with the struct parent, and not with the struct field leaf. The schema_field passed to the function above is created with match[0], i.e. only the first part of the matching field path:

const SchemaField& schema_field = manifest_->schema_fields[match[0]];


To illustrate this, creating a small test file with a nested struct column and consisting of two row groups:

import pyarrow as pa
import pyarrow.parquet as pq

struct_arr = pa.StructArray.from_arrays([[1, 2, 3, 4]]*4, names=["xmin", "xmax", "ymin", "ymax"])
table = pa.table({"geom": [1, 2, 3, 4], "bbox": struct_arr})

pq.write_table(table, "test_bbox_struct.parquet", row_group_size=2)

Reading this through the Datasets API with a filter seems to filter this correctly:

import pyarrow.dataset as ds
dataset = ds.dataset("test_bbox_struct.parquet", format="parquet")

dataset.to_table(filter=ds.field("bbox", "xmax") <=2).to_pandas()
#    geom                                          bbox
# 0     1  {'xmin': 1, 'xmax': 1, 'ymin': 1, 'ymax': 1}
# 1     2  {'xmin': 2, 'xmax': 2, 'ymin': 2, 'ymax': 2}

However, that is only because we correctly filter this with a nested field ref in the second step, i.e. doing an actual filter operation after reading the data. But if we look at APIs that just does the row group filtering step, we can see this is currently not being filtered at the row group stage:

In [2]: fragment = list(dataset.get_fragments())[0]

In [3]: fragment.split_by_row_group()
Out[3]: 
[<pyarrow.dataset.ParquetFileFragment path=test_bbox_struct.parquet>,
 <pyarrow.dataset.ParquetFileFragment path=test_bbox_struct.parquet>]

In [4]: fragment.split_by_row_group(filter=ds.field("bbox", "xmax") <=2)
Out[4]: 
[<pyarrow.dataset.ParquetFileFragment path=test_bbox_struct.parquet>,
 <pyarrow.dataset.ParquetFileFragment path=test_bbox_struct.parquet>]
jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue Dec 4, 2023
jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue Jan 8, 2024
…uet-dataset-row-group-filtering-nested-path
jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue Jan 8, 2024
…uet-dataset-row-group-filtering-nested-path
jorisvandenbossche added a commit that referenced this issue Jan 8, 2024
… for struct fields (#39065)

### Rationale for this change

Currently when filtering with a nested field reference, we were taking the corresponding parquet SchemaField for just the first index of the nested path, i.e. the parent node in the Parquet schema. But logically, filtering on statistics only works for a primitive leaf node.

This PR changes that logic to iterate over all indices of the FieldPath, if nested, to ensure we use the actual corresponding child leaf node of the ParquetSchema to get the statistics from.

### Are there any user-facing changes?

No, only improving performance by doing the filtering at the row group stage, instead of afterwards on the read data

* Closes: #39064

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@jorisvandenbossche jorisvandenbossche added this to the 15.0.0 milestone Jan 8, 2024
clayburn pushed a commit to clayburn/arrow that referenced this issue Jan 23, 2024
… paths for struct fields (apache#39065)

### Rationale for this change

Currently when filtering with a nested field reference, we were taking the corresponding parquet SchemaField for just the first index of the nested path, i.e. the parent node in the Parquet schema. But logically, filtering on statistics only works for a primitive leaf node.

This PR changes that logic to iterate over all indices of the FieldPath, if nested, to ensure we use the actual corresponding child leaf node of the ParquetSchema to get the statistics from.

### Are there any user-facing changes?

No, only improving performance by doing the filtering at the row group stage, instead of afterwards on the read data

* Closes: apache#39064

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
… paths for struct fields (apache#39065)

### Rationale for this change

Currently when filtering with a nested field reference, we were taking the corresponding parquet SchemaField for just the first index of the nested path, i.e. the parent node in the Parquet schema. But logically, filtering on statistics only works for a primitive leaf node.

This PR changes that logic to iterate over all indices of the FieldPath, if nested, to ensure we use the actual corresponding child leaf node of the ParquetSchema to get the statistics from.

### Are there any user-facing changes?

No, only improving performance by doing the filtering at the row group stage, instead of afterwards on the read data

* Closes: apache#39064

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this issue Feb 28, 2024
… paths for struct fields (apache#39065)

### Rationale for this change

Currently when filtering with a nested field reference, we were taking the corresponding parquet SchemaField for just the first index of the nested path, i.e. the parent node in the Parquet schema. But logically, filtering on statistics only works for a primitive leaf node.

This PR changes that logic to iterate over all indices of the FieldPath, if nested, to ensure we use the actual corresponding child leaf node of the ParquetSchema to get the statistics from.

### Are there any user-facing changes?

No, only improving performance by doing the filtering at the row group stage, instead of afterwards on the read data

* Closes: apache#39064

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant