[Python] RowGroup filtering on file level #17793

asfimport · 2017-11-10T16:28:52Z

We can build upon the API defined in fastparquet for defining RowGroup filters: https://github.com/dask/fastparquet/blob/master/fastparquet/api.py#L296-L300 and translate them into the C++ enums we will define in https://issues.apache.org/jira/browse/PARQUET-1158 . This should enable us to provide the user with a simple predicate pushdown API that we can extend in the background from RowGroup to Page level later on.

Reporter: Uwe Korn / @xhochy
Assignee: Joris Van den Bossche / @jorisvandenbossche

Related issues:

[C++] Basic RowGroup filtering (is related to)
[Python][Dataset] Support using dataset API in pyarrow.parquet with a minimal ParquetDataset shim (depends upon)

PRs and other links:

GitHub Pull Request #2623

_{Note: This issue was originally created as ARROW-1796. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2018-06-29T15:51:31Z

Wes McKinney / @wesm:
If Gandiva becomes a part of Apache Arrow, then we should look at compiling filters and pushing them down into parquet-cpp

asfimport · 2018-08-03T13:06:36Z

Uwe Korn / @xhochy:
I would start by contributing a pure Python implementation that already implements all necessary filters and then we can move the predicate evaluation either to Gandiva or pre-compiled C++. The pure Python pass is much simpler as a first step and provides already a working interface at acceptable performance.

asfimport · 2018-08-03T13:11:05Z

Uwe Korn / @xhochy:
As an interface I would add a new kwarg to read_table called filters that accepts a list of list of tuples. This will be in disjunctive normal form representation. The innermost triples consist of (column_name, operation, value(s)), e.g. ('name', '==', 'John'). These innermost triples are combined into a list and all predicates in this list and combined with AND. The outer list is then an OR combination of the AND-combined triples.

asfimport · 2018-09-05T18:20:47Z

Robbie Gruener / @rgruener:
That sounds good to me. I would like to point out it would be nice if it would be possible to apply it at the ParquetDataset level as well extending the filter parameter that already exists to handle both hive partitions and row group level filtering https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L777 It could do this by using the summary _metadata file or by reading all footers.

asfimport · 2018-09-15T14:18:33Z

Wes McKinney / @wesm:
Since we're on a critical path to get 0.11 out in the next week or two, I'm moving this to 0.12

asfimport · 2020-02-05T14:27:04Z

Joris Van den Bossche / @jorisvandenbossche:
I think we can close this issue, since this is now possible with the dataset API?

(we can have a separate one about actually using this in pyarrow.parquet.read_table filter argument.

asfimport · 2020-04-10T15:53:55Z

Wes McKinney / @wesm:
Let's close as soon as it's documented

asfimport closed this as completed Jun 11, 2020

asfimport assigned jorisvandenbossche Jan 10, 2023

asfimport added this to the 1.0.0 milestone Jan 11, 2023

asfimport mentioned this issue Jan 11, 2023

[Python][Dataset] Support using dataset API in pyarrow.parquet with a minimal ParquetDataset shim #17077

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] RowGroup filtering on file level #17793

[Python] RowGroup filtering on file level #17793

asfimport commented Nov 10, 2017 •

edited

Loading

asfimport commented Jun 29, 2018

asfimport commented Aug 3, 2018

asfimport commented Aug 3, 2018

asfimport commented Sep 5, 2018

asfimport commented Sep 15, 2018

asfimport commented Feb 5, 2020

asfimport commented Apr 10, 2020

[Python] RowGroup filtering on file level #17793

[Python] RowGroup filtering on file level #17793

Comments

asfimport commented Nov 10, 2017 • edited Loading

Related issues:

PRs and other links:

asfimport commented Jun 29, 2018

asfimport commented Aug 3, 2018

asfimport commented Aug 3, 2018

asfimport commented Sep 5, 2018

asfimport commented Sep 15, 2018

asfimport commented Feb 5, 2020

asfimport commented Apr 10, 2020

asfimport commented Nov 10, 2017 •

edited

Loading