Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] RowGroup filtering on file level #17793

Closed
asfimport opened this issue Nov 10, 2017 · 7 comments
Closed

[Python] RowGroup filtering on file level #17793

asfimport opened this issue Nov 10, 2017 · 7 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Nov 10, 2017

We can build upon the API defined in fastparquet for defining RowGroup filters: https://github.com/dask/fastparquet/blob/master/fastparquet/api.py#L296-L300 and translate them into the C++ enums we will define in https://issues.apache.org/jira/browse/PARQUET-1158 . This should enable us to provide the user with a simple predicate pushdown API that we can extend in the background from RowGroup to Page level later on.

Reporter: Uwe Korn / @xhochy
Assignee: Joris Van den Bossche / @jorisvandenbossche

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-1796. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
If Gandiva becomes a part of Apache Arrow, then we should look at compiling filters and pushing them down into parquet-cpp

@asfimport
Copy link
Collaborator Author

Uwe Korn / @xhochy:
I would start by contributing a pure Python implementation that already implements all necessary filters and then we can move the predicate evaluation either to Gandiva or pre-compiled C++. The pure Python pass is much simpler as a first step and provides already a working interface at acceptable performance.

@asfimport
Copy link
Collaborator Author

Uwe Korn / @xhochy:
As an interface I would add a new kwarg to read_table called filters that accepts a list of list of tuples. This will be in disjunctive normal form representation. The innermost triples consist of (column_name, operation, value(s)), e.g. ('name', '==', 'John'). These innermost triples are combined into a list and all predicates in this list and combined with AND. The outer list is then an OR combination of the AND-combined triples.

@asfimport
Copy link
Collaborator Author

Robbie Gruener / @rgruener:
That sounds good to me. I would like to point out it would be nice if it would be possible to apply it at the ParquetDataset level as well extending the filter parameter that already exists to handle both hive partitions and row group level filtering https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L777 It could do this by using the summary _metadata file or by reading all footers.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
Since we're on a critical path to get 0.11 out in the next week or two, I'm moving this to 0.12

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
I think we can close this issue, since this is now possible with the dataset API?

(we can have a separate one about actually using this in pyarrow.parquet.read_table filter argument.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
Let's close as soon as it's documented

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants