Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][Parquet] Allow collections for 'in' / 'not in' filter (in addition to sets) #26538

Closed
asfimport opened this issue Nov 13, 2020 · 3 comments

Comments

@asfimport
Copy link
Collaborator

I would like to enhance partition filters in methods such as:

pyarrow.parquet.ParquetDataset(path, filters)

I am proposing the below enhancements:

  1. for operator "in", "not in", the value should be any typing.Iteratable (also a container). But currently only set is supported while other iteratable, such as list, tuple cannot function correctly. I would like to change it to accept any iteratable.

  2. Enhance the documents about the partition filters.

    I see there is a new version implemented with 
    _ParquetDatasetV2 which already accepts an iterable. So the documentation update is fine for the new version as well.
     

Reporter: Weiyang Zhao / @wyzhao
Assignee: Weiyang Zhao / @wyzhao

PRs and other links:

Note: This issue was originally created as ARROW-10574. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
Could you already push a branch with your changes to GitHub? Seeing the code might help in understanding what you are proposing.

for operator "in", "not in", the value currently must be a set.

I would think we already support other iterable that support the "in" python operator. Do you have an example where it fails?
But I agree that converting it to a set might be good anyway.

I would like to add a 'like' operator which has a semantics of a sql "like". Alternatively, a regular expression can be used. I prefer sql like semantics for reasons to achieve sql consistency.

The ParquetDataset code is being replaced with a pyarrow.dataset based implementation. So any significant enhancement or new feature should probably target this new implementation. Currently, we do not yet support general filter expressions, but there is work ongoing on allowing this (I can't directly find the correct JIRA, but see eg ARROW-10305 for similar discussion)

@asfimport
Copy link
Collaborator Author

Weiyang Zhao / @wyzhao:
Hi,  @jorisvandenbossche

Just saw your comments. I don't know where Jira sent me the notification email.

Currently if I pass in a filter like this:

('x', 'in', ['a', 'b'])

It will not work because of line 876 in parquet.py:
if isinstance(f_value, set)
You see that it only checks 'set'.

I also enhanced the documents to make it clear.

I dropped the 'like' enhancement because it is not supported in the cython version and I am unfamiliar with cython.

I have submitted a pull request. You can see the details there.

Thanks.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
Issue resolved by pull request 8672
#8672

@asfimport asfimport added this to the 3.0.0 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant