[Python][Parquet] Allow collections for 'in' / 'not in' filter (in addition to sets) #26538

asfimport · 2020-11-13T00:45:37Z

I would like to enhance partition filters in methods such as:

pyarrow.parquet.ParquetDataset(path, filters)

I am proposing the below enhancements:

for operator "in", "not in", the value should be any typing.Iteratable (also a container). But currently only set is supported while other iteratable, such as list, tuple cannot function correctly. I would like to change it to accept any iteratable.
Enhance the documents about the partition filters.

I see there is a new version implemented with
_ParquetDatasetV2 which already accepts an iterable. So the documentation update is fine for the new version as well.

Reporter: Weiyang Zhao / @wyzhao
Assignee: Weiyang Zhao / @wyzhao

PRs and other links:

GitHub Pull Request #8672

_{Note: This issue was originally created as ARROW-10574. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2020-11-13T13:02:10Z

Joris Van den Bossche / @jorisvandenbossche:
Could you already push a branch with your changes to GitHub? Seeing the code might help in understanding what you are proposing.

for operator "in", "not in", the value currently must be a set.

I would think we already support other iterable that support the "in" python operator. Do you have an example where it fails?
But I agree that converting it to a set might be good anyway.

I would like to add a 'like' operator which has a semantics of a sql "like". Alternatively, a regular expression can be used. I prefer sql like semantics for reasons to achieve sql consistency.

The ParquetDataset code is being replaced with a pyarrow.dataset based implementation. So any significant enhancement or new feature should probably target this new implementation. Currently, we do not yet support general filter expressions, but there is work ongoing on allowing this (I can't directly find the correct JIRA, but see eg ARROW-10305 for similar discussion)

asfimport · 2020-11-15T20:01:16Z

Weiyang Zhao / @wyzhao:
Hi, @jorisvandenbossche

Just saw your comments. I don't know where Jira sent me the notification email.

Currently if I pass in a filter like this:

('x', 'in', ['a', 'b'])

It will not work because of line 876 in parquet.py:
if isinstance(f_value, set)
You see that it only checks 'set'.

I also enhanced the documents to make it clear.

I dropped the 'like' enhancement because it is not supported in the cython version and I am unfamiliar with cython.

I have submitted a pull request. You can see the details there.

Thanks.

asfimport · 2020-12-21T16:44:48Z

Joris Van den Bossche / @jorisvandenbossche:
Issue resolved by pull request 8672
#8672

asfimport closed this as completed Dec 21, 2020

asfimport added this to the 3.0.0 milestone Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python][Parquet] Allow collections for 'in' / 'not in' filter (in addition to sets) #26538

[Python][Parquet] Allow collections for 'in' / 'not in' filter (in addition to sets) #26538

asfimport commented Nov 13, 2020

asfimport commented Nov 13, 2020

asfimport commented Nov 15, 2020

asfimport commented Dec 21, 2020

[Python][Parquet] Allow collections for 'in' / 'not in' filter (in addition to sets) #26538

[Python][Parquet] Allow collections for 'in' / 'not in' filter (in addition to sets) #26538

Comments

asfimport commented Nov 13, 2020

PRs and other links:

asfimport commented Nov 13, 2020

asfimport commented Nov 15, 2020

asfimport commented Dec 21, 2020