Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-10574: [Python][Parquet] Allow collections for 'in' / 'not in' filter (in addition to sets) #8672

Closed
wants to merge 6 commits into from

Conversation

wyzhao
Copy link

@wyzhao wyzhao commented Nov 15, 2020

I would like to enhance partition filters in methods such as:

pyarrow.parquet.ParquetDataset(path, filters)

I am proposing the below enhancements:

  1. for operator "in", "not in", the value should be any typing.Iteratable (also a container). But currently only set is supported while other iteratable, such as list, tuple cannot function correctly. I would like to change it to accept any iteratable.

  2. Enhance the documents about the partition filters.

I see there is a new version implemented with _ParquetDatasetV2 which passed my tests with an iterable for "in" and "not in". So the documentation update is fine for the new version as well.

@github-actions
Copy link

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! Added a few comments

python/pyarrow/parquet.py Outdated Show resolved Hide resolved
python/pyarrow/parquet.py Outdated Show resolved Hide resolved
python/pyarrow/parquet.py Outdated Show resolved Hide resolved
python/pyarrow/parquet.py Outdated Show resolved Hide resolved
@wyzhao wyzhao force-pushed the feature/partition_filter branch 2 times, most recently from 35cc90a to a7794d4 Compare November 22, 2020 16:00
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates!

python/pyarrow/parquet.py Outdated Show resolved Hide resolved
python/pyarrow/tests/test_parquet.py Outdated Show resolved Hide resolved
python/pyarrow/tests/test_parquet.py Outdated Show resolved Hide resolved
@wyzhao
Copy link
Author

wyzhao commented Dec 6, 2020

Hi Joris,

Thank you for reviewing my work. I saw that python 3.5 is not supported any more, so I changed back to use "Collection" as you suggested. I believe everything is taken care of. Please review.
Thanks and best regards,

Weiyang

@jorisvandenbossche
Copy link
Member

@wyzhao thanks for the update! I am waiting until #8816 gets merged, and then will get back to this PR (this will give a merge conflict, but I can handle that)

@jorisvandenbossche jorisvandenbossche changed the title ARROW-10574: [Python][Parquet] Enhance hive partition filtering. ARROW-10574: [Python][Parquet] Allow collections for 'in' / 'not in' filter (in addition to sets) Dec 21, 2020
@jorisvandenbossche
Copy link
Member

Thanks @wyzhao !

GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…filter (in addition to sets)

I would like to enhance partition filters in methods such as:

pyarrow.parquet.ParquetDataset(path, filters)

I am proposing the below enhancements:

1. for operator "in", "not in", the value should be any typing.Iteratable (also a container). But currently only set is supported while other iteratable, such as list, tuple cannot function correctly. I would like to change it to accept any iteratable.

2. Enhance the documents about the partition filters.

I see there is a new version implemented with _ParquetDatasetV2 which passed my tests with an iterable for "in" and "not in". So the documentation update is fine for the new version as well.

Closes apache#8672 from wyzhao/feature/partition_filter

Lead-authored-by: Weiyang Zhao <weiyzha@blackrock.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants