You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to enhance partition filters in methods such as:
pyarrow.parquet.ParquetDataset(path, filters)
I am proposing the below enhancements:
for operator "in", "not in", the value should be any typing.Iteratable (also a container). But currently only set is supported while other iteratable, such as list, tuple cannot function correctly. I would like to change it to accept any iteratable.
Enhance the documents about the partition filters.
I see there is a new version implemented with
_ParquetDatasetV2 which already accepts an iterable. So the documentation update is fine for the new version as well.
Joris Van den Bossche / @jorisvandenbossche:
Could you already push a branch with your changes to GitHub? Seeing the code might help in understanding what you are proposing.
for operator "in", "not in", the value currently must be a set.
I would think we already support other iterable that support the "in" python operator. Do you have an example where it fails?
But I agree that converting it to a set might be good anyway.
I would like to add a 'like' operator which has a semantics of a sql "like". Alternatively, a regular expression can be used. I prefer sql like semantics for reasons to achieve sql consistency.
The ParquetDataset code is being replaced with a pyarrow.dataset based implementation. So any significant enhancement or new feature should probably target this new implementation. Currently, we do not yet support general filter expressions, but there is work ongoing on allowing this (I can't directly find the correct JIRA, but see eg ARROW-10305 for similar discussion)
I would like to enhance partition filters in methods such as:
pyarrow.parquet.ParquetDataset(path, filters)
I am proposing the below enhancements:
for operator "in", "not in", the value should be any typing.Iteratable (also a container). But currently only set is supported while other iteratable, such as list, tuple cannot function correctly. I would like to change it to accept any iteratable.
Enhance the documents about the partition filters.
I see there is a new version implemented with
_ParquetDatasetV2 which already accepts an iterable. So the documentation update is fine for the new version as well.
Reporter: Weiyang Zhao / @wyzhao
Assignee: Weiyang Zhao / @wyzhao
PRs and other links:
Note: This issue was originally created as ARROW-10574. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: