-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-5436: [Python] parquet.read_table add filters keyword #4409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-5436: [Python] parquet.read_table add filters keyword #4409
Conversation
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one comment, otherwise LGTM.
| If the source is a file path, use a memory map to read file, which can | ||
| improve performance in some environments | ||
| {1} | ||
| filters : List[Tuple] or List[List[Tuple]] or None (default) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will appear in the read_pandas docstring, so you should probably add the filters argument to read_pandas as well.
|
|
||
| _generate_partition_directories(fs, base_path, partition_spec, df) | ||
|
|
||
| table = pq.read_table( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add a test with [[('integers', '<', 3)]]? I consider the [[…]] to be better for end users as it supports the full scope of all possible queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added one (although I think it is not that essential here, as I am only testing the filter argument is correctly passed through)
Codecov Report
@@ Coverage Diff @@
## master #4409 +/- ##
===========================================
- Coverage 88.26% 65.16% -23.11%
===========================================
Files 846 475 -371
Lines 103360 60446 -42914
Branches 1253 0 -1253
===========================================
- Hits 91233 39389 -51844
- Misses 11880 21057 +9177
+ Partials 247 0 -247
Continue to review full report at Codecov.
|
xhochy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM
|
Awesome! |
https://issues.apache.org/jira/browse/ARROW-5436
I suppose the fact that
parquet.read_tabledispatched to FileSystem.read_parquet was for historical reasons (that function was added before ParquetDataset was added), but directly calling ParquetDataset there looks cleaner instead of going through FileSystem.read_parquet. So therefore I also changed that.In addition, I made sure the
memory_mapkeyword was actually passed through, I think an oversight of #2954.(those two changes should be useful anyway, regardless of adding
filterskeyword or not)