ARROW-5436: [Python] parquet.read_table add filters keyword #4409

jorisvandenbossche · 2019-05-29T13:08:24Z

https://issues.apache.org/jira/browse/ARROW-5436

I suppose the fact that parquet.read_table dispatched to FileSystem.read_parquet was for historical reasons (that function was added before ParquetDataset was added), but directly calling ParquetDataset there looks cleaner instead of going through FileSystem.read_parquet. So therefore I also changed that.

In addition, I made sure the memory_map keyword was actually passed through, I think an oversight of #2954.

(those two changes should be useful anyway, regardless of adding filters keyword or not)

pitrou

Just one comment, otherwise LGTM.

pitrou · 2019-05-30T13:14:50Z

python/pyarrow/parquet.py

    If the source is a file path, use a memory map to read file, which can
    improve performance in some environments
 {1}
+filters : List[Tuple] or List[List[Tuple]] or None (default)


This will appear in the read_pandas docstring, so you should probably add the filters argument to read_pandas as well.

xhochy · 2019-05-30T21:31:13Z

python/pyarrow/tests/test_parquet.py

+
+    _generate_partition_directories(fs, base_path, partition_spec, df)
+
+    table = pq.read_table(


Can you also add a test with [[('integers', '<', 3)]]? I consider the [[…]] to be better for end users as it supports the full scope of all possible queries.

Added one (although I think it is not that essential here, as I am only testing the filter argument is correctly passed through)

…t-read_table

jorisvandenbossche · 2019-06-05T08:13:42Z

@pitrou @xhochy thanks for the reviews, updated the PR.

codecov-io · 2019-06-05T11:49:34Z

Codecov Report

Merging #4409 into master will decrease coverage by 23.1%.
The diff coverage is 100%.

@@             Coverage Diff             @@
##           master    #4409       +/-   ##
===========================================
- Coverage   88.26%   65.16%   -23.11%     
===========================================
  Files         846      475      -371     
  Lines      103360    60446    -42914     
  Branches     1253        0     -1253     
===========================================
- Hits        91233    39389    -51844     
- Misses      11880    21057     +9177     
+ Partials      247        0      -247

Impacted Files	Coverage Δ
python/pyarrow/parquet.py	`92.21% <100%> (-0.02%)`	⬇️
python/pyarrow/tests/test_parquet.py	`96.09% <100%> (+0.03%)`	⬆️
cpp/src/arrow/util/memory.h	`0% <0%> (-100%)`	⬇️
cpp/src/gandiva/date_utils.h	`0% <0%> (-100%)`	⬇️
cpp/src/arrow/extension_type.h	`0% <0%> (-100%)`	⬇️
cpp/src/arrow/compute/kernels/compare.h	`0% <0%> (-100%)`	⬇️
cpp/src/arrow/util/memory.cc	`0% <0%> (-100%)`	⬇️
cpp/src/arrow/filesystem/util-internal.cc	`0% <0%> (-100%)`	⬇️
cpp/src/arrow/util/sse-util.h	`0% <0%> (-100%)`	⬇️
cpp/src/gandiva/decimal_type_util.h	`0% <0%> (-100%)`	⬇️
... and 602 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a4dad32...85e5b0e. Read the comment docs.

xhochy

+1, LGTM

rjurney · 2019-06-06T20:27:19Z

Awesome!

jorisvandenbossche added 4 commits May 29, 2019 14:49

simplify read_table (use ParquetDataset directly)

896abb2

fix passing of memory_map (leftover from ARROW-2807)

9c10f70

add filters keyword

4eb2ea7

fix test

4ea7b77

pitrou requested changes May 30, 2019

View reviewed changes

xhochy reviewed May 30, 2019

View reviewed changes

jorisvandenbossche added 3 commits June 5, 2019 10:01

Merge remote-tracking branch 'upstream/master' into ARROW-5436-parque…

0df8c88

…t-read_table

add filters to read_pandas

9baf420

add test with nested list

0ae1488

lint

85e5b0e

xhochy approved these changes Jun 6, 2019

View reviewed changes

xhochy closed this in d235f69 Jun 6, 2019

jorisvandenbossche mentioned this pull request Jun 6, 2019

Add filters parameter to pandas.read_parquet() to enable PyArrow/Parquet partition filtering pandas-dev/pandas#26551

Closed

asfimport mentioned this pull request Nov 19, 2021

[Python] expose filters argument in parquet.read_table #21889

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-5436: [Python] parquet.read_table add filters keyword #4409

ARROW-5436: [Python] parquet.read_table add filters keyword #4409

Uh oh!

jorisvandenbossche commented May 29, 2019

Uh oh!

pitrou left a comment

Uh oh!

pitrou May 30, 2019

Uh oh!

xhochy May 30, 2019

Uh oh!

jorisvandenbossche Jun 5, 2019

Uh oh!

jorisvandenbossche commented Jun 5, 2019

Uh oh!

codecov-io commented Jun 5, 2019

Uh oh!

xhochy left a comment

Uh oh!

rjurney commented Jun 6, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants


		_generate_partition_directories(fs, base_path, partition_spec, df)

		table = pq.read_table(

ARROW-5436: [Python] parquet.read_table add filters keyword #4409

ARROW-5436: [Python] parquet.read_table add filters keyword #4409

Uh oh!

Conversation

jorisvandenbossche commented May 29, 2019

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

pitrou May 30, 2019

Choose a reason for hiding this comment

Uh oh!

xhochy May 30, 2019

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Jun 5, 2019

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Jun 5, 2019

Uh oh!

codecov-io commented Jun 5, 2019

Codecov Report

Uh oh!

xhochy left a comment

Choose a reason for hiding this comment

Uh oh!

rjurney commented Jun 6, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants