Skip to content

Conversation

@jorisvandenbossche
Copy link
Member

https://issues.apache.org/jira/browse/ARROW-5436

I suppose the fact that parquet.read_table dispatched to FileSystem.read_parquet was for historical reasons (that function was added before ParquetDataset was added), but directly calling ParquetDataset there looks cleaner instead of going through FileSystem.read_parquet. So therefore I also changed that.

In addition, I made sure the memory_map keyword was actually passed through, I think an oversight of #2954.

(those two changes should be useful anyway, regardless of adding filters keyword or not)

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one comment, otherwise LGTM.

If the source is a file path, use a memory map to read file, which can
improve performance in some environments
{1}
filters : List[Tuple] or List[List[Tuple]] or None (default)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will appear in the read_pandas docstring, so you should probably add the filters argument to read_pandas as well.


_generate_partition_directories(fs, base_path, partition_spec, df)

table = pq.read_table(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a test with [[('integers', '<', 3)]]? I consider the [[…]] to be better for end users as it supports the full scope of all possible queries.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added one (although I think it is not that essential here, as I am only testing the filter argument is correctly passed through)

@jorisvandenbossche
Copy link
Member Author

@pitrou @xhochy thanks for the reviews, updated the PR.

@codecov-io
Copy link

Codecov Report

Merging #4409 into master will decrease coverage by 23.1%.
The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #4409       +/-   ##
===========================================
- Coverage   88.26%   65.16%   -23.11%     
===========================================
  Files         846      475      -371     
  Lines      103360    60446    -42914     
  Branches     1253        0     -1253     
===========================================
- Hits        91233    39389    -51844     
- Misses      11880    21057     +9177     
+ Partials      247        0      -247
Impacted Files Coverage Δ
python/pyarrow/parquet.py 92.21% <100%> (-0.02%) ⬇️
python/pyarrow/tests/test_parquet.py 96.09% <100%> (+0.03%) ⬆️
cpp/src/arrow/util/memory.h 0% <0%> (-100%) ⬇️
cpp/src/gandiva/date_utils.h 0% <0%> (-100%) ⬇️
cpp/src/arrow/extension_type.h 0% <0%> (-100%) ⬇️
cpp/src/arrow/compute/kernels/compare.h 0% <0%> (-100%) ⬇️
cpp/src/arrow/util/memory.cc 0% <0%> (-100%) ⬇️
cpp/src/arrow/filesystem/util-internal.cc 0% <0%> (-100%) ⬇️
cpp/src/arrow/util/sse-util.h 0% <0%> (-100%) ⬇️
cpp/src/gandiva/decimal_type_util.h 0% <0%> (-100%) ⬇️
... and 602 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a4dad32...85e5b0e. Read the comment docs.

Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM

@rjurney
Copy link

rjurney commented Jun 6, 2019

Awesome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants