Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-2801: [Python] Implement split_row_groups for ParquetDataset #2223

Closed
wants to merge 1 commit into from

Conversation

rgruener
Copy link

@rgruener rgruener commented Jul 6, 2018

I still need to write a unit test but figured if there are any glaring issues better to get feedback early

@codecov-io
Copy link

codecov-io commented Jul 6, 2018

Codecov Report

Merging #2223 into master will increase coverage by 2%.
The diff coverage is 5.26%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #2223     +/-   ##
=========================================
+ Coverage   84.49%   86.49%     +2%     
=========================================
  Files         293      232     -61     
  Lines       45318    41132   -4186     
=========================================
- Hits        38290    35576   -2714     
+ Misses       7001     5556   -1445     
+ Partials       27        0     -27
Impacted Files Coverage Δ
python/pyarrow/parquet.py 87.59% <5.26%> (-3.75%) ⬇️
cpp/src/arrow/pretty_print.cc 65.58% <0%> (-18.59%) ⬇️
cpp/src/arrow/builder.h 91.75% <0%> (-5.51%) ⬇️
cpp/src/plasma/common.cc 90.32% <0%> (-4.13%) ⬇️
python/pyarrow/compat.py 74.59% <0%> (-3.5%) ⬇️
cpp/src/arrow/io/file.cc 93.67% <0%> (-2%) ⬇️
cpp/src/arrow/io/io-file-test.cc 92.67% <0%> (-1.94%) ⬇️
cpp/src/arrow/python/helpers.cc 79.87% <0%> (-1.84%) ⬇️
cpp/src/arrow/test-util.h 72.72% <0%> (-1.71%) ⬇️
... and 135 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 551e9ce...c413a67. Read the comment docs.

Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good but this will be hard to test as we cannot yet write summary files. I guess it would probably better to investigate time into writing them first.

@@ -864,6 +864,42 @@ def open_file(path, meta=None):
common_metadata=self.common_metadata)
return open_file

def _split_row_groups(self):
if not self.metadata or self.metadata.num_row_groups == 0:
raise NotImplementedError("split_row_groups is only implemented "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In future, we should also be able to support this without the central _metadata file by loading all the footers of the files in the dataset.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this was just a first iteration implementing it with minimal effort.

@rgruener
Copy link
Author

Code looks good but this will be hard to test as we cannot yet write summary files. I guess it would probably better to investigate time into writing them first.

Yeah that would likely be nice to have before merging this. Looking into it briefly it would normally not be too difficult to write (basically just read all the footers of a dataset) however the current method for writing metadata (write_metadata) only takes a schema which does not include row group information. Is there an easy way to alter that method to take FileMetaData which incorporates all file metadata.

@xhochy
Copy link
Member

xhochy commented Jul 12, 2018

@rgruener Can you open a Parquet JIRA about what is missing on the parquet-cpp side to support _metadata files?

@wesm
Copy link
Member

wesm commented Jul 16, 2018

I will move this off 0.10.0 for now; if it can get tested and merged in time, that's great, but I suspect that @rgruener / Uber will be OK building packages for themselves if it gets merged in between 0.10 and 0.11

@rgruener
Copy link
Author

Yeah Im fine waiting until this can be tested so it is likely this wont be getting in for 0.10. Though in general we would like to be able to move to the vanilla open sourced version 😉

@rgruener
Copy link
Author

I am going to close this for now until it can be tests (will reopen when it is ready). When I have some time, I will work on writing the summary metadata files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants