-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-2801: [Python] Implement split_row_groups for ParquetDataset #2223
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2223 +/- ##
=========================================
+ Coverage 84.49% 86.49% +2%
=========================================
Files 293 232 -61
Lines 45318 41132 -4186
=========================================
- Hits 38290 35576 -2714
+ Misses 7001 5556 -1445
+ Partials 27 0 -27
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good but this will be hard to test as we cannot yet write summary files. I guess it would probably better to investigate time into writing them first.
@@ -864,6 +864,42 @@ def open_file(path, meta=None): | |||
common_metadata=self.common_metadata) | |||
return open_file | |||
|
|||
def _split_row_groups(self): | |||
if not self.metadata or self.metadata.num_row_groups == 0: | |||
raise NotImplementedError("split_row_groups is only implemented " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In future, we should also be able to support this without the central _metadata
file by loading all the footers of the files in the dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, this was just a first iteration implementing it with minimal effort.
Yeah that would likely be nice to have before merging this. Looking into it briefly it would normally not be too difficult to write (basically just read all the footers of a dataset) however the current method for writing metadata (write_metadata) only takes a schema which does not include row group information. Is there an easy way to alter that method to take FileMetaData which incorporates all file metadata. |
@rgruener Can you open a Parquet JIRA about what is missing on the |
I will move this off 0.10.0 for now; if it can get tested and merged in time, that's great, but I suspect that @rgruener / Uber will be OK building packages for themselves if it gets merged in between 0.10 and 0.11 |
Yeah Im fine waiting until this can be tested so it is likely this wont be getting in for 0.10. Though in general we would like to be able to move to the vanilla open sourced version 😉 |
I am going to close this for now until it can be tests (will reopen when it is ready). When I have some time, I will work on writing the summary metadata files. |
I still need to write a unit test but figured if there are any glaring issues better to get feedback early