ARROW-2801: [Python] Implement split_row_groups for ParquetDataset #2223

rgruener · 2018-07-06T21:12:24Z

I still need to write a unit test but figured if there are any glaring issues better to get feedback early

codecov-io · 2018-07-06T22:06:56Z

Codecov Report

Merging #2223 into master will increase coverage by 2%.
The diff coverage is 5.26%.

@@            Coverage Diff            @@
##           master    #2223     +/-   ##
=========================================
+ Coverage   84.49%   86.49%     +2%     
=========================================
  Files         293      232     -61     
  Lines       45318    41132   -4186     
=========================================
- Hits        38290    35576   -2714     
+ Misses       7001     5556   -1445     
+ Partials       27        0     -27

Impacted Files	Coverage Δ
python/pyarrow/parquet.py	`87.59% <5.26%> (-3.75%)`	⬇️
cpp/src/arrow/pretty_print.cc	`65.58% <0%> (-18.59%)`	⬇️
cpp/src/arrow/builder.h	`91.75% <0%> (-5.51%)`	⬇️
cpp/src/plasma/common.cc	`90.32% <0%> (-4.13%)`	⬇️
python/pyarrow/compat.py	`74.59% <0%> (-3.5%)`	⬇️
cpp/src/arrow/io/file.cc	`93.67% <0%> (-2%)`	⬇️
cpp/src/arrow/io/io-file-test.cc	`92.67% <0%> (-1.94%)`	⬇️
cpp/src/arrow/python/helpers.cc	`79.87% <0%> (-1.84%)`	⬇️
cpp/src/arrow/test-util.h	`72.72% <0%> (-1.71%)`	⬇️
... and 135 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 551e9ce...c413a67. Read the comment docs.

xhochy

Code looks good but this will be hard to test as we cannot yet write summary files. I guess it would probably better to investigate time into writing them first.

xhochy · 2018-07-08T18:10:46Z

python/pyarrow/parquet.py

@@ -864,6 +864,42 @@ def open_file(path, meta=None):
                                   common_metadata=self.common_metadata)
        return open_file

+    def _split_row_groups(self):
+        if not self.metadata or self.metadata.num_row_groups == 0:
+            raise NotImplementedError("split_row_groups is only implemented "


In future, we should also be able to support this without the central _metadata file by loading all the footers of the files in the dataset.

Agreed, this was just a first iteration implementing it with minimal effort.

rgruener · 2018-07-10T14:38:10Z

Code looks good but this will be hard to test as we cannot yet write summary files. I guess it would probably better to investigate time into writing them first.

Yeah that would likely be nice to have before merging this. Looking into it briefly it would normally not be too difficult to write (basically just read all the footers of a dataset) however the current method for writing metadata (write_metadata) only takes a schema which does not include row group information. Is there an easy way to alter that method to take FileMetaData which incorporates all file metadata.

xhochy · 2018-07-12T06:49:37Z

@rgruener Can you open a Parquet JIRA about what is missing on the parquet-cpp side to support _metadata files?

wesm · 2018-07-16T23:36:56Z

I will move this off 0.10.0 for now; if it can get tested and merged in time, that's great, but I suspect that @rgruener / Uber will be OK building packages for themselves if it gets merged in between 0.10 and 0.11

rgruener · 2018-07-18T16:21:32Z

Yeah Im fine waiting until this can be tested so it is likely this wont be getting in for 0.10. Though in general we would like to be able to move to the vanilla open sourced version 😉

rgruener · 2018-08-20T20:58:14Z

I am going to close this for now until it can be tests (will reopen when it is ready). When I have some time, I will work on writing the summary metadata files.

ARROW-2801: [Python] Implement split_row_groups for ParquetDataset

c413a67

xhochy reviewed Jul 8, 2018

View reviewed changes

rgruener mentioned this pull request Aug 1, 2018

Switch to using parquet summary metadata file to split row groups uber/petastorm#19

Closed

rgruener closed this Aug 20, 2018

asfimport mentioned this pull request Apr 15, 2021

[Python][C++][Dataset] Implement split_row_groups for ParquetDataset #19181

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-2801: [Python] Implement split_row_groups for ParquetDataset #2223

ARROW-2801: [Python] Implement split_row_groups for ParquetDataset #2223

rgruener commented Jul 6, 2018

codecov-io commented Jul 6, 2018 •

edited

Loading

xhochy left a comment

xhochy Jul 8, 2018

rgruener Jul 10, 2018

rgruener commented Jul 10, 2018

xhochy commented Jul 12, 2018

wesm commented Jul 16, 2018

rgruener commented Jul 18, 2018

rgruener commented Aug 20, 2018

ARROW-2801: [Python] Implement split_row_groups for ParquetDataset #2223

ARROW-2801: [Python] Implement split_row_groups for ParquetDataset #2223

Conversation

rgruener commented Jul 6, 2018

codecov-io commented Jul 6, 2018 • edited Loading

Codecov Report

xhochy left a comment

Choose a reason for hiding this comment

xhochy Jul 8, 2018

Choose a reason for hiding this comment

rgruener Jul 10, 2018

Choose a reason for hiding this comment

rgruener commented Jul 10, 2018

xhochy commented Jul 12, 2018

wesm commented Jul 16, 2018

rgruener commented Jul 18, 2018

rgruener commented Aug 20, 2018

codecov-io commented Jul 6, 2018 •

edited

Loading