ARROW-10134: [Python][Dataset] Add ParquetFileFragment.num_row_groups #8317

bkietz · 2020-10-01T12:42:12Z

No description provided.

github-actions · 2020-10-01T12:46:31Z

https://issues.apache.org/jira/browse/ARROW-10134

jorisvandenbossche · 2020-10-02T16:12:18Z

I think the main reason such a property would be interesting for dask's use case is to get the number of row groups in a case all statistics are not available / not yet parsed. So the way that this PR returns None in that case is not super useful, I think.
I think ideally if the number of row groups is not know (the row_groups are not set), it would retrieve this information from the FileMetaData.

bkietz · 2020-10-02T17:30:37Z

@jorisvandenbossche just confirming: you want f.num_row_groups to potentially perform IO?

python/pyarrow/_dataset.pyx

bkietz · 2020-10-05T20:41:36Z

@jorisvandenbossche PTAL

jorisvandenbossche · 2020-10-07T09:21:24Z

you want f.num_row_groups to potentially perform IO?

Yes, that's indeed the consequence for now (if the metadata was not yet parsed before). Long term I would like us to cache the metadata, though, without the need to necessarily directly parse all statistics etc (https://issues.apache.org/jira/browse/ARROW-10131-

cpp/src/arrow/dataset/file_parquet.cc

jorisvandenbossche · 2020-10-07T15:46:36Z

Could you add test for the case I commented about? I think this should do it (didn't run the code though):

@pytest.mark.parquet
def test_parquet_fragment_num_row_groups(tempdir):
    import pyarrow.parquet as pq

    table = pa.table({'a': range(8)})
    pq.write_table(table, tempdir / "test.parquet", row_group_size=2)
    dataset = ds.dataset(tempdir / "test.parquet", format="parquet")
    original_fragment = list(dataset.get_fragments())[0]

    # create fragment with subset of row groups
    fragment = original_fragment.format.make_fragment(
        original_fragment.path, original_fragment.filesystem,
          row_groups=[1, 3])
    assert fragment.num_row_groups == 2
    # ensure that parsing metadata preserves correct number of row groups
    fragment.ensure_complete_metadata()
    assert fragment.num_row_groups == 2
    assert len(fragment.row_groups) == 2

bkietz · 2020-10-08T00:25:28Z

CI failure is unrelated. Merging

Closes apache#8317 from bkietz/10134-Add-ParquetFileFragmentnu Authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>

bkietz requested a review from jorisvandenbossche October 1, 2020 12:42

pitrou reviewed Oct 5, 2020

View reviewed changes

python/pyarrow/_dataset.pyx Outdated Show resolved Hide resolved

jorisvandenbossche requested changes Oct 7, 2020

View reviewed changes

cpp/src/arrow/dataset/file_parquet.cc Outdated Show resolved Hide resolved

bkietz added 3 commits October 7, 2020 11:04

ARROW-10134: [Python][Dataset] Add ParquetFileFragment.num_row_groups

c2a21c4

load num_row_groups if it is not available

01779e7

correct setting num_row_groups_ in the event of subselection

9f5fcd1

bkietz force-pushed the 10134-Add-ParquetFileFragmentnu branch from 2257f86 to 9f5fcd1 Compare October 7, 2020 15:04

jorisvandenbossche reviewed Oct 7, 2020

View reviewed changes

cpp/src/arrow/dataset/file_parquet.cc Show resolved Hide resolved

bkietz added 2 commits October 7, 2020 13:33

don't use selected row group count to validate row group ids

3a67f37

lint fix

caadfbd

jorisvandenbossche approved these changes Oct 7, 2020

View reviewed changes

bkietz closed this in 1150c38 Oct 8, 2020

bkietz deleted the 10134-Add-ParquetFileFragmentnu branch February 25, 2021 16:19

asfimport mentioned this pull request Oct 8, 2020

[C++][Dataset] Add ParquetFileFragment::num_row_groups property #26145

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-10134: [Python][Dataset] Add ParquetFileFragment.num_row_groups #8317

ARROW-10134: [Python][Dataset] Add ParquetFileFragment.num_row_groups #8317

bkietz commented Oct 1, 2020

github-actions bot commented Oct 1, 2020

jorisvandenbossche commented Oct 2, 2020

bkietz commented Oct 2, 2020

bkietz commented Oct 5, 2020

jorisvandenbossche commented Oct 7, 2020

jorisvandenbossche commented Oct 7, 2020

bkietz commented Oct 8, 2020

ARROW-10134: [Python][Dataset] Add ParquetFileFragment.num_row_groups #8317

ARROW-10134: [Python][Dataset] Add ParquetFileFragment.num_row_groups #8317

Conversation

bkietz commented Oct 1, 2020

github-actions bot commented Oct 1, 2020

jorisvandenbossche commented Oct 2, 2020

bkietz commented Oct 2, 2020

bkietz commented Oct 5, 2020

jorisvandenbossche commented Oct 7, 2020

jorisvandenbossche commented Oct 7, 2020

bkietz commented Oct 8, 2020