Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-10134: [Python][Dataset] Add ParquetFileFragment.num_row_groups #8317

Closed
wants to merge 5 commits into from

Conversation

bkietz
Copy link
Member

@bkietz bkietz commented Oct 1, 2020

No description provided.

@github-actions
Copy link

github-actions bot commented Oct 1, 2020

@jorisvandenbossche
Copy link
Member

I think the main reason such a property would be interesting for dask's use case is to get the number of row groups in a case all statistics are not available / not yet parsed. So the way that this PR returns None in that case is not super useful, I think.
I think ideally if the number of row groups is not know (the row_groups are not set), it would retrieve this information from the FileMetaData.

@bkietz
Copy link
Member Author

bkietz commented Oct 2, 2020

@jorisvandenbossche just confirming: you want f.num_row_groups to potentially perform IO?

python/pyarrow/_dataset.pyx Outdated Show resolved Hide resolved
@bkietz
Copy link
Member Author

bkietz commented Oct 5, 2020

@jorisvandenbossche PTAL

@jorisvandenbossche
Copy link
Member

you want f.num_row_groups to potentially perform IO?

Yes, that's indeed the consequence for now (if the metadata was not yet parsed before). Long term I would like us to cache the metadata, though, without the need to necessarily directly parse all statistics etc (https://issues.apache.org/jira/browse/ARROW-10131-

@bkietz bkietz force-pushed the 10134-Add-ParquetFileFragmentnu branch from 2257f86 to 9f5fcd1 Compare October 7, 2020 15:04
@jorisvandenbossche
Copy link
Member

Could you add test for the case I commented about? I think this should do it (didn't run the code though):

@pytest.mark.parquet
def test_parquet_fragment_num_row_groups(tempdir):
    import pyarrow.parquet as pq

    table = pa.table({'a': range(8)})
    pq.write_table(table, tempdir / "test.parquet", row_group_size=2)
    dataset = ds.dataset(tempdir / "test.parquet", format="parquet")
    original_fragment = list(dataset.get_fragments())[0]

    # create fragment with subset of row groups
    fragment = original_fragment.format.make_fragment(
        original_fragment.path, original_fragment.filesystem,
          row_groups=[1, 3])
    assert fragment.num_row_groups == 2
    # ensure that parsing metadata preserves correct number of row groups
    fragment.ensure_complete_metadata()
    assert fragment.num_row_groups == 2
    assert len(fragment.row_groups) == 2

@bkietz
Copy link
Member Author

bkietz commented Oct 8, 2020

CI failure is unrelated. Merging

@bkietz bkietz closed this in 1150c38 Oct 8, 2020
emkornfield pushed a commit to emkornfield/arrow that referenced this pull request Oct 16, 2020
Closes apache#8317 from bkietz/10134-Add-ParquetFileFragmentnu

Authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
@bkietz bkietz deleted the 10134-Add-ParquetFileFragmentnu branch February 25, 2021 16:19
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
Closes apache#8317 from bkietz/10134-Add-ParquetFileFragmentnu

Authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants