-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-10134: [Python][Dataset] Add ParquetFileFragment.num_row_groups #8317
Conversation
I think the main reason such a property would be interesting for dask's use case is to get the number of row groups in a case all statistics are not available / not yet parsed. So the way that this PR returns |
@jorisvandenbossche just confirming: you want |
@jorisvandenbossche PTAL |
Yes, that's indeed the consequence for now (if the metadata was not yet parsed before). Long term I would like us to cache the metadata, though, without the need to necessarily directly parse all statistics etc (https://issues.apache.org/jira/browse/ARROW-10131- |
2257f86
to
9f5fcd1
Compare
Could you add test for the case I commented about? I think this should do it (didn't run the code though): @pytest.mark.parquet
def test_parquet_fragment_num_row_groups(tempdir):
import pyarrow.parquet as pq
table = pa.table({'a': range(8)})
pq.write_table(table, tempdir / "test.parquet", row_group_size=2)
dataset = ds.dataset(tempdir / "test.parquet", format="parquet")
original_fragment = list(dataset.get_fragments())[0]
# create fragment with subset of row groups
fragment = original_fragment.format.make_fragment(
original_fragment.path, original_fragment.filesystem,
row_groups=[1, 3])
assert fragment.num_row_groups == 2
# ensure that parsing metadata preserves correct number of row groups
fragment.ensure_complete_metadata()
assert fragment.num_row_groups == 2
assert len(fragment.row_groups) == 2 |
CI failure is unrelated. Merging |
Closes apache#8317 from bkietz/10134-Add-ParquetFileFragmentnu Authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
Closes apache#8317 from bkietz/10134-Add-ParquetFileFragmentnu Authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
No description provided.