-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-8061: [C++][Dataset] Provide RowGroup fragments for ParquetFileFormat #6670
Conversation
1b9fca0
to
3515126
Compare
9f64e84
to
4f658b2
Compare
69c243e
to
3757e2f
Compare
I was just trying this out, and the row_groups property seems to work correctly for a dataset that I read with
|
@jorisvandenbossche my intention was that fragments would exclusively be discovered containing all row groups for a file. Fragments with a refined row group selection could then be created from these whole-file Fragments as desired: def single_row_group_fragments(parquet_dataset, filter, columns):
for fragment in parquet_dataset.get_fragments(filter=filter, columns=columns):
for row_group in range(fragment.metadata.num_row_groups):
yield parquet_dataset.format.make_fragment(fragment.path, row_groups=[row_group]) Fragments can be scanned in C++ so I can expose that to Python as well: assert first_row_group_fragment.row_groups == {0}
first_row_group_fragment.scan(memory_pool)
# NB: filter, columns already specified in get_fragments. See ARROW-8065 |
One potential advantage of having this at the dataset level, is that Now, an API where we can construct / scan fragments would certainly already be useful as well. |
Another potential problem is that, assuming discovery happens only for full files, for an application like dask where the fragments are then re-created per row group, dask needs to know the number of row groups per file. (just thinking out loud) |
I see. In that case, in C++ I'll add for fragment in parquet_dataset.get_fragments(filter, columns,
max_row_groups_per_fragment=1):
yield from fragment.scan(memory_pool) WRT using |
We have the same problem with number of rows. We need to expose a lazy accessor or something like this. This is related to the I'd say leave it out for now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for me on the C++ side, I'll let @jorisvandenbossche decide on the python part.
2bf51db
to
5d890dc
Compare
This example segfaults for me:
|
Another question: shouldn't there be a |
@jorisvandenbossche I've added an optional schema argument to make_fragment (the default is to inspect the fragment and infer a schema). I've also added a test similar to your segfaulting case which passes here |
Thanks! So now the snippet above doesn't segfault, but creating a fragment with a row group specified still does:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do some more thorough testing tomorrow (I can also push some more python docstrings / tests then)
@@ -400,6 +438,34 @@ cdef class Fragment: | |||
""" | |||
return Expression.wrap(self.fragment.partition_expression()) | |||
|
|||
def scan(self, MemoryPool memory_pool=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be able to pass a columns/filter arguments here? (similar as Dataset.scan and Scanner)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, these options are already specified in Dataset.get_fragments()
. I could add optional parameters which would refine the filter/projection further but this would involve reconstructing the fragment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, they are specified there indeed, if creating from a Dataset. But when re-creating a new fragment from its parts, it's still useful to specify filter/columns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fragment creation can be accomplished with FileFormat.make_fragment
, which does take filter/columns. Please compare this method to Scanner.scan(self)
rather than Dataset.scan(self, filter, columns, ...)
. I'm opposed to putting any more parameters here as they're already specified elsewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's desired that fragment store the filter, is that a temporary solution until ARROW-8065?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think most of the discussion / disagreement is indeed to bring back to that issue.
I find it strange to tie the filter/columns to fragment creation, as for me conceptually that are scanning options. And if we decide to remove ScanOptions from the fragment creation, that would indeed entail that?
I am trying this out with my dask POC, and running into the following issue: I would basically need a I am also still getting segfaults in some cases, see #6670 (comment) |
@jorisvandenbossche I'll extract |
f032598
to
07664d7
Compare
One thing I already encountered is that |
Provides ParquetFileFragment, which may view a subset of row groups within a parquet file. The indices of viewed row groups are available through the
row_groups()
property which is exposed to python. Construction of subset-viewing ParquetFileFragments is not yet exposed to python.