Align FastParquetEngine with pyarrow engines#7091
Conversation
…ling of row-groups
rjzamora
left a comment
There was a problem hiding this comment.
@martindurant @jrbourbeau - Although this PR may look huge, it is really just moving around logic that was already there :)
I added some review comments to help explain the changes.
martindurant
left a comment
There was a problem hiding this comment.
Sorry, more questions, but I think it's all along the right lines!
| # TODO: Adding `parquet_file._set_attrs()` here seems to | ||
| # cause a failure in `test_append_with_partition[fastparquet]` |
There was a problem hiding this comment.
Related to this comment - I'll look into this, but let me know if you have thoughts @martindurant
There was a problem hiding this comment.
Not immediately - I guess append updates/mutates the original fmd, but we messed with it in the meantime?
There was a problem hiding this comment.
Interesting - It seems that there were two issues: (1) Removing col.meta_data.statistics was somehow leading to the wrong dtype in the result. As far as I can tell, it is because the null_count is queried somewhere in fastparquet (I didn't confirm this, but found that the error vanished if I kept statistics.null_count). (2) Calling _set_attrs reset pf.cats - leading to the wrong Categorical dtype (so we need to save and reset cats after the call).
There was a problem hiding this comment.
aha, 2) makes sense: the cats are based on the pathnames that the pf sees. Not sure yet about 1).
|
Thank you for the thorough review here @martindurant - It was extremely helpful and I really do appreciate it! It seems like the only comment that isn't (at least temporarily) resolved is the global_lookup concern. Is there a change that you would be happy to see there, or is your concern mostly related to your intuition that most parquet datasets should comprise single-row-group files? If the concern is with the use of a dict: We could move the information into If the concern is about focusing on multi-row-group files: This is the dataset layout we explicitly recommend in |
|
@jrbourbeau - I'd sat this PR is ready as soon as Martin gives the okay (unless you have other comments/questions) :) |
|
No more comments |
|
Thanks @jrbourbeau, and thanks @martindurant for the great review/advice! |
Depends on #7066
Addresses #6376
This PR (dramatically) improves the performance of the FastParquetEngine for large hive-partitioned datasets (when reading). It also rewrites most of the engine to align with
ArrowLegacyEngineandArrowDatasetEngine. In fact, all three engines now share >100 lines of metadata-processing code.In order to prepare for the
Blockwise+IO work in #7042, these changes include the explicit pickling of row-group metadata to ensure all partition-specific arguments aremsgpack-serializable. I am not measuring much of an overhead for this round-trip serialization, but it mayu make sense to avoid pickling until we need it.