Skip to content

Align FastParquetEngine with pyarrow engines#7091

Merged
jrbourbeau merged 41 commits intodask:masterfrom
rjzamora:fastparquet-manual-metadata
Jan 29, 2021
Merged

Align FastParquetEngine with pyarrow engines#7091
jrbourbeau merged 41 commits intodask:masterfrom
rjzamora:fastparquet-manual-metadata

Conversation

@rjzamora
Copy link
Copy Markdown
Member

Depends on #7066
Addresses #6376

This PR (dramatically) improves the performance of the FastParquetEngine for large hive-partitioned datasets (when reading). It also rewrites most of the engine to align with ArrowLegacyEngine and ArrowDatasetEngine. In fact, all three engines now share >100 lines of metadata-processing code.

In order to prepare for the Blockwise+IO work in #7042, these changes include the explicit pickling of row-group metadata to ensure all partition-specific arguments are msgpack-serializable. I am not measuring much of an overhead for this round-trip serialization, but it mayu make sense to avoid pickling until we need it.

@rjzamora rjzamora marked this pull request as ready for review January 26, 2021 16:56
Copy link
Copy Markdown
Member Author

@rjzamora rjzamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martindurant @jrbourbeau - Although this PR may look huge, it is really just moving around logic that was already there :)

I added some review comments to help explain the changes.

Copy link
Copy Markdown
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, more questions, but I think it's all along the right lines!

Comment on lines +744 to +745
# TODO: Adding `parquet_file._set_attrs()` here seems to
# cause a failure in `test_append_with_partition[fastparquet]`
Copy link
Copy Markdown
Member Author

@rjzamora rjzamora Jan 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to this comment - I'll look into this, but let me know if you have thoughts @martindurant

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not immediately - I guess append updates/mutates the original fmd, but we messed with it in the meantime?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting - It seems that there were two issues: (1) Removing col.meta_data.statistics was somehow leading to the wrong dtype in the result. As far as I can tell, it is because the null_count is queried somewhere in fastparquet (I didn't confirm this, but found that the error vanished if I kept statistics.null_count). (2) Calling _set_attrs reset pf.cats - leading to the wrong Categorical dtype (so we need to save and reset cats after the call).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aha, 2) makes sense: the cats are based on the pathnames that the pf sees. Not sure yet about 1).

@rjzamora
Copy link
Copy Markdown
Member Author

rjzamora commented Jan 27, 2021

Thank you for the thorough review here @martindurant - It was extremely helpful and I really do appreciate it!

It seems like the only comment that isn't (at least temporarily) resolved is the global_lookup concern. Is there a change that you would be happy to see there, or is your concern mostly related to your intuition that most parquet datasets should comprise single-row-group files?

If the concern is with the use of a dict: We could move the information into file_row_groups if we store a list of tuples ( e.g. (local_row_group_id, global_row_group_id)) instead of integers (e.g. just local_row_group_id). [EDIT: This change is now included (see ee0f112)]

If the concern is about focusing on multi-row-group files: This is the dataset layout we explicitly recommend in dask_cudf (so we see it a lot).

@rjzamora
Copy link
Copy Markdown
Member Author

@jrbourbeau - I'd sat this PR is ready as soon as Martin gives the okay (unless you have other comments/questions) :)

@martindurant
Copy link
Copy Markdown
Member

No more comments

Copy link
Copy Markdown
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rjzamora!

@jrbourbeau jrbourbeau merged commit 39ef483 into dask:master Jan 29, 2021
@rjzamora
Copy link
Copy Markdown
Member Author

Thanks @jrbourbeau, and thanks @martindurant for the great review/advice!

@rjzamora rjzamora deleted the fastparquet-manual-metadata branch May 21, 2024 00:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants