Align FastParquetEngine with pyarrow engines#7091

Merged

jrbourbeau merged 41 commits intodask:masterfrom

rjzamora:fastparquet-manual-metadata

Jan 29, 2021

Member

rjzamora commented Jan 20, 2021

Depends on #7066
Addresses #6376

This PR (dramatically) improves the performance of the FastParquetEngine for large hive-partitioned datasets (when reading). It also rewrites most of the engine to align with ArrowLegacyEngine and ArrowDatasetEngine. In fact, all three engines now share >100 lines of metadata-processing code.

In order to prepare for the Blockwise+IO work in #7042, these changes include the explicit pickling of row-group metadata to ensure all partition-specific arguments are msgpack-serializable. I am not measuring much of an overhead for this round-trip serialization, but it mayu make sense to avoid pickling until we need it.

rjzamora added 20 commits

January 13, 2021 21:13


          introduce use_common_kwargs/common_kwargs to read_parquet

84692db


          fix compatibility issue with dask_cudf

52a1b18


          trigger formatting

eb687cd


          retrigger formatting

069ad60


          retrigger formatting

ac3ea8c


          Merge remote-tracking branch 'upstream/master' into simplify-pyarrow-…

d19176a

…parquet


          avoid frag generation when filtering was already performed

e3589b9


          move meta generation to standalone method

610be3c


          add _construct_parts for fastparquet

36b6d54


          avoid gathering statistics in fastparquet when it isn't necessary

f6a3a96


          passing parquet_file object working in fastparquet, but requires pick…

f52b109

…ling of row-groups


          reducing graph size (but also reducing performance)

b89ada9


          remove fast_metadata

de508ee


          all tests passing with new fastparquet algos

21c6493


          add _process_metadata for fastparquet

266c42a


          add _organize_row_groups

2f1b5ea


          add _get_thrift_row_groups

e48a291


          add _row_groups_to_parts utility

c4c8ec3


          pyarrow and fastparquet aligned to use same _row_groups_to_parts

e5c92e7


          update test_split_row_groups tests to test fastparquet engine

59f8544

rjzamora commented

View reviewed changes

dask/dataframe/io/parquet/fastparquet.py Outdated Show resolved Hide resolved


          remove ParquetFile override

c0106e4

rjzamora mentioned this pull request

Support Blockwise-based IO for DataFrame collections #7042

Closed

rjzamora added 2 commits

January 20, 2021 13:39


          remove unnecessary line

c8b5b45


          sync with latest version of 7066

9e34a87

rjzamora marked this pull request as ready for review

January 26, 2021 16:56


          Merge remote-tracking branch 'upstream/master' into fastparquet-manua…

7bd9964

…l-metadata

rjzamora commented

View reviewed changes

dask/dataframe/io/parquet/arrow.py Show resolved Hide resolved


          revert accidental CategoricalIndex change

e87dd6c

rjzamora commented

View reviewed changes

dask/dataframe/io/parquet/fastparquet.py Outdated Show resolved Hide resolved

rjzamora added 2 commits

January 26, 2021 10:05


          warn/raise if filters are passed without gather_statistics

602151f


          remove commented line

e6fb493

rjzamora commented

View reviewed changes

Member Author

rjzamora left a comment

@martindurant @jrbourbeau - Although this PR may look huge, it is really just moving around logic that was already there :)

I added some review comments to help explain the changes.

dask/dataframe/io/parquet/arrow.py Show resolved Hide resolved

dask/dataframe/io/parquet/arrow.py Show resolved Hide resolved

dask/dataframe/io/parquet/arrow.py Show resolved Hide resolved

dask/dataframe/io/parquet/arrow.py Show resolved Hide resolved

dask/dataframe/io/parquet/arrow.py Show resolved Hide resolved

dask/dataframe/io/tests/test_parquet.py Show resolved Hide resolved

dask/dataframe/io/tests/test_parquet.py Show resolved Hide resolved

dask/dataframe/io/tests/test_parquet.py Show resolved Hide resolved

dask/dataframe/io/parquet/fastparquet.py Outdated Show resolved Hide resolved

dask/dataframe/io/parquet/fastparquet.py Show resolved Hide resolved

rjzamora added 3 commits

January 26, 2021 11:33


          cleanup

67d97e7


          fix row_groups vs fmd.row_groups mistake

9838d76


          temporarily remove _set_attrs call

a832217

martindurant reviewed

View reviewed changes

Member

martindurant left a comment

Sorry, more questions, but I think it's all along the right lines!

dask/dataframe/io/parquet/utils.py Show resolved Hide resolved

dask/dataframe/io/parquet/utils.py Show resolved Hide resolved

dask/dataframe/io/parquet/utils.py Show resolved Hide resolved

dask/dataframe/io/parquet/fastparquet.py Show resolved Hide resolved

dask/dataframe/io/parquet/fastparquet.py Outdated Show resolved Hide resolved

dask/dataframe/io/parquet/fastparquet.py Outdated Show resolved Hide resolved

dask/dataframe/io/parquet/fastparquet.py Show resolved Hide resolved

dask/dataframe/io/parquet/fastparquet.py Show resolved Hide resolved

dask/dataframe/io/parquet/fastparquet.py Show resolved Hide resolved

dask/dataframe/io/parquet/fastparquet.py Show resolved Hide resolved

rjzamora commented

View reviewed changes

dask/dataframe/io/parquet/fastparquet.py Outdated

Comment on lines +744 to +745

		# TODO: Adding `parquet_file._set_attrs()` here seems to
		# cause a failure in `test_append_with_partition[fastparquet]`

Member Author

rjzamora Jan 26, 2021 •

edited

Loading

Related to this comment - I'll look into this, but let me know if you have thoughts @martindurant

Member

martindurant Jan 26, 2021

Not immediately - I guess append updates/mutates the original fmd, but we messed with it in the meantime?

Member Author

rjzamora Jan 27, 2021

Interesting - It seems that there were two issues: (1) Removing col.meta_data.statistics was somehow leading to the wrong dtype in the result. As far as I can tell, it is because the null_count is queried somewhere in fastparquet (I didn't confirm this, but found that the error vanished if I kept statistics.null_count). (2) Calling _set_attrs reset pf.cats - leading to the wrong Categorical dtype (so we need to save and reset cats after the call).

Member

martindurant Jan 27, 2021

aha, 2) makes sense: the cats are based on the pathnames that the pf sees. Not sure yet about 1).

rjzamora added 6 commits

January 26, 2021 13:12


          review-related cleanup

46f2c4d


          fix _set_attrs problem

64007a9


          add back pf.row_groups = None as this dramatically reduces the graph …

ba503ef

…size


          add check for pandas_type in case the user really wants bytes

5010ffd


          remove null_count from statistics gathering since it is not currently…

1a460ca

… used


          fall back to min/max if min_value/max_value are None

19d6479

Member Author

rjzamora commented Jan 27, 2021 •

edited

Loading

Thank you for the thorough review here @martindurant - It was extremely helpful and I really do appreciate it!

It seems like the only comment that isn't (at least temporarily) resolved is the global_lookup concern. Is there a change that you would be happy to see there, or is your concern mostly related to your intuition that most parquet datasets should comprise single-row-group files?

If the concern is with the use of a dict: We could move the information into file_row_groups if we store a list of tuples ( e.g. (local_row_group_id, global_row_group_id)) instead of integers (e.g. just local_row_group_id). [EDIT: This change is now included (see ee0f112)]

If the concern is about focusing on multi-row-group files: This is the dataset layout we explicitly recommend in dask_cudf (so we see it a lot).


          remove use of global_lookup

ee0f112

Member Author

rjzamora commented Jan 27, 2021

@jrbourbeau - I'd sat this PR is ready as soon as Martin gives the okay (unless you have other comments/questions) :)

Member

martindurant commented Jan 27, 2021

No more comments

rjzamora added 4 commits

January 28, 2021 07:21


          include easy fix for 7122

82afed4


          use dtype.name to check for category

70b39a4


          use dtype.name to check for category

7d0b33e


          update docstring on fastparquet filtering (doesn't support DNF for pa…

9b4576f

…rtitioned columns)

rjzamora mentioned this pull request

BUG: len / column subset failing on filtered parquet dataset using pyarrow.dataset engine #7122

Open

jrbourbeau approved these changes

View reviewed changes

Member

jrbourbeau left a comment

Thanks @rjzamora!

jrbourbeau merged commit 39ef483 into dask:master

Member Author

rjzamora commented Jan 29, 2021

Thanks @jrbourbeau, and thanks @martindurant for the great review/advice!

rjzamora deleted the fastparquet-manual-metadata branch

May 21, 2024 00:05

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet