Change `to_parquet` default to `write_metadata_file=None` by jcrist · Pull Request #8988 · dask/dask

jcrist · 2022-04-27T18:17:16Z

This changes to_parquet to default to write_metadata_file=None. If None, a _metadata file is only written when append=True and the dataset has an existing _metadata file, otherwise it defaults to False.

If a _metadata file doesn't exist when appending, the last file in the dataset is used to validate schema and divisions.

There are also some adjacent changes to simplify the generated to_parquet graph (in particular, the full dataset metadata isn't actually needed in each to_parquet task, but was previously included, bloating graph size).

The majority of this PR is modifying the tests to not fail after this change, since many of them were implicitly relying on the existence of a _metadata file.

Fixes #8901.

Closes #xxxx
Tests added / passed
Passes pre-commit run --all-files

A bit of refactoring before changing the default of `write_metadata_file` to `None` in `to_parquet`. - Simplify implementation - Don't include file metadata in `write_partition` calls if it's not needed - Everything needed to support implementing `write_metadata_file=None` as default *except* changing the value (to ensure tests pass).

Most of the failures are due to divisions not being known by default anymore, since they're only known by default if a `_metadata` file is present.

jcrist · 2022-04-27T18:18:05Z

cc @rjzamora

rjzamora

Thanks for working on this @jcrist!

I didn't look very carefully at the tests yet, but I only have some minor questions so far.

dask/dataframe/io/parquet/arrow.py

dask/dataframe/io/parquet/core.py

rjzamora · 2022-04-27T20:24:14Z

dask/dataframe/io/tests/test_parquet.py

+    res = dd.read_parquet(fn, index=["a"], engine=read_engine)
+    sol = ddf.compute()
+    assert_eq(res, sol)


Does this just need a gather_statistics=True to avoid the compute()? (same question for other usage of sol below)

Or a check_divisions=False. The main change is that without a _metadata file read_parquet will result in a dataframe without divisions by default, and most of the tests are calling assert_eq on two dask dataframes (so divisions are compared). Since there are tests for explicitly checking divisions, in most cases checking divisions like this was accidental and unimportant, I used a mishmash of:

Using a pandas dataframe for the intended solution

Passing in gather_statistics=True to get a dataframe wit hknown divisions

Passing in check_divisions=False

to resolve these failures.

That makes sense - In my branch to deprecate gather_statistics (which I didn't submit yet, and will likely conflict with this PR's test changes in many places), I did end up adding an explicit calculate_divisions=True in many places.

It is somewhat true that we are "accidentally" checking divisions, but my slight preference is to avoid computing before the assert_eq unless we really need to. I'd say check_divisions=False is a bit better, but I may end up rolling back some of these sol changes in the gather_statistics PR.

rjzamora

LGTM @jcrist

I do think it is important that we update the Metadata section in the new parquet doc page, but I'm okay with that happening in a separate PR if you'd prefer.

rjzamora · 2022-04-28T03:57:56Z

dask/dataframe/io/tests/test_parquet.py

    ddf3 = dd.read_parquet(fn, engine=engine)
-    assert ddf3.npartitions < 5


I'm okay with removing this, but if you specifiy gather_statistics=True (soon to becalculate_divisions=True), we will filter out the empty partitions. The original test was checking that we don't get these empty partitions.

jcrist added 2 commits April 26, 2022 15:05

Fixup failing parquet tests

9c0f4e7

Most of the failures are due to divisions not being known by default anymore, since they're only known by default if a `_metadata` file is present.

jcrist changed the title ~~To parquet default write metadata file~~ Change to_parquet default to write_metadata_file=None Apr 27, 2022

github-actions bot added dataframe io labels Apr 27, 2022

rjzamora reviewed Apr 27, 2022

View reviewed changes

Respond to feedback

f10db76

rjzamora approved these changes Apr 28, 2022

View reviewed changes

rjzamora mentioned this pull request Apr 28, 2022

Release 2022.4.2 dask/community#240

Closed

jcrist merged commit 0057207 into dask:main Apr 28, 2022

jcrist deleted the to_parquet-default-write-metadata-file branch April 28, 2022 14:03

rjzamora mentioned this pull request Apr 28, 2022

Update write_metadata_file discussion in documentation #8995

Merged

bryanwweber added the parquet label Apr 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Change `to_parquet` default to `write_metadata_file=None`#8988

Change `to_parquet` default to `write_metadata_file=None`#8988
jcrist merged 3 commits intodask:mainfrom
jcrist:to_parquet-default-write-metadata-file

jcrist commented Apr 27, 2022

Uh oh!

jcrist commented Apr 27, 2022

Uh oh!

rjzamora left a comment

Uh oh!

Uh oh!

Uh oh!

rjzamora Apr 27, 2022

Uh oh!

jcrist Apr 27, 2022

Uh oh!

rjzamora Apr 28, 2022 •

edited

Loading

Uh oh!

rjzamora left a comment

Uh oh!

rjzamora Apr 28, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		ddf3 = dd.read_parquet(fn, engine=engine)
		assert ddf3.npartitions < 5

Uh oh!

Conversation

jcrist commented Apr 27, 2022

Uh oh!

jcrist commented Apr 27, 2022

Uh oh!

rjzamora left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rjzamora Apr 27, 2022

Choose a reason for hiding this comment

Uh oh!

jcrist Apr 27, 2022

Choose a reason for hiding this comment

Uh oh!

rjzamora Apr 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjzamora left a comment

Choose a reason for hiding this comment

Uh oh!

rjzamora Apr 28, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rjzamora Apr 28, 2022 •

edited

Loading