Skip to content

Change to_parquet default to write_metadata_file=None#8988

Merged
jcrist merged 3 commits intodask:mainfrom
jcrist:to_parquet-default-write-metadata-file
Apr 28, 2022
Merged

Change to_parquet default to write_metadata_file=None#8988
jcrist merged 3 commits intodask:mainfrom
jcrist:to_parquet-default-write-metadata-file

Conversation

@jcrist
Copy link
Copy Markdown
Member

@jcrist jcrist commented Apr 27, 2022

This changes to_parquet to default to write_metadata_file=None. If None, a _metadata file is only written when append=True and the dataset has an existing _metadata file, otherwise it defaults to False.

If a _metadata file doesn't exist when appending, the last file in the dataset is used to validate schema and divisions.

There are also some adjacent changes to simplify the generated to_parquet graph (in particular, the full dataset metadata isn't actually needed in each to_parquet task, but was previously included, bloating graph size).

The majority of this PR is modifying the tests to not fail after this change, since many of them were implicitly relying on the existence of a _metadata file.

Fixes #8901.

  • Closes #xxxx
  • Tests added / passed
  • Passes pre-commit run --all-files

jcrist added 2 commits April 26, 2022 15:05
A bit of refactoring before changing the default of
`write_metadata_file` to `None` in `to_parquet`.

- Simplify implementation
- Don't include file metadata in `write_partition` calls if it's not
needed
- Everything needed to support implementing `write_metadata_file=None`
as default *except* changing the value (to ensure tests pass).
Most of the failures are due to divisions not being known by default
anymore, since they're only known by default if a `_metadata` file is
present.
@jcrist jcrist changed the title To parquet default write metadata file Change to_parquet default to write_metadata_file=None Apr 27, 2022
@jcrist
Copy link
Copy Markdown
Member Author

jcrist commented Apr 27, 2022

cc @rjzamora

Copy link
Copy Markdown
Member

@rjzamora rjzamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @jcrist!

I didn't look very carefully at the tests yet, but I only have some minor questions so far.

Comment on lines +233 to +235
res = dd.read_parquet(fn, index=["a"], engine=read_engine)
sol = ddf.compute()
assert_eq(res, sol)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this just need a gather_statistics=True to avoid the compute()? (same question for other usage of sol below)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or a check_divisions=False. The main change is that without a _metadata file read_parquet will result in a dataframe without divisions by default, and most of the tests are calling assert_eq on two dask dataframes (so divisions are compared). Since there are tests for explicitly checking divisions, in most cases checking divisions like this was accidental and unimportant, I used a mishmash of:

  • Using a pandas dataframe for the intended solution
  • Passing in gather_statistics=True to get a dataframe wit hknown divisions
  • Passing in check_divisions=False

to resolve these failures.

Copy link
Copy Markdown
Member

@rjzamora rjzamora Apr 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense - In my branch to deprecate gather_statistics (which I didn't submit yet, and will likely conflict with this PR's test changes in many places), I did end up adding an explicit calculate_divisions=True in many places.

It is somewhat true that we are "accidentally" checking divisions, but my slight preference is to avoid computing before the assert_eq unless we really need to. I'd say check_divisions=False is a bit better, but I may end up rolling back some of these sol changes in the gather_statistics PR.

Copy link
Copy Markdown
Member

@rjzamora rjzamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM @jcrist

I do think it is important that we update the Metadata section in the new parquet doc page, but I'm okay with that happening in a separate PR if you'd prefer.

Comment on lines 1075 to -1038
ddf3 = dd.read_parquet(fn, engine=engine)
assert ddf3.npartitions < 5
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with removing this, but if you specifiy gather_statistics=True (soon to becalculate_divisions=True), we will filter out the empty partitions. The original test was checking that we don't get these empty partitions.

@jcrist jcrist merged commit 0057207 into dask:main Apr 28, 2022
@jcrist jcrist deleted the to_parquet-default-write-metadata-file branch April 28, 2022 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Change to_parquet default to write_metadata_file=False

3 participants