Change to_parquet default to write_metadata_file=None#8988
Conversation
A bit of refactoring before changing the default of `write_metadata_file` to `None` in `to_parquet`. - Simplify implementation - Don't include file metadata in `write_partition` calls if it's not needed - Everything needed to support implementing `write_metadata_file=None` as default *except* changing the value (to ensure tests pass).
Most of the failures are due to divisions not being known by default anymore, since they're only known by default if a `_metadata` file is present.
to_parquet default to write_metadata_file=None
|
cc @rjzamora |
| res = dd.read_parquet(fn, index=["a"], engine=read_engine) | ||
| sol = ddf.compute() | ||
| assert_eq(res, sol) |
There was a problem hiding this comment.
Does this just need a gather_statistics=True to avoid the compute()? (same question for other usage of sol below)
There was a problem hiding this comment.
Or a check_divisions=False. The main change is that without a _metadata file read_parquet will result in a dataframe without divisions by default, and most of the tests are calling assert_eq on two dask dataframes (so divisions are compared). Since there are tests for explicitly checking divisions, in most cases checking divisions like this was accidental and unimportant, I used a mishmash of:
- Using a pandas dataframe for the intended solution
- Passing in
gather_statistics=Trueto get a dataframe wit hknown divisions - Passing in
check_divisions=False
to resolve these failures.
There was a problem hiding this comment.
That makes sense - In my branch to deprecate gather_statistics (which I didn't submit yet, and will likely conflict with this PR's test changes in many places), I did end up adding an explicit calculate_divisions=True in many places.
It is somewhat true that we are "accidentally" checking divisions, but my slight preference is to avoid computing before the assert_eq unless we really need to. I'd say check_divisions=False is a bit better, but I may end up rolling back some of these sol changes in the gather_statistics PR.
rjzamora
left a comment
There was a problem hiding this comment.
LGTM @jcrist
I do think it is important that we update the Metadata section in the new parquet doc page, but I'm okay with that happening in a separate PR if you'd prefer.
| ddf3 = dd.read_parquet(fn, engine=engine) | ||
| assert ddf3.npartitions < 5 |
There was a problem hiding this comment.
I'm okay with removing this, but if you specifiy gather_statistics=True (soon to becalculate_divisions=True), we will filter out the empty partitions. The original test was checking that we don't get these empty partitions.
This changes
to_parquetto default towrite_metadata_file=None. IfNone, a_metadatafile is only written whenappend=Trueand the dataset has an existing_metadatafile, otherwise it defaults to False.If a
_metadatafile doesn't exist when appending, the last file in the dataset is used to validate schema and divisions.There are also some adjacent changes to simplify the generated
to_parquetgraph (in particular, the full dataset metadata isn't actually needed in eachto_parquettask, but was previously included, bloating graph size).The majority of this PR is modifying the tests to not fail after this change, since many of them were implicitly relying on the existence of a
_metadatafile.Fixes #8901.
pre-commit run --all-files