Should we deprecate `gather_statistics` in `read_parquet`? #8937

jcrist · 2022-04-14T21:05:00Z

but my preference moving forward may be to deprecate gather_statistics as a public option. Right now, the only time the user should ever specify gather_statistics, is when they are doing it to insist on gather_statistics=False. For the case you are highlighting here, the suggested argument is split_row_groups=True (which doesn't actually require any row-group statistics to be gathered, but does require the parquet metadata to be parsed, as you said).

The text was updated successfully, but these errors were encountered:

rjzamora · 2022-04-18T17:56:58Z

I’m generally +1 on deprecating gather_statistics, because it is not really an “option,” but is a mechanism for changing several default settings that can affect the output you get. At this point, my opinion is that the default output for a given dataset should be the same, no matter where the dataset is stored or if it contains a _metadata file.  

My proposal would probably be to change the split_row_groups default to False, and to replace gather_statistics with calculate_divisions (default False). This way, we will only need to parse metadata statistics if (1) the user passes in filters, (2) if the user sets calculate_divisions=True and there is an index, or (3) the user passes in a chunksize argument. We would also need to read metadata if the user sets split_row_groups=True or aggregate_files=True, but we would not need to parse any statistics.

(Possibly) Related Thoughts…

I’d also be interested to know if anyone is actually using the chunksize argument. I imagine that this option may be useful for smaller datasets, but the graph-construction performance is terribly slow for large data and/or remote storage. I also have the same question for aggregate_files=True. I get the feeling that it would be just as useful to users if we were to replace these options (chunksize= and aggregate_files=True) with the option for the user to pass in a list of lists to read_parquet (meaning each sub list should be mapped to the same output partition). This would no longer support a combination of row-group splitting and file aggregation, but the resulting code would be much simpler and more efficient because file aggregation would no-longer require us to read any metadata at all.  

Note that I am already exploring these kinds of defaults in #8944, and I am still very unsure of what the best balance is.

jcrist · 2022-04-20T21:09:00Z

My proposal would probably be to change the split_row_groups default to False, and to replace gather_statistics with calculate_divisions (default False). This way, we will only need to parse metadata statistics if (1) the user passes in filters, (2) if the user sets calculate_divisions=True and there is an index, or (3) the user passes in a chunksize argument. We would also need to read metadata if the user sets split_row_groups=True or aggregate_files=True, but we would not need to parse any statistics.

👍 👍 👍 This proposal makes sense to me. Changing this without a deprecation period would mainly lead to bad behavior for users reading from datasets containing large files written by some other tool (dask never writes large files). Based on our small survey I suspect this isn't the norm, so we might be able to get by without a deprecation period? Otherwise, do you have thoughts on how we might smoothly make this change?

rjzamora · 2022-04-21T17:49:39Z

so we might be able to get by without a deprecation period? Otherwise, do you have thoughts on how we might smoothly make this change?

I was thinking along the lines of the plan mentioned in 8901: Temporarily raise a warning when we there is a _metadata file present (and split_row_groups is not explicitly set). I was thinking we would do the split_row_groups and gather_statistics changes seperately, but I suppose these change could be done at once.

Either way, I intend to submit a PR to cudf to set write_metadata_file and split_row_groups explicitly in all the tests (and to remove explicit gather_statistics usage wherever possible).

jcrist · 2022-05-05T18:43:21Z

Fixed by #8992 & #8981. Closing.

jcrist added parquet dataframe io labels Apr 14, 2022

jcrist mentioned this issue Apr 14, 2022

Add a docs page for dask & parquet #8899

Merged

jcrist mentioned this issue Apr 20, 2022

Change to_parquet default to write_metadata_file=False #8901

Closed

rjzamora mentioned this issue Apr 26, 2022

Change split_row_groups default to False in read_parquet #8981

Merged

3 tasks

jcrist closed this as completed May 5, 2022

This was referenced May 5, 2022

Should we deprecate chunksize from read_parquet? #9043

Closed

Should we deprecate aggregate_files from read_parquet? #9051

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should we deprecate `gather_statistics` in `read_parquet`? #8937

Should we deprecate `gather_statistics` in `read_parquet`? #8937

jcrist commented Apr 14, 2022

rjzamora commented Apr 18, 2022

jcrist commented Apr 20, 2022

rjzamora commented Apr 21, 2022

jcrist commented May 5, 2022

Should we deprecate gather_statistics in read_parquet? #8937

Should we deprecate gather_statistics in read_parquet? #8937

Comments

jcrist commented Apr 14, 2022

rjzamora commented Apr 18, 2022

jcrist commented Apr 20, 2022

rjzamora commented Apr 21, 2022

jcrist commented May 5, 2022

Should we deprecate `gather_statistics` in `read_parquet`? #8937

Should we deprecate `gather_statistics` in `read_parquet`? #8937