Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we deprecate gather_statistics in read_parquet? #8937

Closed
jcrist opened this issue Apr 14, 2022 · 4 comments
Closed

Should we deprecate gather_statistics in read_parquet? #8937

jcrist opened this issue Apr 14, 2022 · 4 comments

Comments

@jcrist
Copy link
Member

jcrist commented Apr 14, 2022

From @rjzamora in #8899 (comment):

but my preference moving forward may be to deprecate gather_statistics as a public option. Right now, the only time the user should ever specify gather_statistics, is when they are doing it to insist on gather_statistics=False. For the case you are highlighting here, the suggested argument is split_row_groups=True (which doesn't actually require any row-group statistics to be gathered, but does require the parquet metadata to be parsed, as you said).

@rjzamora
Copy link
Member

I’m generally +1 on deprecating gather_statistics, because it is not really an “option,” but is a mechanism for changing several default settings that can affect the output you get. At this point, my opinion is that the default output for a given dataset should be the same, no matter where the dataset is stored or if it contains a _metadata file.



My proposal would probably be to change the split_row_groups default to False, and to replace gather_statistics with calculate_divisions (default False). This way, we will only need to parse metadata statistics if (1) the user passes in filters, (2) if the user sets calculate_divisions=True and there is an index, or (3) the user passes in a chunksize argument. We would also need to read metadata if the user sets split_row_groups=True or aggregate_files=True, but we would not need to parse any statistics.

(Possibly) Related Thoughts…

I’d also be interested to know if anyone is actually using the chunksize argument. I imagine that this option may be useful for smaller datasets, but the graph-construction performance is terribly slow for large data and/or remote storage. I also have the same question for aggregate_files=True. I get the feeling that it would be just as useful to users if we were to replace these options (chunksize= and aggregate_files=True) with the option for the user to pass in a list of lists to read_parquet (meaning each sub list should be mapped to the same output partition). This would no longer support a combination of row-group splitting and file aggregation, but the resulting code would be much simpler and more efficient because file aggregation would no-longer require us to read any metadata at all.



Note that I am already exploring these kinds of defaults in #8944, and I am still very unsure of what the best balance is.

@jcrist
Copy link
Member Author

jcrist commented Apr 20, 2022

My proposal would probably be to change the split_row_groups default to False, and to replace gather_statistics with calculate_divisions (default False). This way, we will only need to parse metadata statistics if (1) the user passes in filters, (2) if the user sets calculate_divisions=True and there is an index, or (3) the user passes in a chunksize argument. We would also need to read metadata if the user sets split_row_groups=True or aggregate_files=True, but we would not need to parse any statistics.

👍 👍 👍 This proposal makes sense to me. Changing this without a deprecation period would mainly lead to bad behavior for users reading from datasets containing large files written by some other tool (dask never writes large files). Based on our small survey I suspect this isn't the norm, so we might be able to get by without a deprecation period? Otherwise, do you have thoughts on how we might smoothly make this change?

@rjzamora
Copy link
Member

so we might be able to get by without a deprecation period? Otherwise, do you have thoughts on how we might smoothly make this change?

I was thinking along the lines of the plan mentioned in 8901: Temporarily raise a warning when we there is a _metadata file present (and split_row_groups is not explicitly set). I was thinking we would do the split_row_groups and gather_statistics changes seperately, but I suppose these change could be done at once.

Either way, I intend to submit a PR to cudf to set write_metadata_file and split_row_groups explicitly in all the tests (and to remove explicit gather_statistics usage wherever possible).

@jcrist
Copy link
Member Author

jcrist commented May 5, 2022

Fixed by #8992 & #8981. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants