New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor parquet reader/writer #4329
Comments
Certainly a good part of the reason for the complexity of the code is simply history. However, there are functions that should be shared between the two drivers, and for any driver, a number of possible parquet file layouts and user options to consider. Having said that, the code is not structured too differently from what you suggest, except that there are many internal helper functions. One problem I see with your plan for simplification, is that the initial metadata parsing stage will create internal objects that are also required for the second stage - which parsing would then have to be repeated for the data files in this model. This is particularly true when the |
What is the reason for this? Is it costly to read row groups directly from files without being given the metadata other than columns? What are some situations where you need to full metadata or schema object? |
Aside from additional reads at the foot of the file, yes, it can be expensive to parse metadata in the case that there are many columns. This has already been a problem in some situations in the existing code, and has caused some of the alternate code paths like |
OK, so the Maybe I should reverse the question and say "is there anything in the following function signatures that is either not necessary, or should be generalized?" def _read_pf_simple(fs, path, base, index_names, all_columns, is_series,
categories, cats, scheme, storage_name_mapping):
def _read_pyarrow_parquet_piece(fs, piece, columns, index_cols, is_series,
partitions, categories): |
Yes, from the fastparquet point of view, it is better to pass in the metadata if you have it, which is what happens in |
This is a proof of concept around the proposal in dask#4329 The different engines are responsible for returning a reader function, a list of tokens/kwargs to pass to that reader function, and an optional list of statistics. We pull out all other book-keeping to a master function to avoid the duplication of tricky logic. I've only done this for pyarrow so far. Feedback appreciated.
After #4995, this can probably be closed. |
Closing based on #4329 (comment). Others should feel free to re-open if necessary |
Our handling of parquet data is effective, but the code seems to be a bit of a scramble. It's not clear what features are handled and where they should be handled. I think that we should consider a rewrite. Looking through the code there are a few things in the design that I would like to change, but I'm not sure if they're there only due to history or if there is some reason why they were chosen.
I think that in principle a new reading backend should focus on the following two functions:
This would return both ...
filename
,row-group-number
) pairs to pass on to the next function that satisfied the filtersand then also the following function:
This function would be given a file-like object and a list of partitions within that object and would produce a single pandas-like DataFrame.
Other considerations like index placement, and whether or not to return a series, would not be handled by the backend, but would instead be handled by the backend-agnostic code. Options like whether or not to return categoricals would be left up to the backend and be handled by
kwargs
inread_parquet_pieces
.Am I missing anything here? Again, my objective in this issue is to understand if there are reasons for the complexity of the current system other than history (which is a fine reason, I don't mean to throw stones).
The text was updated successfully, but these errors were encountered: