Refactor parquet reader/writer #4329

mrocklin · 2018-12-24T16:57:28Z

Our handling of parquet data is effective, but the code seems to be a bit of a scramble. It's not clear what features are handled and where they should be handled. I think that we should consider a rewrite. Looking through the code there are a few things in the design that I would like to change, but I'm not sure if they're there only due to history or if there is some reason why they were chosen.

I think that in principle a new reading backend should focus on the following two functions:

def get_metadata(
    path: str,
    filters: list,
) -> dict

This would return both ...

a list of (filename, row-group-number) pairs to pass on to the next function that satisfied the filters
a set of columns that could serve as a sorted index

and then also the following function:

def read_parquet_pieces(
    file: FileLike,
    partitions: List[int],
    columns: List[str],
    **kwargs: dict,  # backend specific keyword arguments
) -> pandas.DataFrame:

This function would be given a file-like object and a list of partitions within that object and would produce a single pandas-like DataFrame.

Other considerations like index placement, and whether or not to return a series, would not be handled by the backend, but would instead be handled by the backend-agnostic code. Options like whether or not to return categoricals would be left up to the backend and be handled by kwargs in read_parquet_pieces.

Am I missing anything here? Again, my objective in this issue is to understand if there are reasons for the complexity of the current system other than history (which is a fine reason, I don't mean to throw stones).

The text was updated successfully, but these errors were encountered:

martindurant · 2018-12-24T18:09:32Z

Certainly a good part of the reason for the complexity of the code is simply history. However, there are functions that should be shared between the two drivers, and for any driver, a number of possible parquet file layouts and user options to consider. Having said that, the code is not structured too differently from what you suggest, except that there are many internal helper functions.

One problem I see with your plan for simplification, is that the initial metadata parsing stage will create internal objects that are also required for the second stage - which parsing would then have to be repeated for the data files in this model. This is particularly true when the _matadata file exists, since then all of the metadata (including column statistics an in-file offsets) can be read in a single read. Of course, this problem disappears if the output dict contains the parquet objects too. I do think, though, that the module can be made more legible just by moving the existing functions around, naming them carefully, and documenting even the hidden functions.

mrocklin · 2018-12-24T18:12:03Z

the initial metadata parsing stage will create internal objects that are also required for the second stage

What is the reason for this? Is it costly to read row groups directly from files without being given the metadata other than columns? What are some situations where you need to full metadata or schema object?

martindurant · 2018-12-24T18:16:59Z

Aside from additional reads at the foot of the file, yes, it can be expensive to parse metadata in the case that there are many columns. This has already been a problem in some situations in the existing code, and has caused some of the alternate code paths like _read_fp_multifile (which emulates a metadata file where none is present by parsing the first one and using its schema).

mrocklin · 2018-12-24T18:27:21Z

OK, so the partitions: List[int] should be replaced by List[T] where T is some arbitrary token of which a list is returned from the get_metadata function called once at the beginning?

Maybe I should reverse the question and say "is there anything in the following function signatures that is either not necessary, or should be generalized?"

def _read_pf_simple(fs, path, base, index_names, all_columns, is_series,
                    categories, cats, scheme, storage_name_mapping):
def _read_pyarrow_parquet_piece(fs, piece, columns, index_cols, is_series,
                                partitions, categories):

martindurant · 2018-12-24T18:34:53Z

Yes, from the fastparquet point of view, it is better to pass in the metadata if you have it, which is what happens in _read_parquet_row_group.

This is a proof of concept around the proposal in dask#4329 The different engines are responsible for returning a reader function, a list of tokens/kwargs to pass to that reader function, and an optional list of statistics. We pull out all other book-keeping to a master function to avoid the duplication of tricky logic. I've only done this for pyarrow so far. Feedback appreciated.

rjzamora · 2019-07-19T17:16:50Z

After #4995, this can probably be closed.

jrbourbeau · 2019-07-22T00:34:07Z

Closing based on #4329 (comment). Others should feel free to re-open if necessary

mrocklin mentioned this issue Dec 29, 2018

WIP - Refactor Parquet reader #4336

Closed

jrbourbeau closed this as completed Jul 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor parquet reader/writer #4329

Refactor parquet reader/writer #4329

mrocklin commented Dec 24, 2018

martindurant commented Dec 24, 2018

mrocklin commented Dec 24, 2018

martindurant commented Dec 24, 2018

mrocklin commented Dec 24, 2018

martindurant commented Dec 24, 2018

rjzamora commented Jul 19, 2019

jrbourbeau commented Jul 22, 2019

Refactor parquet reader/writer #4329

Refactor parquet reader/writer #4329

Comments

mrocklin commented Dec 24, 2018

martindurant commented Dec 24, 2018

mrocklin commented Dec 24, 2018

martindurant commented Dec 24, 2018

mrocklin commented Dec 24, 2018

martindurant commented Dec 24, 2018

rjzamora commented Jul 19, 2019

jrbourbeau commented Jul 22, 2019