Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor parquet reader/writer #4329

Closed
mrocklin opened this issue Dec 24, 2018 · 7 comments
Closed

Refactor parquet reader/writer #4329

mrocklin opened this issue Dec 24, 2018 · 7 comments

Comments

@mrocklin
Copy link
Member

Our handling of parquet data is effective, but the code seems to be a bit of a scramble. It's not clear what features are handled and where they should be handled. I think that we should consider a rewrite. Looking through the code there are a few things in the design that I would like to change, but I'm not sure if they're there only due to history or if there is some reason why they were chosen.

I think that in principle a new reading backend should focus on the following two functions:

def get_metadata(
    path: str,
    filters: list,
) -> dict

This would return both ...

  • a list of (filename, row-group-number) pairs to pass on to the next function that satisfied the filters
  • a set of columns that could serve as a sorted index

and then also the following function:

def read_parquet_pieces(
    file: FileLike,
    partitions: List[int],
    columns: List[str],
    **kwargs: dict,  # backend specific keyword arguments
) -> pandas.DataFrame:

This function would be given a file-like object and a list of partitions within that object and would produce a single pandas-like DataFrame.

Other considerations like index placement, and whether or not to return a series, would not be handled by the backend, but would instead be handled by the backend-agnostic code. Options like whether or not to return categoricals would be left up to the backend and be handled by kwargs in read_parquet_pieces.

Am I missing anything here? Again, my objective in this issue is to understand if there are reasons for the complexity of the current system other than history (which is a fine reason, I don't mean to throw stones).

@martindurant
Copy link
Member

Certainly a good part of the reason for the complexity of the code is simply history. However, there are functions that should be shared between the two drivers, and for any driver, a number of possible parquet file layouts and user options to consider. Having said that, the code is not structured too differently from what you suggest, except that there are many internal helper functions.

One problem I see with your plan for simplification, is that the initial metadata parsing stage will create internal objects that are also required for the second stage - which parsing would then have to be repeated for the data files in this model. This is particularly true when the _matadata file exists, since then all of the metadata (including column statistics an in-file offsets) can be read in a single read. Of course, this problem disappears if the output dict contains the parquet objects too. I do think, though, that the module can be made more legible just by moving the existing functions around, naming them carefully, and documenting even the hidden functions.

@mrocklin
Copy link
Member Author

the initial metadata parsing stage will create internal objects that are also required for the second stage

What is the reason for this? Is it costly to read row groups directly from files without being given the metadata other than columns? What are some situations where you need to full metadata or schema object?

@martindurant
Copy link
Member

Aside from additional reads at the foot of the file, yes, it can be expensive to parse metadata in the case that there are many columns. This has already been a problem in some situations in the existing code, and has caused some of the alternate code paths like _read_fp_multifile (which emulates a metadata file where none is present by parsing the first one and using its schema).

@mrocklin
Copy link
Member Author

OK, so the partitions: List[int] should be replaced by List[T] where T is some arbitrary token of which a list is returned from the get_metadata function called once at the beginning?

Maybe I should reverse the question and say "is there anything in the following function signatures that is either not necessary, or should be generalized?"

def _read_pf_simple(fs, path, base, index_names, all_columns, is_series,
                    categories, cats, scheme, storage_name_mapping):
def _read_pyarrow_parquet_piece(fs, piece, columns, index_cols, is_series,
                                partitions, categories):

@martindurant
Copy link
Member

Yes, from the fastparquet point of view, it is better to pass in the metadata if you have it, which is what happens in _read_parquet_row_group.

mrocklin added a commit to mrocklin/dask that referenced this issue Dec 29, 2018
This is a proof of concept around the proposal in
dask#4329

The different engines are responsible for returning a reader function, a list
of tokens/kwargs to pass to that reader function, and an optional list of
statistics.  We pull out all other book-keeping to a master function to avoid
the duplication of tricky logic.

I've only done this for pyarrow so far.  Feedback appreciated.
@rjzamora
Copy link
Member

After #4995, this can probably be closed.

@jrbourbeau
Copy link
Member

Closing based on #4329 (comment). Others should feel free to re-open if necessary

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants