Skip to content

Make path for reading many parquet files without prescan#3978

Merged
martindurant merged 4 commits intodask:masterfrom
martindurant:fastparquet_multifile
Oct 1, 2018
Merged

Make path for reading many parquet files without prescan#3978
martindurant merged 4 commits intodask:masterfrom
martindurant:fastparquet_multifile

Conversation

@martindurant
Copy link
Copy Markdown
Member

For fastparquet when there is no _metadata.

Fixes #3974

Benchmarks for a small dataset of 100 files, 6col*10row each, all int

Master

In [8]: %timeit d2 = dd.read_parquet('out.parq/')
13.3 ms ± 63.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [9]: %timeit d2 = dd.read_parquet('out.parq/*.parquet')
43.5 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

this branch

In [4]: %timeit d2 = dd.read_parquet('out.parq/')
13.5 ms ± 251 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: %timeit d2 = dd.read_parquet('out.parq/*.parquet')
4.56 ms ± 39.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Note that reading the metadata is now longer, because is has to parse all of the row-groups up front (even though they are loaded from one place), whereas this is deferred to the lazy/parallel task when supplying a glob/list of paths.

This code is rather long and convoluted.

@martindurant
Copy link
Copy Markdown
Member Author

Only failure is dask\tests\test_multiprocessing.py::test_pickle_globals , a load of extra global keys turns up.

@martindurant
Copy link
Copy Markdown
Member Author

Failure is this #3983

@martindurant
Copy link
Copy Markdown
Member Author

Any thoughts here? I think this is a thing worth doing, and, in addition, gives users the choice to supply a glob even when there is a _metadata file, for possibly faster response and smaller size of the serialised ParquetFile.

Copy link
Copy Markdown
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things look OK at a glance, though I didn't review thoroughly. docstrings would help me follow things a bit better I think.

return out_type(dsk, name, meta, divisions)


def _pf_validation(pf, columns, index, categories, filters):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A docstring about what-all this does would be helpful for the future.

def _read_pf_simple(fs, path, base, index_names, all_columns, is_series,
categories, cats, scheme, storage_name_mapping):
from fastparquet import ParquetFile
print(path, base)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debugging print.

@martindurant
Copy link
Copy Markdown
Member Author

Thanks, fixed.

@martindurant
Copy link
Copy Markdown
Member Author

Anything more here, @TomAugspurger ?

Copy link
Copy Markdown
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave another quick look. Mostly beyond me, but things look fine overall.

categories=None, index=None):
"""Read dataset with fastparquet by assuming metadata from first file"""
from fastparquet import ParquetFile
from fastparquet.util import analyse_paths, get_file_scheme
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are always available on the oldest fast parquet we support?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both been around over a year


# Infer divisions for engines/versions that support it

ddf2 = dd.read_parquet(os.path.join(fn, '*'), engine=read_engine,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the old one, were we using the metadata file? Should your tests using just the individual files be in addition to the old test, rather than in place of them?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only change here, was that the set of data-files shouldn't include the _common_metadata file by mistake.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So with a directory

_common_metadata
part.0.parquet
part.1.parquet

_common_metadata is ignored when it's captured by the glob? Is that documented (understood that it's not immediately relevant to this PR)?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works fine, you get an empty partition for the non-data file, just would complicate the test a little, I feel this is cleaner. The case where there is no _common_metadata either may be more common.

return out_type(dsk, name, meta, divisions)
def _paths_to_cats(paths, scheme):
"""Extract out fields and labels from directory names"""
# can be factored out in fastparquet
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this comment mean? Is there similar code in fastparquet?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it's tied into a class method, and I don't have the appetite to change it and put in a set of deprivations.


# Infer divisions for engines/versions that support it

ddf2 = dd.read_parquet(os.path.join(fn, '*'), engine=read_engine,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So with a directory

_common_metadata
part.0.parquet
part.1.parquet

_common_metadata is ignored when it's captured by the glob? Is that documented (understood that it's not immediately relevant to this PR)?

@martindurant martindurant merged commit 904805b into dask:master Oct 1, 2018
@martindurant martindurant deleted the fastparquet_multifile branch October 1, 2018 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants