Make path for reading many parquet files without prescan by martindurant · Pull Request #3978 · dask/dask

martindurant · 2018-09-12T22:15:26Z

For fastparquet when there is no _metadata.

Benchmarks for a small dataset of 100 files, 6col*10row each, all int

Master

In [8]: %timeit d2 = dd.read_parquet('out.parq/')
13.3 ms ± 63.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [9]: %timeit d2 = dd.read_parquet('out.parq/*.parquet')
43.5 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

this branch

In [4]: %timeit d2 = dd.read_parquet('out.parq/')
13.5 ms ± 251 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: %timeit d2 = dd.read_parquet('out.parq/*.parquet')
4.56 ms ± 39.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Note that reading the metadata is now longer, because is has to parse all of the row-groups up front (even though they are loaded from one place), whereas this is deferred to the lazy/parallel task when supplying a glob/list of paths.

This code is rather long and convoluted.

…scan

martindurant · 2018-09-13T14:58:12Z

Only failure is dask\tests\test_multiprocessing.py::test_pickle_globals , a load of extra global keys turns up.

martindurant · 2018-09-13T21:26:25Z

Failure is this #3983

martindurant · 2018-09-18T13:19:39Z

Any thoughts here? I think this is a thing worth doing, and, in addition, gives users the choice to supply a glob even when there is a _metadata file, for possibly faster response and smaller size of the serialised ParquetFile.

TomAugspurger

Things look OK at a glance, though I didn't review thoroughly. docstrings would help me follow things a bit better I think.

TomAugspurger · 2018-09-19T16:09:45Z

dask/dataframe/io/parquet.py

+    return out_type(dsk, name, meta, divisions)
+
+
+def _pf_validation(pf, columns, index, categories, filters):


A docstring about what-all this does would be helpful for the future.

TomAugspurger · 2018-09-19T16:16:10Z

dask/dataframe/io/parquet.py

+def _read_pf_simple(fs, path, base, index_names, all_columns, is_series,
+                    categories, cats, scheme, storage_name_mapping):
+    from fastparquet import ParquetFile
+    print(path, base)


Debugging print.

martindurant · 2018-09-19T16:53:02Z

Thanks, fixed.

martindurant · 2018-09-21T23:37:49Z

Anything more here, @TomAugspurger ?

TomAugspurger

Gave another quick look. Mostly beyond me, but things look fine overall.

TomAugspurger · 2018-09-28T16:26:27Z

dask/dataframe/io/parquet.py

+                       categories=None, index=None):
+    """Read dataset with fastparquet by assuming metadata from first file"""
+    from fastparquet import ParquetFile
+    from fastparquet.util import analyse_paths, get_file_scheme


These are always available on the oldest fast parquet we support?

Both been around over a year

TomAugspurger · 2018-09-28T16:27:48Z

dask/dataframe/io/tests/test_parquet.py


    # Infer divisions for engines/versions that support it

-    ddf2 = dd.read_parquet(os.path.join(fn, '*'), engine=read_engine,


For the old one, were we using the metadata file? Should your tests using just the individual files be in addition to the old test, rather than in place of them?

The only change here, was that the set of data-files shouldn't include the _common_metadata file by mistake.

So with a directory

_common_metadata part.0.parquet part.1.parquet

_common_metadata is ignored when it's captured by the glob? Is that documented (understood that it's not immediately relevant to this PR)?

It works fine, you get an empty partition for the non-data file, just would complicate the test a little, I feel this is cleaner. The case where there is no _common_metadata either may be more common.

TomAugspurger · 2018-09-28T16:30:05Z

dask/dataframe/io/parquet.py

-    return out_type(dsk, name, meta, divisions)
+def _paths_to_cats(paths, scheme):
+    """Extract out fields and labels from directory names"""
+    # can be factored out in fastparquet


What does this comment mean? Is there similar code in fastparquet?

Yes, but it's tied into a class method, and I don't have the appetite to change it and put in a set of deprivations.

TomAugspurger · 2018-10-01T16:16:57Z

dask/dataframe/io/tests/test_parquet.py


    # Infer divisions for engines/versions that support it

-    ddf2 = dd.read_parquet(os.path.join(fn, '*'), engine=read_engine,


So with a directory

_common_metadata part.0.parquet part.1.parquet

_common_metadata is ignored when it's captured by the glob? Is that documented (understood that it's not immediately relevant to this PR)?

Martin Durant added 2 commits September 12, 2018 18:00

Make path for reading many parquet files with fastparquet without pre…

fb2c6c7

…scan

flake

7fabda7

Merge branch 'master' into fastparquet_multifile

1238837

TomAugspurger reviewed Sep 19, 2018

View reviewed changes

Add docstrings, remove debug

494030c

TomAugspurger reviewed Sep 28, 2018

View reviewed changes

TomAugspurger approved these changes Oct 1, 2018

View reviewed changes

martindurant merged commit 904805b into dask:master Oct 1, 2018

martindurant deleted the fastparquet_multifile branch October 1, 2018 17:15

martindurant mentioned this pull request Oct 1, 2018

Hello Astronomy pangeo-data/pangeo#255

Closed

		return out_type(dsk, name, meta, divisions)


		def _pf_validation(pf, columns, index, categories, filters):


		# Infer divisions for engines/versions that support it

		ddf2 = dd.read_parquet(os.path.join(fn, '*'), engine=read_engine,

Uh oh!

Conversation

martindurant commented Sep 12, 2018

Uh oh!

martindurant commented Sep 13, 2018

Uh oh!

martindurant commented Sep 13, 2018

Uh oh!

martindurant commented Sep 18, 2018

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martindurant commented Sep 19, 2018

Uh oh!

martindurant commented Sep 21, 2018

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants