Parquet engine-core refactoring #4995

rjzamora · 2019-06-24T22:35:49Z

This PR is intended to superseed PR#4336, which was originally driven mostly by @mrocklin and @martindurant (I'm opening a PR just to refresh/reorganize the discussion a bit). There are still 18 tests failing in test_parquet.py (all for fastparquet), but there should be enough progress for healthy discussion/feedback.

The primary goal here is to refactor the parquet interface into distinct core and engine code that will be easier to maintain in the future. The idea is to pull all engine-agnostic code into dask/dask/dataframe/io/parquet/core.py, and then isolate engine specific code in dask/dask/dataframe/io/parquet/arrow.py and dask/dask/dataframe/io/parquet/fastparquet.py (for pyarrow and fastparquet, respectively).

Engine calls during write phase (core.to_parquet):

engine.initialize_write: Preparation for writing/appending
engine.write_partition: Write operation for each partition
engine.write_metadata: Metadata-write operation

Engine calls during read phase (core.read_parquet):

engine.read_metadata: Read metadata and file statistics
engine.read_partition: Read operation for each partition

Another goal of PR#4336 was to limit the responsibilities of engine-specific code. I tried to do this without loosing any significant/useful features. For example, I decided to retain index detection in files with pandas metadata. For simplicity, the index is alway preserved by resetting it before writing (unless write_index is explicitly set to False). The name(s) of the original index column(s) are then passed to the engines, which are expected to use pandas metadata to store the column index (and read it back during read_metadata).

Any and all feedback is welcome :)

Tests added / passed
Passes flake8 dask

martindurant · 2019-06-25T00:26:33Z

Do you have an idea of any changes you need from fastparquet? Or can it all be worked out here if I were to go through the code?

martindurant · 2019-06-25T00:31:04Z

@mrocklin , happy to close #4336? I have not yet looked through any of the code here, so I have no comment on whether it is better.

rjzamora · 2019-06-25T00:49:39Z

Do you have an idea of any changes you need from fastparquet? Or can it all be worked out here if I were to go through the code?

I'm actually not sure about this, but hopefully I will have a better idea tomorrow. You are certainly welcome to dig through the code and make changes (or let me know what changes you suggest)

@mrocklin , happy to close #4336? I have not yet looked through any of the code here, so I have no comment on whether it is better.

I'm realizeing that I did not properly track the earlier changes from #4336 for this PR (due to a bit of a rebasing blunder). So, it would also be fine with me if @mrocklin thinks we should move these changes back over there.

mrocklin · 2019-06-25T06:29:25Z

I don't care about which PR we use. I would use whichever is more likely to finish faster :)

rjzamora · 2019-06-25T21:55:56Z

Update: We are down to 7 failing tests. Here is the grep of FAILED in the pytest output:

95:dask/dataframe/io/tests/test_parquet.py::test_roundtrip[fastparquet-df3-write_kwargs3-read_kwargs3] FAILED
145:dask/dataframe/io/tests/test_parquet.py::test_partition_on[pyarrow-fastparquet] FAILED
150:dask/dataframe/io/tests/test_parquet.py::test_divisions_read_with_filters FAILED
151:dask/dataframe/io/tests/test_parquet.py::test_divisions_are_known_read_with_filters FAILED
157:dask/dataframe/io/tests/test_parquet.py::test_timestamp96 FAILED
180:dask/dataframe/io/tests/test_parquet.py::test_writing_parquet_with_unknown_kwargs[fastparquet] FAILED
189:dask/dataframe/io/tests/test_parquet.py::test_passing_parquetfile FAILED
696:=== 7 failed, 172 passed, 20 xfailed, 5 xpassed, 1 warnings in 36.54 seconds ===

TomAugspurger · 2019-06-26T11:56:56Z

dask/bytes/core.py

@@ -357,7 +362,9 @@ def get_fs_token_paths(urlpath, mode='rb', num=1, name_function=None,
        fs, fs_token = get_fs(protocol, options)
        paths = expand_paths_if_needed(paths, mode, num, fs, name_function)

-    elif isinstance(urlpath, (str, unicode)) or hasattr(urlpath, 'name'):
+    elif (isinstance(urlpath, (str, unicode)) or


We have a new stringify_path helper in dask/bytes/utils.py Does that suffice here?

TomAugspurger · 2019-06-26T15:07:15Z

dask/bytes/utils.py

@@ -37,6 +37,8 @@ def infer_storage_options(urlpath, inherit_storage_options=None):
    "host": "node", "port": 123, "path": "/mnt/datasets/test.csv",
    "url_query": "q=1", "extra": "value"}
    """
+    urlpath = str(urlpath)  # re, urllib don't support pathlib.Path objects


stringify_path, though maybe the callers should be doing this.

cc @dlovell

TomAugspurger · 2019-06-26T15:08:10Z

dask/dataframe/core.py

@@ -243,6 +243,9 @@ class _Frame(DaskMethodsMixin, OperatorMethodMixin):
        Values along which we partition our blocks on the index
    """
    def __init__(self, dsk, name, meta, divisions):
+        if len(divisions) < 2:  # no partitions


Not saying this is wrong, but it's a bit surprising to see this in a Parquet refactor. Do you recall what prompted it?

These lines were added in PR#4336 to address cases where an empty dataframe/partition is read in (like test_parquet.py::test_empty). Not sure if there is a more appropriate fix.

You mean a dataframe without any partitions, I think. Yes, that could happen with parquet.

TomAugspurger · 2019-06-26T15:14:41Z

Ahh, was this open when we ran black over the codebase? Sorry about that.

rjzamora · 2019-06-26T15:23:03Z

@TomAugspurger Not a problem - I can certainly understand the advantages of using black moving forward :)

rjzamora · 2019-06-27T14:50:59Z

@mrocklin @martindurant - Should we allow read_parquet to accept an existing ParquetFile (which the current parquet api supports)? This is certainly doable, but I just wanted to confirm that this case is actually used/desired.

If "yes" (which I am expecting), do we want to address test_passing_parquetfile as is (in which the actual parquet directory is removed), or does a simple solution like this one suffice?

martindurant · 2019-06-27T14:53:41Z

Should we allow read_parquet to accept an existing ParquetFile

I think this is unnecessary now that we have multiple engines and a stronger desire to structure the code well. Anything that can be opened with ParquetFile(.., open_with=) should be openable using a URL alone in the code here.

rjzamora · 2019-06-27T20:58:18Z

Update: All test_parquet.py tests are passing. However, (1) the latest master branch of pyarrow is required, (2) Not all "new" capabilities are tested for pyarrow.

mrocklin · 2019-06-27T21:02:38Z

Update: All test_parquet.py tests are passing

Woot!

TomAugspurger · 2019-06-28T11:43:22Z

@martindurant will you have a chance to look through this PR?

martindurant · 2019-06-28T14:19:34Z

Will look today. Was waiting for the new commits to stop coming.

rjzamora · 2019-06-28T14:42:17Z

Thanks @TomAugspurger and @martindurant - Since the refactor is only "working" for pyarrow 0.13.1, my next step (likely this afternoon) will be to add backward compatibility. Code review is still welcome/appreciated at any time of course.

martindurant · 2019-06-28T14:43:58Z

@mrocklin , do you think we'll need the compatibility? I imagine it may take more effort than it's worth, and we already disallow 0.13.0.

TomAugspurger · 2019-06-28T14:52:24Z

I’d say just bump the minimum pyarrow version.

…

On Jun 28, 2019, at 09:44, Martin Durant ***@***.***> wrote: @mrocklin , do you think we'll need the compatibility? I imagine it may take more effort than it's worth, and we already disallow 0.13.0. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

martindurant · 2019-06-28T15:13:47Z

I'd be ok with that

martindurant

First pass:
looked over everything except tests and fastprquet module.

martindurant · 2019-06-28T15:15:18Z

dask/dataframe/core.py

@@ -243,6 +243,9 @@ class _Frame(DaskMethodsMixin, OperatorMethodMixin):
        Values along which we partition our blocks on the index
    """
    def __init__(self, dsk, name, meta, divisions):
+        if len(divisions) < 2:  # no partitions


You mean a dataframe without any partitions, I think. Yes, that could happen with parquet.

martindurant · 2019-06-28T15:16:37Z

dask/dataframe/io/parquet/arrow.py

+import pyarrow as pa
+import pyarrow.parquet as pq
+from ....delayed import delayed
+from ....bytes.core import get_pyarrow_filesystem


Forewarning: when we switch to fsspec, all filesystem implementations will already be subclasses of pyarrow's (if installed)

dask/dataframe/io/parquet/arrow.py

martindurant · 2019-06-28T15:19:11Z

dask/dataframe/io/parquet/arrow.py

+        fs, paths, categories=None, index=None, gather_statistics=None, **kwargs
+    ):
+
+        # In pyarrow, the physical storage field names may differ from


I pandas-derived parquet (also true for fastparquet)

dask/dataframe/io/parquet/arrow.py

martindurant · 2019-06-28T16:08:50Z

dask/dataframe/io/parquet/core.py

+
+def apply_filters(parts, statistics, filters):
+    """ Apply filters onto parts/statistics pairs
+


(copied from fastparquet; not sure where in Dask this is documented)

martindurant · 2019-06-28T16:11:25Z

dask/dataframe/io/parquet/utils.py

+        index_names = user_index
+        if set(column_names).intersection(index_names):
+            raise ValueError(
+                "Specified index and column names must not " "intersect"


Same type and thoughts. Which copy gets triggered, probably not both.

martindurant · 2019-06-28T16:12:34Z

dask/dataframe/io/parquet/utils.py

+    ]
+
+    # Need to reconcile storage and real names. These will differ for
+    # pyarrow, which uses __index_leveL_d__ for the storage name of indexes.


This should also work for fastparquet - to be tested

dask/dataframe/io/parquet/utils.py

dask/dataframe/optimize.py

rjzamora · 2019-07-18T02:28:11Z

Hmm the new test_s3.py failures may have something to do with the s3fs conda package being updated to 0.3.0 just hours ago...

martindurant · 2019-07-18T12:20:30Z

Possibly, but all tests are passing in #5064 is also using the same version. Your help diagnosing would be greatly appreciated.

martindurant · 2019-07-18T13:05:00Z

Actually, if #5064 is to be merged soon, then the best action would be to merge/rebase from it (which did affect the old parquet code slightly).

rjzamora · 2019-07-18T15:28:35Z

Actually, if #5064 is to be merged soon, then the best action would be to merge/rebase from it (which did affect the old parquet code slightly).

@martindurant This sounds good to me - merging in your fsspec branch seems to fix the failed tests on my local machine. If all goes well, we should be able to merge this after #5064

martindurant · 2019-07-18T23:29:41Z

Sorry, merge wasn't quite clean (probably because it was merged)

rjzamora · 2019-07-18T23:55:12Z

No problem - I can clean this up tonight.

martindurant · 2019-07-19T14:27:23Z

I think the new failures are fixed in #5056

rjzamora · 2019-07-19T14:31:52Z

I think the new failures are fixed in #5056

That is what I'm hoping :) - I am assuming I should wait for that to go through before bothering to rebase/merge here again?

martindurant · 2019-07-19T14:32:47Z

I expect yes. @TomAugspurger ?

martindurant · 2019-07-19T14:52:13Z

A hearty +1 from me, as soon as fixes are merged in

martindurant · 2019-07-19T15:36:51Z

#5056 was merged

…on check

rjzamora · 2019-07-19T16:57:18Z

Travis tests are all passing, but that pesky partd problem is still there in appveyor :/

martindurant · 2019-07-19T17:09:30Z

... here we go

rjzamora · 2019-07-19T17:11:42Z

Thanks @martindurant!

martindurant mentioned this pull request Jun 25, 2019

Add schema keyword argument to to_parquet #4851

Closed

2 tasks

TomAugspurger mentioned this pull request Jun 26, 2019

read_parquet fails for non-string column names #5000

Closed

TomAugspurger reviewed Jun 26, 2019

View reviewed changes

rjzamora force-pushed the parquet-cleanup branch from a031a60 to e11157b Compare June 26, 2019 20:52

martindurant reviewed Jun 28, 2019

View reviewed changes

TomAugspurger mentioned this pull request Jul 1, 2019

Dask doesn't work with pyarrow > 0.12.1 #5040

Closed

rjzamora force-pushed the parquet-cleanup branch from e586b6e to 99066ea Compare July 1, 2019 22:05

Martin Durant added 3 commits July 2, 2019 15:02

rip out bytes.utils

8b734ba

remove flaky stash

f98fdfb

stop point

3b4d4b9

dlovell mentioned this pull request Jul 3, 2019

BUG: to_parquet: accept pathlib.Path for path arg (#4936) #5030

Closed

Refactoring parquet api using core-engine approach

8b96b74

rjzamora force-pushed the parquet-cleanup branch from 38d05ce to 8b96b74 Compare July 3, 2019 22:12

removing signature change for _meta_from_dtypes

94f8e87

merging martindurant/fsspec

402f76d

syncing necessary changes comming in pr#5064

4219443

rjzamora changed the title ~~[WIP] Parquet engine-core refactoring~~ Parquet engine-core refactoring Jul 18, 2019

rjzamora mentioned this pull request Jul 18, 2019

base dask on fsspec #5064

Merged

2 tasks

bug fix to address behavior raised in issue dask#5112

fd171ca

rjzamora mentioned this pull request Jul 18, 2019

dataframe.read_parquet() cannot cope with more than 2^16-valued categories #5112

Closed

Merge remote-tracking branch 'upstream/master' into parquet-cleanup

ee0006c

rjzamora added 2 commits July 19, 2019 11:09

Merge remote-tracking branch 'upstream/master' into parquet-cleanup

93b116d

small code review tweak and correction of 0.13.01 to 0.13.1 for versi…

7c0e150

…on check

martindurant merged commit a53d45e into dask:master Jul 19, 2019

rjzamora mentioned this pull request Jul 19, 2019

Refactor parquet reader/writer #4329

Closed

rjzamora deleted the parquet-cleanup branch July 19, 2019 17:18

rjzamora mentioned this pull request Jul 22, 2019

[Review] Updating dask_cudf.read_parquet() to leverage new Dask API rapidsai/cudf#2368

Merged

mrocklin mentioned this pull request Jul 23, 2019

A long time before jobs start running when reading parquet files #4701

Closed

jrbourbeau mentioned this pull request Nov 5, 2019

Update minimum pyarrow version on CI #5562

Merged


		def apply_filters(parts, statistics, filters):
		""" Apply filters onto parts/statistics pairs

Parquet engine-core refactoring #4995

Parquet engine-core refactoring #4995

Conversation

rjzamora commented Jun 24, 2019 • edited

martindurant commented Jun 25, 2019

martindurant commented Jun 25, 2019

rjzamora commented Jun 25, 2019

mrocklin commented Jun 25, 2019

rjzamora commented Jun 25, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Jun 26, 2019

rjzamora commented Jun 26, 2019

rjzamora commented Jun 27, 2019

martindurant commented Jun 27, 2019

rjzamora commented Jun 27, 2019

mrocklin commented Jun 27, 2019

TomAugspurger commented Jun 28, 2019

martindurant commented Jun 28, 2019

rjzamora commented Jun 28, 2019

martindurant commented Jun 28, 2019

TomAugspurger commented Jun 28, 2019 via email

martindurant commented Jun 28, 2019

martindurant left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rjzamora commented Jul 18, 2019

martindurant commented Jul 18, 2019

martindurant commented Jul 18, 2019

rjzamora commented Jul 18, 2019 • edited

martindurant commented Jul 18, 2019

rjzamora commented Jul 18, 2019

martindurant commented Jul 19, 2019

rjzamora commented Jul 19, 2019

martindurant commented Jul 19, 2019

martindurant commented Jul 19, 2019

martindurant commented Jul 19, 2019

rjzamora commented Jul 19, 2019

martindurant commented Jul 19, 2019

rjzamora commented Jul 19, 2019

rjzamora commented Jun 24, 2019 •

edited

rjzamora commented Jul 18, 2019 •

edited