New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet engine-core refactoring #4995
Conversation
Do you have an idea of any changes you need from fastparquet? Or can it all be worked out here if I were to go through the code? |
I'm actually not sure about this, but hopefully I will have a better idea tomorrow. You are certainly welcome to dig through the code and make changes (or let me know what changes you suggest)
I'm realizeing that I did not properly track the earlier changes from #4336 for this PR (due to a bit of a rebasing blunder). So, it would also be fine with me if @mrocklin thinks we should move these changes back over there. |
I don't care about which PR we use. I would use whichever is more likely to finish faster :) |
Update: We are down to 7 failing tests. Here is the grep of
|
dask/bytes/core.py
Outdated
@@ -357,7 +362,9 @@ def get_fs_token_paths(urlpath, mode='rb', num=1, name_function=None, | |||
fs, fs_token = get_fs(protocol, options) | |||
paths = expand_paths_if_needed(paths, mode, num, fs, name_function) | |||
|
|||
elif isinstance(urlpath, (str, unicode)) or hasattr(urlpath, 'name'): | |||
elif (isinstance(urlpath, (str, unicode)) or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a new stringify_path
helper in dask/bytes/utils.py Does that suffice here?
dask/bytes/utils.py
Outdated
@@ -37,6 +37,8 @@ def infer_storage_options(urlpath, inherit_storage_options=None): | |||
"host": "node", "port": 123, "path": "/mnt/datasets/test.csv", | |||
"url_query": "q=1", "extra": "value"} | |||
""" | |||
urlpath = str(urlpath) # re, urllib don't support pathlib.Path objects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stringify_path
, though maybe the callers should be doing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @dlovell
dask/dataframe/core.py
Outdated
@@ -243,6 +243,9 @@ class _Frame(DaskMethodsMixin, OperatorMethodMixin): | |||
Values along which we partition our blocks on the index | |||
""" | |||
def __init__(self, dsk, name, meta, divisions): | |||
if len(divisions) < 2: # no partitions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not saying this is wrong, but it's a bit surprising to see this in a Parquet refactor. Do you recall what prompted it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lines were added in PR#4336 to address cases where an empty dataframe/partition is read in (like test_parquet.py::test_empty
). Not sure if there is a more appropriate fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean a dataframe without any partitions, I think. Yes, that could happen with parquet.
Ahh, was this open when we ran |
@TomAugspurger Not a problem - I can certainly understand the advantages of using black moving forward :) |
a031a60
to
e11157b
Compare
@mrocklin @martindurant - Should we allow If "yes" (which I am expecting), do we want to address |
I think this is unnecessary now that we have multiple engines and a stronger desire to structure the code well. Anything that can be opened with |
Update: All |
Woot! |
@martindurant will you have a chance to look through this PR? |
Will look today. Was waiting for the new commits to stop coming. |
Thanks @TomAugspurger and @martindurant - Since the refactor is only "working" for pyarrow 0.13.1, my next step (likely this afternoon) will be to add backward compatibility. Code review is still welcome/appreciated at any time of course. |
@mrocklin , do you think we'll need the compatibility? I imagine it may take more effort than it's worth, and we already disallow 0.13.0. |
I’d say just bump the minimum pyarrow version.
… On Jun 28, 2019, at 09:44, Martin Durant ***@***.***> wrote:
@mrocklin , do you think we'll need the compatibility? I imagine it may take more effort than it's worth, and we already disallow 0.13.0.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
I'd be ok with that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass:
looked over everything except tests and fastprquet module.
dask/dataframe/core.py
Outdated
@@ -243,6 +243,9 @@ class _Frame(DaskMethodsMixin, OperatorMethodMixin): | |||
Values along which we partition our blocks on the index | |||
""" | |||
def __init__(self, dsk, name, meta, divisions): | |||
if len(divisions) < 2: # no partitions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean a dataframe without any partitions, I think. Yes, that could happen with parquet.
dask/dataframe/io/parquet/arrow.py
Outdated
import pyarrow as pa | ||
import pyarrow.parquet as pq | ||
from ....delayed import delayed | ||
from ....bytes.core import get_pyarrow_filesystem |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forewarning: when we switch to fsspec, all filesystem implementations will already be subclasses of pyarrow's (if installed)
dask/dataframe/io/parquet/arrow.py
Outdated
fs, paths, categories=None, index=None, gather_statistics=None, **kwargs | ||
): | ||
|
||
# In pyarrow, the physical storage field names may differ from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pandas-derived parquet (also true for fastparquet)
|
||
def apply_filters(parts, statistics, filters): | ||
""" Apply filters onto parts/statistics pairs | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(copied from fastparquet; not sure where in Dask this is documented)
dask/dataframe/io/parquet/utils.py
Outdated
index_names = user_index | ||
if set(column_names).intersection(index_names): | ||
raise ValueError( | ||
"Specified index and column names must not " "intersect" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same type and thoughts. Which copy gets triggered, probably not both.
] | ||
|
||
# Need to reconcile storage and real names. These will differ for | ||
# pyarrow, which uses __index_leveL_d__ for the storage name of indexes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also work for fastparquet - to be tested
Hmm the new |
Possibly, but all tests are passing in #5064 is also using the same version. Your help diagnosing would be greatly appreciated. |
Actually, if #5064 is to be merged soon, then the best action would be to merge/rebase from it (which did affect the old parquet code slightly). |
@martindurant This sounds good to me - merging in your |
Sorry, merge wasn't quite clean (probably because it was merged) |
No problem - I can clean this up tonight. |
I think the new failures are fixed in #5056 |
That is what I'm hoping :) - I am assuming I should wait for that to go through before bothering to rebase/merge here again? |
I expect yes. @TomAugspurger ? |
A hearty +1 from me, as soon as fixes are merged in |
#5056 was merged |
Travis tests are all passing, but that pesky partd problem is still there in appveyor :/ |
... here we go |
Thanks @martindurant! |
This PR is intended to superseed PR#4336, which was originally driven mostly by @mrocklin and @martindurant (I'm opening a PR just to refresh/reorganize the discussion a bit). There are still 18 tests failing in
test_parquet.py
(all for fastparquet), but there should be enough progress for healthy discussion/feedback.The primary goal here is to refactor the parquet interface into distinct core and engine code that will be easier to maintain in the future. The idea is to pull all engine-agnostic code into
dask/dask/dataframe/io/parquet/core.py
, and then isolate engine specific code indask/dask/dataframe/io/parquet/arrow.py
anddask/dask/dataframe/io/parquet/fastparquet.py
(forpyarrow
andfastparquet
, respectively).Engine calls during write phase (
core.to_parquet
):engine.initialize_write
: Preparation for writing/appendingengine.write_partition
: Write operation for each partitionengine.write_metadata
: Metadata-write operationEngine calls during read phase (
core.read_parquet
):engine.read_metadata
: Read metadata and file statisticsengine.read_partition
: Read operation for each partitionAnother goal of PR#4336 was to limit the responsibilities of engine-specific code. I tried to do this without loosing any significant/useful features. For example, I decided to retain index detection in files with pandas metadata. For simplicity, the index is alway preserved by resetting it before writing (unless
write_index
is explicitly set toFalse
). The name(s) of the original index column(s) are then passed to the engines, which are expected to use pandas metadata to store the column index (and read it back duringread_metadata
).Any and all feedback is welcome :)
flake8 dask