-
-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add "real" read_parquet
logic
#1
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pull request! Fun!
I added a few comments. It's interesting to see someone else interact with this library.
Any other thoughts/concerns you have about the feel of the library would be welcome. What could do to improve the ease and understandability of things?
dask_match/core.py
Outdated
elif key in self.columns: | ||
return self[key] | ||
else: | ||
return object.__getattribute__(self, key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious, how did this come about?
Also, this seems unnecessarily nested. I suspect that if this is necessary then this would be better with something like the following:
if key == "__name__":
return object.__getattribute__(self, key)
else:
...
I'll probably be pretty allergic to anything that increases nesting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like much of this PR, this change is just a temporary hack to get the existing tests running :)
The problem I ran into was that calling self.columns
often requires us to actually parse parquet metadata to calculate self._meta
, which doesn't work when "myfile.parquet"
is not a real parquet dataset. Although this shouldn't "break" when the dataset is real, it also feels unnecessary to parse metadata when you are retrieving an attribute that doesn't require you to do so.
parquet_file_extension, | ||
filesystem, | ||
kwargs, | ||
), {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking of removing this method and instead relying on the def read_parquet
and similar functions. The classes won't do any normalization, and we'll depend on the user-facing API to handle it. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would be fine. We definitely need a space to "normalize" user inputs, but don't need to do this within the API
subclasses themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
coming a bit late to the party but strong +1 for doing normalization/validation of inputs far up in the stack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done in main now. I've removed the normalize function. So far I'm still comfortable with Exprs not having __init__
methods.
dask_match/tests/test_core.py
Outdated
assert_eq(func(ddf), func(df)) | ||
assert_eq(func(ddf.x), func(df.x)) | ||
assert_eq(func(ddf).compute(), func(df)) | ||
assert_eq(func(ddf.x).compute(), func(df.x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you probably needed this diff applied to dask/dask. Sorry
diff --git a/dask/dataframe/utils.py b/dask/dataframe/utils.py
index a52a53619..43b06b41b 100644
--- a/dask/dataframe/utils.py
+++ b/dask/dataframe/utils.py
@@ -507,9 +507,6 @@ def _check_dask(dsk, check_names=True, check_dtypes=True, result=None, scheduler
)
if check_dtypes:
assert_dask_dtypes(dsk, result)
- else:
- msg = f"Unsupported dask instance {type(dsk)} found"
- raise AssertionError(msg)
return result
return dsk
dask_match/tests/test_core.py
Outdated
|
||
@pytest.mark.xfail(reason="TODO: Debug this") | ||
def test_predicate_pushdown(tmpdir): | ||
from dask_match.io.parquet import ReadParquet as ReadPq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this renaming? It seems like it might make more sense to just use the ReadParquet
class in io.parquet
. Having two class names like this seems like it complicates things to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, maybe this is coming from the difference of having a function and class with the same name. I suspect that this will become simpler if we name the function read_parquet
like in dask.dataframe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, exactly. This is another temporary thing - You were too quick :)
cc also @jrbourbeau and @fjetter in case they're interested |
Looks clean to me. I'm hitting the "Ready for review" button. I had one final request, mostly to make sure that we cover the actual task graph code. |
Co-authored-by: Matthew Rocklin <mrocklin@gmail.com>
dask-expr v0.1.4
Updates
ReadParquet
to use metadata-parsing and IO logic fromdask.dataframe.io.parquet
.Requires dask/dask#9637 (only because my environment was using that PR when I put this together).