Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add "real" read_parquet logic #1

Merged
merged 6 commits into from
Feb 22, 2023
Merged

Conversation

rjzamora
Copy link
Member

Updates ReadParquet to use metadata-parsing and IO logic from dask.dataframe.io.parquet.

Requires dask/dask#9637 (only because my environment was using that PR when I put this together).

Copy link
Member

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pull request! Fun!

I added a few comments. It's interesting to see someone else interact with this library.

Any other thoughts/concerns you have about the feel of the library would be welcome. What could do to improve the ease and understandability of things?

elif key in self.columns:
return self[key]
else:
return object.__getattribute__(self, key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, how did this come about?

Also, this seems unnecessarily nested. I suspect that if this is necessary then this would be better with something like the following:

if key == "__name__":
    return object.__getattribute__(self, key)
else:
    ...

I'll probably be pretty allergic to anything that increases nesting

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like much of this PR, this change is just a temporary hack to get the existing tests running :)

The problem I ran into was that calling self.columns often requires us to actually parse parquet metadata to calculate self._meta, which doesn't work when "myfile.parquet" is not a real parquet dataset. Although this shouldn't "break" when the dataset is real, it also feels unnecessary to parse metadata when you are retrieving an attribute that doesn't require you to do so.

dask_match/core.py Show resolved Hide resolved
parquet_file_extension,
filesystem,
kwargs,
), {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking of removing this method and instead relying on the def read_parquet and similar functions. The classes won't do any normalization, and we'll depend on the user-facing API to handle it. Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be fine. We definitely need a space to "normalize" user inputs, but don't need to do this within the API subclasses themselves.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

coming a bit late to the party but strong +1 for doing normalization/validation of inputs far up in the stack

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done in main now. I've removed the normalize function. So far I'm still comfortable with Exprs not having __init__ methods.

dask_match/tests/test_core.py Show resolved Hide resolved
assert_eq(func(ddf), func(df))
assert_eq(func(ddf.x), func(df.x))
assert_eq(func(ddf).compute(), func(df))
assert_eq(func(ddf.x).compute(), func(df.x))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you probably needed this diff applied to dask/dask. Sorry

diff --git a/dask/dataframe/utils.py b/dask/dataframe/utils.py
index a52a53619..43b06b41b 100644
--- a/dask/dataframe/utils.py
+++ b/dask/dataframe/utils.py
@@ -507,9 +507,6 @@ def _check_dask(dsk, check_names=True, check_dtypes=True, result=None, scheduler
             )
             if check_dtypes:
                 assert_dask_dtypes(dsk, result)
-        else:
-            msg = f"Unsupported dask instance {type(dsk)} found"
-            raise AssertionError(msg)
         return result
     return dsk


@pytest.mark.xfail(reason="TODO: Debug this")
def test_predicate_pushdown(tmpdir):
from dask_match.io.parquet import ReadParquet as ReadPq
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this renaming? It seems like it might make more sense to just use the ReadParquet class in io.parquet. Having two class names like this seems like it complicates things to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, maybe this is coming from the difference of having a function and class with the same name. I suspect that this will become simpler if we name the function read_parquet like in dask.dataframe.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, exactly. This is another temporary thing - You were too quick :)

@mrocklin
Copy link
Member

cc also @jrbourbeau and @fjetter in case they're interested

@mrocklin
Copy link
Member

Looks clean to me. I'm hitting the "Ready for review" button. I had one final request, mostly to make sure that we cover the actual task graph code.

@mrocklin mrocklin marked this pull request as ready for review February 22, 2023 01:11
Co-authored-by: Matthew Rocklin <mrocklin@gmail.com>
@mrocklin mrocklin merged commit 2453921 into dask:main Feb 22, 2023
@rjzamora rjzamora deleted the read-parquet branch February 22, 2023 15:05
fjetter pushed a commit to fjetter/dask-expr that referenced this pull request Mar 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants