[WIP] Add "real" `read_parquet` logic #1

rjzamora · 2023-02-21T19:20:52Z

Updates ReadParquet to use metadata-parsing and IO logic from dask.dataframe.io.parquet.

Requires dask/dask#9637 (only because my environment was using that PR when I put this together).

mrocklin

First pull request! Fun!

I added a few comments. It's interesting to see someone else interact with this library.

Any other thoughts/concerns you have about the feel of the library would be welcome. What could do to improve the ease and understandability of things?

mrocklin · 2023-02-21T19:28:05Z

dask_match/core.py

-        elif key in self.columns:
-            return self[key]
-        else:
-            return object.__getattribute__(self, key)


I'm curious, how did this come about?

Also, this seems unnecessarily nested. I suspect that if this is necessary then this would be better with something like the following:

if key == "__name__": return object.__getattribute__(self, key) else: ...

I'll probably be pretty allergic to anything that increases nesting

Like much of this PR, this change is just a temporary hack to get the existing tests running :)

The problem I ran into was that calling self.columns often requires us to actually parse parquet metadata to calculate self._meta, which doesn't work when "myfile.parquet" is not a real parquet dataset. Although this shouldn't "break" when the dataset is real, it also feels unnecessary to parse metadata when you are retrieving an attribute that doesn't require you to do so.

dask_match/core.py

mrocklin · 2023-02-21T19:29:42Z

dask_match/io/parquet.py

+            parquet_file_extension,
+            filesystem,
+            kwargs,
+        ), {}


I'm thinking of removing this method and instead relying on the def read_parquet and similar functions. The classes won't do any normalization, and we'll depend on the user-facing API to handle it. Thoughts?

I think that would be fine. We definitely need a space to "normalize" user inputs, but don't need to do this within the API subclasses themselves.

coming a bit late to the party but strong +1 for doing normalization/validation of inputs far up in the stack

This is done in main now. I've removed the normalize function. So far I'm still comfortable with Exprs not having __init__ methods.

dask_match/tests/test_core.py

mrocklin · 2023-02-21T19:31:49Z

dask_match/tests/test_core.py

-    assert_eq(func(ddf), func(df))
-    assert_eq(func(ddf.x), func(df.x))
+    assert_eq(func(ddf).compute(), func(df))
+    assert_eq(func(ddf.x).compute(), func(df.x))


Ah, you probably needed this diff applied to dask/dask. Sorry

diff --git a/dask/dataframe/utils.py b/dask/dataframe/utils.py index a52a53619..43b06b41b 100644 --- a/dask/dataframe/utils.py +++ b/dask/dataframe/utils.py @@ -507,9 +507,6 @@ def _check_dask(dsk, check_names=True, check_dtypes=True, result=None, scheduler ) if check_dtypes: assert_dask_dtypes(dsk, result) - else: - msg = f"Unsupported dask instance {type(dsk)} found" - raise AssertionError(msg) return result return dsk

mrocklin · 2023-02-21T19:32:42Z

dask_match/tests/test_core.py

+
+@pytest.mark.xfail(reason="TODO: Debug this")
+def test_predicate_pushdown(tmpdir):
+    from dask_match.io.parquet import ReadParquet as ReadPq


Why this renaming? It seems like it might make more sense to just use the ReadParquet class in io.parquet. Having two class names like this seems like it complicates things to me.

Oh, maybe this is coming from the difference of having a function and class with the same name. I suspect that this will become simpler if we name the function read_parquet like in dask.dataframe.

yes, exactly. This is another temporary thing - You were too quick :)

mrocklin · 2023-02-21T19:34:56Z

cc also @jrbourbeau and @fjetter in case they're interested

dask_match/tests/test_core.py

mrocklin · 2023-02-22T01:11:05Z

Looks clean to me. I'm hitting the "Ready for review" button. I had one final request, mostly to make sure that we cover the actual task graph code.

Co-authored-by: Matthew Rocklin <mrocklin@gmail.com>

dask-expr v0.1.4

rjzamora added 3 commits February 17, 2023 13:13

basic read_parquet functionality

1684ecd

update tests

500d4b9

remove debug statement

c8e6741

mrocklin reviewed Feb 21, 2023

View reviewed changes

rjzamora added 2 commits February 21, 2023 16:22

remove extra computes in tests

b47e227

minor simplifications

6699d09

mrocklin reviewed Feb 22, 2023

View reviewed changes

dask_match/tests/test_core.py Outdated Show resolved Hide resolved

mrocklin marked this pull request as ready for review February 22, 2023 01:11

Update dask_match/tests/test_core.py

3dfcaf2

Co-authored-by: Matthew Rocklin <mrocklin@gmail.com>

mrocklin merged commit 2453921 into dask:main Feb 22, 2023

rjzamora deleted the read-parquet branch February 22, 2023 15:05

rjzamora mentioned this pull request Mar 8, 2023

Proposal: Decouple _parameters and __getattr__/__setattr__ #4

Closed

fjetter pushed a commit to fjetter/dask-expr that referenced this pull request Mar 14, 2024

Merge pull request dask#1 from regro-cf-autotick-bot/0.1.4_he35944

4ce575d

dask-expr v0.1.4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add "real" `read_parquet` logic #1

[WIP] Add "real" `read_parquet` logic #1

rjzamora commented Feb 21, 2023

mrocklin left a comment

mrocklin Feb 21, 2023

rjzamora Feb 21, 2023

mrocklin Feb 21, 2023

rjzamora Feb 21, 2023

fjetter Feb 22, 2023

mrocklin Feb 22, 2023

mrocklin Feb 21, 2023

mrocklin Feb 21, 2023

mrocklin Feb 21, 2023

rjzamora Feb 21, 2023

mrocklin commented Feb 21, 2023

mrocklin commented Feb 22, 2023

[WIP] Add "real" read_parquet logic #1

[WIP] Add "real" read_parquet logic #1

Conversation

rjzamora commented Feb 21, 2023

mrocklin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrocklin commented Feb 21, 2023

mrocklin commented Feb 22, 2023

[WIP] Add "real" `read_parquet` logic #1

[WIP] Add "real" `read_parquet` logic #1