Fix predicate-pushdown compute #2

rjzamora · 2023-02-22T15:11:04Z

Updates ReadParquet and test_predicate_pushdown to get the correct result after compute.

Future TODO: Currently, the read_parquet engine is not required to perform row-wise filtering to satisfy the filters argument. For the pyarrow engine, you will get row-wise filtering, but the fastparquet and cudf engines don't do this yet. Since the replacement rules result in the removal of the explicit Filter operation, we need to make sure the the partition-wise IO function ensures that row-wise filtering has been applied.

mrocklin · 2023-02-23T01:41:37Z

Cool. Things here seem generally fine to me.

As a heads-up, I'll be mostly unresponsive at least tomorrow and maybe Friday as well.

mrocklin · 2023-02-23T14:19:41Z

but the fastparquet and cudf engines don't do this yet

FWIW if this actually becomes a real library then my plan was to use it as an opportunity to reset a lot of things like using fastparquet, and depending on pandas < 2. If cudf could grow support for filtering that would be nice.

I actually removed the engine keyword from your code with this in mind. I hadn't though of cudf though.

Since the replacement rules result in the removal of the explicit Filter operation

If you wanted to play with replacement rules you could add back in the engine keyword, and then add in new replacement rules that matched against ReadParquet(filename, columns=a, filters=b, engine="pyarrow") and ReadParquet(filename, columns=a, filters=b, engine="cudf") or just leave off the cudf replacement rules for now and just force engine="pyarrow". That way you'll know that these rules won't apply when engine is anything else.

Note that if you do do this then you'll need to leave self.engine as a string, and not as an Engine object.

I mention this not because I necessarily want you to do this work (there are probably more important things to sort out) but because I think it might give you a feel for replacement rules, which could be interesting / fun / educational.

rjzamora · 2023-02-23T14:44:22Z

FWIW if this actually becomes a real library then my plan was to use it as an opportunity to reset a lot of things like using fastparquet, and depending on pandas < 2. If cudf could grow support for filtering that would be nice.

I actually removed the engine keyword from your code with this in mind. I hadn't though of cudf though.

I mostly agree with this, but will obviously insist on making it as easy as possible to use a cudf backend as early as possible.

Note that if you do do this then you'll need to leave self.engine as a string, and not as an Engine object.

Ah, good point

I mention this not because I necessarily want you to do this work (there are probably more important things to sort out) but because I think it might give you a feel for replacement rules, which could be interesting / fun / educational.

Thanks for sharing these thoughts. It is likely that I will end up playing with rules like this. However, the more-immediate issue is probably that tests unrelated to parquet/predicate_pushdown are still failing in assert_eq. So we probably want to iron out the best way to validate and compare an expression to an expected pandas result.

mrocklin · 2023-02-23T15:06:51Z

Fair point. Maybe we should pull over assert_eq into this library and modify it to our own needs?

rjzamora · 2023-02-23T15:32:57Z

Fair point. Maybe we should pull over assert_eq into this library and modify it to our own needs?

I'm open to this

mrocklin · 2023-02-28T13:24:02Z

My eyes are back in a decent state. Is this ready for review?

rjzamora · 2023-02-28T14:29:36Z

My eyes are back in a decent state. Is this ready for review?

I came down with something yesterday, so haven't pushed on this since last week - You are welcome to review and/or make changes. If I remember correctly, this PR is probably trying to do too many things at once (and non-pushdown tests are not passing).

…results correctly yet

rjzamora · 2023-03-07T21:30:22Z

dask_match/core.py

+    @property
+    def index(self):
+        return Index(self)


I'm wondering if we want various base classes, like FrameExpr, DataFrameExpr, SeriesExpr, and ScalarExpr to make sure we only expose attributes like index when it makes sense.

rjzamora · 2023-03-07T21:42:29Z

dask_match/io/parquet.py

+    @property
+    def input_columns(self):
+        return self.operands[self._parameters.index("columns")]
+
+    @property
+    def columns(self):
+        if self.input_columns is None:
+            return self._meta.columns
+        else:
+            import pandas as pd
+
+            return pd.Index(_list_columns(self.input_columns))


Note that overlapping parameters and properties can be a bit confusing, and I don't really like the solution used here.

We may want to require each class to define a set of reserved names, and forbid any parameter name from intersecting with that set. This means the user-facing read_parquet API, for example, would need to translate arguments like columns into a different name (like column_projection), so that the ReadParquet implementation wouldn't need to worry about implementing anything like the input_columns workarounds used here.

rjzamora · 2023-03-08T17:44:17Z

dask_match/core.py

    def __getattr__(self, key):
        if key == "__name__":
            return object.__getattribute__(self, key)
-        elif key in type(self)._parameters:
-            idx = type(self)._parameters.index(key)
-            return self.operands[idx]
        elif key in dir(type(self)):
            return object.__getattribute__(self, key)
        elif is_dataframe_like(self._meta) and key in self._meta.columns:
            return self[key]
        else:
            return object.__getattribute__(self, key)

+    def operand(self, key):
+        return self.operands[type(self)._parameters.index(key)]


Note that commit 6f191e4 demonstrates one possible approach to the proposal in #4 (adding a distinct operand API for accessing parameters used to create the expression). Although this prohibits us from using self.<operand> syntax, I feel that sacrificing this short-hand makes it much easier to avoid recursion traps and unexpected attribute/column-name collisions.

mrocklin · 2023-03-08T23:23:00Z

dask_match/core.py

+
+    def _layer(self):
+        return {
+            (self._name, i): (getattr, (self.operand("frame")._name, i), "index")


Why was .operand necessary here? This seems unpleasant to do for all operands. I'm optimizing pretty hard here for "pleasant to work with"

Why was .operand necessary here? This seems unpleasant to do for all operands. I'm optimizing pretty hard here for "pleasant to work with"

I'm totally on board with the "pleasant to work with" goal, and so the operand proposal may not be the best solution. Overall, I found the current practice of accessing parameters/operands as attributes (e.g. self.<operand>) to be surprisingly problematic, and eventually rationalized the idea to forbid the practice altogether. I realize that self.<operand> pattern can be supported just fine, but I'm not quite convinced that it is worth the potential for pain (though my mind is open).

mrocklin · 2023-03-29T18:32:37Z

Anything further to do here @rjzamora ?

rjzamora · 2023-03-29T18:55:48Z

These changes were addressed in #6

rjzamora added 3 commits February 22, 2023 07:04

make it possible to compute after predicate pushdown

723f2a0

playing with columns

2133b67

play with Index

b3d7969

get tests passing - assert_eq doesn't interact with scalar or series …

4f6417f

…results correctly yet

rjzamora marked this pull request as ready for review March 7, 2023 21:27

rjzamora commented Mar 7, 2023

View reviewed changes

rjzamora mentioned this pull request Mar 8, 2023

Proposal: Decouple _parameters and __getattr__/__setattr__ #4

Closed

adopt proposal in dask#3

6f191e4

rjzamora commented Mar 8, 2023

View reviewed changes

mrocklin reviewed Mar 8, 2023

View reviewed changes

rjzamora mentioned this pull request Mar 23, 2023

Separate the collection-API from the Expr classes #6

Merged

rjzamora closed this Mar 29, 2023

rjzamora deleted the fix-predicate-pushdown-compute branch March 29, 2023 18:55

hendrikmakait added a commit to hendrikmakait/dask-expr that referenced this pull request Oct 25, 2023

dask#2

d99a876

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix predicate-pushdown compute #2

Fix predicate-pushdown compute #2

rjzamora commented Feb 22, 2023 •

edited

Loading

mrocklin commented Feb 23, 2023

mrocklin commented Feb 23, 2023

rjzamora commented Feb 23, 2023

mrocklin commented Feb 23, 2023

rjzamora commented Feb 23, 2023

mrocklin commented Feb 28, 2023

rjzamora commented Feb 28, 2023 •

edited

Loading

rjzamora Mar 7, 2023

rjzamora Mar 7, 2023

rjzamora Mar 8, 2023

mrocklin Mar 8, 2023

rjzamora Mar 8, 2023

mrocklin commented Mar 29, 2023

rjzamora commented Mar 29, 2023

Fix predicate-pushdown compute #2

Fix predicate-pushdown compute #2

Conversation

rjzamora commented Feb 22, 2023 • edited Loading

mrocklin commented Feb 23, 2023

mrocklin commented Feb 23, 2023

rjzamora commented Feb 23, 2023

mrocklin commented Feb 23, 2023

rjzamora commented Feb 23, 2023

mrocklin commented Feb 28, 2023

rjzamora commented Feb 28, 2023 • edited Loading

rjzamora Mar 7, 2023

Choose a reason for hiding this comment

rjzamora Mar 7, 2023

Choose a reason for hiding this comment

rjzamora Mar 8, 2023

Choose a reason for hiding this comment

mrocklin Mar 8, 2023

Choose a reason for hiding this comment

rjzamora Mar 8, 2023

Choose a reason for hiding this comment

mrocklin commented Mar 29, 2023

rjzamora commented Mar 29, 2023

rjzamora commented Feb 22, 2023 •

edited

Loading

rjzamora commented Feb 28, 2023 •

edited

Loading