Refactor `Expr` class #158

phofl · 2023-06-16T09:13:52Z

This is a first suggestion. There is still stuff that needs to be done as follow up (e.g. adapting new_collection for example), but this will be a pain to synch, so want to keep scope as small as possible

closes Remove DataFrame assumptions from Expr #142

# Conflicts: # dask_expr/expr.py # dask_expr/io/parquet.py # dask_expr/reductions.py

mrocklin

Some comments

mrocklin · 2023-06-16T11:47:29Z

dask_expr/expr.py

@@ -33,8 +19,7 @@
 class Expr:
    """Primary class for all Expressions

-    This mostly includes Dask protocols and various Pandas-like method
-    definitions to make us look more like a DataFrame.
+    This mostly includes Dask protocols.
    """

    commutative = False


Unrelated to this PR, but we can remove commutativve / associative

These are left over from the matchpy days

mrocklin · 2023-06-16T11:48:35Z

dask_expr/expr.py

+
+    @property
+    def npartitions(self):
+        raise NotImplementedError


Maybe npartitions should move over? Not a huge deal I guess if collections like scalar or delayed return 1 though.

I checked this with the dask/dask implementation and for stuff like bag this is given through init, that's why I ordered it like this

mrocklin · 2023-06-16T11:49:15Z

dask_expr/expr.py

-
-    def _task(self, index: int):
-        assert index == 0
-        return self.value


Maybe Literal stays? I wonder if we can remove divisions/meta and still have things work.

I can take a look as a follow-up if ok?

mrocklin · 2023-06-16T11:50:09Z

dask_expr/frameexpr.py

@@ -0,0 +1,1375 @@
+from __future__ import annotations


I think I'm not a fan of the name frameexpr.py. Maybe frame.py, maybe dataframe/expr.py? Thoughts?

I am ok with frame.py and Frame(Expr)

mrocklin · 2023-06-16T11:51:10Z

dask_expr/frameexpr.py

+    """
+
+    def __hash__(self):
+        return hash(self._name)


I'm curious about your thoughts on where hash should live. Is this because _name may not be on all expressions? (maybe it should?)

This is very weird, we have a lot of tests failing if I add this only to Expr, but it was hard to investigate with all the other changes as well, that's why I put it here as well for now.

You might need to move over __eq__ as well. Python has weird rules around hash and eq together.

mrocklin · 2023-06-16T11:52:15Z

dask_expr/expr.py

-    def __getattr__(self, key):
-        try:
-            return object.__getattribute__(self, key)
-        except AttributeError as err:
-            # Allow operands to be accessed as attributes
-            # as long as the keys are not already reserved
-            # by existing methods/properties
-            _parameters = type(self)._parameters
-            if key in _parameters:
-                idx = _parameters.index(key)
-                return self.operands[idx]


Probably we want this top bit in Expr

Oh good point, moved up

mrocklin · 2023-06-16T11:52:48Z

dask_expr/frameexpr.py

+    def _simplify_down(self):
+        return
+
+    def _simplify_up(self, parent):
+        return
+
+    def optimize(self, **kwargs):
+        return optimize(self, **kwargs)


These should maybe be in Expr

Oh, I see, optimize is dataframe specific. The simplify methods then at least should probably be moved up.

The simplify steps live in both actually, missed deleting them here

mrocklin · 2023-06-16T11:53:17Z

dask_expr/frameexpr.py

+@normalize_token.register(FrameExpr)
+def normalize_expression(expr):
+    return expr._name


This should probably be in expr

mrocklin · 2023-06-16T11:56:13Z

dask_expr/frameexpr.py

+no_default = "__no_default__"
+
+
+class FrameExpr(Expr):


Maybe?

class Frame(Expr)

Or is that weird for scalars? (maybe scalars shouldn't be Frames?)

In general I dislike.

class FooBar(Bar)

It always feels needlessly wordy to me. This is subjective though and I'm happy to be overruled.

mrocklin · 2023-06-16T11:57:59Z

I'm curious if we can make Scalar inherit from Expr rather than FrameExpr. This will likely stress things like Add and simplify and meta checks. I think that that'll open up a lot of interesting questions though.

Maybe that's follow-up work though if we want to keep things low-friction.

mrocklin · 2023-06-16T12:10:37Z

OK, so let's imagine that we make it so that scalars don't inherit from FrameExpr, but just Expr. And then we want to add a scalar to a series:

scalar + series

First, we check Scalar.__add__, which probably uses expr.Add normally, but instead notices that the other object has its own Add class and uses that. I guess we have lots of diffeent Adds?

class Add(Expr):
    ...

class Add(FrameExpr, expr.Add):
    ...

This is starting to get a litlte weird. Maybe it's ok though.

Part of me thinks "we should just use one Add class, expressions are designed to show what the user typed in, and not capture user object types (we get into terrible many-class multiple-inheritance as above if we try to capture types as well as operations). But then we have to figure out lots of things dynamically like "which optimization functions should I use in this expression?" and "should this class have frame methods like map_partitions and maybe more. Historically when we've run into these situations we've said "ok, expresions should not have dataframe/series/type sensitive methods. Those should live on the collection". If we now make expressions generic to the point where they may or may not be dataframes then we probably lose all of those shared methods.

I bring this up because it's possible that while this path seems generally pretty good now, that it leads us down some odd multi-class hell. This is some small motivation to not merge this yet until we've looked a bit farther down this path. (not pushing for that hard though).

@rjzamora when you wake up you might want to take a look at this PR.

Closes dask#142 Supercedes dask#158

phofl added 3 commits June 16, 2023 10:56

Refactor Expr class

6ce7591

Merge remote-tracking branch 'upstream/main' into refactor_expr

b34d3dc

# Conflicts: # dask_expr/expr.py # dask_expr/io/parquet.py # dask_expr/reductions.py

Fix

c6360d1

mrocklin reviewed Jun 16, 2023

View reviewed changes

phofl added 6 commits June 16, 2023 14:03

Remove simplify

a34f04e

Remove simplify

8ad220d

Refactor getattr

8f0a0d2

Refactor getattr

376192f

Rename file

0d79ed1

Rename class

faccb15

phofl added 2 commits June 16, 2023 14:11

Move normalize

937e04c

Remove class arguments

5318941

mrocklin added a commit to mrocklin/dask-expr that referenced this pull request Dec 5, 2023

Split out graph Expr code from Dataframe Expr code

0e411ae

Closes dask#142 Supercedes dask#158

mrocklin mentioned this pull request Dec 5, 2023

Split out graph Expr code from Dataframe Expr code #470

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor `Expr` class #158

Refactor `Expr` class #158

phofl commented Jun 16, 2023 •

edited

mrocklin left a comment

mrocklin Jun 16, 2023

mrocklin Jun 16, 2023

mrocklin Jun 16, 2023

phofl Jun 16, 2023

mrocklin Jun 16, 2023

phofl Jun 16, 2023

mrocklin Jun 16, 2023

phofl Jun 16, 2023

mrocklin Jun 16, 2023

phofl Jun 16, 2023

mrocklin Jun 16, 2023

mrocklin Jun 16, 2023

phofl Jun 16, 2023

mrocklin Jun 16, 2023

mrocklin Jun 16, 2023

phofl Jun 16, 2023

mrocklin Jun 16, 2023

phofl Jun 16, 2023

mrocklin Jun 16, 2023

phofl Jun 16, 2023

mrocklin commented Jun 16, 2023

mrocklin commented Jun 16, 2023

Refactor Expr class #158

Are you sure you want to change the base?

Refactor Expr class #158

Conversation

phofl commented Jun 16, 2023 • edited

mrocklin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrocklin commented Jun 16, 2023

mrocklin commented Jun 16, 2023

Refactor `Expr` class #158

Refactor `Expr` class #158

phofl commented Jun 16, 2023 •

edited