Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Expr class #158

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open

Refactor Expr class #158

wants to merge 11 commits into from

Conversation

phofl
Copy link
Collaborator

@phofl phofl commented Jun 16, 2023

This is a first suggestion. There is still stuff that needs to be done as follow up (e.g. adapting new_collection for example), but this will be a pain to synch, so want to keep scope as small as possible

# Conflicts:
#	dask_expr/expr.py
#	dask_expr/io/parquet.py
#	dask_expr/reductions.py
Copy link
Member

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments

@@ -33,8 +19,7 @@
class Expr:
"""Primary class for all Expressions

This mostly includes Dask protocols and various Pandas-like method
definitions to make us look more like a DataFrame.
This mostly includes Dask protocols.
"""

commutative = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR, but we can remove commutativve / associative

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are left over from the matchpy days


@property
def npartitions(self):
raise NotImplementedError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe npartitions should move over? Not a huge deal I guess if collections like scalar or delayed return 1 though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked this with the dask/dask implementation and for stuff like bag this is given through init, that's why I ordered it like this


def _task(self, index: int):
assert index == 0
return self.value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe Literal stays? I wonder if we can remove divisions/meta and still have things work.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can take a look as a follow-up if ok?

@@ -0,0 +1,1375 @@
from __future__ import annotations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm not a fan of the name frameexpr.py. Maybe frame.py, maybe dataframe/expr.py? Thoughts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with frame.py and Frame(Expr)

"""

def __hash__(self):
return hash(self._name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about your thoughts on where hash should live. Is this because _name may not be on all expressions? (maybe it should?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very weird, we have a lot of tests failing if I add this only to Expr, but it was hard to investigate with all the other changes as well, that's why I put it here as well for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might need to move over __eq__ as well. Python has weird rules around hash and eq together.

Comment on lines -135 to -145
def __getattr__(self, key):
try:
return object.__getattribute__(self, key)
except AttributeError as err:
# Allow operands to be accessed as attributes
# as long as the keys are not already reserved
# by existing methods/properties
_parameters = type(self)._parameters
if key in _parameters:
idx = _parameters.index(key)
return self.operands[idx]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we want this top bit in Expr

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good point, moved up

Comment on lines 90 to 97
def _simplify_down(self):
return

def _simplify_up(self, parent):
return

def optimize(self, **kwargs):
return optimize(self, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should maybe be in Expr

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, optimize is dataframe specific. The simplify methods then at least should probably be moved up.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The simplify steps live in both actually, missed deleting them here

Comment on lines 1124 to 1126
@normalize_token.register(FrameExpr)
def normalize_expression(expr):
return expr._name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be in expr

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved

no_default = "__no_default__"


class FrameExpr(Expr):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe?

class Frame(Expr)

Or is that weird for scalars? (maybe scalars shouldn't be Frames?)

In general I dislike.

class FooBar(Bar)

It always feels needlessly wordy to me. This is subjective though and I'm happy to be overruled.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed

@mrocklin
Copy link
Member

I'm curious if we can make Scalar inherit from Expr rather than FrameExpr. This will likely stress things like Add and simplify and meta checks. I think that that'll open up a lot of interesting questions though.

Maybe that's follow-up work though if we want to keep things low-friction.

@mrocklin
Copy link
Member

OK, so let's imagine that we make it so that scalars don't inherit from FrameExpr, but just Expr. And then we want to add a scalar to a series:

scalar + series

First, we check Scalar.__add__, which probably uses expr.Add normally, but instead notices that the other object has its own Add class and uses that. I guess we have lots of diffeent Adds?

class Add(Expr):
    ...

class Add(FrameExpr, expr.Add):
    ...

This is starting to get a litlte weird. Maybe it's ok though.

Part of me thinks "we should just use one Add class, expressions are designed to show what the user typed in, and not capture user object types (we get into terrible many-class multiple-inheritance as above if we try to capture types as well as operations). But then we have to figure out lots of things dynamically like "which optimization functions should I use in this expression?" and "should this class have frame methods like map_partitions and maybe more. Historically when we've run into these situations we've said "ok, expresions should not have dataframe/series/type sensitive methods. Those should live on the collection". If we now make expressions generic to the point where they may or may not be dataframes then we probably lose all of those shared methods.

I bring this up because it's possible that while this path seems generally pretty good now, that it leads us down some odd multi-class hell. This is some small motivation to not merge this yet until we've looked a bit farther down this path. (not pushing for that hard though).

@rjzamora when you wake up you might want to take a look at this PR.

mrocklin added a commit to mrocklin/dask-expr that referenced this pull request Dec 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove DataFrame assumptions from Expr
2 participants