Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix predicate-pushdown compute #2

Closed
wants to merge 5 commits into from

Conversation

rjzamora
Copy link
Member

@rjzamora rjzamora commented Feb 22, 2023

Updates ReadParquet and test_predicate_pushdown to get the correct result after compute.

Future TODO: Currently, the read_parquet engine is not required to perform row-wise filtering to satisfy the filters argument. For the pyarrow engine, you will get row-wise filtering, but the fastparquet and cudf engines don't do this yet. Since the replacement rules result in the removal of the explicit Filter operation, we need to make sure the the partition-wise IO function ensures that row-wise filtering has been applied.

@mrocklin
Copy link
Member

Cool. Things here seem generally fine to me.

As a heads-up, I'll be mostly unresponsive at least tomorrow and maybe Friday as well.

@mrocklin
Copy link
Member

but the fastparquet and cudf engines don't do this yet

FWIW if this actually becomes a real library then my plan was to use it as an opportunity to reset a lot of things like using fastparquet, and depending on pandas < 2. If cudf could grow support for filtering that would be nice.

I actually removed the engine keyword from your code with this in mind. I hadn't though of cudf though.

Since the replacement rules result in the removal of the explicit Filter operation

If you wanted to play with replacement rules you could add back in the engine keyword, and then add in new replacement rules that matched against ReadParquet(filename, columns=a, filters=b, engine="pyarrow") and ReadParquet(filename, columns=a, filters=b, engine="cudf") or just leave off the cudf replacement rules for now and just force engine="pyarrow". That way you'll know that these rules won't apply when engine is anything else.

Note that if you do do this then you'll need to leave self.engine as a string, and not as an Engine object.

I mention this not because I necessarily want you to do this work (there are probably more important things to sort out) but because I think it might give you a feel for replacement rules, which could be interesting / fun / educational.

@rjzamora
Copy link
Member Author

FWIW if this actually becomes a real library then my plan was to use it as an opportunity to reset a lot of things like using fastparquet, and depending on pandas < 2. If cudf could grow support for filtering that would be nice.

I actually removed the engine keyword from your code with this in mind. I hadn't though of cudf though.

I mostly agree with this, but will obviously insist on making it as easy as possible to use a cudf backend as early as possible.

Note that if you do do this then you'll need to leave self.engine as a string, and not as an Engine object.

Ah, good point

I mention this not because I necessarily want you to do this work (there are probably more important things to sort out) but because I think it might give you a feel for replacement rules, which could be interesting / fun / educational.

Thanks for sharing these thoughts. It is likely that I will end up playing with rules like this. However, the more-immediate issue is probably that tests unrelated to parquet/predicate_pushdown are still failing in assert_eq. So we probably want to iron out the best way to validate and compare an expression to an expected pandas result.

@mrocklin
Copy link
Member

Fair point. Maybe we should pull over assert_eq into this library and modify it to our own needs?

@rjzamora
Copy link
Member Author

Fair point. Maybe we should pull over assert_eq into this library and modify it to our own needs?

I'm open to this

@mrocklin
Copy link
Member

My eyes are back in a decent state. Is this ready for review?

@rjzamora
Copy link
Member Author

rjzamora commented Feb 28, 2023

My eyes are back in a decent state. Is this ready for review?

I came down with something yesterday, so haven't pushed on this since last week - You are welcome to review and/or make changes. If I remember correctly, this PR is probably trying to do too many things at once (and non-pushdown tests are not passing).

@rjzamora rjzamora marked this pull request as ready for review March 7, 2023 21:27
Comment on lines +113 to +115
@property
def index(self):
return Index(self)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we want various base classes, like FrameExpr, DataFrameExpr, SeriesExpr, and ScalarExpr to make sure we only expose attributes like index when it makes sense.

Comment on lines 74 to 85
@property
def input_columns(self):
return self.operands[self._parameters.index("columns")]

@property
def columns(self):
if self.input_columns is None:
return self._meta.columns
else:
import pandas as pd

return pd.Index(_list_columns(self.input_columns))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that overlapping parameters and properties can be a bit confusing, and I don't really like the solution used here.

We may want to require each class to define a set of reserved names, and forbid any parameter name from intersecting with that set. This means the user-facing read_parquet API, for example, would need to translate arguments like columns into a different name (like column_projection), so that the ReadParquet implementation wouldn't need to worry about implementing anything like the input_columns workarounds used here.

Comment on lines 100 to +111
def __getattr__(self, key):
if key == "__name__":
return object.__getattribute__(self, key)
elif key in type(self)._parameters:
idx = type(self)._parameters.index(key)
return self.operands[idx]
elif key in dir(type(self)):
return object.__getattribute__(self, key)
elif is_dataframe_like(self._meta) and key in self._meta.columns:
return self[key]
else:
return object.__getattribute__(self, key)

def operand(self, key):
return self.operands[type(self)._parameters.index(key)]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that commit 6f191e4 demonstrates one possible approach to the proposal in #4 (adding a distinct operand API for accessing parameters used to create the expression). Although this prohibits us from using self.<operand> syntax, I feel that sacrificing this short-hand makes it much easier to avoid recursion traps and unexpected attribute/column-name collisions.


def _layer(self):
return {
(self._name, i): (getattr, (self.operand("frame")._name, i), "index")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was .operand necessary here? This seems unpleasant to do for all operands. I'm optimizing pretty hard here for "pleasant to work with"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was .operand necessary here? This seems unpleasant to do for all operands. I'm optimizing pretty hard here for "pleasant to work with"

I'm totally on board with the "pleasant to work with" goal, and so the operand proposal may not be the best solution. Overall, I found the current practice of accessing parameters/operands as attributes (e.g. self.<operand>) to be surprisingly problematic, and eventually rationalized the idea to forbid the practice altogether. I realize that self.<operand> pattern can be supported just fine, but I'm not quite convinced that it is worth the potential for pain (though my mind is open).

@mrocklin
Copy link
Member

Anything further to do here @rjzamora ?

@rjzamora
Copy link
Member Author

These changes were addressed in #6

@rjzamora rjzamora closed this Mar 29, 2023
@rjzamora rjzamora deleted the fix-predicate-pushdown-compute branch March 29, 2023 18:55
hendrikmakait added a commit to hendrikmakait/dask-expr that referenced this pull request Oct 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants