-
-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix predicate-pushdown compute #2
Conversation
Cool. Things here seem generally fine to me. As a heads-up, I'll be mostly unresponsive at least tomorrow and maybe Friday as well. |
FWIW if this actually becomes a real library then my plan was to use it as an opportunity to reset a lot of things like using fastparquet, and depending on pandas < 2. If cudf could grow support for filtering that would be nice. I actually removed the
If you wanted to play with replacement rules you could add back in the engine keyword, and then add in new replacement rules that matched against Note that if you do do this then you'll need to leave I mention this not because I necessarily want you to do this work (there are probably more important things to sort out) but because I think it might give you a feel for replacement rules, which could be interesting / fun / educational. |
I mostly agree with this, but will obviously insist on making it as easy as possible to use a cudf backend as early as possible.
Ah, good point
Thanks for sharing these thoughts. It is likely that I will end up playing with rules like this. However, the more-immediate issue is probably that tests unrelated to parquet/predicate_pushdown are still failing in |
Fair point. Maybe we should pull over assert_eq into this library and modify it to our own needs? |
I'm open to this |
My eyes are back in a decent state. Is this ready for review? |
I came down with something yesterday, so haven't pushed on this since last week - You are welcome to review and/or make changes. If I remember correctly, this PR is probably trying to do too many things at once (and non-pushdown tests are not passing). |
…results correctly yet
@property | ||
def index(self): | ||
return Index(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we want various base classes, like FrameExpr
, DataFrameExpr
, SeriesExpr
, and ScalarExpr
to make sure we only expose attributes like index
when it makes sense.
dask_match/io/parquet.py
Outdated
@property | ||
def input_columns(self): | ||
return self.operands[self._parameters.index("columns")] | ||
|
||
@property | ||
def columns(self): | ||
if self.input_columns is None: | ||
return self._meta.columns | ||
else: | ||
import pandas as pd | ||
|
||
return pd.Index(_list_columns(self.input_columns)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that overlapping parameters and properties can be a bit confusing, and I don't really like the solution used here.
We may want to require each class to define a set of reserved names, and forbid any parameter name from intersecting with that set. This means the user-facing read_parquet
API, for example, would need to translate arguments like columns
into a different name (like column_projection
), so that the ReadParquet
implementation wouldn't need to worry about implementing anything like the input_columns
workarounds used here.
def __getattr__(self, key): | ||
if key == "__name__": | ||
return object.__getattribute__(self, key) | ||
elif key in type(self)._parameters: | ||
idx = type(self)._parameters.index(key) | ||
return self.operands[idx] | ||
elif key in dir(type(self)): | ||
return object.__getattribute__(self, key) | ||
elif is_dataframe_like(self._meta) and key in self._meta.columns: | ||
return self[key] | ||
else: | ||
return object.__getattribute__(self, key) | ||
|
||
def operand(self, key): | ||
return self.operands[type(self)._parameters.index(key)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that commit 6f191e4 demonstrates one possible approach to the proposal in #4 (adding a distinct operand
API for accessing parameters used to create the expression). Although this prohibits us from using self.<operand>
syntax, I feel that sacrificing this short-hand makes it much easier to avoid recursion traps and unexpected attribute/column-name collisions.
|
||
def _layer(self): | ||
return { | ||
(self._name, i): (getattr, (self.operand("frame")._name, i), "index") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was .operand
necessary here? This seems unpleasant to do for all operands. I'm optimizing pretty hard here for "pleasant to work with"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was .operand necessary here? This seems unpleasant to do for all operands. I'm optimizing pretty hard here for "pleasant to work with"
I'm totally on board with the "pleasant to work with" goal, and so the operand
proposal may not be the best solution. Overall, I found the current practice of accessing parameters/operands as attributes (e.g. self.<operand>
) to be surprisingly problematic, and eventually rationalized the idea to forbid the practice altogether. I realize that self.<operand>
pattern can be supported just fine, but I'm not quite convinced that it is worth the potential for pain (though my mind is open).
Anything further to do here @rjzamora ? |
These changes were addressed in #6 |
Updates
ReadParquet
andtest_predicate_pushdown
to get the correct result aftercompute
.Future TODO: Currently, the
read_parquet
engine is not required to perform row-wise filtering to satisfy thefilters
argument. For the pyarrow engine, you will get row-wise filtering, but the fastparquet and cudf engines don't do this yet. Since the replacement rules result in the removal of the explicitFilter
operation, we need to make sure the the partition-wise IO function ensures that row-wise filtering has been applied.