New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BoundPredicateVisitor
trait
#320
Conversation
05e6a54
to
8c6d461
Compare
7fbf2a3
to
1b95080
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sdd for this pr! I've left some small suggestions to improve.
1b95080
to
77653b1
Compare
fn or(&mut self, lhs: Self::T, rhs: Self::T) -> Result<Self::T>; | ||
fn not(&mut self, inner: Self::T) -> Result<Self::T>; | ||
|
||
fn is_null(&mut self, reference: &BoundReference) -> Result<Self::T>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with this approach is that when we add an operator, we need to modify the trait. We can still keep it compatible with default function, but the actual behavior may break when user upgrades library. cc @viirya @Xuanwo @marvinlanhenke What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we add default implementations that return something like Err(OperatorNotImplemented)
, then if we add an operator, wouldn't existing code work for existing operators and only break if users then try to use the new operator without having updated any visitors they have to use the new operator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we add default implementations that return something like Err(OperatorNotImplemented), then if we add an operator, wouldn't existing code work for existing operators and only break if users then try to use the new operator without having updated any visitors they have to use the new operator?
Yes, I agree with that.
Another option is to use following method:
trait BoundPredicateVisitor {
fn visitor_op(&mut self, op: &PredicateOperator, reference: &BoundReference, results: Vec<Self::T>) -> Result<Self::T> {}
}
This way we can force user to think about added new operator, otherwise it will throw compiler error, but this may break things when we add new operator.
I think the two approaches have different pros and cons, let's wait to see others's comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option is to use following method:
trait BoundPredicateVisitor { fn visitor_op(&mut self, op: &PredicateOperator, reference: &BoundReference, results: Vec<Self::T>) -> Result<Self::T> {} }
I think I'm in favor of keeping the trait as generic as possible - so we don't have to modify it, if we have to support a new operator.
When implementing this trait (no matter which variant) I'd most likely have to overwrite the default behaviour anyway? So I wouldn't bother providing any in the first place and keep the trait as generic and future proof as possible? If that makes any sense to you?
This way we can force user to think about added new operator, otherwise it will throw compiler error, but this may break things when we add new operator.
...wouldn't I have match op
anyway, so having a default match arm in the implementation would solve the issue for the downstream user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...wouldn't I have match op anyway, so having a default match arm in the implementation would solve the issue for the downstream user?
I think this is the advantage of the op approach, it's left to user to decide what to do. If they want new op to be ignored, they could use a default match arm, otherwise they could choose to do pattern match against each operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how this will work with just visitor_op
.
The proposed function signature fn visitor_op(&mut self, op: &PredicateOperator, reference: &BoundReference)
would work for a unary operator.
A binary operator would need fn visitor_op(&mut self, op: &PredicateOperator, reference: &BoundReference, literal: &Datum)
, and a set operator would need fn visitor_op(&mut self, op: &PredicateOperator, reference: &BoundReference, literals: &FnvHashSet<Datum>)
.
To resolve this, you'd need the visitor to have fn unary_op(...)
, fn binary_op(...)
, and fn set_op(...)
.
Is that what we want?
Personally I think this is another example of where the current type hierarchy feels awkward - there are other places where we bump into these type hierarchy mismatches, such as needing a default leg in a match on operator to handle operators that will never show up, such as binary ops in a unary expression.
Something more like this feels better to me:
pub enum Predicate {
// By not having Unary / Binary / Set here, and removing
// that layer in favour of having the ops themselves
// as part of this enum, there is no need for code that
// raises errors when the wrong type of `Op` is used, since those
// issues are prevented by the type system at compile time
// Unary ops now only contain one param, the term, and so can be
// implemented more simply
IsNull(Reference),
// ... other Unary ops here,
// Binary / Set ops ditch the `BinaryExpression`, with the literal
// and reference being available directly
GreaterThan {
literal: Datum,
reference: Reference,
},
// ... other Binary ops here,
In {
literals: HashSet<Datum>,
reference: Reference,
},
// ... other Set ops here
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also like the idea of going further and having an Expression
enum that contains the logical operators plus Predicate
above, with Reference
from the above being an enum over bound or unbound. This allows visitors to potentially work over both bound and unbound expressions.
pub enum Expression {
// The top level would be Expression rather than Predicate.
// The logical operators would live at this level.
// Since logical operators such as And / Or / Not
// are common between bound and unbound, some visitors
// such as rewrite_not can be applied to both bound and
// unbound expressions, preventing duplication and increasing
// flexibility
/// An expression always evaluates to true.
AlwaysTrue,
/// An expression always evaluates to false.
AlwaysFalse,
// Here we use a struct variant to eliminate the need
// for a generic `LogicalExpression`, which did not
// add much, also allowing us to refer to lhs and rhs
// by name
/// An expression combined by `AND`, for example, `a > 10 AND b < 20`.
And {
lhs: Box<Expression>,
rhs: Box<Expression>
},
/// An expression combined by `OR`, for example, `a > 10 OR b < 20`.
Or {
lhs: Box<Expression>,
rhs: Box<Expression>
},
/// An expression combined by `NOT`, for example, `NOT (a > 10)`.
Not(Box<Expression>),
/// A Predicate
Predicate(Predicate),
}
pub enum Reference {
// Only at the leaf level do we distinguish between
// bound and unbound. This allows things like bind
// and unbind to be implemented as a transforming visitor,
// and avoids duplication of logic compared to having
// Bound / Unbound being distinguished at a higher level
// in the type hierarchy
Bound(BoundReference),
Unbound(UnboundReference),
}
pub struct UnboundReference {
name: String,
}
pub struct BoundReference {
column_name: String,
field: NestedFieldRef,
accessor: StructAccessor,
}
We could use type alises such as type BoundExpression = Expression
and have bind
return a BoundExpression
to prevent bound expressions being used where unbound ones are required and vice versa. But I'm digressing very far from the original comment now, sorry! 😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposed function signature fn visitor_op(&mut self, op: &PredicateOperator, reference: &BoundReference) would work for a unary operator.
I think following signature would be applied to all operators:
fn visitor_op(&mut self, op: &PredicateOperator, reference: &BoundReference, literals: &[Datum])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also like the idea of going further and having an Expression enum that contains the logical operators plus Predicate above, with Reference from the above being an enum over bound or unbound. This allows visitors to potentially work over both bound and unbound expressions.
Hi, @sdd Would you mind to open an issue to raise a discussion about this idea? I think it looks promsing to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That still wouldn't work without collecting the set into a vec first for set ops. I've updated the PR with an alternative based on an enum, see what you think.
9ae2e80
to
9c4bef4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @sdd for this PR. I took a look and left some minor comments.
Regarding your design idea about Predicate
and Expression
-
I also think it looks promising, but I still have some doubts about removing the 'layer' of Unary, Binary, SetExpression
and have the PredicateOperator
directly inside the Predicate
.
In #264 for example, I projected the transform based on the 'group of ops' e.g. Unary, Binary or Set. If we remove this layer I'd have to match each op and handle them individually. The extra layer of indirection was helping me in this use-case. Also what about adding, removing changing ops? With an extra layer we should have an easier time modifying the layer below without affecting the downstream consumers, that otherwise match the ops directly and could be breaking?
However, I still like your approach and I think an extra issue where we can discuss this would be very useful. I think more and different views will definitely help (since I have a very limited view and experience on the codebase after all).
9b491ea
to
3716d7b
Compare
I've added a test for Also, based on our discussion in the comments, it seemed prudent to add the |
a295f83
to
033994a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @sdd 🙌 Would be great to get this in, since it is blocking a few follow-up PRs. Thanks for breaking them out!
|
||
visitor.not(inner_result) | ||
} | ||
BoundPredicate::Unary(expr) => visitor.op(expr.op(), expr.term(), None, predicate), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In PyIceberg we've split out the operator in a per-operator basis: https://github.com/apache/iceberg-python/blob/4b911057f13491f30f89f133544c063133133fa5/pyiceberg/expressions/visitors.py#L256
The InclusiveMetricsEvaluator is already quite a beefy visitor, and by having it on this level, 95% will fall in a the op
method: https://github.com/apache/iceberg-python/blob/4b911057f13491f30f89f133544c063133133fa5/pyiceberg/expressions/visitors.py#L1137
I think this is the nicest way since it leverages the visitor pattern as much as possible, but I also saw the discussion of having this on the {unary,binary,set}
level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With Python it is easy to have a cache-all where if nothing matches, it just goes to that one, and then we raise an exception that we've encountered an unknown predicate. This way it is possible to add new expressions, but I don't think that would happen too often.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the nicest way since it leverages the visitor pattern as much as possible, but I also saw the discussion of having this on the {unary,binary,set} level.
If you're not happy with this approach @Fokko, I'll wait until the three of you are in agreement and fix up to whatever you all decide on. I'm happy with the approach as-is after switching my original design to the approach proposed by @liurenjie1024 and @marvinlanhenke.
With Python it is easy to have a cache-all where if nothing matches, it just goes to that one, and then we raise an exception that we've encountered an unknown predicate. This way it is possible to add new expressions, but I don't think that would happen too often.
With the approach in the PR, if a new Op gets added, you have three options available when implementing BoundPredicateVisitor
for a struct (if I remove the #[non_exhaustive]
macro from PredicateOperator
):
- Exhaustively specify a match arm for every Op. If another Op gets added, your BoundPredicateVisitor will no longer compile
- Specify a default match arm, and call
panic!()
in it. When a new Op gets added, your code still compiles but you get a runtime crash if the visitor encounters the new operator - Specify a default match arm, and
return Err(NotImplemented)
in it. When a new Op gets added, your code still compiles but you will get a handleable error if you encounter the new operator - potentially failing an invocation but not taking the service down.
That provides the capability that you mention from Python, as well as other options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sdd Sorry for late reply, I think we are quite close to the final version. I found another way to write op method with generics, what do you think? Others LGTM, thanks!
/// or SetPredicate. Passes the predicate's operator in all cases, | ||
/// as well as the term and literals in the case of binary and aet | ||
/// predicates. | ||
fn op( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Renjie! Thanks for the comment. I'm not totally convinced with your suggestion - we're trying to get the best performance, but by passing Set literals through as an iterator, rather than passing a reference to the HashSet, it requires either collecting the iterator, resulting in an allocation, or iterating over potentially the whole iterator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think generating an iterator for hashset will not introduce allocation, but you are right that user may need to iterating the whole iterator to implement their logic. Method in this pr seems kind of overdesign and counterintutive for user to implement. In this case I think your original design seems more intuitive to me, as @Fokko said adding operator maybe not too frequent. Sorry for the back and forth, I didn't expect this approach to be so complicated and introduce so much overhead.
cc @Fokko @marvinlanhenke @Xuanwo What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My preference is the original design, which is (confusingly) in the "alternative" PRs 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sold on the iterator idea either, it looks very fancy, but iterating over a set feels awkward to me.
If you're not happy with this approach @Fokko, I'll wait until the three of you are in agreement and fix up to whatever you all decide on. I'm happy with the approach as-is after switching my original design to the approach proposed by @liurenjie1024 and @marvinlanhenke.
Well, I'm not the one to decide here, just weighing in my opinion :D I have a strong preference for the original approach as suggested in #334 I think this leverages the visitor as much as possible, and also makes sure that all the operations are covered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also +1 for the original approach after comparing these two approaches.
033994a
to
c86c768
Compare
Close by #334 |
This trait is used for
BoundPredicate
visitors that evaluate the predicate, and associated tests. It has been broken out of the #241 PR as that was getting too large.