Add `BoundPredicateVisitor` trait #320

sdd · 2024-04-04T20:13:50Z

This trait is used for BoundPredicate visitors that evaluate the predicate, and associated tests. It has been broken out of the #241 PR as that was getting too large.

liurenjie1024

Thanks @sdd for this pr! I've left some small suggestions to improve.

crates/iceberg/src/expr/predicate.rs

crates/iceberg/src/expr/visitors/bound_predicate_visitor.rs

liurenjie1024 · 2024-04-05T09:04:58Z

crates/iceberg/src/expr/visitors/bound_predicate_visitor.rs

+    fn or(&mut self, lhs: Self::T, rhs: Self::T) -> Result<Self::T>;
+    fn not(&mut self, inner: Self::T) -> Result<Self::T>;
+
+    fn is_null(&mut self, reference: &BoundReference) -> Result<Self::T>;


The problem with this approach is that when we add an operator, we need to modify the trait. We can still keep it compatible with default function, but the actual behavior may break when user upgrades library. cc @viirya @Xuanwo @marvinlanhenke What do you think?

If we add default implementations that return something like Err(OperatorNotImplemented), then if we add an operator, wouldn't existing code work for existing operators and only break if users then try to use the new operator without having updated any visitors they have to use the new operator?

If we add default implementations that return something like Err(OperatorNotImplemented), then if we add an operator, wouldn't existing code work for existing operators and only break if users then try to use the new operator without having updated any visitors they have to use the new operator?

Yes, I agree with that.

Another option is to use following method:

trait BoundPredicateVisitor { fn visitor_op(&mut self, op: &PredicateOperator, reference: &BoundReference, results: Vec<Self::T>) -> Result<Self::T> {} }

This way we can force user to think about added new operator, otherwise it will throw compiler error, but this may break things when we add new operator.

I think the two approaches have different pros and cons, let's wait to see others's comments.

Another option is to use following method:

trait BoundPredicateVisitor { fn visitor_op(&mut self, op: &PredicateOperator, reference: &BoundReference, results: Vec<Self::T>) -> Result<Self::T> {} }

I think I'm in favor of keeping the trait as generic as possible - so we don't have to modify it, if we have to support a new operator.

When implementing this trait (no matter which variant) I'd most likely have to overwrite the default behaviour anyway? So I wouldn't bother providing any in the first place and keep the trait as generic and future proof as possible? If that makes any sense to you?

This way we can force user to think about added new operator, otherwise it will throw compiler error, but this may break things when we add new operator.

...wouldn't I have match op anyway, so having a default match arm in the implementation would solve the issue for the downstream user?

...wouldn't I have match op anyway, so having a default match arm in the implementation would solve the issue for the downstream user?

I think this is the advantage of the op approach, it's left to user to decide what to do. If they want new op to be ignored, they could use a default match arm, otherwise they could choose to do pattern match against each operator.

I don't see how this will work with just visitor_op.

The proposed function signature fn visitor_op(&mut self, op: &PredicateOperator, reference: &BoundReference) would work for a unary operator.

A binary operator would need fn visitor_op(&mut self, op: &PredicateOperator, reference: &BoundReference, literal: &Datum), and a set operator would need fn visitor_op(&mut self, op: &PredicateOperator, reference: &BoundReference, literals: &FnvHashSet<Datum>).

To resolve this, you'd need the visitor to have fn unary_op(...), fn binary_op(...), and fn set_op(...).

Is that what we want?

Personally I think this is another example of where the current type hierarchy feels awkward - there are other places where we bump into these type hierarchy mismatches, such as needing a default leg in a match on operator to handle operators that will never show up, such as binary ops in a unary expression.

Something more like this feels better to me:

pub enum Predicate { // By not having Unary / Binary / Set here, and removing // that layer in favour of having the ops themselves // as part of this enum, there is no need for code that // raises errors when the wrong type of `Op` is used, since those // issues are prevented by the type system at compile time // Unary ops now only contain one param, the term, and so can be // implemented more simply IsNull(Reference), // ... other Unary ops here, // Binary / Set ops ditch the `BinaryExpression`, with the literal // and reference being available directly GreaterThan { literal: Datum, reference: Reference, }, // ... other Binary ops here, In { literals: HashSet<Datum>, reference: Reference, }, // ... other Set ops here }

I also like the idea of going further and having an Expression enum that contains the logical operators plus Predicate above, with Reference from the above being an enum over bound or unbound. This allows visitors to potentially work over both bound and unbound expressions.

pub enum Expression { // The top level would be Expression rather than Predicate. // The logical operators would live at this level. // Since logical operators such as And / Or / Not // are common between bound and unbound, some visitors // such as rewrite_not can be applied to both bound and // unbound expressions, preventing duplication and increasing // flexibility /// An expression always evaluates to true. AlwaysTrue, /// An expression always evaluates to false. AlwaysFalse, // Here we use a struct variant to eliminate the need // for a generic `LogicalExpression`, which did not // add much, also allowing us to refer to lhs and rhs // by name /// An expression combined by `AND`, for example, `a > 10 AND b < 20`. And { lhs: Box<Expression>, rhs: Box<Expression> }, /// An expression combined by `OR`, for example, `a > 10 OR b < 20`. Or { lhs: Box<Expression>, rhs: Box<Expression> }, /// An expression combined by `NOT`, for example, `NOT (a > 10)`. Not(Box<Expression>), /// A Predicate Predicate(Predicate), } pub enum Reference { // Only at the leaf level do we distinguish between // bound and unbound. This allows things like bind // and unbind to be implemented as a transforming visitor, // and avoids duplication of logic compared to having // Bound / Unbound being distinguished at a higher level // in the type hierarchy Bound(BoundReference), Unbound(UnboundReference), } pub struct UnboundReference { name: String, } pub struct BoundReference { column_name: String, field: NestedFieldRef, accessor: StructAccessor, }

We could use type alises such as type BoundExpression = Expression and have bind return a BoundExpression to prevent bound expressions being used where unbound ones are required and vice versa. But I'm digressing very far from the original comment now, sorry! 😆

The proposed function signature fn visitor_op(&mut self, op: &PredicateOperator, reference: &BoundReference) would work for a unary operator.

I think following signature would be applied to all operators:

fn visitor_op(&mut self, op: &PredicateOperator, reference: &BoundReference, literals: &[Datum])

I also like the idea of going further and having an Expression enum that contains the logical operators plus Predicate above, with Reference from the above being an enum over bound or unbound. This allows visitors to potentially work over both bound and unbound expressions.

Hi, @sdd Would you mind to open an issue to raise a discussion about this idea? I think it looks promsing to me.

That still wouldn't work without collecting the set into a vec first for set ops. I've updated the PR with an alternative based on an enum, see what you think.

marvinlanhenke

thanks @sdd for this PR. I took a look and left some minor comments.

Regarding your design idea about Predicate and Expression -
I also think it looks promising, but I still have some doubts about removing the 'layer' of Unary, Binary, SetExpression and have the PredicateOperator directly inside the Predicate.

In #264 for example, I projected the transform based on the 'group of ops' e.g. Unary, Binary or Set. If we remove this layer I'd have to match each op and handle them individually. The extra layer of indirection was helping me in this use-case. Also what about adding, removing changing ops? With an extra layer we should have an easier time modifying the layer below without affecting the downstream consumers, that otherwise match the ops directly and could be breaking?

However, I still like your approach and I think an extra issue where we can discuss this would be very useful. I think more and different views will definitely help (since I have a very limited view and experience on the codebase after all).

crates/iceberg/src/expr/visitors/bound_predicate_visitor.rs

sdd · 2024-04-07T09:53:58Z

I've added a test for visit_op, as well as some doc comments.

Also, based on our discussion in the comments, it seemed prudent to add the non_exhaustive marker to PredicateOperator? PTAL @liurenjie1024, @marvinlanhenke, @ZENOTME as I think this is pretty much ready now

Fokko

Nice work @sdd 🙌 Would be great to get this in, since it is blocking a few follow-up PRs. Thanks for breaking them out!

Fokko · 2024-04-15T11:45:29Z

crates/iceberg/src/expr/visitors/bound_predicate_visitor.rs

+
+            visitor.not(inner_result)
+        }
+        BoundPredicate::Unary(expr) => visitor.op(expr.op(), expr.term(), None, predicate),


In PyIceberg we've split out the operator in a per-operator basis: https://github.com/apache/iceberg-python/blob/4b911057f13491f30f89f133544c063133133fa5/pyiceberg/expressions/visitors.py#L256

The InclusiveMetricsEvaluator is already quite a beefy visitor, and by having it on this level, 95% will fall in a the op method: https://github.com/apache/iceberg-python/blob/4b911057f13491f30f89f133544c063133133fa5/pyiceberg/expressions/visitors.py#L1137
I think this is the nicest way since it leverages the visitor pattern as much as possible, but I also saw the discussion of having this on the {unary,binary,set} level.

With Python it is easy to have a cache-all where if nothing matches, it just goes to that one, and then we raise an exception that we've encountered an unknown predicate. This way it is possible to add new expressions, but I don't think that would happen too often.

I think this is the nicest way since it leverages the visitor pattern as much as possible, but I also saw the discussion of having this on the {unary,binary,set} level.

If you're not happy with this approach @Fokko, I'll wait until the three of you are in agreement and fix up to whatever you all decide on. I'm happy with the approach as-is after switching my original design to the approach proposed by @liurenjie1024 and @marvinlanhenke.

With Python it is easy to have a cache-all where if nothing matches, it just goes to that one, and then we raise an exception that we've encountered an unknown predicate. This way it is possible to add new expressions, but I don't think that would happen too often.

With the approach in the PR, if a new Op gets added, you have three options available when implementing BoundPredicateVisitor for a struct (if I remove the #[non_exhaustive] macro from PredicateOperator):

Exhaustively specify a match arm for every Op. If another Op gets added, your BoundPredicateVisitor will no longer compile

Specify a default match arm, and call panic!() in it. When a new Op gets added, your code still compiles but you get a runtime crash if the visitor encounters the new operator

Specify a default match arm, and return Err(NotImplemented) in it. When a new Op gets added, your code still compiles but you will get a handleable error if you encounter the new operator - potentially failing an invocation but not taking the service down.

That provides the capability that you mention from Python, as well as other options.

crates/iceberg/src/expr/visitors/bound_predicate_visitor.rs

crates/iceberg/src/expr/predicate.rs

liurenjie1024

@sdd Sorry for late reply, I think we are quite close to the final version. I found another way to write op method with generics, what do you think? Others LGTM, thanks!

liurenjie1024 · 2024-04-15T13:17:16Z

crates/iceberg/src/expr/visitors/bound_predicate_visitor.rs

+    /// or SetPredicate. Passes the predicate's operator in all cases,
+    /// as well as the term and literals in the case of binary and aet
+    /// predicates.
+    fn op(


Hi, @sdd How about we try rewrite it like this

Hi Renjie! Thanks for the comment. I'm not totally convinced with your suggestion - we're trying to get the best performance, but by passing Set literals through as an iterator, rather than passing a reference to the HashSet, it requires either collecting the iterator, resulting in an allocation, or iterating over potentially the whole iterator.

I think generating an iterator for hashset will not introduce allocation, but you are right that user may need to iterating the whole iterator to implement their logic. Method in this pr seems kind of overdesign and counterintutive for user to implement. In this case I think your original design seems more intuitive to me, as @Fokko said adding operator maybe not too frequent. Sorry for the back and forth, I didn't expect this approach to be so complicated and introduce so much overhead.

cc @Fokko @marvinlanhenke @Xuanwo What do you think?

Ok - here's another PR with the original design. I'll let you guys choose which one to merge :-) #334.

Here's how the inclusive projection would end up looking with this design too: #335

My preference is the original design, which is (confusingly) in the "alternative" PRs 😅

I'm not sold on the iterator idea either, it looks very fancy, but iterating over a set feels awkward to me.

If you're not happy with this approach @Fokko, I'll wait until the three of you are in agreement and fix up to whatever you all decide on. I'm happy with the approach as-is after switching my original design to the approach proposed by @liurenjie1024 and @marvinlanhenke.

Well, I'm not the one to decide here, just weighing in my opinion :D I have a strong preference for the original approach as suggested in #334 I think this leverages the visitor as much as possible, and also makes sure that all the operations are covered.

I'm also +1 for the original approach after comparing these two approaches.

…edicate

liurenjie1024 · 2024-04-16T12:54:18Z

Close by #334

This was referenced Apr 4, 2024

add InclusiveProjection Visitor #321

Closed

Add ManifestEvaluator, used to filter manifests in table scans #322

Merged

sdd force-pushed the add-bound-predicate-evaluator branch 4 times, most recently from 05e6a54 to 8c6d461 Compare April 4, 2024 22:31

sdd mentioned this pull request Apr 4, 2024

Add AlwaysTrue and AlwaysFalse to Predicate #319

Closed

sdd force-pushed the add-bound-predicate-evaluator branch 3 times, most recently from 7fbf2a3 to 1b95080 Compare April 4, 2024 23:07

sdd changed the title ~~Add BoundPredicateEvaluator trait~~ Add BoundPredicateVtor tisirait Apr 4, 2024

sdd changed the title ~~Add BoundPredicateVtor tisirait~~ Add BoundPredicateVisitor trait Apr 4, 2024

liurenjie1024 reviewed Apr 5, 2024

View reviewed changes

crates/iceberg/src/expr/predicate.rs Show resolved Hide resolved

crates/iceberg/src/expr/visitors/bound_predicate_visitor.rs Outdated Show resolved Hide resolved

sdd force-pushed the add-bound-predicate-evaluator branch from 1b95080 to 77653b1 Compare April 5, 2024 07:42

liurenjie1024 reviewed Apr 5, 2024

View reviewed changes

sdd force-pushed the add-bound-predicate-evaluator branch 2 times, most recently from 9ae2e80 to 9c4bef4 Compare April 6, 2024 11:18

marvinlanhenke reviewed Apr 6, 2024

View reviewed changes

crates/iceberg/src/expr/visitors/bound_predicate_visitor.rs Outdated Show resolved Hide resolved

sdd force-pushed the add-bound-predicate-evaluator branch 3 times, most recently from 9b491ea to 3716d7b Compare April 7, 2024 09:48

sdd force-pushed the add-bound-predicate-evaluator branch 2 times, most recently from a295f83 to 033994a Compare April 8, 2024 06:47

sdd mentioned this pull request Apr 9, 2024

[WIP] Add ManifestEvaluator to allow filtering of files in a table scan (Issue #152) #241

Closed

Fokko reviewed Apr 15, 2024

View reviewed changes

liurenjie1024 reviewed Apr 15, 2024

View reviewed changes

feat: add BoundPredicateVisitor. Add AlwaysTrue and AlwaysFalse to Pr…

c86c768

…edicate

sdd force-pushed the add-bound-predicate-evaluator branch from 033994a to c86c768 Compare April 16, 2024 06:45

sdd mentioned this pull request Apr 16, 2024

Add BoundPredicateVisitor (alternate version) #334

Merged

liurenjie1024 closed this Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `BoundPredicateVisitor` trait #320

Add `BoundPredicateVisitor` trait #320

sdd commented Apr 4, 2024 •

edited

liurenjie1024 left a comment

liurenjie1024 Apr 5, 2024

sdd Apr 5, 2024

liurenjie1024 Apr 5, 2024

marvinlanhenke Apr 5, 2024 •

edited

liurenjie1024 Apr 5, 2024

sdd Apr 5, 2024

sdd Apr 6, 2024

liurenjie1024 Apr 6, 2024

liurenjie1024 Apr 6, 2024

sdd Apr 6, 2024

marvinlanhenke left a comment

sdd commented Apr 7, 2024

Fokko left a comment •

edited

Fokko Apr 15, 2024

Fokko Apr 15, 2024

sdd Apr 15, 2024

liurenjie1024 left a comment

liurenjie1024 Apr 15, 2024

sdd Apr 15, 2024

liurenjie1024 Apr 16, 2024

sdd Apr 16, 2024

sdd Apr 16, 2024

Fokko Apr 16, 2024

liurenjie1024 Apr 16, 2024

liurenjie1024 commented Apr 16, 2024

Add BoundPredicateVisitor trait #320

Add BoundPredicateVisitor trait #320

Conversation

sdd commented Apr 4, 2024 • edited

liurenjie1024 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marvinlanhenke Apr 5, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marvinlanhenke left a comment

Choose a reason for hiding this comment

sdd commented Apr 7, 2024

Fokko left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liurenjie1024 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liurenjie1024 commented Apr 16, 2024

Add `BoundPredicateVisitor` trait #320

Add `BoundPredicateVisitor` trait #320

sdd commented Apr 4, 2024 •

edited

marvinlanhenke Apr 5, 2024 •

edited

Fokko left a comment •

edited