Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BoundPredicateVisitor trait #320

Closed
wants to merge 1 commit into from

Conversation

sdd
Copy link
Contributor

@sdd sdd commented Apr 4, 2024

This trait is used for BoundPredicate visitors that evaluate the predicate, and associated tests. It has been broken out of the #241 PR as that was getting too large.

@sdd sdd force-pushed the add-bound-predicate-evaluator branch 4 times, most recently from 05e6a54 to 8c6d461 Compare April 4, 2024 22:31
@sdd sdd force-pushed the add-bound-predicate-evaluator branch 3 times, most recently from 7fbf2a3 to 1b95080 Compare April 4, 2024 23:07
@sdd sdd changed the title Add BoundPredicateEvaluator trait Add BoundPredicateVtor tisirait Apr 4, 2024
@sdd sdd changed the title Add BoundPredicateVtor tisirait Add BoundPredicateVisitor trait Apr 4, 2024
Copy link
Collaborator

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sdd for this pr! I've left some small suggestions to improve.

crates/iceberg/src/expr/predicate.rs Show resolved Hide resolved
@sdd sdd force-pushed the add-bound-predicate-evaluator branch from 1b95080 to 77653b1 Compare April 5, 2024 07:42
fn or(&mut self, lhs: Self::T, rhs: Self::T) -> Result<Self::T>;
fn not(&mut self, inner: Self::T) -> Result<Self::T>;

fn is_null(&mut self, reference: &BoundReference) -> Result<Self::T>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with this approach is that when we add an operator, we need to modify the trait. We can still keep it compatible with default function, but the actual behavior may break when user upgrades library. cc @viirya @Xuanwo @marvinlanhenke What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add default implementations that return something like Err(OperatorNotImplemented), then if we add an operator, wouldn't existing code work for existing operators and only break if users then try to use the new operator without having updated any visitors they have to use the new operator?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add default implementations that return something like Err(OperatorNotImplemented), then if we add an operator, wouldn't existing code work for existing operators and only break if users then try to use the new operator without having updated any visitors they have to use the new operator?

Yes, I agree with that.

Another option is to use following method:

trait BoundPredicateVisitor {
  fn visitor_op(&mut self, op: &PredicateOperator, reference: &BoundReference, results: Vec<Self::T>) -> Result<Self::T> {}
}

This way we can force user to think about added new operator, otherwise it will throw compiler error, but this may break things when we add new operator.

I think the two approaches have different pros and cons, let's wait to see others's comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to use following method:

trait BoundPredicateVisitor {
  fn visitor_op(&mut self, op: &PredicateOperator, reference: &BoundReference, results: Vec<Self::T>) -> Result<Self::T> {}
}

I think I'm in favor of keeping the trait as generic as possible - so we don't have to modify it, if we have to support a new operator.

When implementing this trait (no matter which variant) I'd most likely have to overwrite the default behaviour anyway? So I wouldn't bother providing any in the first place and keep the trait as generic and future proof as possible? If that makes any sense to you?

This way we can force user to think about added new operator, otherwise it will throw compiler error, but this may break things when we add new operator.

...wouldn't I have match op anyway, so having a default match arm in the implementation would solve the issue for the downstream user?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...wouldn't I have match op anyway, so having a default match arm in the implementation would solve the issue for the downstream user?

I think this is the advantage of the op approach, it's left to user to decide what to do. If they want new op to be ignored, they could use a default match arm, otherwise they could choose to do pattern match against each operator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how this will work with just visitor_op.

The proposed function signature fn visitor_op(&mut self, op: &PredicateOperator, reference: &BoundReference) would work for a unary operator.

A binary operator would need fn visitor_op(&mut self, op: &PredicateOperator, reference: &BoundReference, literal: &Datum), and a set operator would need fn visitor_op(&mut self, op: &PredicateOperator, reference: &BoundReference, literals: &FnvHashSet<Datum>).

To resolve this, you'd need the visitor to have fn unary_op(...), fn binary_op(...), and fn set_op(...).

Is that what we want?

Personally I think this is another example of where the current type hierarchy feels awkward - there are other places where we bump into these type hierarchy mismatches, such as needing a default leg in a match on operator to handle operators that will never show up, such as binary ops in a unary expression.

Something more like this feels better to me:

pub enum Predicate {
    // By not having Unary / Binary / Set here, and removing
    // that layer in favour of having the ops themselves
    // as part of this enum, there is no need for code that
    // raises errors when the wrong type of `Op` is used, since those
    // issues are prevented by the type system at compile time

    // Unary ops now only contain one param, the term, and so can be
    // implemented more simply
    IsNull(Reference),
    // ... other Unary ops here,

    // Binary / Set ops ditch the `BinaryExpression`, with the literal
    // and reference being available directly
    GreaterThan {
        literal: Datum,
        reference: Reference,
    },
    // ... other Binary ops here,
    
    In {
        literals: HashSet<Datum>,
        reference: Reference,
    },
    // ... other Set ops here
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also like the idea of going further and having an Expression enum that contains the logical operators plus Predicate above, with Reference from the above being an enum over bound or unbound. This allows visitors to potentially work over both bound and unbound expressions.

pub enum Expression {
    // The top level would be Expression rather than Predicate.
    // The logical operators would live at this level.

    // Since logical operators such as And / Or / Not
    // are common between bound and unbound, some visitors
    // such as rewrite_not can be applied to both bound and
    // unbound expressions, preventing duplication and increasing
    // flexibility

    /// An expression always evaluates to true.
    AlwaysTrue,

    /// An expression always evaluates to false.
    AlwaysFalse,

    // Here we use a struct variant to eliminate the need
    // for a generic `LogicalExpression`, which did not
    // add much, also allowing us to refer to lhs and rhs
    // by name

    /// An expression combined by `AND`, for example, `a > 10 AND b < 20`.
    And {
        lhs: Box<Expression>,
        rhs: Box<Expression>
    },

    /// An expression combined by `OR`, for example, `a > 10 OR b < 20`.
    Or {
        lhs: Box<Expression>,
        rhs: Box<Expression>
    },

    /// An expression combined by `NOT`, for example, `NOT (a > 10)`.
    Not(Box<Expression>),

    /// A Predicate
    Predicate(Predicate),
}

pub enum Reference {
    // Only at the leaf level do we distinguish between
    // bound and unbound. This allows things like bind
    // and unbind to be implemented as a transforming visitor,
    // and avoids duplication of logic compared to having
    // Bound / Unbound being distinguished at a higher level
    // in the type hierarchy

    Bound(BoundReference),
    Unbound(UnboundReference),
}

pub struct UnboundReference {
    name: String,
}

pub struct BoundReference {
    column_name: String,
    field: NestedFieldRef,
    accessor: StructAccessor,
}

We could use type alises such as type BoundExpression = Expression and have bind return a BoundExpression to prevent bound expressions being used where unbound ones are required and vice versa. But I'm digressing very far from the original comment now, sorry! 😆

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed function signature fn visitor_op(&mut self, op: &PredicateOperator, reference: &BoundReference) would work for a unary operator.

I think following signature would be applied to all operators:

 fn visitor_op(&mut self, op: &PredicateOperator, reference: &BoundReference, literals: &[Datum])

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also like the idea of going further and having an Expression enum that contains the logical operators plus Predicate above, with Reference from the above being an enum over bound or unbound. This allows visitors to potentially work over both bound and unbound expressions.

Hi, @sdd Would you mind to open an issue to raise a discussion about this idea? I think it looks promsing to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That still wouldn't work without collecting the set into a vec first for set ops. I've updated the PR with an alternative based on an enum, see what you think.

@sdd sdd force-pushed the add-bound-predicate-evaluator branch 2 times, most recently from 9ae2e80 to 9c4bef4 Compare April 6, 2024 11:18
Copy link
Contributor

@marvinlanhenke marvinlanhenke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @sdd for this PR. I took a look and left some minor comments.

Regarding your design idea about Predicate and Expression -
I also think it looks promising, but I still have some doubts about removing the 'layer' of Unary, Binary, SetExpression and have the PredicateOperator directly inside the Predicate.

In #264 for example, I projected the transform based on the 'group of ops' e.g. Unary, Binary or Set. If we remove this layer I'd have to match each op and handle them individually. The extra layer of indirection was helping me in this use-case. Also what about adding, removing changing ops? With an extra layer we should have an easier time modifying the layer below without affecting the downstream consumers, that otherwise match the ops directly and could be breaking?

However, I still like your approach and I think an extra issue where we can discuss this would be very useful. I think more and different views will definitely help (since I have a very limited view and experience on the codebase after all).

@sdd sdd force-pushed the add-bound-predicate-evaluator branch 3 times, most recently from 9b491ea to 3716d7b Compare April 7, 2024 09:48
@sdd
Copy link
Contributor Author

sdd commented Apr 7, 2024

I've added a test for visit_op, as well as some doc comments.

Also, based on our discussion in the comments, it seemed prudent to add the non_exhaustive marker to PredicateOperator? PTAL @liurenjie1024, @marvinlanhenke, @ZENOTME as I think this is pretty much ready now

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @sdd 🙌 Would be great to get this in, since it is blocking a few follow-up PRs. Thanks for breaking them out!


visitor.not(inner_result)
}
BoundPredicate::Unary(expr) => visitor.op(expr.op(), expr.term(), None, predicate),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In PyIceberg we've split out the operator in a per-operator basis: https://github.com/apache/iceberg-python/blob/4b911057f13491f30f89f133544c063133133fa5/pyiceberg/expressions/visitors.py#L256

The InclusiveMetricsEvaluator is already quite a beefy visitor, and by having it on this level, 95% will fall in a the op method: https://github.com/apache/iceberg-python/blob/4b911057f13491f30f89f133544c063133133fa5/pyiceberg/expressions/visitors.py#L1137
I think this is the nicest way since it leverages the visitor pattern as much as possible, but I also saw the discussion of having this on the {unary,binary,set} level.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With Python it is easy to have a cache-all where if nothing matches, it just goes to that one, and then we raise an exception that we've encountered an unknown predicate. This way it is possible to add new expressions, but I don't think that would happen too often.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the nicest way since it leverages the visitor pattern as much as possible, but I also saw the discussion of having this on the {unary,binary,set} level.

If you're not happy with this approach @Fokko, I'll wait until the three of you are in agreement and fix up to whatever you all decide on. I'm happy with the approach as-is after switching my original design to the approach proposed by @liurenjie1024 and @marvinlanhenke.

With Python it is easy to have a cache-all where if nothing matches, it just goes to that one, and then we raise an exception that we've encountered an unknown predicate. This way it is possible to add new expressions, but I don't think that would happen too often.

With the approach in the PR, if a new Op gets added, you have three options available when implementing BoundPredicateVisitor for a struct (if I remove the #[non_exhaustive] macro from PredicateOperator):

  1. Exhaustively specify a match arm for every Op. If another Op gets added, your BoundPredicateVisitor will no longer compile
  2. Specify a default match arm, and call panic!() in it. When a new Op gets added, your code still compiles but you get a runtime crash if the visitor encounters the new operator
  3. Specify a default match arm, and return Err(NotImplemented) in it. When a new Op gets added, your code still compiles but you will get a handleable error if you encounter the new operator - potentially failing an invocation but not taking the service down.

That provides the capability that you mention from Python, as well as other options.

crates/iceberg/src/expr/predicate.rs Show resolved Hide resolved
Copy link
Collaborator

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdd Sorry for late reply, I think we are quite close to the final version. I found another way to write op method with generics, what do you think? Others LGTM, thanks!

/// or SetPredicate. Passes the predicate's operator in all cases,
/// as well as the term and literals in the case of binary and aet
/// predicates.
fn op(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @sdd How about we try rewrite it like this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Renjie! Thanks for the comment. I'm not totally convinced with your suggestion - we're trying to get the best performance, but by passing Set literals through as an iterator, rather than passing a reference to the HashSet, it requires either collecting the iterator, resulting in an allocation, or iterating over potentially the whole iterator.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think generating an iterator for hashset will not introduce allocation, but you are right that user may need to iterating the whole iterator to implement their logic. Method in this pr seems kind of overdesign and counterintutive for user to implement. In this case I think your original design seems more intuitive to me, as @Fokko said adding operator maybe not too frequent. Sorry for the back and forth, I didn't expect this approach to be so complicated and introduce so much overhead.

cc @Fokko @marvinlanhenke @Xuanwo What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok - here's another PR with the original design. I'll let you guys choose which one to merge :-) #334.

Here's how the inclusive projection would end up looking with this design too: #335

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference is the original design, which is (confusingly) in the "alternative" PRs 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sold on the iterator idea either, it looks very fancy, but iterating over a set feels awkward to me.

If you're not happy with this approach @Fokko, I'll wait until the three of you are in agreement and fix up to whatever you all decide on. I'm happy with the approach as-is after switching my original design to the approach proposed by @liurenjie1024 and @marvinlanhenke.

Well, I'm not the one to decide here, just weighing in my opinion :D I have a strong preference for the original approach as suggested in #334 I think this leverages the visitor as much as possible, and also makes sure that all the operations are covered.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also +1 for the original approach after comparing these two approaches.

@liurenjie1024
Copy link
Collaborator

Close by #334

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants