-
Notifications
You must be signed in to change notification settings - Fork 1.8k
#17838 Rewrite regexp_like calls as ~ and *~ operator expressions when possible
#17839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c72802e to
2213ced
Compare
2213ced to
dcea3eb
Compare
…s as operator expressions when possible
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pepijnve
| // is more optimised. | ||
| let Some((st, op, re)) = (match args.as_slice() { | ||
| [string, regexp] => { | ||
| Some((string.clone(), Operator::RegexMatch, regexp.clone())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be nice to avoid these clone() if possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will check if that's possible
Let me see if my Rust skills are good enough already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to eliminate the clone calls. I don't know if there's a better way to pull the values out of the Vec than swap_remove.
| ---- | ||
| logical_plan | ||
| 01)Projection: regexp_like(test.column1_utf8view, Utf8("^https?://(?:www\.)?([^/]+)/.*$")) AS k | ||
| 01)Projection: test.column1_utf8view ~ Utf8View("^https?://(?:www\.)?([^/]+)/.*$") AS k |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If ~ is faster than regexp_like can we simply change the implementation to use the same underlying implementation of ~ (why only rewrite in some cases?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #17838 (comment)
The operator logic is in physical_expr, while regexp_like lives in functions. We would probably have to move the common logic to a separate crate. This PR was intended as a stopgap solution for common cases.
We can only rewrite in some cases because of the optional flags argument. With the operators all you have is the case sensitivity (i.e. the iflag).
The reason for the operator being more efficient is that it will make use of the regexp_is_match_scalar kernel if it can, while regexp_like always uses regexp_is_match. regexp_is_match does maintain a cache of compiled regexes so at least the pattern isn't compiled over and over again, but it's still quite a bit more code compared to regexp_is_match_scalar.
Additionally there's a regular expression simplification rule that only operates on BinaryExpr with one of the regex matching operators. The transformation here enables that optimisation for regexp_like calls as well.
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks great to me -- thank you @pepijnve
Can you add some negative test coverage too like for case when the types don't match, and when for the three argument case?
Specifically add an explain .slt that shows the function isn't rewritten -- perhaps in https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/regexp/regexp_like.slt ?
| mut args: Vec<Expr>, | ||
| info: &dyn SimplifyInfo, | ||
| ) -> Result<ExprSimplifyResult> { | ||
| // Try to simplify regexp_like usage to one of the builtin operators since those have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you -- these comments really help
|
I took the liberty of merging this PR up from main and adding some more tests (and in so doing reviewing it more). I'll plan to merge it when the CI completes Thanks again @pepijnve |
regexp_like calls as operator expressions when possibleregexp_like calls as ~ and *~ operator expressions when possible
|
I'd like to see the different calls use the same implementation, having 2 implementations for this seems problematic. I'll file a followup issue is no one else does that references this ticket to create a common implementation. |
Good call, I filed |
|
Urgh, that is annoying. Setting up my new linux machine and used the wrong account. |
Which issue does this PR close?
regexp_like,~, and~*implementations #17838.Rationale for this change
Avoid performance delta between
regexp_likeand~where possible.What changes are included in this PR?
Adds a simplification rule that rewrites simple
regexp_likeinvocations as~or~*operator expressions.Are these changes tested?
Covered by regexp_like SQL logic tests
Are there any user-facing changes?
No