Skip to content

Conversation

@pepijnve
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

Avoid performance delta between regexp_like and ~ where possible.

What changes are included in this PR?

Adds a simplification rule that rewrites simple regexp_like invocations as ~ or ~* operator expressions.

Are these changes tested?

Covered by regexp_like SQL logic tests

Are there any user-facing changes?

No

@github-actions github-actions bot added the functions Changes to functions implementation label Sep 30, 2025
@pepijnve pepijnve force-pushed the issue_17838 branch 2 times, most recently from c72802e to 2213ced Compare September 30, 2025 18:43
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Sep 30, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pepijnve

// is more optimised.
let Some((st, op, re)) = (match args.as_slice() {
[string, regexp] => {
Some((string.clone(), Operator::RegexMatch, regexp.clone()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice to avoid these clone() if possible

Copy link
Contributor Author

@pepijnve pepijnve Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check if that's possible

Let me see if my Rust skills are good enough already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to eliminate the clone calls. I don't know if there's a better way to pull the values out of the Vec than swap_remove.

----
logical_plan
01)Projection: regexp_like(test.column1_utf8view, Utf8("^https?://(?:www\.)?([^/]+)/.*$")) AS k
01)Projection: test.column1_utf8view ~ Utf8View("^https?://(?:www\.)?([^/]+)/.*$") AS k
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ~ is faster than regexp_like can we simply change the implementation to use the same underlying implementation of ~ (why only rewrite in some cases?)

Copy link
Contributor Author

@pepijnve pepijnve Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #17838 (comment)

The operator logic is in physical_expr, while regexp_like lives in functions. We would probably have to move the common logic to a separate crate. This PR was intended as a stopgap solution for common cases.

We can only rewrite in some cases because of the optional flags argument. With the operators all you have is the case sensitivity (i.e. the iflag).

The reason for the operator being more efficient is that it will make use of the regexp_is_match_scalar kernel if it can, while regexp_like always uses regexp_is_match. regexp_is_match does maintain a cache of compiled regexes so at least the pattern isn't compiled over and over again, but it's still quite a bit more code compared to regexp_is_match_scalar.

Additionally there's a regular expression simplification rule that only operates on BinaryExpr with one of the regex matching operators. The transformation here enables that optimisation for regexp_like calls as well.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks great to me -- thank you @pepijnve

Can you add some negative test coverage too like for case when the types don't match, and when for the three argument case?

Specifically add an explain .slt that shows the function isn't rewritten -- perhaps in https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/regexp/regexp_like.slt ?

mut args: Vec<Expr>,
info: &dyn SimplifyInfo,
) -> Result<ExprSimplifyResult> {
// Try to simplify regexp_like usage to one of the builtin operators since those have
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you -- these comments really help

@alamb alamb added the performance Make DataFusion faster label Oct 3, 2025
@alamb
Copy link
Contributor

alamb commented Oct 6, 2025

I took the liberty of merging this PR up from main and adding some more tests (and in so doing reviewing it more). I'll plan to merge it when the CI completes

Thanks again @pepijnve

@alamb alamb changed the title #17838 Rewrite regexp_like calls as operator expressions when possible #17838 Rewrite regexp_like calls as ~ and *~ operator expressions when possible Oct 6, 2025
@ghost
Copy link

ghost commented Oct 6, 2025

I'd like to see the different calls use the same implementation, having 2 implementations for this seems problematic. I'll file a followup issue is no one else does that references this ticket to create a common implementation.

@alamb
Copy link
Contributor

alamb commented Oct 6, 2025

I'd like to see the different calls use the same implementation, having 2 implementations for this seems problematic. I'll file a followup issue is no one else does that references this ticket to create a common implementation.

Good call, I filed

@alamb alamb added this pull request to the merge queue Oct 6, 2025
@alamb
Copy link
Contributor

alamb commented Oct 6, 2025

Thanks @pepijnve and @Omega1 (what happend to @Omega359 🤔 )

Merged via the queue into apache:main with commit 8e73844 Oct 6, 2025
29 checks passed
@Omega359
Copy link
Contributor

Omega359 commented Oct 6, 2025

Urgh, that is annoying. Setting up my new linux machine and used the wrong account.

@pepijnve pepijnve deleted the issue_17838 branch November 3, 2025 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation performance Make DataFusion faster sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Align regexp_like, ~, and ~* implementations

3 participants