-
Notifications
You must be signed in to change notification settings - Fork 1.8k
feat: enable "substring" as a UDF in addition to "substr" #11277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Substrait uses the name "substring", and it already exists in DF SQL The setup here is a bit weird; I'd have added substring as an alias for substr, but then we have here this "substring" version being created as udf already and exported through the export_functions, with slightly different args than substr (even though in reality the underlying function for both is the same substr impl). I think this PR should work, but if you have suggestions on how to make the situation here cleaner, I'd be happy to!
…ubstrait producer
I think it needs to be updated by hand at this point It would be great eventually to figure out how to automatically generate the docs from the code, as I think is described in #9173 |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me @Blizzara -- I have a question about the tests, but otherwise this looks good to go to me. 🙏
| #[tokio::test] | ||
| async fn simple_scalar_function_substr() -> Result<()> { | ||
| roundtrip("SELECT * FROM data WHERE a = SUBSTR('datafusion', 0, 3)").await | ||
| roundtrip("SELECT SUBSTR(f, 1, 3) FROM data").await |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason to change the test?
Maybe we could add this particular query as an additioanl query (to show the existing behavior is not changed). Something like
| roundtrip("SELECT SUBSTR(f, 1, 3) FROM data").await | |
| roundtrip("SELECT * FROM data WHERE a = SUBSTR('datafusion', 0, 3)").await | |
| roundtrip("SELECT SUBSTR(f, 1, 3) FROM data").await |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason to change the test?
Yes - the original query gets optimized by DF into SELECT * FROM data WHERE a = "dat" before being converted into Substrait, i.e. the whole SUBSTRING call is optimized away:
Filter: CAST(data.a AS Utf8) = Utf8("da")
TableScan: data projection=[a, b, c, d, e, f, g], partial_filters=[CAST(data.a AS Utf8) = Utf8("da")]
Maybe we could add this particular query as an additioanl query (to show the existing behavior is not changed).
I can add it back, but it doesn't really test what it tries to test 😅 given that would you still like to have it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the explanation
No need to change the PR
The issue seems to be that DataFusion partially evaluates the expression
|
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As always, thank you for pushing substrait along @Blizzara
cc @Lordworms
|
Seems like a reasonable change to me, Thank you @Blizzara |
* feat: enable "substring" as a UDF in addition to "substr" Substrait uses the name "substring", and it already exists in DF SQL The setup here is a bit weird; I'd have added substring as an alias for substr, but then we have here this "substring" version being created as udf already and exported through the export_functions, with slightly different args than substr (even though in reality the underlying function for both is the same substr impl). I think this PR should work, but if you have suggestions on how to make the situation here cleaner, I'd be happy to! * okay redo everything: add an alias instead, and add renaming in the substrait producer * add alias into scalar_functions.md
Which issue does this PR close?
Closes #.
Rationale for this change
Substrait uses the name "substring", and it already exists in DF SQL
What changes are included in this PR?
Adds "substring" alias for "substr" UDF so that it can be found from Substrait consumer. Also adds a mapping into Substrait producer to rewrite "substr" into "substring".
And fixes the substr roundtrip test.
The setup here is a bit weird; in unicode.mod there is a "substring" udf being created so that it can be used in
export_functionswith different args than the "substr" version, even if both really end up using the samesubstrimpl. What's the use for theexport_functionsversions? Anyways I think that's unrelated to this PR.Are these changes tested?
Yes - fixed (*) the Substrait roundtrip test for SUBSTR and confirmed it uses the correct alias:
ExtensionFunction { extension_uri_reference: 4294967295, function_anchor: 0, name: "substring" }DF doesn't validate the functions against the Substrait extensions so we don't automatically check that the name is correct, but this should be good enough.
*: The previous version would optimize the call away already during planning, so the Substrait actually didn't contain the substr call, just an equals check. Now it does. There are a bunch more functions where that happens, I didn't want to fix everything here but may do it as a followup!
Are there any user-facing changes?
Adding a "substring" alias for "substr" func. That should be added into https://datafusion.apache.org/user-guide/sql/scalar_functions.html#substr - is there some automatic way to generate the file or should I just do it by hand?