feat: enable "substring" as a UDF in addition to "substr" #11277

Blizzara · 2024-07-05T09:10:17Z

Which issue does this PR close?

Closes #.

Rationale for this change

Substrait uses the name "substring", and it already exists in DF SQL

What changes are included in this PR?

Adds "substring" alias for "substr" UDF so that it can be found from Substrait consumer. Also adds a mapping into Substrait producer to rewrite "substr" into "substring".
And fixes the substr roundtrip test.

The setup here is a bit weird; in unicode.mod there is a "substring" udf being created so that it can be used in export_functions with different args than the "substr" version, even if both really end up using the same substr impl. What's the use for the export_functions versions? Anyways I think that's unrelated to this PR.

Are these changes tested?

Yes - fixed (*) the Substrait roundtrip test for SUBSTR and confirmed it uses the correct alias:
ExtensionFunction { extension_uri_reference: 4294967295, function_anchor: 0, name: "substring" }

DF doesn't validate the functions against the Substrait extensions so we don't automatically check that the name is correct, but this should be good enough.

*: The previous version would optimize the call away already during planning, so the Substrait actually didn't contain the substr call, just an equals check. Now it does. There are a bunch more functions where that happens, I didn't want to fix everything here but may do it as a followup!

Are there any user-facing changes?

Adding a "substring" alias for "substr" func. That should be added into https://datafusion.apache.org/user-guide/sql/scalar_functions.html#substr - is there some automatic way to generate the file or should I just do it by hand?

Substrait uses the name "substring", and it already exists in DF SQL The setup here is a bit weird; I'd have added substring as an alias for substr, but then we have here this "substring" version being created as udf already and exported through the export_functions, with slightly different args than substr (even though in reality the underlying function for both is the same substr impl). I think this PR should work, but if you have suggestions on how to make the situation here cleaner, I'd be happy to!

…ubstrait producer

alamb · 2024-07-05T13:08:14Z

Adding a "substring" alias for "substr" func. That should be added into https://datafusion.apache.org/user-guide/sql/scalar_functions.html#substr - is there some automatic way to generate the file or should I just do it by hand?

I think it needs to be updated by hand at this point

It would be great eventually to figure out how to automatically generate the docs from the code, as I think is described in #9173

alamb

Makes sense to me @Blizzara -- I have a question about the tests, but otherwise this looks good to go to me. 🙏

alamb · 2024-07-05T13:09:38Z

datafusion/substrait/tests/cases/roundtrip_logical_plan.rs

 #[tokio::test]
 async fn simple_scalar_function_substr() -> Result<()> {
-    roundtrip("SELECT * FROM data WHERE a = SUBSTR('datafusion', 0, 3)").await
+    roundtrip("SELECT SUBSTR(f, 1, 3) FROM data").await


is there a reason to change the test?

Maybe we could add this particular query as an additioanl query (to show the existing behavior is not changed). Something like

Suggested change

roundtrip("SELECT SUBSTR(f, 1, 3) FROM data").await

roundtrip("SELECT * FROM data WHERE a = SUBSTR('datafusion', 0, 3)").await

roundtrip("SELECT SUBSTR(f, 1, 3) FROM data").await

is there a reason to change the test?

Yes - the original query gets optimized by DF into SELECT * FROM data WHERE a = "dat" before being converted into Substrait, i.e. the whole SUBSTRING call is optimized away:

Filter: CAST(data.a AS Utf8) = Utf8("da") TableScan: data projection=[a, b, c, d, e, f, g], partial_filters=[CAST(data.a AS Utf8) = Utf8("da")]

Maybe we could add this particular query as an additioanl query (to show the existing behavior is not changed).

I can add it back, but it doesn't really test what it tries to test 😅 given that would you still like to have it?

Thank you for the explanation

No need to change the PR

The issue seems to be that DataFusion partially evaluates the expression

Blizzara · 2024-07-05T15:13:05Z

Adding a "substring" alias for "substr" func. That should be added into https://datafusion.apache.org/user-guide/sql/scalar_functions.html#substr - is there some automatic way to generate the file or should I just do it by hand?

I think it needs to be updated by hand at this point

Cool, done in ~~70ac8f9~~ 174b8ae

alamb

As always, thank you for pushing substrait along @Blizzara

cc @Lordworms

Lordworms · 2024-07-05T17:57:39Z

Seems like a reasonable change to me, Thank you @Blizzara

* feat: enable "substring" as a UDF in addition to "substr" Substrait uses the name "substring", and it already exists in DF SQL The setup here is a bit weird; I'd have added substring as an alias for substr, but then we have here this "substring" version being created as udf already and exported through the export_functions, with slightly different args than substr (even though in reality the underlying function for both is the same substr impl). I think this PR should work, but if you have suggestions on how to make the situation here cleaner, I'd be happy to! * okay redo everything: add an alias instead, and add renaming in the substrait producer * add alias into scalar_functions.md

Arttu Voutilainen added 2 commits July 5, 2024 10:21

okay redo everything: add an alias instead, and add renaming in the s…

f42dbfb

…ubstrait producer

github-actions bot added the substrait Changes to the substrait crate label Jul 5, 2024

Blizzara marked this pull request as ready for review July 5, 2024 09:14

alamb reviewed Jul 5, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/main' into avo/substring

2376c7d

add alias into scalar_functions.md

174b8ae

Blizzara force-pushed the avo/substring branch from 70ac8f9 to 174b8ae Compare July 5, 2024 15:16

alamb approved these changes Jul 5, 2024

View reviewed changes

alamb merged commit 6f86bfa into apache:main Jul 6, 2024

Blizzara deleted the avo/substring branch July 8, 2024 08:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: enable "substring" as a UDF in addition to "substr" #11277

feat: enable "substring" as a UDF in addition to "substr" #11277

Uh oh!

Blizzara commented Jul 5, 2024 •

edited

Loading

Uh oh!

alamb commented Jul 5, 2024

Uh oh!

alamb left a comment

Uh oh!

alamb Jul 5, 2024

Uh oh!

Blizzara Jul 5, 2024

Uh oh!

alamb Jul 5, 2024

Uh oh!

Blizzara commented Jul 5, 2024 •

edited

Loading

Uh oh!

alamb left a comment

Uh oh!

Lordworms commented Jul 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	roundtrip("SELECT SUBSTR(f, 1, 3) FROM data").await
	roundtrip("SELECT * FROM data WHERE a = SUBSTR('datafusion', 0, 3)").await
	roundtrip("SELECT SUBSTR(f, 1, 3) FROM data").await

feat: enable "substring" as a UDF in addition to "substr" #11277

feat: enable "substring" as a UDF in addition to "substr" #11277

Uh oh!

Conversation

Blizzara commented Jul 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb commented Jul 5, 2024

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Jul 5, 2024

Choose a reason for hiding this comment

Uh oh!

Blizzara Jul 5, 2024

Choose a reason for hiding this comment

Uh oh!

alamb Jul 5, 2024

Choose a reason for hiding this comment

Uh oh!

Blizzara commented Jul 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Lordworms commented Jul 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Blizzara commented Jul 5, 2024 •

edited

Loading

Blizzara commented Jul 5, 2024 •

edited

Loading