Skip to content

Conversation

@Blizzara
Copy link
Contributor

@Blizzara Blizzara commented Jul 5, 2024

Which issue does this PR close?

Closes #.

Rationale for this change

Substrait uses the name "substring", and it already exists in DF SQL

What changes are included in this PR?

Adds "substring" alias for "substr" UDF so that it can be found from Substrait consumer. Also adds a mapping into Substrait producer to rewrite "substr" into "substring".
And fixes the substr roundtrip test.

The setup here is a bit weird; in unicode.mod there is a "substring" udf being created so that it can be used in export_functions with different args than the "substr" version, even if both really end up using the same substr impl. What's the use for the export_functions versions? Anyways I think that's unrelated to this PR.

Are these changes tested?

Yes - fixed (*) the Substrait roundtrip test for SUBSTR and confirmed it uses the correct alias:
ExtensionFunction { extension_uri_reference: 4294967295, function_anchor: 0, name: "substring" }

DF doesn't validate the functions against the Substrait extensions so we don't automatically check that the name is correct, but this should be good enough.

*: The previous version would optimize the call away already during planning, so the Substrait actually didn't contain the substr call, just an equals check. Now it does. There are a bunch more functions where that happens, I didn't want to fix everything here but may do it as a followup!

Are there any user-facing changes?

Adding a "substring" alias for "substr" func. That should be added into https://datafusion.apache.org/user-guide/sql/scalar_functions.html#substr - is there some automatic way to generate the file or should I just do it by hand?

Arttu Voutilainen added 2 commits July 5, 2024 10:21
Substrait uses the name "substring", and it already exists in DF SQL

The setup here is a bit weird; I'd have added substring as an alias for substr, but then we have here this "substring" version being created as udf already and exported through the export_functions, with slightly different args than substr
(even though in reality the underlying function for both is the same substr impl).

I think this PR should work, but if you have suggestions on how to make the situation here cleaner, I'd be happy to!
@github-actions github-actions bot added the substrait Changes to the substrait crate label Jul 5, 2024
@Blizzara Blizzara marked this pull request as ready for review July 5, 2024 09:14
@alamb
Copy link
Contributor

alamb commented Jul 5, 2024

Adding a "substring" alias for "substr" func. That should be added into https://datafusion.apache.org/user-guide/sql/scalar_functions.html#substr - is there some automatic way to generate the file or should I just do it by hand?

I think it needs to be updated by hand at this point

It would be great eventually to figure out how to automatically generate the docs from the code, as I think is described in #9173

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me @Blizzara -- I have a question about the tests, but otherwise this looks good to go to me. 🙏

#[tokio::test]
async fn simple_scalar_function_substr() -> Result<()> {
roundtrip("SELECT * FROM data WHERE a = SUBSTR('datafusion', 0, 3)").await
roundtrip("SELECT SUBSTR(f, 1, 3) FROM data").await
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason to change the test?

Maybe we could add this particular query as an additioanl query (to show the existing behavior is not changed). Something like

Suggested change
roundtrip("SELECT SUBSTR(f, 1, 3) FROM data").await
roundtrip("SELECT * FROM data WHERE a = SUBSTR('datafusion', 0, 3)").await
roundtrip("SELECT SUBSTR(f, 1, 3) FROM data").await

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason to change the test?

Yes - the original query gets optimized by DF into SELECT * FROM data WHERE a = "dat" before being converted into Substrait, i.e. the whole SUBSTRING call is optimized away:

Filter: CAST(data.a AS Utf8) = Utf8("da")
  TableScan: data projection=[a, b, c, d, e, f, g], partial_filters=[CAST(data.a AS Utf8) = Utf8("da")]

Maybe we could add this particular query as an additioanl query (to show the existing behavior is not changed).

I can add it back, but it doesn't really test what it tries to test 😅 given that would you still like to have it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the explanation

No need to change the PR

The issue seems to be that DataFusion partially evaluates the expression

@Blizzara
Copy link
Contributor Author

Blizzara commented Jul 5, 2024

Adding a "substring" alias for "substr" func. That should be added into https://datafusion.apache.org/user-guide/sql/scalar_functions.html#substr - is there some automatic way to generate the file or should I just do it by hand?

I think it needs to be updated by hand at this point

Cool, done in 70ac8f9 174b8ae

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As always, thank you for pushing substrait along @Blizzara

cc @Lordworms

@Lordworms
Copy link
Contributor

Seems like a reasonable change to me, Thank you @Blizzara

@alamb alamb merged commit 6f86bfa into apache:main Jul 6, 2024
@Blizzara Blizzara deleted the avo/substring branch July 8, 2024 08:05
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
* feat: enable "substring" as a UDF in addition to "substr"

Substrait uses the name "substring", and it already exists in DF SQL

The setup here is a bit weird; I'd have added substring as an alias for substr, but then we have here this "substring" version being created as udf already and exported through the export_functions, with slightly different args than substr
(even though in reality the underlying function for both is the same substr impl).

I think this PR should work, but if you have suggestions on how to make the situation here cleaner, I'd be happy to!

* okay redo everything: add an alias instead, and add renaming in the substrait producer

* add alias into scalar_functions.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

substrait Changes to the substrait crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants