Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark-compatible CAST operation #11201

Open
Blizzara opened this issue Jul 1, 2024 · 4 comments
Open

Spark-compatible CAST operation #11201

Blizzara opened this issue Jul 1, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@Blizzara
Copy link
Contributor

Blizzara commented Jul 1, 2024

Is your feature request related to a problem or challenge?

We're looking to use DataFusion as a replacement for Spark for some workflows, through Spark -> Substrait -> DataFusion conversions. Lot of the functionality already works, and it's been nice to see DataFusion and Spark agree on most behavior we've tested so far. However, one place where we're seeing more differences is the CAST expression, for example on casting complex types into strings (, or casting strings into numbers (where Spark is more lenient).

One option I've considered is to use Comet's cast, which I'd expect to be closer aligned with Spark (or at least on the way there). However, is there a way for me to replace/redirect the inbuilt Cast expression into using the Comet implementation?

Or would there be any other alternatives I could try?

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

@Blizzara Blizzara added the enhancement New feature or request label Jul 1, 2024
@alamb
Copy link
Contributor

alamb commented Jul 1, 2024

cc @viirya and @andygrove and @Omega359

You could always create a user defined function that does the cast 🤔 bit I am not sure about how much in DataFusion special cases the CAST

@Omega359
Copy link
Contributor

Omega359 commented Jul 1, 2024

I'm surprised cast is the thing that is catching you .. for me it was the differences in functions and their behaviour between the two systems. I wouldn't have a clue atm how to make cast in datafusion replaceable - I suspect it wouldn't be an easy task.

@Blizzara
Copy link
Contributor Author

Blizzara commented Jul 3, 2024

There's definitely difference in some functions as well, but those have been mostly minor so far (I haven't gotten to regexps or date/time stuff yet), like returning null vs throwing on overflow.

I did confirm today I can use Comet's scalar functions by overriding the existing ScalarUDFs in DF, that's quite nice!

For Expr::Cast that doesn't work as it's part of the Expr enum rather than being a ScalarUDF - is that by choice, or just pending migration? But I plan to try if I can use FunctionRewrite to change the Expr::Cast into a ScalarUDF that uses the Comet cast.

@alamb
Copy link
Contributor

alamb commented Jul 5, 2024

For Expr::Cast that doesn't work as it's part of the Expr enum rather than being a ScalarUDF - is that by choice, or just pending migration? But I plan to try if I can use FunctionRewrite to change the Expr::Cast into a ScalarUDF that uses the Comet cast.

I think it is largely historical, but realistically there is enough code that handles Cast specially in the core that switching it to use a UDF is likely a susbtantial project (we could certainly do it, but I wanted to offer my opinion on the scope)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants