-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-48769][SQL] Support constant folding for ScalaUDF #47164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48769][SQL] Support constant folding for ScalaUDF #47164
Conversation
b237c81 to
4b55bf2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem we made ScalaUDF as deterministic by default so far, and we will have a lot of problems in existing workloads by executing the UDFs in driver side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we ever made them non-determinstic by default, it would have been much easier to support foldability ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem we made ScalaUDF as deterministic by default so far, and we will have a lot of problems in existing workloads by executing the UDFs in driver side.
To me deterministic seems like the correct default, I would expect most udfs to be deterministic. So chainging the default seems like a more invasive change.
The concerns you raise are around the belief that people are using the existing API incorrectly. Do we have some more concrete evidence that this is the case? Because this seems like an argument that could be used against a lot of potential improvements and to me it does not seems like a rule we would like to follow in the long run.
My understanding it also that having a UDF with incorrect value for deterministic would already open you up to correctness issues due to other optimizer rules? So maybe that leads people to already specifying the correct nullability for their udfs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If user defines a UDF that requires something in executors, that'd be a problem now as the UDF runs in driver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, if the UDF runs something very heavy, it will stop the Catalyst optimizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, if the UDF runs something very heavy, it will stop the Catalyst optimizer.
To me this argument sounds a bit backwards. The heavier the UDF is the more gain it is from executing it once during planning instead of executing it multiple times during execution (worst case we are going to execute it once per row).
If user defines a UDF that requires something in executors, that'd be a problem now as the UDF runs in driver.
Ok, this does sound like somthing that could cause breakage. Not sure how common that would be in practice b though.
I changed the PR to make it opt in per UDF to make it foldable. I guess that should address most/all of the concerns you brought up with regards of it being a breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon any chance we could modify this change to make it acceptable or should it just be abandoned?
I also notice that we will attempt to execute expressions and therefore UDFs as part of EliminateOuterJoins. There errors are ignored. We could extend the ConstantFolding rule to the same, ignore errors from expressions containing UDFs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found that we will also execute udfs of part of ConvertToLocalRelation
bb0725b to
086e48b
Compare
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
This patch makes ii possible to mark scala UDFs as foldable. UDFs marked as foldable will be constant folded when when it is deterministic and all the children are foldable.
Why are the changes needed?
By allowing the expression to be folded we can allow for more optimizations. Zero argument UDFs are commonly used for literals of custom types.
Does this PR introduce any user-facing change?
Yes, it allows scala UDFs to be constant folded during analysis.
How was this patch tested?
New tests in ConstantFoldingSuite.
Was this patch authored or co-authored using generative AI tooling?
No.