-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-27692][SQL] Add new optimizer rule to evaluate the deterministic scala udf only once if all inputs are literals #24593
Conversation
ok to test |
Test build #105367 has finished for PR 24593 at commit
|
Test build #105369 has finished for PR 24593 at commit
|
external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala
Outdated
Show resolved
Hide resolved
Instead of defining a new override def foldable: Boolean = deterministic && children.forall(_.foldable) on |
This makes sense in theory, but could it be a problem in practice in case users are (incorrectly but implicitly) relying on existing behavior (e.g. to perform side-effects)? I'm not saying that we shouldn't consider this, but just wanted to consider whether there's any user workloads that are likely to be broken by this (and whether this merits a mention in a migration guide / release notes). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does a UDF deterministic
means (implicitly or explicitly) it is side-effect free?
@viirya it should, since otherwise the optimizer can already make it to be executed more than once. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am against the changes for automatically folding the ScalaUDF. Based on my experience, this heuristic is risky. Although I added a deterministic flag before, most users are not aware of it. We even should discuss whether all UDFs must be deterministic or non-deterministic by default.
@gatorsmile I agree on this and actually I'd be +1 in making them Actually, I think this PR could be revisited and more meaningful after such a change, since if the user consciously sets a UDF to deterministic, it makes sense to optimize it as proposed here. |
@gatorsmile @mgaido91 Could another option be to make a ScalaUDF foldable under a config which defaults to false ? |
@dilipbiswal IIUC what you mean, I'd consider that more a hack than a solution, as there may be several UDFs in a query and they may need to be treated differently. Hence a global config isn't really a solution IMO. |
Hmmn.. i am wondering why ? The way i look at it, when user sets this config, he is opting in consciously. The documentation of this config should describe what this "opting in really means". I agree that a global config does not necessary give the granularity per UDF. But i believe we have many such examples where we set things at a global level. Also i was thinking about the proposal to make UDFs non-deterministic by default. |
As far as the config is regarded for @gatorsmile 's proposal, I agree having a switch for getting back to the previous value, but I think the default should be the new behavior: we are going to release 3.0 so a behavior change is fine IMHO, especially if the new behavior is what a user expects when he/she writes the query. |
IMHO, just to be clear, what we do with this config is to make users aware of this optimization. In your example, if users have several udfs (some deterministic and some not), they can always set the correct deterministic flag for each of their UDFs.. and then can opt-in with the config. The config is to simply make sure the default behaviour is retained thats all.. |
I see @dilipbiswal , but I would argue that then we just need better documentation. The config would be there for documentation purposes. My point is that most of the users when they use a UDF, they assume that it is executed once, and that the optimizer doesn't change the number of times it is invoked (I myself don't read every bit of documentation for everything I use and a detail like this can be easily missed and/or misinterpreted for a non-expert user). So, since this is the more intuitive behavior, I'd argue that this should be the default behavior. We can then have a config for being consistent with previous behavior, I agree with that. Just to be clear, what I'd prefer is to have a behavior which a basic user can easily understand and that a more expert user can tune and improve leveraging all the flexibility the framework provides. |
Thanks everyone for your comments. (@BryanCutler , @JoshRosen , @viirya, @gatorsmile , @mgaido91 , @dilipbiswal ) I wanted to summarize my understanding of the discussion so far and my comments. In general, the optimization mentioned here to optimize the deterministic UDF for literals makes sense. Optimization:
Having this as a separate optimizer rule enables user to use the existing framework's property spark.sql.optimizer.excludedRules to exclude it. Concerns:
To address the concerns:
Documentation: Questions: Please let me know. Thanks. |
sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala
Outdated
Show resolved
Hide resolved
@gatorsmile , Thanks for the comment. I have opened SPARK-27761 to followup on this. |
Since SPARK-27761 is created, can we reduce the focus back to the optimizer again? Actually, |
@skambha . If we provide a configuration by default |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
Outdated
Show resolved
Hide resolved
Test build #105576 has finished for PR 24593 at commit
|
Test build #105583 has finished for PR 24593 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
@dongjoon-hyun , I have pushed the changes to address the renames ee5fa4e you suggested. Please take a look. Thanks. |
Test build #105927 has finished for PR 24593 at commit
|
...atalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DeterministicLiteralUDF.scala
Outdated
Show resolved
Hide resolved
...atalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DeterministicLiteralUDF.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/DeterministicLiteralUDF.scala
Outdated
Show resolved
Hide resolved
...atalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DeterministicLiteralUDF.scala
Outdated
Show resolved
Hide resolved
...atalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DeterministicLiteralUDF.scala
Outdated
Show resolved
Hide resolved
Test build #105983 has finished for PR 24593 at commit
|
} | ||
assert(exception.message.startsWith("Detected implicit cartesian product")) | ||
|
||
withSQLConf(SQLConf.CROSS_JOINS_ENABLED.key -> "true") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This udf optimization rule is as part of the operator optimization batch. One other option we could consider is to move it after the 'Check Cartesian Products' batch.
Test build #106447 has finished for PR 24593 at commit
|
Why don't we consider #24593 (comment)? Otherwise, this rule looks too specific to one case. |
Hm, why don't we block (and close) this one for now by SPARK-27761? |
Hi @HyukjinKwon , Thanks for your comments. This has the summary of the discussion so far:
@gatorsmile says “I am against the changes for automatically folding the ScalaUDF” is against doing this in this comment #24593 (review)
Actually the rule will work for different cases in combination with the constant folding rule for nested expressions that are foldable etc. |
SPARK-27761 is for the broader discussion of whether we want ScalaUDF to be non deterministic by default or not.
Given that a) the changes are fairly minimal, a few lines of code for the optimization itself and b) the risk is minimal as it is under a feature flag, it would be good to get the changes in. WDYT? Please share your comments/concerns on the best way forward. Thanks. |
The risk stuff or flag is okay. Honestly, I don't like that we need to add a flag for every single optimization. My point is that the approach looks hacky. This has to be done at |
I think we do have cases where we have flags for rules and there is precedence for it already today. @HyukjinKwon would like to use the foldable property and @gatorsmile is against using the foldable property. One other approach would be to add a global flag in the ScalaUDF when calculating the foldable. Would that be something that would satisfy both of your concerns? Personally, I dont particularly like a global flag in ScalaUDF, but I wanted to mention this just in case. Let me know. Thanks. |
Can one of the admins verify this patch? |
Seems we're deadlocked then. Closing it is a feasible option. |
@HyukjinKwon @gatorsmile @skambha It seems there is general agreement that this optimization makes sense and it brings benefit for some use cases. I agree with the assessment of @skambha that the change is small and the risk is small. It seems there is only a minor difference in opinion on the best approach to take. While @HyukjinKwon is concerned that the approach may look hacky, this is just for the short term until other issues are resolved e.g. Scala UDFs become non deterministic by default, when the flag can be removed. Could this be an acceptable solution ? It seems to make more sense than closing the PR. |
We're closing this PR because it hasn't been updated in a while. If you'd like to revive this PR, please reopen it! |
Description:
Deterministic UDF is a udf for which the following is true: Given a specific input, the output of the udf will be the same no matter how many times you execute the udf.
When your inputs to the UDF are all literal and UDF is deterministic, we can optimize this to evaluate the udf once and use the output instead of evaluating the UDF each time for every row in the query.
This is valid only if the UDF is deterministic and inputs are literal. Otherwise we should not and cannot apply this optimization.
Changes:
Testing:
Credits:
Thanks to Guy Khazma from the IBM Haifa Research Team for the idea and the original implementation.