[SPARK-27692][SQL] Add new optimizer rule to evaluate the deterministic scala udf only once if all inputs are literals #24593

skambha · 2019-05-13T18:58:47Z

Description:
Deterministic UDF is a udf for which the following is true: Given a specific input, the output of the udf will be the same no matter how many times you execute the udf.

When your inputs to the UDF are all literal and UDF is deterministic, we can optimize this to evaluate the udf once and use the output instead of evaluating the UDF each time for every row in the query.

This is valid only if the UDF is deterministic and inputs are literal. Otherwise we should not and cannot apply this optimization.

Changes:

Add a new optimizer rule to evaluate the ScalaUDF once if it is deterministic and the inputs are literals.

Testing:

Added new unit tests

Credits:
Thanks to Guy Khazma from the IBM Haifa Research Team for the idea and the original implementation.

BryanCutler · 2019-05-13T20:10:35Z

ok to test

SparkQA · 2019-05-13T20:34:39Z

Test build #105367 has finished for PR 24593 at commit 1d0f995.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-13T22:29:46Z

Test build #105369 has finished for PR 24593 at commit e05c816.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

JoshRosen · 2019-05-14T03:29:45Z

Instead of defining a new DeterministicLiteralUDF rule, could we instead

override def foldable: Boolean = deterministic && children.forall(_.foldable)

on ScalaUDF? That would be a slight generalization of the optimization proposed here.

JoshRosen · 2019-05-14T03:33:49Z

This makes sense in theory, but could it be a problem in practice in case users are (incorrectly but implicitly) relying on existing behavior (e.g. to perform side-effects)? I'm not saying that we shouldn't consider this, but just wanted to consider whether there's any user workloads that are likely to be broken by this (and whether this merits a mention in a migration guide / release notes).

viirya

Does a UDF deterministic means (implicitly or explicitly) it is side-effect free?

mgaido91 · 2019-05-14T11:50:46Z

@viirya it should, since otherwise the optimizer can already make it to be executed more than once.

gatorsmile

I am against the changes for automatically folding the ScalaUDF. Based on my experience, this heuristic is risky. Although I added a deterministic flag before, most users are not aware of it. We even should discuss whether all UDFs must be deterministic or non-deterministic by default.

mgaido91 · 2019-05-14T16:32:44Z

We even should discuss whether all UDFs must be deterministic or non-deterministic by default.

@gatorsmile I agree on this and actually I'd be +1 in making them nonDetermistic by default. I have seen basically no user aware of the flag you introduced and all were pretty surprised by the current behavior, as it is quite counter-intuitive.

Actually, I think this PR could be revisited and more meaningful after such a change, since if the user consciously sets a UDF to deterministic, it makes sense to optimize it as proposed here.

dilipbiswal · 2019-05-14T17:51:55Z

@gatorsmile @mgaido91 Could another option be to make a ScalaUDF foldable under a config which defaults to false ?

mgaido91 · 2019-05-14T18:08:19Z

@dilipbiswal IIUC what you mean, I'd consider that more a hack than a solution, as there may be several UDFs in a query and they may need to be treated differently. Hence a global config isn't really a solution IMO.

dilipbiswal · 2019-05-14T18:44:51Z

@mgaido91

IIUC what you mean, I'd consider that more a hack than a solution

Hmmn.. i am wondering why ? The way i look at it, when user sets this config, he is opting in consciously. The documentation of this config should describe what this "opting in really means". I agree that a global config does not necessary give the granularity per UDF. But i believe we have many such examples where we set things at a global level.

Also i was thinking about the proposal to make UDFs non-deterministic by default.
I am thinking this option may also require a config switch as this may be behaviour breaking change. WDYT ?

mgaido91 · 2019-05-14T19:20:12Z

i am wondering why ?
Because it is a global config and most of the times (at least what I saw so far) users have very few queries and potentially many UDFs in it. I can witness a use case I saw few days ago, with complex queries having several UDFs where one had to be as non-deterministic while the others no. So I consider anything which doesn't have the granularity of a single UDF not a real solution to the problem.

As far as the config is regarded for @gatorsmile 's proposal, I agree having a switch for getting back to the previous value, but I think the default should be the new behavior: we are going to release 3.0 so a behavior change is fine IMHO, especially if the new behavior is what a user expects when he/she writes the query.

dilipbiswal · 2019-05-14T19:30:09Z

@mgaido91

Because it is a global config and most of the times (at least what I saw so far) users have very few queries and potentially many UDFs in it. I can witness a use case I saw few days ago, with complex queries having several UDFs where one had to be as non-deterministic while the others no. So I consider anything which doesn't have the granularity of a single UDF not a real solution to the problem.

IMHO, just to be clear, what we do with this config is to make users aware of this optimization. In your example, if users have several udfs (some deterministic and some not), they can always set the correct deterministic flag for each of their UDFs.. and then can opt-in with the config. The config is to simply make sure the default behaviour is retained thats all..

mgaido91 · 2019-05-14T20:06:40Z

IMHO, just to be clear, what we do with this config is to make users aware of this optimization.

I see @dilipbiswal , but I would argue that then we just need better documentation. The config would be there for documentation purposes. My point is that most of the users when they use a UDF, they assume that it is executed once, and that the optimizer doesn't change the number of times it is invoked (I myself don't read every bit of documentation for everything I use and a detail like this can be easily missed and/or misinterpreted for a non-expert user). So, since this is the more intuitive behavior, I'd argue that this should be the default behavior. We can then have a config for being consistent with previous behavior, I agree with that. Just to be clear, what I'd prefer is to have a behavior which a basic user can easily understand and that a more expert user can tune and improve leveraging all the flexibility the framework provides.

skambha · 2019-05-14T23:35:43Z

Thanks everyone for your comments. (@BryanCutler , @JoshRosen , @viirya, @gatorsmile , @mgaido91 , @dilipbiswal )

I wanted to summarize my understanding of the discussion so far and my comments.

In general, the optimization mentioned here to optimize the deterministic UDF for literals makes sense.

Optimization:
There are two ways to do the optimization that has been discussed:

New Optimization Rule ‘DeterministicLiteralUDF’. ie Changes are in the PR. This can be disabled using the optimizer property. spark.sql.optimizer.excludedRules
Implement ScalaUDF foldable (As mentioned by JoshRosen) that will be consumed by the existing ConstantFolding optimizer rule. We can also use a config property to enable or disable it.

Having this as a separate optimizer rule enables user to use the existing framework's property spark.sql.optimizer.excludedRules to exclude it.

Concerns:

Today, in Spark the UDFs by default are deterministic. The users may not be aware of this as it is implicitly set by default.
This optimization is based on the ‘deterministic’ property of the UDF and the inputs are literals. Users may not be aware of the deterministic property since it is set by default to true. We do not want to do optimization because we are not sure if user fully understands the implication of the udf being deterministic

To address the concerns:

For 1: I can open a new issue to discuss whether the udf should be marked nondeterministic by default and user consciously marks it deterministic.
For 2: For this optimization based on deterministic property, we create a new rule & have it under a config flag and by default make it not use this optimization. This would address the concern 2.

Documentation:
In general, there is agreement that we should document in the migration guide to ensure that users are aware of this optimization and the implications behind it.

Questions:
Q1: Can we move forward with this PR under a flag that will disable the optimization by default?
Q2: Do we want to hold off on this optimization till UDFs are marked nondeterministic by default?
———

Please let me know. Thanks.

sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala

skambha · 2019-05-17T18:17:43Z

We even should discuss whether all UDFs must be deterministic or non-deterministic by default.

@gatorsmile , Thanks for the comment. I have opened SPARK-27761 to followup on this.

dongjoon-hyun · 2019-05-18T20:27:41Z

Since SPARK-27761 is created, can we reduce the focus back to the optimizer again? Actually, override def foldable will be the simplest code changes, but if we need to provide this optionally, a separate optimizer proposed in this PR will be the most feasible approach (without touching ScalaUDF and the existing optimizers like ConstantFolding.)

dongjoon-hyun · 2019-05-18T20:32:10Z

@skambha . If we provide a configuration by default false, we don't need to change the existing test cases. Then, the reviewers feel more comfortable about that. Although we didn't choose the direction yet, could you proceed in that way for the next round reviews?

sql/core/src/test/scala/org/apache/spark/sql/UDFSuite.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

sql/core/src/test/scala/org/apache/spark/sql/UDFSuite.scala

SparkQA · 2019-05-20T21:28:22Z

Test build #105576 has finished for PR 24593 at commit 99542e4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-21T02:04:00Z

Test build #105583 has finished for PR 24593 at commit 25099ed.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

skambha · 2019-05-29T18:30:27Z

@dongjoon-hyun , I have pushed the changes to address the renames ee5fa4e you suggested. Please take a look. Thanks.

SparkQA · 2019-05-29T22:03:38Z

Test build #105927 has finished for PR 24593 at commit ee5fa4e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DeterministicLiteralUDF.scala

sql/core/src/test/scala/org/apache/spark/sql/DeterministicLiteralUDF.scala

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DeterministicLiteralUDF.scala

SparkQA · 2019-05-31T02:11:19Z

Test build #105983 has finished for PR 24593 at commit 7b98cad.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

skambha · 2019-06-12T22:29:12Z

sql/core/src/test/scala/org/apache/spark/sql/DeterministicLiteralUDFFoldingSuite.scala

+        }
+        assert(exception.message.startsWith("Detected implicit cartesian product"))
+
+        withSQLConf(SQLConf.CROSS_JOINS_ENABLED.key -> "true") {


This udf optimization rule is as part of the operator optimization batch. One other option we could consider is to move it after the 'Check Cartesian Products' batch.

SparkQA · 2019-06-13T01:18:57Z

Test build #106447 has finished for PR 24593 at commit 93241b3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-06-20T04:57:23Z

Why don't we consider #24593 (comment)? Otherwise, this rule looks too specific to one case.

HyukjinKwon · 2019-06-20T05:01:41Z

Hm, why don't we block (and close) this one for now by SPARK-27761?

skambha · 2019-06-20T17:37:11Z

Hi @HyukjinKwon , Thanks for your comments.

This has the summary of the discussion so far:
#24593 (comment) and #24593 (comment)

Why don't we consider #24593 (comment)?

@gatorsmile says “I am against the changes for automatically folding the ScalaUDF” is against doing this in this comment #24593 (review)
Although technically we can approach it this way, but per this comment, IIUC, we do not want to go down this path.

Otherwise, this rule looks too specific to one case.

Actually the rule will work for different cases in combination with the constant folding rule for nested expressions that are foldable etc.
If you have a specific scenario that you would like to be covered, please let me know.

skambha · 2019-06-20T17:44:36Z

Hm, why don't we block (and close) this one for now by SPARK-27761?

SPARK-27761 is for the broader discussion of whether we want ScalaUDF to be non deterministic by default or not.

From the time we have Scala UDF support in Spark, it has always been deterministic by default and we make decisions in the SQL engine already based on that.
The optimization and the idea in this PR is relevant irrespective of what we decide to be the default for the Scala UDF.
The optimization in this PR is helpful for many scenarios and we use them internally and it boosts performance for relevant queries ( as expected).
The PR now also places this optimization under a flag which defaults to false to minimize any risk for users who may not want to make use of this optimization by default.

Given that a) the changes are fairly minimal, a few lines of code for the optimization itself and b) the risk is minimal as it is under a feature flag, it would be good to get the changes in.

WDYT? Please share your comments/concerns on the best way forward. Thanks.

HyukjinKwon · 2019-07-02T14:13:29Z

The risk stuff or flag is okay. Honestly, I don't like that we need to add a flag for every single optimization. My point is that the approach looks hacky. This has to be done at ConstantFolding via foldable property.

skambha · 2019-07-02T18:46:17Z

I think we do have cases where we have flags for rules and there is precedence for it already today.

@HyukjinKwon would like to use the foldable property and @gatorsmile is against using the foldable property.

One other approach would be to add a global flag in the ScalaUDF when calculating the foldable. Would that be something that would satisfy both of your concerns?

Personally, I dont particularly like a global flag in ScalaUDF, but I wanted to mention this just in case.

Let me know. Thanks.

AmplabJenkins · 2019-09-16T18:13:04Z

Can one of the admins verify this patch?

HyukjinKwon · 2019-09-17T00:05:53Z

Seems we're deadlocked then. Closing it is a feasible option.

paulata · 2019-09-22T12:06:41Z

@HyukjinKwon @gatorsmile @skambha It seems there is general agreement that this optimization makes sense and it brings benefit for some use cases. I agree with the assessment of @skambha that the change is small and the risk is small. It seems there is only a minor difference in opinion on the best approach to take. While @HyukjinKwon is concerned that the approach may look hacky, this is just for the short term until other issues are resolved e.g. Scala UDFs become non deterministic by default, when the flag can be removed. Could this be an acceptable solution ? It seems to make more sense than closing the PR.

github-actions · 2020-01-01T00:05:52Z

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

BryanCutler reviewed May 13, 2019

View reviewed changes

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala Outdated Show resolved Hide resolved

viirya reviewed May 14, 2019

View reviewed changes

gatorsmile requested changes May 14, 2019

View reviewed changes

kiszk reviewed May 16, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala Outdated Show resolved Hide resolved

skambha force-pushed the udfopt branch from e05c816 to 99542e4 Compare May 20, 2019 18:16

pattern reviewed May 20, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/UDFSuite.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed May 20, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed May 20, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/UDFSuite.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed May 26, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed May 26, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed May 26, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

skambha added 3 commits May 28, 2019 14:41

small changes to config name and remove diffs to other tests

b348cd5

Move rule and test to new file respectively - review comment

7d54787

rename

ee5fa4e

skambha force-pushed the udfopt branch from 25099ed to ee5fa4e Compare May 29, 2019 18:28