Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-27692][SQL] Add new optimizer rule to evaluate the deterministic scala udf only once if all inputs are literals #24593

Closed
wants to merge 12 commits into from

Conversation

skambha
Copy link
Contributor

@skambha skambha commented May 13, 2019

Description:
Deterministic UDF is a udf for which the following is true: Given a specific input, the output of the udf will be the same no matter how many times you execute the udf.

When your inputs to the UDF are all literal and UDF is deterministic, we can optimize this to evaluate the udf once and use the output instead of evaluating the UDF each time for every row in the query.

This is valid only if the UDF is deterministic and inputs are literal. Otherwise we should not and cannot apply this optimization.

Changes:

  • Add a new optimizer rule to evaluate the ScalaUDF once if it is deterministic and the inputs are literals.

Testing:

  • Added new unit tests

Credits:
Thanks to Guy Khazma from the IBM Haifa Research Team for the idea and the original implementation.

@BryanCutler
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented May 13, 2019

Test build #105367 has finished for PR 24593 at commit 1d0f995.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 13, 2019

Test build #105369 has finished for PR 24593 at commit e05c816.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

Instead of defining a new DeterministicLiteralUDF rule, could we instead

override def foldable: Boolean = deterministic && children.forall(_.foldable)

on ScalaUDF? That would be a slight generalization of the optimization proposed here.

@JoshRosen
Copy link
Contributor

This makes sense in theory, but could it be a problem in practice in case users are (incorrectly but implicitly) relying on existing behavior (e.g. to perform side-effects)? I'm not saying that we shouldn't consider this, but just wanted to consider whether there's any user workloads that are likely to be broken by this (and whether this merits a mention in a migration guide / release notes).

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does a UDF deterministic means (implicitly or explicitly) it is side-effect free?

@mgaido91
Copy link
Contributor

@viirya it should, since otherwise the optimizer can already make it to be executed more than once.

Copy link
Member

@gatorsmile gatorsmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am against the changes for automatically folding the ScalaUDF. Based on my experience, this heuristic is risky. Although I added a deterministic flag before, most users are not aware of it. We even should discuss whether all UDFs must be deterministic or non-deterministic by default.

@mgaido91
Copy link
Contributor

We even should discuss whether all UDFs must be deterministic or non-deterministic by default.

@gatorsmile I agree on this and actually I'd be +1 in making them nonDetermistic by default. I have seen basically no user aware of the flag you introduced and all were pretty surprised by the current behavior, as it is quite counter-intuitive.

Actually, I think this PR could be revisited and more meaningful after such a change, since if the user consciously sets a UDF to deterministic, it makes sense to optimize it as proposed here.

@dilipbiswal
Copy link
Contributor

@gatorsmile @mgaido91 Could another option be to make a ScalaUDF foldable under a config which defaults to false ?

@mgaido91
Copy link
Contributor

@dilipbiswal IIUC what you mean, I'd consider that more a hack than a solution, as there may be several UDFs in a query and they may need to be treated differently. Hence a global config isn't really a solution IMO.

@dilipbiswal
Copy link
Contributor

@mgaido91

IIUC what you mean, I'd consider that more a hack than a solution

Hmmn.. i am wondering why ? The way i look at it, when user sets this config, he is opting in consciously. The documentation of this config should describe what this "opting in really means". I agree that a global config does not necessary give the granularity per UDF. But i believe we have many such examples where we set things at a global level.

Also i was thinking about the proposal to make UDFs non-deterministic by default.
I am thinking this option may also require a config switch as this may be behaviour breaking change. WDYT ?

@mgaido91
Copy link
Contributor

i am wondering why ?
Because it is a global config and most of the times (at least what I saw so far) users have very few queries and potentially many UDFs in it. I can witness a use case I saw few days ago, with complex queries having several UDFs where one had to be as non-deterministic while the others no. So I consider anything which doesn't have the granularity of a single UDF not a real solution to the problem.

As far as the config is regarded for @gatorsmile 's proposal, I agree having a switch for getting back to the previous value, but I think the default should be the new behavior: we are going to release 3.0 so a behavior change is fine IMHO, especially if the new behavior is what a user expects when he/she writes the query.

@dilipbiswal
Copy link
Contributor

@mgaido91

Because it is a global config and most of the times (at least what I saw so far) users have very few queries and potentially many UDFs in it. I can witness a use case I saw few days ago, with complex queries having several UDFs where one had to be as non-deterministic while the others no. So I consider anything which doesn't have the granularity of a single UDF not a real solution to the problem.

IMHO, just to be clear, what we do with this config is to make users aware of this optimization. In your example, if users have several udfs (some deterministic and some not), they can always set the correct deterministic flag for each of their UDFs.. and then can opt-in with the config. The config is to simply make sure the default behaviour is retained thats all..

@mgaido91
Copy link
Contributor

mgaido91 commented May 14, 2019

IMHO, just to be clear, what we do with this config is to make users aware of this optimization.

I see @dilipbiswal , but I would argue that then we just need better documentation. The config would be there for documentation purposes. My point is that most of the users when they use a UDF, they assume that it is executed once, and that the optimizer doesn't change the number of times it is invoked (I myself don't read every bit of documentation for everything I use and a detail like this can be easily missed and/or misinterpreted for a non-expert user). So, since this is the more intuitive behavior, I'd argue that this should be the default behavior. We can then have a config for being consistent with previous behavior, I agree with that. Just to be clear, what I'd prefer is to have a behavior which a basic user can easily understand and that a more expert user can tune and improve leveraging all the flexibility the framework provides.

@skambha
Copy link
Contributor Author

skambha commented May 14, 2019

Thanks everyone for your comments. (@BryanCutler , @JoshRosen , @viirya, @gatorsmile , @mgaido91 , @dilipbiswal )

I wanted to summarize my understanding of the discussion so far and my comments.

In general, the optimization mentioned here to optimize the deterministic UDF for literals makes sense.

Optimization:
There are two ways to do the optimization that has been discussed:

  1. New Optimization Rule ‘DeterministicLiteralUDF’. ie Changes are in the PR. This can be disabled using the optimizer property. spark.sql.optimizer.excludedRules
  2. Implement ScalaUDF foldable (As mentioned by JoshRosen) that will be consumed by the existing ConstantFolding optimizer rule. We can also use a config property to enable or disable it.

Having this as a separate optimizer rule enables user to use the existing framework's property spark.sql.optimizer.excludedRules to exclude it.

Concerns:

  1. Today, in Spark the UDFs by default are deterministic. The users may not be aware of this as it is implicitly set by default.
  2. This optimization is based on the ‘deterministic’ property of the UDF and the inputs are literals. Users may not be aware of the deterministic property since it is set by default to true. We do not want to do optimization because we are not sure if user fully understands the implication of the udf being deterministic

To address the concerns:

  • For 1: I can open a new issue to discuss whether the udf should be marked nondeterministic by default and user consciously marks it deterministic.

  • For 2: For this optimization based on deterministic property, we create a new rule & have it under a config flag and by default make it not use this optimization. This would address the concern 2.

Documentation:
In general, there is agreement that we should document in the migration guide to ensure that users are aware of this optimization and the implications behind it.

Questions:
Q1: Can we move forward with this PR under a flag that will disable the optimization by default?
Q2: Do we want to hold off on this optimization till UDFs are marked nondeterministic by default?
———

Please let me know. Thanks.

@skambha
Copy link
Contributor Author

skambha commented May 17, 2019

We even should discuss whether all UDFs must be deterministic or non-deterministic by default.

@gatorsmile , Thanks for the comment. I have opened SPARK-27761 to followup on this.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented May 18, 2019

Since SPARK-27761 is created, can we reduce the focus back to the optimizer again? Actually, override def foldable will be the simplest code changes, but if we need to provide this optionally, a separate optimizer proposed in this PR will be the most feasible approach (without touching ScalaUDF and the existing optimizers like ConstantFolding.)

@dongjoon-hyun
Copy link
Member

@skambha . If we provide a configuration by default false, we don't need to change the existing test cases. Then, the reviewers feel more comfortable about that. Although we didn't choose the direction yet, could you proceed in that way for the next round reviews?

@SparkQA
Copy link

SparkQA commented May 20, 2019

Test build #105576 has finished for PR 24593 at commit 99542e4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 21, 2019

Test build #105583 has finished for PR 24593 at commit 25099ed.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@skambha
Copy link
Contributor Author

skambha commented May 29, 2019

@dongjoon-hyun , I have pushed the changes to address the renames ee5fa4e you suggested. Please take a look. Thanks.

@SparkQA
Copy link

SparkQA commented May 29, 2019

Test build #105927 has finished for PR 24593 at commit ee5fa4e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 31, 2019

Test build #105983 has finished for PR 24593 at commit 7b98cad.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
assert(exception.message.startsWith("Detected implicit cartesian product"))

withSQLConf(SQLConf.CROSS_JOINS_ENABLED.key -> "true") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This udf optimization rule is as part of the operator optimization batch. One other option we could consider is to move it after the 'Check Cartesian Products' batch.

@SparkQA
Copy link

SparkQA commented Jun 13, 2019

Test build #106447 has finished for PR 24593 at commit 93241b3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Why don't we consider #24593 (comment)? Otherwise, this rule looks too specific to one case.

@HyukjinKwon
Copy link
Member

Hm, why don't we block (and close) this one for now by SPARK-27761?

@skambha
Copy link
Contributor Author

skambha commented Jun 20, 2019

Hi @HyukjinKwon , Thanks for your comments.

This has the summary of the discussion so far:
#24593 (comment) and #24593 (comment)

Why don't we consider #24593 (comment)?

@gatorsmile says “I am against the changes for automatically folding the ScalaUDF” is against doing this in this comment #24593 (review)
Although technically we can approach it this way, but per this comment, IIUC, we do not want to go down this path.

Otherwise, this rule looks too specific to one case.

Actually the rule will work for different cases in combination with the constant folding rule for nested expressions that are foldable etc.
If you have a specific scenario that you would like to be covered, please let me know.

@skambha
Copy link
Contributor Author

skambha commented Jun 20, 2019

Hm, why don't we block (and close) this one for now by SPARK-27761?

SPARK-27761 is for the broader discussion of whether we want ScalaUDF to be non deterministic by default or not.

  1. From the time we have Scala UDF support in Spark, it has always been deterministic by default and we make decisions in the SQL engine already based on that.

  2. The optimization and the idea in this PR is relevant irrespective of what we decide to be the default for the Scala UDF.

  3. The optimization in this PR is helpful for many scenarios and we use them internally and it boosts performance for relevant queries ( as expected).

  4. The PR now also places this optimization under a flag which defaults to false to minimize any risk for users who may not want to make use of this optimization by default.

Given that a) the changes are fairly minimal, a few lines of code for the optimization itself and b) the risk is minimal as it is under a feature flag, it would be good to get the changes in.

WDYT? Please share your comments/concerns on the best way forward. Thanks.

@HyukjinKwon
Copy link
Member

The risk stuff or flag is okay. Honestly, I don't like that we need to add a flag for every single optimization. My point is that the approach looks hacky. This has to be done at ConstantFolding via foldable property.

@skambha
Copy link
Contributor Author

skambha commented Jul 2, 2019

I think we do have cases where we have flags for rules and there is precedence for it already today.

@HyukjinKwon would like to use the foldable property and @gatorsmile is against using the foldable property.

One other approach would be to add a global flag in the ScalaUDF when calculating the foldable. Would that be something that would satisfy both of your concerns?

Personally, I dont particularly like a global flag in ScalaUDF, but I wanted to mention this just in case.

Let me know. Thanks.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@HyukjinKwon
Copy link
Member

Seems we're deadlocked then. Closing it is a feasible option.

@paulata
Copy link

paulata commented Sep 22, 2019

@HyukjinKwon @gatorsmile @skambha It seems there is general agreement that this optimization makes sense and it brings benefit for some use cases. I agree with the assessment of @skambha that the change is small and the risk is small. It seems there is only a minor difference in opinion on the best approach to take. While @HyukjinKwon is concerned that the approach may look hacky, this is just for the short term until other issues are resolved e.g. Scala UDFs become non deterministic by default, when the flag can be removed. Could this be an acceptable solution ? It seems to make more sense than closing the PR.

@github-actions
Copy link

github-actions bot commented Jan 1, 2020

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.