[MINOR][SQL][DOCS] Add notes of the deterministic assumption on UDF functions #13087

dongjoon-hyun · 2016-05-13T00:32:04Z

What changes were proposed in this pull request?

Spark assumes that UDF functions are deterministic. This PR adds explicit notes about that.

How was this patch tested?

It's only about docs.

SparkQA · 2016-05-13T01:41:55Z

Test build #58530 has finished for PR 13087 at commit 43794e1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-13T02:02:11Z

Test build #58531 has finished for PR 13087 at commit 610a5b9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-05-13T02:50:14Z

There are several cases which assumes UDF is deterministic. It would be a big change to user. I'll revert the change on ScalaUDF, and update this PR to change optimizer not to duplicate the UDF expression.

dongjoon-hyun · 2016-05-13T04:16:02Z

The reported error scenario was the following.

scala> val df = sc.parallelize(Seq(("a", "b"), ("a1", "b1"))).toDF("old","old1")
scala> val udfFunc = udf((s: String) => {println(s"running udf($s)"); s })
scala> val newDF = df.withColumn("new", udfFunc(df("old")))
scala> val filteredOnNewColumnDF = newDF.filter("new <> 'a1'")
scala> filteredOnNewColumnDF.show
running udf(a)
running udf(a)
running udf(a1)
+---+----+---+
|old|old1|new|
+---+----+---+
|  a|   b|  a|
+---+----+---+

The result of this PR is like the following.

scala> filteredOnNewColumnDF.show
running udf(a1)
running udf(a)
+---+----+---+
|old|old1|new|
+---+----+---+
|  a|   b|  a|
+---+----+---+

SparkQA · 2016-05-13T04:21:03Z

Test build #58535 has finished for PR 13087 at commit ab5bc4b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-05-13T04:24:23Z

Hi, @liancheng and @cloud-fan .
This PR is similar with your commit, [SPARK-13473][SQL] Don't push predicate through project with nondeterministic field(s).
This PR prevent pushing predicate through project with UDF function expression.
Could you review this PR?

SparkQA · 2016-05-16T18:17:33Z

Test build #58647 has finished for PR 13087 at commit afe5216.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-05-19T17:25:17Z

Hi, @rxin .
Could you review this PR?

SparkQA · 2016-05-19T18:44:39Z

Test build #58874 has finished for PR 13087 at commit 294042d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

linbojin · 2016-05-20T02:27:42Z

Hi, @marmbrus
Could you review this PR?

rxin · 2016-05-20T02:34:36Z

cc @cloud-fan

cloud-fan · 2016-05-20T05:23:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

@@ -1025,7 +1025,8 @@ object PushDownPredicate extends Rule[LogicalPlan] with PredicateHelper {
    // state and all the input rows processed before. In another word, the order of input rows
    // matters for non-deterministic expressions, while pushing down predicates changes the order.
    case filter @ Filter(condition, project @ Project(fields, grandChild))
-      if fields.forall(_.deterministic) =>
+        if fields.forall(_.deterministic) &&
+          fields.forall(_.find(_.isInstanceOf[ScalaUDF]).isEmpty) =>


I'm not sure if I understand this correctly, do you mean ScalaUDF can be nondeterministic and we should always treat it as nondeterministic expression? If so, I think a better idea is just override deterministic in ScalaUDF and always return false.

Thank you, @cloud-fan , again!
Yep. Right. Exactly, I really wanted to do that. So, I made my first initial commit for this PR.

But, you can see the result in the above.

the first initial commit: 85fa040

Jenkins result: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58531/consoleFull

My decision:

There are several cases which assumes UDF is deterministic. It would be a big change to user. I'll revert the change on ScalaUDF, and update this PR to change optimizer not to duplicate the UDF expression.

I still think that is a correct solution. I mean I totally agree with you. But, as you see, it needs to change other testsuites, so I thought I need commiters' decision to do that.

Can you look into it to see where we made this assumption? For now I prefer to override deterministic in ScalaUDF, if it's not a lot of effort to fix this problem.

Great! No problem. I will try to fix other testsuite correctly.

dongjoon-hyun · 2016-05-20T08:33:26Z

According to @cloud-fan 's advice, the goal of this PR is now making ScalaUDF as a non-deterministic expression. Although this is a correct fix, one noticeable drawback is that we cannot use ScalaUDF in GROUP BY. I updated the description of this PR, too.

SparkQA · 2016-05-20T08:58:49Z

Test build #58969 has finished for PR 13087 at commit 94ecdd4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-05-20T09:50:12Z

Hmm. @cloud-fan .
There is bad news. ALS.scala uses UDF on Join LeftOuter. So, there are 7 failures on ALSSuite.scala.

override def transform(dataset: Dataset[_]): DataFrame = {
    transformSchema(dataset.schema)
    // Register a UDF for DataFrame, and then
    // create a new column named map(predictionCol) by running the predict UDF.
    val predict = udf { (userFeatures: Seq[Float], itemFeatures: Seq[Float]) =>
      if (userFeatures != null && itemFeatures != null) {
        blas.sdot(rank, userFeatures.toArray, 1, itemFeatures.toArray, 1)
      } else {
        Float.NaN
      }
    }
    dataset
      .join(userFactors,
        checkedCast(dataset($(userCol)).cast(DoubleType)) === userFactors("id"), "left")
      .join(itemFactors,
        checkedCast(dataset($(itemCol)).cast(DoubleType)) === itemFactors("id"), "left")
      .select(dataset("*"),
        predict(userFactors("features"), itemFactors("features")).as($(predictionCol)))
}

Nondeterministic UDF has more limitation than I expected.

nondeterministic expressions are only allowed in
Project, Filter, Aggregate or Window, found:
 ((IF((CAST(`user` AS DOUBLE) IS NULL), CAST(NULL AS INT), UDF(cast(user as double)))) = `id`)
in operator Join LeftOuter, Some((if (isnull(cast(user#33 as double))) null else UDF(cast(user#33 as double)) = id#19))

According to the Jenkins test failure log, this is the last hurdle. However, it proves the usage of UDF prevails. Spark users might depends on this risky feature much more.

dongjoon-hyun · 2016-05-20T10:35:09Z

I updated ALS and ALSSuite just in order to pass the Jenkins for further discussion.

SparkQA · 2016-05-20T11:52:05Z

Test build #58976 has finished for PR 13087 at commit 1bf8b43.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-05-20T15:47:45Z

The PySpark failure is fixed as a HOTFIX.

marmbrus · 2016-05-20T16:19:15Z

I think we can have a way to mark a UDF as non-deterministic, but that is too large of a change to make it the default.

Also, is this an actual performance problem, or does it just look like one (and common subexpression elimination is fixing it)?

markhamstra · 2016-05-20T16:44:41Z

@marmbrus +1

dongjoon-hyun · 2016-05-20T16:51:11Z

Thank you for review, @marmbrus and @markhamstra !

Actually, it's huge change. Although I'm not aware of the real background, the reported case can be handled by just preventing PushDownPredicate should not duplicate UDF function expressions.

I think we can keep the common subexpression elimination without any change.

Anyway, I will revert the current investigation into my second commit back.

thunterdb · 2016-05-20T16:53:16Z

@marmbrus +1

I suggest to change the documentation of UDFs instead to underline the expectation that they be deterministic for the time being.

dongjoon-hyun · 2016-05-20T17:01:49Z

Thank you, @thunterdb .
By the way, do you mean we should resolve the JIRA issue as WONTFIX?

dongjoon-hyun · 2016-05-20T17:06:00Z

@marmbrus @markhamstra @thunterdb .
Now, the code and description of this PR is rollbacked to my second commit 7 days ago.

For common subexpression elimination, I think it's okay if the some optimizer reduces the number of UDF calls. It's expectable. How do you think about this?

SparkQA · 2016-05-20T17:22:57Z

Test build #59006 has finished for PR 13087 at commit bb164ea.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-20T18:35:42Z

Test build #59009 has finished for PR 13087 at commit 9b81251.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2016-05-20T19:43:50Z

My main questions remains: does this actually make a difference in runtime? or is execution smart enough already to do this optimization (even if to the user it looks like its getting called more than once).

Also, @dongjoon-hyun would you have time to audit places where UDFs are documented and add the expectation that they are deterministic?

felixcheung · 2016-05-22T21:37:51Z

@dongjoon-hyun I don't think we support UDF this way in SparkR?
Maybe something we could add to dapply? @sun-rui

dongjoon-hyun · 2016-05-22T21:42:35Z

Thank you, @felixcheung ! Then, this PR is enough for the current master branch. :)
If SparkR have something related to this in the future, we can add a note on that at that time.

SparkQA · 2016-05-22T22:36:07Z

Test build #59113 has finished for PR 13087 at commit 7c11e2e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2016-05-23T17:34:09Z

python/pyspark/sql/functions.py

@@ -1756,6 +1756,7 @@ def __call__(self, *cols):
 @since(1.3)
 def udf(f, returnType=StringType()):
    """Creates a :class:`Column` expression representing a user defined function (UDF).
+    Note that the user-defined functions should be deterministic.


Thanks for doing this! I would say must be deterministic. We might even want to say that "due to optimization duplicate invocations may be eliminated or the function may even be invoked more times than it is present in the query".

Thank you. I see. I'll fix like that.

…o be evaluated once.

…izer.

dongjoon-hyun · 2016-05-23T18:10:26Z

Hi, @marmbrus .
I replaced 'should' with 'must', and added the detail description for functions.py, SQLContext.scala, SparkSession.scala and UserDefinedFunction.scala, too.

marmbrus · 2016-05-23T19:05:45Z

LGTM pending tests.

@linbojin, we should also handle your use case though maybe that should be its own JIRA. Perhaps you could open one with the information you posed here?

SparkQA · 2016-05-23T19:40:17Z

Test build #59147 has finished for PR 13087 at commit c1f92c7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-05-23T20:54:52Z

Hi, @marmbrus .
Instead of creating new JIRA, I think we had better change the title of this PR into [MINOR][SQL][DOC] ....
Initially, I tried to handle @linbojin 's SPARK-15282, but now this PR does not.
How do you think about that? Of course, @linbojin can add the detail description to that JIRA issue.

marmbrus · 2016-05-23T21:03:41Z

sure thats fine.

marmbrus · 2016-05-23T21:04:29Z

merging to master and 2.0

dongjoon-hyun · 2016-05-23T21:15:40Z

Oops. I didn't change the title yet.

dongjoon-hyun · 2016-05-23T21:18:05Z

I'm not sure what happen. I'll remove this PR information from @linbojin 's JIRA issue anyway.

…unctions ## What changes were proposed in this pull request? Spark assumes that UDF functions are deterministic. This PR adds explicit notes about that. ## How was this patch tested? It's only about docs. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13087 from dongjoon-hyun/SPARK-15282. (cherry picked from commit 37c617e) Signed-off-by: Michael Armbrust <michael@databricks.com>

marmbrus · 2016-05-23T21:20:18Z

Good thing I got distracted :) I would have just changed the title while merging though.

dongjoon-hyun · 2016-05-23T21:21:07Z

Thank you so much!

dongjoon-hyun · 2016-05-23T21:26:46Z

Thank you all for reviewing and helping this PR!

linbojin · 2016-05-24T02:16:26Z

@marmbrus @dongjoon-hyun I will add the detail description to the old SPARK-15282 JIRA issue.

dongjoon-hyun changed the title ~~[SPARK-15282][SQL] UDF funtion is not always deterministic and need to be evaluated once.~~ [SPARK-15282][SQL] PushDownPredicate should not duplicate UDF function expressions May 13, 2016

cloud-fan reviewed May 20, 2016
View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-15282][SQL] PushDownPredicate should not duplicate UDF function expressions~~ [SPARK-15282][SQL] Make ScalaUDF nondeterministic May 20, 2016

dongjoon-hyun changed the title ~~[SPARK-15282][SQL] Make ScalaUDF nondeterministic~~ [SPARK-15282][SQL] PushDownPredicate should not duplicate UDF function expressions May 20, 2016

marmbrus reviewed May 23, 2016
View reviewed changes

dongjoon-hyun added 5 commits May 23, 2016 10:55

[SPARK-15282][SQL] UDF funtion is not always deterministic and need t…

bcc4b3c

…o be evaluated once.

Revert the change on ScalaUDF. Instead change PushDownPredicate optim…

3fe691b

…izer.

Revert code changes.

e183fa4

Add notes on udf functions.

880d80f

Address comments.

c1f92c7

dongjoon-hyun changed the title ~~[SPARK-15282][SQL][DOCS] Add notes of the deterministic assumption on UDF functions~~ [MINOR][SQL][DOCS] Add notes of the deterministic assumption on UDF functions May 23, 2016

asfgit closed this in 37c617e May 23, 2016

dongjoon-hyun mentioned this pull request Jun 1, 2016

[SPARK-15076][SQL] Add ReorderAssociativeOperator optimizer #12850

Closed

dongjoon-hyun deleted the SPARK-15282 branch July 20, 2016 07:36

gatorsmile mentioned this pull request Apr 13, 2017

[SPARK-20315] [SQL] Set ScalaUDF's deterministic to true #17626

Closed

[MINOR][SQL][DOCS] Add notes of the deterministic assumption on UDF functions #13087

[MINOR][SQL][DOCS] Add notes of the deterministic assumption on UDF functions #13087

Conversation

dongjoon-hyun commented May 13, 2016 • edited

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented May 13, 2016

SparkQA commented May 13, 2016

dongjoon-hyun commented May 13, 2016

dongjoon-hyun commented May 13, 2016 • edited

SparkQA commented May 13, 2016

dongjoon-hyun commented May 13, 2016

SparkQA commented May 16, 2016

dongjoon-hyun commented May 19, 2016

SparkQA commented May 19, 2016

linbojin commented May 20, 2016

rxin commented May 20, 2016

cloud-fan May 20, 2016

Choose a reason for hiding this comment

dongjoon-hyun May 20, 2016

Choose a reason for hiding this comment

cloud-fan May 20, 2016 • edited

Choose a reason for hiding this comment

dongjoon-hyun May 20, 2016

Choose a reason for hiding this comment

dongjoon-hyun commented May 20, 2016

SparkQA commented May 20, 2016

dongjoon-hyun commented May 20, 2016 • edited

dongjoon-hyun commented May 20, 2016

SparkQA commented May 20, 2016

dongjoon-hyun commented May 20, 2016

marmbrus commented May 20, 2016

markhamstra commented May 20, 2016

dongjoon-hyun commented May 20, 2016

thunterdb commented May 20, 2016

dongjoon-hyun commented May 20, 2016

dongjoon-hyun commented May 20, 2016 • edited

SparkQA commented May 20, 2016

SparkQA commented May 20, 2016

marmbrus commented May 20, 2016

felixcheung commented May 22, 2016

dongjoon-hyun commented May 22, 2016

SparkQA commented May 22, 2016

marmbrus May 23, 2016

Choose a reason for hiding this comment

dongjoon-hyun May 23, 2016

Choose a reason for hiding this comment

dongjoon-hyun commented May 23, 2016

marmbrus commented May 23, 2016

SparkQA commented May 23, 2016

dongjoon-hyun commented May 23, 2016

marmbrus commented May 23, 2016

marmbrus commented May 23, 2016

dongjoon-hyun commented May 23, 2016

dongjoon-hyun commented May 23, 2016

marmbrus commented May 23, 2016

dongjoon-hyun commented May 23, 2016

dongjoon-hyun commented May 23, 2016

linbojin commented May 24, 2016

dongjoon-hyun commented May 13, 2016 •

edited

dongjoon-hyun commented May 13, 2016 •

edited

cloud-fan May 20, 2016 •

edited

dongjoon-hyun commented May 20, 2016 •

edited

dongjoon-hyun commented May 20, 2016 •

edited