Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MINOR][SQL][DOCS] Add notes of the deterministic assumption on UDF functions #13087

Closed
wants to merge 5 commits into from
Closed

[MINOR][SQL][DOCS] Add notes of the deterministic assumption on UDF functions #13087

wants to merge 5 commits into from

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented May 13, 2016

What changes were proposed in this pull request?

Spark assumes that UDF functions are deterministic. This PR adds explicit notes about that.

How was this patch tested?

It's only about docs.

@SparkQA
Copy link

SparkQA commented May 13, 2016

Test build #58530 has finished for PR 13087 at commit 43794e1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 13, 2016

Test build #58531 has finished for PR 13087 at commit 610a5b9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

There are several cases which assumes UDF is deterministic. It would be a big change to user. I'll revert the change on ScalaUDF, and update this PR to change optimizer not to duplicate the UDF expression.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-15282][SQL] UDF funtion is not always deterministic and need to be evaluated once. [SPARK-15282][SQL] PushDownPredicate should not duplicate UDF function expressions May 13, 2016
@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented May 13, 2016

The reported error scenario was the following.

scala> val df = sc.parallelize(Seq(("a", "b"), ("a1", "b1"))).toDF("old","old1")
scala> val udfFunc = udf((s: String) => {println(s"running udf($s)"); s })
scala> val newDF = df.withColumn("new", udfFunc(df("old")))
scala> val filteredOnNewColumnDF = newDF.filter("new <> 'a1'")
scala> filteredOnNewColumnDF.show
running udf(a)
running udf(a)
running udf(a1)
+---+----+---+
|old|old1|new|
+---+----+---+
|  a|   b|  a|
+---+----+---+

The result of this PR is like the following.

scala> filteredOnNewColumnDF.show
running udf(a1)
running udf(a)
+---+----+---+
|old|old1|new|
+---+----+---+
|  a|   b|  a|
+---+----+---+

@SparkQA
Copy link

SparkQA commented May 13, 2016

Test build #58535 has finished for PR 13087 at commit ab5bc4b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @liancheng and @cloud-fan .
This PR is similar with your commit, [SPARK-13473][SQL] Don't push predicate through project with nondeterministic field(s).
This PR prevent pushing predicate through project with UDF function expression.
Could you review this PR?

@SparkQA
Copy link

SparkQA commented May 16, 2016

Test build #58647 has finished for PR 13087 at commit afe5216.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @rxin .
Could you review this PR?

@SparkQA
Copy link

SparkQA commented May 19, 2016

Test build #58874 has finished for PR 13087 at commit 294042d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@linbojin
Copy link
Contributor

Hi, @marmbrus
Could you review this PR?

@rxin
Copy link
Contributor

rxin commented May 20, 2016

cc @cloud-fan

@@ -1025,7 +1025,8 @@ object PushDownPredicate extends Rule[LogicalPlan] with PredicateHelper {
// state and all the input rows processed before. In another word, the order of input rows
// matters for non-deterministic expressions, while pushing down predicates changes the order.
case filter @ Filter(condition, project @ Project(fields, grandChild))
if fields.forall(_.deterministic) =>
if fields.forall(_.deterministic) &&
fields.forall(_.find(_.isInstanceOf[ScalaUDF]).isEmpty) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I understand this correctly, do you mean ScalaUDF can be nondeterministic and we should always treat it as nondeterministic expression? If so, I think a better idea is just override deterministic in ScalaUDF and always return false.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @cloud-fan , again!
Yep. Right. Exactly, I really wanted to do that. So, I made my first initial commit for this PR.

But, you can see the result in the above.

I still think that is a correct solution. I mean I totally agree with you. But, as you see, it needs to change other testsuites, so I thought I need commiters' decision to do that.

Copy link
Contributor

@cloud-fan cloud-fan May 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you look into it to see where we made this assumption? For now I prefer to override deterministic in ScalaUDF, if it's not a lot of effort to fix this problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! No problem. I will try to fix other testsuite correctly.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-15282][SQL] PushDownPredicate should not duplicate UDF function expressions [SPARK-15282][SQL] Make ScalaUDF nondeterministic May 20, 2016
@dongjoon-hyun
Copy link
Member Author

According to @cloud-fan 's advice, the goal of this PR is now making ScalaUDF as a non-deterministic expression. Although this is a correct fix, one noticeable drawback is that we cannot use ScalaUDF in GROUP BY. I updated the description of this PR, too.

@SparkQA
Copy link

SparkQA commented May 20, 2016

Test build #58969 has finished for PR 13087 at commit 94ecdd4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented May 20, 2016

Hmm. @cloud-fan .
There is bad news. ALS.scala uses UDF on Join LeftOuter. So, there are 7 failures on ALSSuite.scala.

override def transform(dataset: Dataset[_]): DataFrame = {
    transformSchema(dataset.schema)
    // Register a UDF for DataFrame, and then
    // create a new column named map(predictionCol) by running the predict UDF.
    val predict = udf { (userFeatures: Seq[Float], itemFeatures: Seq[Float]) =>
      if (userFeatures != null && itemFeatures != null) {
        blas.sdot(rank, userFeatures.toArray, 1, itemFeatures.toArray, 1)
      } else {
        Float.NaN
      }
    }
    dataset
      .join(userFactors,
        checkedCast(dataset($(userCol)).cast(DoubleType)) === userFactors("id"), "left")
      .join(itemFactors,
        checkedCast(dataset($(itemCol)).cast(DoubleType)) === itemFactors("id"), "left")
      .select(dataset("*"),
        predict(userFactors("features"), itemFactors("features")).as($(predictionCol)))
}

Nondeterministic UDF has more limitation than I expected.

nondeterministic expressions are only allowed in
Project, Filter, Aggregate or Window, found:
 ((IF((CAST(`user` AS DOUBLE) IS NULL), CAST(NULL AS INT), UDF(cast(user as double)))) = `id`)
in operator Join LeftOuter, Some((if (isnull(cast(user#33 as double))) null else UDF(cast(user#33 as double)) = id#19))

According to the Jenkins test failure log, this is the last hurdle. However, it proves the usage of UDF prevails. Spark users might depends on this risky feature much more.

@dongjoon-hyun
Copy link
Member Author

I updated ALS and ALSSuite just in order to pass the Jenkins for further discussion.

@SparkQA
Copy link

SparkQA commented May 20, 2016

Test build #58976 has finished for PR 13087 at commit 1bf8b43.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

The PySpark failure is fixed as a HOTFIX.

@marmbrus
Copy link
Contributor

I think we can have a way to mark a UDF as non-deterministic, but that is too large of a change to make it the default.

Also, is this an actual performance problem, or does it just look like one (and common subexpression elimination is fixing it)?

@markhamstra
Copy link
Contributor

@marmbrus +1

@dongjoon-hyun
Copy link
Member Author

Thank you for review, @marmbrus and @markhamstra !

Actually, it's huge change. Although I'm not aware of the real background, the reported case can be handled by just preventing PushDownPredicate should not duplicate UDF function expressions.

I think we can keep the common subexpression elimination without any change.

Anyway, I will revert the current investigation into my second commit back.

@thunterdb
Copy link
Contributor

@marmbrus +1

I suggest to change the documentation of UDFs instead to underline the expectation that they be deterministic for the time being.

@dongjoon-hyun
Copy link
Member Author

Thank you, @thunterdb .
By the way, do you mean we should resolve the JIRA issue as WONTFIX?

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-15282][SQL] Make ScalaUDF nondeterministic [SPARK-15282][SQL] PushDownPredicate should not duplicate UDF function expressions May 20, 2016
@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented May 20, 2016

@marmbrus @markhamstra @thunterdb .
Now, the code and description of this PR is rollbacked to my second commit 7 days ago.

For common subexpression elimination, I think it's okay if the some optimizer reduces the number of UDF calls. It's expectable. How do you think about this?

@SparkQA
Copy link

SparkQA commented May 20, 2016

Test build #59006 has finished for PR 13087 at commit bb164ea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 20, 2016

Test build #59009 has finished for PR 13087 at commit 9b81251.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@marmbrus
Copy link
Contributor

My main questions remains: does this actually make a difference in runtime? or is execution smart enough already to do this optimization (even if to the user it looks like its getting called more than once).

Also, @dongjoon-hyun would you have time to audit places where UDFs are documented and add the expectation that they are deterministic?

@felixcheung
Copy link
Member

@dongjoon-hyun I don't think we support UDF this way in SparkR?
Maybe something we could add to dapply? @sun-rui

@dongjoon-hyun
Copy link
Member Author

Thank you, @felixcheung ! Then, this PR is enough for the current master branch. :)
If SparkR have something related to this in the future, we can add a note on that at that time.

@SparkQA
Copy link

SparkQA commented May 22, 2016

Test build #59113 has finished for PR 13087 at commit 7c11e2e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -1756,6 +1756,7 @@ def __call__(self, *cols):
@since(1.3)
def udf(f, returnType=StringType()):
"""Creates a :class:`Column` expression representing a user defined function (UDF).
Note that the user-defined functions should be deterministic.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! I would say must be deterministic. We might even want to say that "due to optimization duplicate invocations may be eliminated or the function may even be invoked more times than it is present in the query".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I see. I'll fix like that.

@dongjoon-hyun
Copy link
Member Author

Hi, @marmbrus .
I replaced 'should' with 'must', and added the detail description for functions.py, SQLContext.scala, SparkSession.scala and UserDefinedFunction.scala, too.

@marmbrus
Copy link
Contributor

LGTM pending tests.

@linbojin, we should also handle your use case though maybe that should be its own JIRA. Perhaps you could open one with the information you posed here?

@SparkQA
Copy link

SparkQA commented May 23, 2016

Test build #59147 has finished for PR 13087 at commit c1f92c7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @marmbrus .
Instead of creating new JIRA, I think we had better change the title of this PR into [MINOR][SQL][DOC] ....
Initially, I tried to handle @linbojin 's SPARK-15282, but now this PR does not.
How do you think about that? Of course, @linbojin can add the detail description to that JIRA issue.

@marmbrus
Copy link
Contributor

sure thats fine.

@marmbrus
Copy link
Contributor

merging to master and 2.0

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-15282][SQL][DOCS] Add notes of the deterministic assumption on UDF functions [MINOR][SQL][DOCS] Add notes of the deterministic assumption on UDF functions May 23, 2016
@dongjoon-hyun
Copy link
Member Author

Oops. I didn't change the title yet.

@dongjoon-hyun
Copy link
Member Author

I'm not sure what happen. I'll remove this PR information from @linbojin 's JIRA issue anyway.

asfgit pushed a commit that referenced this pull request May 23, 2016
…unctions

## What changes were proposed in this pull request?

Spark assumes that UDF functions are deterministic. This PR adds explicit notes about that.

## How was this patch tested?

It's only about docs.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13087 from dongjoon-hyun/SPARK-15282.

(cherry picked from commit 37c617e)
Signed-off-by: Michael Armbrust <michael@databricks.com>
@marmbrus
Copy link
Contributor

Good thing I got distracted :) I would have just changed the title while merging though.

@dongjoon-hyun
Copy link
Member Author

Thank you so much!

@asfgit asfgit closed this in 37c617e May 23, 2016
@dongjoon-hyun
Copy link
Member Author

Thank you all for reviewing and helping this PR!

@linbojin
Copy link
Contributor

@marmbrus @dongjoon-hyun I will add the detail description to the old SPARK-15282 JIRA issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
9 participants