[SPARK-18884][SQL] Throw an exception in compile time if Array[_] used in ScalaUDF #16605

maropu · 2017-01-16T14:30:10Z

What changes were proposed in this pull request?

This pr modified code to throw an exception in compile time if Array[_] used in the arguments of ScalaUDF. Currently, a query below throws an exception in runtime if we use the type in ScalaUDF;

scala> import org.apache.spark.sql.execution.debug._
scala> Seq((0, 1)).toDF("a", "b").select(array($"a", $"b").as("ar")).write.mode("overwrite").parquet("/Users/maropu/Desktop/data/")
scala> val df = spark.read.load("/Users/maropu/Desktop/data/")
scala> val df = Seq((0, 1)).toDF("a", "b").select(array($"a", $"b").as("ar"))
scala> val testArrayUdf = udf { (ar: Array[Int]) => ar.sum }
scala> df.select(testArrayUdf($"ar")).show

Caused by: java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [I
  at $anonfun$1.apply(<console>:23)
  at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:89)
  at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:88)
  at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1069)
  ... 99 more

How was this patch tested?

Added tests in DataFrameSuite.

SparkQA · 2017-01-16T14:49:11Z

Test build #71449 has finished for PR 16605 at commit f2cf910.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2017-01-16T15:11:04Z

I'm looking for another approach not to break backward compatibility...

SparkQA · 2017-01-16T17:23:36Z

Test build #71454 has finished for PR 16605 at commit eb7162a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-16T18:04:47Z

Test build #71456 has finished for PR 16605 at commit 1a840eb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-17T03:44:06Z

Test build #71468 has finished for PR 16605 at commit 581c7fa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2017-01-17T03:47:40Z

@dongjoon-hyun Could you take time to review this before committers do? Thanks!

dongjoon-hyun · 2017-01-17T05:17:25Z

Sure, @maropu . I'll do that tomorrow morning (PST).

maropu · 2017-01-17T05:18:03Z

many thanks!

dongjoon-hyun

Hi, @Maropo . This part consists of many generated codes. So, could you update the template in comment together and update the newly added code to use the similar generated syntax?

dongjoon-hyun · 2017-01-17T17:39:28Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

+            udf { (ar1: Seq[Int], ar2: Seq[Int], ar3: Seq[Int], ar4: Seq[Int], ar5: Seq[Int], ar6: Seq[Int], ar7: Seq[Int], ar8: Seq[Int], ar9: Seq[Int], ar10: Seq[Int]) => (ar1 ++ ar2 ++ ar3 ++ ar4 ++ ar5 ++ ar6 ++ ar7 ++ ar8 ++ ar9 ++ ar10).sum }
+          )
+        ).map { case (udf1, udf2, udf3, udf4, udf5, udf6, udf7, udf8, udf9, udf10) =>
+          val arVal = functions.array(lit(1), lit(1))


Could you change this to access the column value instead of Literal?

you mean this kind of this?:

val testUdf = udf { (ar: Array[Long]) => ar.sum } val df = spark.range(10).select(array('id, 'id).as("arVal")) checkAnswer(df.select(udf1(arVal)), Row(2) :: Nil)

dongjoon-hyun · 2017-01-17T17:54:01Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

+    val inputConverters = Try(
+      ScalaReflection.scalaConverterFor(typeTag[A1]) ::
+      Nil
+    ).toOption


Please update the template in the comment and make val inputConverters into single lines like val inputTypes in line 3075.

okay, I'll update

dongjoon-hyun · 2017-01-17T17:57:12Z

sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala

@@ -137,7 +137,11 @@ class UDFRegistration private[sql] (functionRegistry: FunctionRegistry) extends
  def register[RT: TypeTag, A1: TypeTag](name: String, func: Function1[A1, RT]): UserDefinedFunction = {
    val dataType = ScalaReflection.schemaFor[RT].dataType
    val inputTypes = Try(ScalaReflection.schemaFor[A1].dataType :: Nil).toOption
-    def builder(e: Seq[Expression]) = ScalaUDF(func, dataType, e, inputTypes.getOrElse(Nil))
+    val inputConverters = Try(


Please insert inputConverters into the template comment and make inputConverters into a single line like line 139.

dongjoon-hyun · 2017-01-17T18:10:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala

@@ -84,7 +86,9 @@ case class ScalaUDF(
    case 1 =>
      val func = function.asInstanceOf[(Any) => Any]
      val child0 = children(0)
-      lazy val converter0 = CatalystTypeConverters.createToScalaConverter(child0.dataType)
+      lazy val converter0 = inputConverters.map {


Also, please update the template comment and follow the similar syntax.

Hi, I think you missed this comment.

oh, sorry. I'll do it soon.

okay, fixed!

maropu · 2017-01-18T04:53:30Z

@dongjoon-hyun okay, I applied your comments into this pr. Could you check again to satisfy your intentions?

SparkQA · 2017-01-18T07:06:40Z

Test build #71568 has finished for PR 16605 at commit 22fb9d1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-18T07:16:23Z

Test build #71570 has finished for PR 16605 at commit c5d8070.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-01-18T17:45:41Z

Hi, @maropu .

First of all, I generally agree with you on the purpose of this PR.

However, for your failure example, we can simply do the following in the current master. I'm wondering about what you think about this.

scala> val df = Seq((0, 1)).toDF("a", "b").select(array($"a", $"b").as("ar"))

scala> val testArrayUdf = udf { (ar: scala.collection.mutable.WrappedArray[Int]) => ar.sum }
testArrayUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,IntegerType,Some(List(ArrayType(IntegerType,false))))

scala> df.select(testArrayUdf($"ar")).show
+-------+
|UDF(ar)|
+-------+
|      1|
+-------+

PS. For ScalaUDF.scala, please see my above comment again.

maropu · 2017-01-19T04:24:42Z

oh, yea. I didn't find that and I think it's a good point.
IMO WrappedArray is used inside for implicit conversions, so users do not use WrappedArray directly for UDFs in most cases.

Anyway, thanks alots for your reviews!

dongjoon-hyun · 2017-01-19T04:52:33Z

Sure, @maropu . WrappedArray is not documented well for now.

Hi, @gatorsmile and @cloud-fan .
Could you review this PR?

cloud-fan · 2017-01-19T06:04:36Z

Well, it will be good if we can support Array in ScalaUDF, but it's not a big deal as users can easily do udf { (seq: Seq[Int]) => val a = seq.toArray; // do anything you like with the array }.

considering the size of this PR, I don't think it worth.

SparkQA · 2017-01-19T06:13:05Z

Test build #71631 has finished for PR 16605 at commit 35715a4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2017-01-19T06:15:37Z

okay. But, if this issue finished, I'm planning to take SPARK-12823 in a similar way.
Do u think also it's not worth trying struct? cc: @cloud-fan @gatorsmile

maropu · 2017-01-19T15:40:17Z

The workaround @cloud-fan said is good to me though, IMO the most critical issue here is that this cast exception happens not in analysis phases but in runtime. So, at least I think we should modify code to throw an exception in analysis phases and the exception might include a message like "you should use Seq[T] instread of Array[T]". I think we could do this with less amount of code. Thought?

cloud-fan · 2017-01-20T01:12:55Z

SGTM

maropu · 2017-01-20T02:26:11Z

okay, I'll update this pr in that way, thanks!

SparkQA · 2017-01-20T17:46:52Z

Test build #71729 has finished for PR 16605 at commit bc40736.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-24T14:38:52Z

Test build #71934 has finished for PR 16605 at commit c16b121.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-24T16:26:27Z

Test build #71935 has finished for PR 16605 at commit f20de2c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-24T18:18:43Z

Test build #71939 has finished for PR 16605 at commit a738158.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-24T19:23:58Z

Test build #71940 has finished for PR 16605 at commit 94902ce.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-25T03:27:17Z

Test build #71952 has finished for PR 16605 at commit bd1773b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

This reverts commit 35715a4b6847f56f62038e9bbd77bf4a83250410. Revert the previous 6 commits

SparkQA · 2017-02-04T03:18:34Z

Test build #72350 has finished for PR 16605 at commit 4424be8.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-04T05:33:09Z

Test build #72354 has finished for PR 16605 at commit f1fcfc1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-04T08:09:16Z

Test build #72366 has finished for PR 16605 at commit 89a98a7.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-04T10:49:57Z

Test build #72369 has finished for PR 16605 at commit ada9237.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2017-02-13T22:40:58Z

@cloud-fan Could you give me more insights on this?

maropu · 2017-03-02T12:20:49Z

@cloud-fan ping

maropu · 2017-03-21T02:17:51Z

@cloud-fan ping

maropu · 2018-07-18T02:59:19Z

I'll close for now

maropu force-pushed the SPARK-18884 branch from 37dda4d to eb7162a Compare January 16, 2017 15:35

maropu force-pushed the SPARK-18884 branch from 1a840eb to 581c7fa Compare January 17, 2017 01:13

dongjoon-hyun requested changes Jan 17, 2017

View reviewed changes

maropu force-pushed the SPARK-18884 branch from bc40736 to c16b121 Compare January 24, 2017 14:31

maropu force-pushed the SPARK-18884 branch from c16b121 to f20de2c Compare January 24, 2017 14:42

maropu force-pushed the SPARK-18884 branch 2 times, most recently from a738158 to 94902ce Compare January 24, 2017 16:43

maropu force-pushed the SPARK-18884 branch from 94902ce to bd1773b Compare January 25, 2017 00:50

maropu changed the title ~~[SPARK-18884][SQL] Support Array[_] in ScalaUDF~~ [SPARK-18884][SQL] Throw an exception in compile time if Array[_] used in ScalaUDF Jan 25, 2017

maropu added 7 commits February 4, 2017 11:49

Support ArrayType in ScalaUDF

49e0a65

Fix back-compability issues

bf7aff7

Support Array[_] in functions.udf

4e6c503

Apply comments

3a29482

Update template comments

a24c7f8

Apply additional comments

5fff850

Revert "Apply additional comments"

a05b551

This reverts commit 35715a4b6847f56f62038e9bbd77bf4a83250410. Revert the previous 6 commits

maropu force-pushed the SPARK-18884 branch from bd1773b to 4424be8 Compare February 4, 2017 03:07

maropu force-pushed the SPARK-18884 branch from 4424be8 to f1fcfc1 Compare February 4, 2017 03:44

maropu force-pushed the SPARK-18884 branch from f1fcfc1 to 89a98a7 Compare February 4, 2017 08:04

Throw ClassCastException if Array[_] detected in the arguments of UDFs

ada9237

maropu force-pushed the SPARK-18884 branch from 89a98a7 to ada9237 Compare February 4, 2017 08:26

maropu closed this Jul 18, 2018

[SPARK-18884][SQL] Throw an exception in compile time if Array[_] used in ScalaUDF #16605

[SPARK-18884][SQL] Throw an exception in compile time if Array[_] used in ScalaUDF #16605

Conversation

maropu commented Jan 16, 2017 • edited

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jan 16, 2017

maropu commented Jan 16, 2017

SparkQA commented Jan 16, 2017

SparkQA commented Jan 16, 2017

SparkQA commented Jan 17, 2017

maropu commented Jan 17, 2017

dongjoon-hyun commented Jan 17, 2017

maropu commented Jan 17, 2017

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu commented Jan 18, 2017

SparkQA commented Jan 18, 2017

SparkQA commented Jan 18, 2017

dongjoon-hyun commented Jan 18, 2017 • edited

maropu commented Jan 19, 2017 • edited

dongjoon-hyun commented Jan 19, 2017 • edited

cloud-fan commented Jan 19, 2017

SparkQA commented Jan 19, 2017

maropu commented Jan 19, 2017 • edited

maropu commented Jan 19, 2017

cloud-fan commented Jan 20, 2017

maropu commented Jan 20, 2017

SparkQA commented Jan 20, 2017

SparkQA commented Jan 24, 2017

SparkQA commented Jan 24, 2017

SparkQA commented Jan 24, 2017

SparkQA commented Jan 24, 2017

SparkQA commented Jan 25, 2017

SparkQA commented Feb 4, 2017

SparkQA commented Feb 4, 2017

SparkQA commented Feb 4, 2017

SparkQA commented Feb 4, 2017

maropu commented Feb 13, 2017

maropu commented Mar 2, 2017

maropu commented Mar 21, 2017

maropu commented Jul 18, 2018

maropu commented Jan 16, 2017 •

edited

dongjoon-hyun commented Jan 18, 2017 •

edited

maropu commented Jan 19, 2017 •

edited

dongjoon-hyun commented Jan 19, 2017 •

edited

maropu commented Jan 19, 2017 •

edited