Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18884][SQL] Throw an exception in compile time if Array[_] used in ScalaUDF #16605

Closed
wants to merge 8 commits into from

Conversation

maropu
Copy link
Member

@maropu maropu commented Jan 16, 2017

What changes were proposed in this pull request?

This pr modified code to throw an exception in compile time if Array[_] used in the arguments of ScalaUDF. Currently, a query below throws an exception in runtime if we use the type in ScalaUDF;

scala> import org.apache.spark.sql.execution.debug._
scala> Seq((0, 1)).toDF("a", "b").select(array($"a", $"b").as("ar")).write.mode("overwrite").parquet("/Users/maropu/Desktop/data/")
scala> val df = spark.read.load("/Users/maropu/Desktop/data/")
scala> val df = Seq((0, 1)).toDF("a", "b").select(array($"a", $"b").as("ar"))
scala> val testArrayUdf = udf { (ar: Array[Int]) => ar.sum }
scala> df.select(testArrayUdf($"ar")).show

Caused by: java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [I
  at $anonfun$1.apply(<console>:23)
  at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:89)
  at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:88)
  at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1069)
  ... 99 more

How was this patch tested?

Added tests in DataFrameSuite.

@SparkQA
Copy link

SparkQA commented Jan 16, 2017

Test build #71449 has finished for PR 16605 at commit f2cf910.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Jan 16, 2017

I'm looking for another approach not to break backward compatibility...

@SparkQA
Copy link

SparkQA commented Jan 16, 2017

Test build #71454 has finished for PR 16605 at commit eb7162a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 16, 2017

Test build #71456 has finished for PR 16605 at commit 1a840eb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 17, 2017

Test build #71468 has finished for PR 16605 at commit 581c7fa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Jan 17, 2017

@dongjoon-hyun Could you take time to review this before committers do? Thanks!

@dongjoon-hyun
Copy link
Member

Sure, @maropu . I'll do that tomorrow morning (PST).

@maropu
Copy link
Member Author

maropu commented Jan 17, 2017

many thanks!

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @Maropo . This part consists of many generated codes. So, could you update the template in comment together and update the newly added code to use the similar generated syntax?

udf { (ar1: Seq[Int], ar2: Seq[Int], ar3: Seq[Int], ar4: Seq[Int], ar5: Seq[Int], ar6: Seq[Int], ar7: Seq[Int], ar8: Seq[Int], ar9: Seq[Int], ar10: Seq[Int]) => (ar1 ++ ar2 ++ ar3 ++ ar4 ++ ar5 ++ ar6 ++ ar7 ++ ar8 ++ ar9 ++ ar10).sum }
)
).map { case (udf1, udf2, udf3, udf4, udf5, udf6, udf7, udf8, udf9, udf10) =>
val arVal = functions.array(lit(1), lit(1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you change this to access the column value instead of Literal?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean this kind of this?:


val testUdf = udf { (ar: Array[Long]) => ar.sum }
val df = spark.range(10).select(array('id, 'id).as("arVal"))
checkAnswer(df.select(udf1(arVal)), Row(2) :: Nil)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Yes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay!

val inputConverters = Try(
ScalaReflection.scalaConverterFor(typeTag[A1]) ::
Nil
).toOption
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the template in the comment and make val inputConverters into single lines like val inputTypes in line 3075.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I'll update

@@ -137,7 +137,11 @@ class UDFRegistration private[sql] (functionRegistry: FunctionRegistry) extends
def register[RT: TypeTag, A1: TypeTag](name: String, func: Function1[A1, RT]): UserDefinedFunction = {
val dataType = ScalaReflection.schemaFor[RT].dataType
val inputTypes = Try(ScalaReflection.schemaFor[A1].dataType :: Nil).toOption
def builder(e: Seq[Expression]) = ScalaUDF(func, dataType, e, inputTypes.getOrElse(Nil))
val inputConverters = Try(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please insert inputConverters into the template comment and make inputConverters into a single line like line 139.

@@ -84,7 +86,9 @@ case class ScalaUDF(
case 1 =>
val func = function.asInstanceOf[(Any) => Any]
val child0 = children(0)
lazy val converter0 = CatalystTypeConverters.createToScalaConverter(child0.dataType)
lazy val converter0 = inputConverters.map {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please update the template comment and follow the similar syntax.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I think you missed this comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, sorry. I'll do it soon.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, fixed!

@maropu
Copy link
Member Author

maropu commented Jan 18, 2017

@dongjoon-hyun okay, I applied your comments into this pr. Could you check again to satisfy your intentions?

@SparkQA
Copy link

SparkQA commented Jan 18, 2017

Test build #71568 has finished for PR 16605 at commit 22fb9d1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 18, 2017

Test build #71570 has finished for PR 16605 at commit c5d8070.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jan 18, 2017

Hi, @maropu .

First of all, I generally agree with you on the purpose of this PR.

However, for your failure example, we can simply do the following in the current master. I'm wondering about what you think about this.

scala> val df = Seq((0, 1)).toDF("a", "b").select(array($"a", $"b").as("ar"))

scala> val testArrayUdf = udf { (ar: scala.collection.mutable.WrappedArray[Int]) => ar.sum }
testArrayUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,IntegerType,Some(List(ArrayType(IntegerType,false))))

scala> df.select(testArrayUdf($"ar")).show
+-------+
|UDF(ar)|
+-------+
|      1|
+-------+

PS. For ScalaUDF.scala, please see my above comment again.

@maropu
Copy link
Member Author

maropu commented Jan 19, 2017

oh, yea. I didn't find that and I think it's a good point.
IMO WrappedArray is used inside for implicit conversions, so users do not use WrappedArray directly for UDFs in most cases.

Anyway, thanks alots for your reviews!

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jan 19, 2017

Sure, @maropu . WrappedArray is not documented well for now.

Hi, @gatorsmile and @cloud-fan .
Could you review this PR?

@cloud-fan
Copy link
Contributor

Well, it will be good if we can support Array in ScalaUDF, but it's not a big deal as users can easily do udf { (seq: Seq[Int]) => val a = seq.toArray; // do anything you like with the array }.

considering the size of this PR, I don't think it worth.

@SparkQA
Copy link

SparkQA commented Jan 19, 2017

Test build #71631 has finished for PR 16605 at commit 35715a4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Jan 19, 2017

okay. But, if this issue finished, I'm planning to take SPARK-12823 in a similar way.
Do u think also it's not worth trying struct? cc: @cloud-fan @gatorsmile

@maropu
Copy link
Member Author

maropu commented Jan 19, 2017

The workaround @cloud-fan said is good to me though, IMO the most critical issue here is that this cast exception happens not in analysis phases but in runtime. So, at least I think we should modify code to throw an exception in analysis phases and the exception might include a message like "you should use Seq[T] instread of Array[T]". I think we could do this with less amount of code. Thought?

@cloud-fan
Copy link
Contributor

SGTM

@maropu
Copy link
Member Author

maropu commented Jan 20, 2017

okay, I'll update this pr in that way, thanks!

@SparkQA
Copy link

SparkQA commented Jan 20, 2017

Test build #71729 has finished for PR 16605 at commit bc40736.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 24, 2017

Test build #71934 has finished for PR 16605 at commit c16b121.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 24, 2017

Test build #71935 has finished for PR 16605 at commit f20de2c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu maropu force-pushed the SPARK-18884 branch 2 times, most recently from a738158 to 94902ce Compare January 24, 2017 16:43
@SparkQA
Copy link

SparkQA commented Jan 24, 2017

Test build #71939 has finished for PR 16605 at commit a738158.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 24, 2017

Test build #71940 has finished for PR 16605 at commit 94902ce.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 25, 2017

Test build #71952 has finished for PR 16605 at commit bd1773b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu maropu changed the title [SPARK-18884][SQL] Support Array[_] in ScalaUDF [SPARK-18884][SQL] Throw an exception in compile time if Array[_] used in ScalaUDF Jan 25, 2017
@SparkQA
Copy link

SparkQA commented Feb 4, 2017

Test build #72350 has finished for PR 16605 at commit 4424be8.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 4, 2017

Test build #72354 has finished for PR 16605 at commit f1fcfc1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 4, 2017

Test build #72366 has finished for PR 16605 at commit 89a98a7.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 4, 2017

Test build #72369 has finished for PR 16605 at commit ada9237.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Feb 13, 2017

@cloud-fan Could you give me more insights on this?

@maropu
Copy link
Member Author

maropu commented Mar 2, 2017

@cloud-fan ping

1 similar comment
@maropu
Copy link
Member Author

maropu commented Mar 21, 2017

@cloud-fan ping

@maropu
Copy link
Member Author

maropu commented Jul 18, 2018

I'll close for now

@maropu maropu closed this Jul 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants