Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-24148][SQL] Overloading array function to support typed empty arrays #21215

Closed

Conversation

mn-mikke
Copy link
Contributor

@mn-mikke mn-mikke commented May 2, 2018

What changes were proposed in this pull request?

The PR proposes to overload array function and allow users to specify the element type for empty arrays. Currently, empty arrays produced by array function are of StringType and there is no way how to cast them to a different type.

A perfect example of the use case is when(cond, trueExp).otherwise(falseExp), which expects trueExp and falseExp of being the same type. In scenario where we want to produce an empty array, in one of these cases, there's no other way than creating an UDF.

How was this patch tested?

Added test cases into DataFrameComplexTypeSuite

Note

Eventually, I will add a wrapper for PySpark, but would like to discuss the idea first.

@mn-mikke
Copy link
Contributor Author

mn-mikke commented May 2, 2018

@gatorsmile
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented May 2, 2018

Test build #90073 has finished for PR 21215 at commit 44b1852.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class CreateArray(children: Seq[Expression], defaultElementType: DataType = StringType)

@SparkQA
Copy link

SparkQA commented May 3, 2018

Test build #90079 has finished for PR 21215 at commit 9c12457.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented May 3, 2018

retest this please

@maropu
Copy link
Member

maropu commented May 3, 2018

Do you wanna do this?

scala> sql("select array()").printSchema
root
 |-- array(): array (nullable = false)
 |    |-- element: string (containsNull = false)


scala> sql("select CAST(array() AS ARRAY<INT>) c").printSchema
root
 |-- c: array (nullable = false)
 |    |-- element: integer (containsNull = true)


scala> sql("select CAST(array() AS ARRAY<INT>) c").show
+---+
|  c|
+---+
| []|
+---+

@SparkQA
Copy link

SparkQA commented May 3, 2018

Test build #90086 has finished for PR 21215 at commit 9c12457.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@lokm01
Copy link

lokm01 commented May 3, 2018

Hey @maropu,

So we've encountered a number of issues with casting:

  1. Casting an empty array to an array of primitive types caused an exception on 2.2.1, but works on 2.3.0+ so that's sorted

  2. We're still facing an issue on 2.3.0 when we try to cast an empty array to an array of complex types. See the following example:

case class Outer(a: List[Inner])
case class Inner(b: Int, c: String)

object App4 extends App {
  val spark = SparkSession.builder().appName("").master("local[*]").getOrCreate()

  import spark.implicits._
  import org.apache.spark.sql.functions._

  val df = spark.createDataFrame(Seq[Outer]())
  
  val r = spark.range(100).select(array().cast(df.schema("a").dataType))
   
  r.printSchema()
  r.show
  
}

This code produces

Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'array()' due to data type mismatch: cannot cast array to array<struct<b:int,c:string>>;;

@maropu
Copy link
Member

maropu commented May 3, 2018

How about this?


scala> val df = Seq(Outer(Seq.empty[Inner]), Outer(Seq.empty[Inner])).toDF("a")
df: org.apache.spark.sql.DataFrame = [a: array<struct<b:int,c:string>>]

scala> df.printSchema
root
 |-- a: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- b: integer (nullable = false)
 |    |    |-- c: string (nullable = true)

scala> df.show
+---+
|  a|
+---+
| []|
| []|
+---+


scala> val df = Seq(1, 2, 3).toDF("a").withColumn("b", typedLit(Seq.empty[Inner]))
df: org.apache.spark.sql.DataFrame = [a: int, b: array<struct<b:int,c:string>>]

scala> df.printSchema
root
 |-- a: integer (nullable = false)
 |-- b: array (nullable = false)
 |    |-- element: struct (containsNull = true)
 |    |    |-- b: integer (nullable = false)
 |    |    |-- c: string (nullable = true)

scala> df.show
+---+---+
|  a|  b|
+---+---+
|  1| []|
|  2| []|
|  3| []|
+---+---+

@lokm01
Copy link

lokm01 commented May 3, 2018

@maropu That would work if you had scala case classes for all the types. In our case, we're working on a generic framework, where we only have Spark schemas (and I'd rather not generate case classes at runtime).

Can you suggest an existing way to do this using spark's DataType please?

@maropu
Copy link
Member

maropu commented May 3, 2018

Like this?

scala> val structTy = StructType.fromDDL("a ARRAY<STRUCT<b: INT, c: STRING>>")
structTy: org.apache.spark.sql.types.StructType = StructType(StructField(a,ArrayType(StructType(StructField(b,IntegerType,true), StructField(c,StringType,true)),true),true))

scala> val newCol = new Column(Literal.create(Seq.empty[Inner], structTy.head.dataType))
newCol: org.apache.spark.sql.Column = []

scala> val df = Seq(1, 2, 3).toDF("a").withColumn("b", newCol)
df: org.apache.spark.sql.DataFrame = [a: int, b: array<struct<b:int,c:string>>]

scala> df.show
+---+---+
|  a|  b|
+---+---+
|  1| []|
|  2| []|
|  3| []|
+---+---+


scala> df.printSchema
root
 |-- a: integer (nullable = false)
 |-- b: array (nullable = false)
 |    |-- element: struct (containsNull = true)
 |    |    |-- b: integer (nullable = true)
 |    |    |-- c: string (nullable = true)

@lokm01
Copy link

lokm01 commented May 3, 2018

@maropu Thanks! Didn't know about creating a literal this way.

Don't you feel that the suggested change is way more elegant?

@mn-mikke
Copy link
Contributor Author

mn-mikke commented May 3, 2018

@maropu Really nice idea to create typed empty arrays via an Literal expression! On the other hand, I feel that the end user shouldn't work with classes from Catalyst internals if we consider that the creation of typed empty arrays is an elementary operation.

I've tailored the solution according to your suggestion, but still think that some function should be introduced. What do you think?

@SparkQA
Copy link

SparkQA commented May 3, 2018

Test build #90137 has finished for PR 21215 at commit e151ab7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 15, 2019

Test build #104578 has finished for PR 21215 at commit 24a98e1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Apr 15, 2019

What about Map types too -- same issue? I just wonder if it's worth a whole new API method given there's a way to express it if really needed, or possibly a way to simply rewrite the code to avoid it?

@mn-mikke
Copy link
Contributor Author

@srowen Yep, Map types suffer from the same problem. Users could eventually create a column of empty typed maps with using this function and map_from_arrays, but having separate method would be better.

This feature is definitely just nice to have since users can directly use Literal.create. So I wouldn't mind if you wanted me to close the PR without merging. On the other hand, IMHO users shouldn't be forced to use call methods from org.apache.spark.sql.catalyst.expressions. This doesn't sound like public API.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@HyukjinKwon
Copy link
Member

I think we should rather have array<null> that can be type coerced to any type. Since the workarounds were provided, shall we close this PR? I think we won't have to expose another API.

@srowen srowen closed this Nov 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
9 participants