[SPARK-24148][SQL] Overloading array function to support typed empty arrays #21215

mn-mikke · 2018-05-02T13:55:55Z

What changes were proposed in this pull request?

The PR proposes to overload array function and allow users to specify the element type for empty arrays. Currently, empty arrays produced by array function are of StringType and there is no way how to cast them to a different type.

A perfect example of the use case is when(cond, trueExp).otherwise(falseExp), which expects trueExp and falseExp of being the same type. In scenario where we want to produce an empty array, in one of these cases, there's no other way than creating an UDF.

How was this patch tested?

Added test cases into DataFrameComplexTypeSuite

Note

Eventually, I will add a wrapper for PySpark, but would like to discuss the idea first.

…arrays.

mn-mikke · 2018-05-02T13:57:05Z

@lokm01 @gatorsmile @maropu @ueshin

gatorsmile · 2018-05-02T21:01:00Z

ok to test

SparkQA · 2018-05-02T21:23:59Z

Test build #90073 has finished for PR 21215 at commit 44b1852.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class CreateArray(children: Seq[Expression], defaultElementType: DataType = StringType)

SparkQA · 2018-05-03T00:01:17Z

Test build #90079 has finished for PR 21215 at commit 9c12457.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-05-03T01:17:16Z

retest this please

maropu · 2018-05-03T01:21:32Z

Do you wanna do this?

scala> sql("select array()").printSchema
root
 |-- array(): array (nullable = false)
 |    |-- element: string (containsNull = false)


scala> sql("select CAST(array() AS ARRAY<INT>) c").printSchema
root
 |-- c: array (nullable = false)
 |    |-- element: integer (containsNull = true)


scala> sql("select CAST(array() AS ARRAY<INT>) c").show
+---+
|  c|
+---+
| []|
+---+

SparkQA · 2018-05-03T04:40:42Z

Test build #90086 has finished for PR 21215 at commit 9c12457.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

lokm01 · 2018-05-03T07:43:16Z

Hey @maropu,

So we've encountered a number of issues with casting:

Casting an empty array to an array of primitive types caused an exception on 2.2.1, but works on 2.3.0+ so that's sorted
We're still facing an issue on 2.3.0 when we try to cast an empty array to an array of complex types. See the following example:

case class Outer(a: List[Inner])
case class Inner(b: Int, c: String)

object App4 extends App {
  val spark = SparkSession.builder().appName("").master("local[*]").getOrCreate()

  import spark.implicits._
  import org.apache.spark.sql.functions._

  val df = spark.createDataFrame(Seq[Outer]())
  
  val r = spark.range(100).select(array().cast(df.schema("a").dataType))
   
  r.printSchema()
  r.show
  
}

This code produces

Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'array()' due to data type mismatch: cannot cast array to array<struct<b:int,c:string>>;;

maropu · 2018-05-03T10:23:17Z

How about this?


scala> val df = Seq(Outer(Seq.empty[Inner]), Outer(Seq.empty[Inner])).toDF("a")
df: org.apache.spark.sql.DataFrame = [a: array<struct<b:int,c:string>>]

scala> df.printSchema
root
 |-- a: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- b: integer (nullable = false)
 |    |    |-- c: string (nullable = true)

scala> df.show
+---+
|  a|
+---+
| []|
| []|
+---+


scala> val df = Seq(1, 2, 3).toDF("a").withColumn("b", typedLit(Seq.empty[Inner]))
df: org.apache.spark.sql.DataFrame = [a: int, b: array<struct<b:int,c:string>>]

scala> df.printSchema
root
 |-- a: integer (nullable = false)
 |-- b: array (nullable = false)
 |    |-- element: struct (containsNull = true)
 |    |    |-- b: integer (nullable = false)
 |    |    |-- c: string (nullable = true)

scala> df.show
+---+---+
|  a|  b|
+---+---+
|  1| []|
|  2| []|
|  3| []|
+---+---+

lokm01 · 2018-05-03T12:21:24Z

@maropu That would work if you had scala case classes for all the types. In our case, we're working on a generic framework, where we only have Spark schemas (and I'd rather not generate case classes at runtime).

Can you suggest an existing way to do this using spark's DataType please?

maropu · 2018-05-03T12:41:46Z

Like this?

scala> val structTy = StructType.fromDDL("a ARRAY<STRUCT<b: INT, c: STRING>>")
structTy: org.apache.spark.sql.types.StructType = StructType(StructField(a,ArrayType(StructType(StructField(b,IntegerType,true), StructField(c,StringType,true)),true),true))

scala> val newCol = new Column(Literal.create(Seq.empty[Inner], structTy.head.dataType))
newCol: org.apache.spark.sql.Column = []

scala> val df = Seq(1, 2, 3).toDF("a").withColumn("b", newCol)
df: org.apache.spark.sql.DataFrame = [a: int, b: array<struct<b:int,c:string>>]

scala> df.show
+---+---+
|  a|  b|
+---+---+
|  1| []|
|  2| []|
|  3| []|
+---+---+


scala> df.printSchema
root
 |-- a: integer (nullable = false)
 |-- b: array (nullable = false)
 |    |-- element: struct (containsNull = true)
 |    |    |-- b: integer (nullable = true)
 |    |    |-- c: string (nullable = true)

lokm01 · 2018-05-03T12:52:00Z

@maropu Thanks! Didn't know about creating a literal this way.

Don't you feel that the suggested change is way more elegant?

mn-mikke · 2018-05-03T14:47:03Z

@maropu Really nice idea to create typed empty arrays via an Literal expression! On the other hand, I feel that the end user shouldn't work with classes from Catalyst internals if we consider that the creation of typed empty arrays is an elementary operation.

I've tailored the solution according to your suggestion, but still think that some function should be introduced. What do you think?

SparkQA · 2018-05-03T18:20:47Z

Test build #90137 has finished for PR 21215 at commit e151ab7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

SparkQA · 2019-04-15T01:09:38Z

Test build #104578 has finished for PR 21215 at commit 24a98e1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-04-15T13:45:29Z

What about Map types too -- same issue? I just wonder if it's worth a whole new API method given there's a way to express it if really needed, or possibly a way to simply rewrite the code to avoid it?

mn-mikke · 2019-04-15T18:37:52Z

@srowen Yep, Map types suffer from the same problem. Users could eventually create a column of empty typed maps with using this function and map_from_arrays, but having separate method would be better.

This feature is definitely just nice to have since users can directly use Literal.create. So I wouldn't mind if you wanted me to close the PR without merging. On the other hand, IMHO users shouldn't be forced to use call methods from org.apache.spark.sql.catalyst.expressions. This doesn't sound like public API.

AmplabJenkins · 2019-09-16T18:21:44Z

Can one of the admins verify this patch?

HyukjinKwon · 2019-11-05T05:43:45Z

I think we should rather have array<null> that can be type coerced to any type. Since the workarounds were provided, shall we close this PR? I think we won't have to expose another API.

[SPARK-24148][SQL] Overloading array function to support typed empty …

44b1852

…arrays.

[SPARK-24148][SQL] Adding a constuctor with one parameter.

9c12457

mn-mikke added 2 commits May 3, 2018 16:28

[SPARK-24148][SQL] Changing solution to use literals.

784c464

[SPARK-24148][SQL] Removing an unused namespace import.

e151ab7

srowen reviewed Apr 12, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/functions.scala Outdated Show resolved Hide resolved

Upgrading @SInCE version to 3.0.0

24a98e1

dongjoon-hyun added the SQL label Jun 14, 2019

srowen closed this Nov 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-24148][SQL] Overloading array function to support typed empty arrays #21215

[SPARK-24148][SQL] Overloading array function to support typed empty arrays #21215

mn-mikke commented May 2, 2018

mn-mikke commented May 2, 2018

gatorsmile commented May 2, 2018

SparkQA commented May 2, 2018

SparkQA commented May 3, 2018

maropu commented May 3, 2018

maropu commented May 3, 2018

SparkQA commented May 3, 2018

lokm01 commented May 3, 2018 •

edited

maropu commented May 3, 2018

lokm01 commented May 3, 2018

maropu commented May 3, 2018

lokm01 commented May 3, 2018

mn-mikke commented May 3, 2018

SparkQA commented May 3, 2018

SparkQA commented Apr 15, 2019

srowen commented Apr 15, 2019

mn-mikke commented Apr 15, 2019

AmplabJenkins commented Sep 16, 2019

HyukjinKwon commented Nov 5, 2019

[SPARK-24148][SQL] Overloading array function to support typed empty arrays #21215

[SPARK-24148][SQL] Overloading array function to support typed empty arrays #21215

Conversation

mn-mikke commented May 2, 2018

What changes were proposed in this pull request?

How was this patch tested?

Note

mn-mikke commented May 2, 2018

gatorsmile commented May 2, 2018

SparkQA commented May 2, 2018

SparkQA commented May 3, 2018

maropu commented May 3, 2018

maropu commented May 3, 2018

SparkQA commented May 3, 2018

lokm01 commented May 3, 2018 • edited

maropu commented May 3, 2018

lokm01 commented May 3, 2018

maropu commented May 3, 2018

lokm01 commented May 3, 2018

mn-mikke commented May 3, 2018

SparkQA commented May 3, 2018

SparkQA commented Apr 15, 2019

srowen commented Apr 15, 2019

mn-mikke commented Apr 15, 2019

AmplabJenkins commented Sep 16, 2019

HyukjinKwon commented Nov 5, 2019

lokm01 commented May 3, 2018 •

edited