[SPARK-25048][SQL] Pivoting by multiple columns in Scala/Java #22316

MaxGekk · 2018-09-02T13:34:34Z

What changes were proposed in this pull request?

In the PR, I propose to extend implementation of existing method:

def pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset

to support values of the struct type. This allows pivoting by multiple columns combined by struct:

trainingSales
      .groupBy($"sales.year")
      .pivot(
        pivotColumn = struct(lower($"sales.course"), $"training"),
        values = Seq(
          struct(lit("dotnet"), lit("Experts")),
          struct(lit("java"), lit("Dummies")))
      ).agg(sum($"sales.earnings"))

How was this patch tested?

Added a test for values specified via struct in Java and Scala.

HyukjinKwon · 2018-09-02T15:40:05Z

sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

+   *
+   * {{{
+   *   df
+   *     .groupBy($"year")


I would make this line up

HyukjinKwon · 2018-09-02T15:40:18Z

Yup I prefer this way

SparkQA · 2018-09-02T17:26:21Z

Test build #95590 has finished for PR 22316 at commit a097b29.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-02T21:42:17Z

Test build #95592 has finished for PR 22316 at commit ef8e22a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-09-03T02:40:35Z

sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

@@ -406,6 +407,14 @@ class RelationalGroupedDataset protected[sql](
   *   df.groupBy($"year").pivot($"course", Seq("dotNET", "Java")).sum($"earnings")
   * }}}
   *
+   * For pivoting by multiple columns, use the `struct` function to combine the columns and values:


Since the documentation states it's an overloaded version of the `pivot` method with `pivotColumn` of the `String` type., shall we move this contents to that method?

Also, I would document this, for instance,

From Spark 2.4.0, values can be literal columns, for instance, struct. For pivoting by multiple columns, use the struct function to combine the columns and values.

maropu · 2018-09-03T02:46:15Z

sql/core/src/test/scala/org/apache/spark/sql/DataFramePivotSuite.scala

+        .groupBy($"sales.year")
+        .pivot(struct(lower($"sales.course"), $"training"))
+        .agg(sum($"sales.earnings"))
+        .collect()


Don't need this .collect() to cactch the RuntimeException? btw, IMHO AnalysisException is better than RuntimeException in this case? Can't we?

My changes don't throw the exception. It is thrown in the collect() :

spark/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

Line 385 in 41c2227

.collect()

@maropu Do you propose to catch RuntimeException and replace it by AnalysisException?

I tried in your branch;

scala> df.show +--------+--------------------+ |training| sales| +--------+--------------------+ | Experts|[dotNET, 2012, 10...| | Experts|[JAVA, 2012, 2000...| | Dummies|[dotNet, 2012, 50...| | Experts|[dotNET, 2013, 48...| | Dummies|[Java, 2013, 3000...| +--------+--------------------+ scala> df.groupBy($"sales.year").pivot(struct(lower($"sales.course"), $"training")).agg(sum($"sales.earnings")) java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema [dotnet,Dummies] at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78) at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:164) at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:164) at scala.util.Try.getOrElse(Try.scala:79) at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:163) at org.apache.spark.sql.functions$.typedLit(functions.scala:127)

I miss something?

I miss something?

No, you don't. The exception for sure is thrown inside of lit because collect() returns a complex value which cannot be "wrapped" by lit. This is exactly checked in the test which I added to show existing behavior.

btw, IMHO AnalysisException is better than RuntimeException in this case?

@maropu Could you explain, please, why do you think AnalysisException is better for the error occurs in run-time?

Just in case, in the PR, I don't aim to change behavior of existing method: def pivot(pivotColumn: Column): RelationalGroupedDataset. I believe it should be discussed separately regarding to needs for changing user visible behavior. The PR aims to improve def pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset to allow users to specify struct literals in particular. Please, see the description.

I think invalid queries basically throw `AnalysisException. But, yea, indeed, we'd better to keep the current behaivour. Thanks!

SparkQA · 2018-09-03T19:27:26Z

Test build #95631 has finished for PR 22316 at commit 673ef00.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jaceklaskowski · 2018-09-03T20:28:31Z

sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

@@ -416,7 +426,7 @@ class RelationalGroupedDataset protected[sql](
        new RelationalGroupedDataset(
          df,
          groupingExprs,
-          RelationalGroupedDataset.PivotType(pivotColumn.expr, values.map(Literal.apply)))
+          RelationalGroupedDataset.PivotType(pivotColumn.expr, values.map(lit(_).expr)))


What do you think about map(lit).map(_.expr) instead?

Don't see any advantages of this. It is longer and slower.

MaxGekk · 2018-09-06T07:42:47Z

@HyukjinKwon May I ask you to look at the PR. Is there anything which blocks the PR for now?

HyukjinKwon · 2018-09-06T07:45:01Z

Looks good but I wonder if all guys are happy with that involved in the previous PR.

HyukjinKwon · 2018-09-06T07:46:00Z

At least @gatorsmile and @cloud-fan, WDYT?

HyukjinKwon · 2018-09-06T08:30:00Z

Branch is cut out. Let's target 3.0.0

HyukjinKwon · 2018-09-08T07:00:51Z

sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

@@ -330,6 +331,15 @@ class RelationalGroupedDataset protected[sql](
   *   df.groupBy("year").pivot("course").sum("earnings")
   * }}}
   *
+   * From Spark 2.4.0, values can be literal columns, for instance, struct. For pivoting by


Let's target 3.0.0 @MaxGekk.

SparkQA · 2018-09-08T13:56:53Z

Test build #95829 has finished for PR 22316 at commit 8ccf845.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-09-10T08:22:11Z

Seems fine to me.

MaxGekk · 2018-09-12T15:45:00Z

@gatorsmile Do you have any objections for this approach?

HyukjinKwon · 2018-09-13T03:37:05Z

sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

@@ -416,7 +426,7 @@ class RelationalGroupedDataset protected[sql](
        new RelationalGroupedDataset(
          df,
          groupingExprs,
-          RelationalGroupedDataset.PivotType(pivotColumn.expr, values.map(Literal.apply)))
+          RelationalGroupedDataset.PivotType(pivotColumn.expr, values.map(lit(_).expr)))


@MaxGekk, just for doubly doubly sure, shell we Try(...).getOrElse(lit(...).expr)? Looks at least there's one case of a potential behaviour change about scale and precision.

Looks at least there's one case of a potential behaviour change about scale and precision.

Could you explain, please. Why do you expect some behavior change?

now we eventually call Literal.create instead of Literal.apply. I'm not sure if there is a behavior change though.

from a quick look, seems Literal.create is more powerful and should not have regressions.

That's true in general but specifically is decimal precision more correct?

HyukjinKwon · 2018-09-13T03:37:23Z

LGTM otherwise

MaxGekk · 2018-09-17T09:00:36Z

@HyukjinKwon @maropu @jaceklaskowski Please, take a look at this PR one more time.

HyukjinKwon

I checked the decimal precision and scale could be different from a cursory look. For instance,

spark/sql/core/src/main/scala/org/apache/spark/sql/functions.scala

Lines 100 to 128 in d749d03

    
             /** 
        
              * Creates a [[Column]] of literal value. 
        
              * 
        
              * The passed in object is returned directly if it is already a [[Column]]. 
        
              * If the object is a Scala Symbol, it is converted into a [[Column]] also. 
        
              * Otherwise, a new [[Column]] is created to represent the literal value. 
        
              * 
        
              * @group normal_funcs 
        
              * @since 1.3.0 
        
              */ 
        
             def lit(literal: Any): Column = typedLit(literal) 
        
             /** 
        
              * Creates a [[Column]] of literal value. 
        
              * 
        
              * The passed in object is returned directly if it is already a [[Column]]. 
        
              * If the object is a Scala Symbol, it is converted into a [[Column]] also. 
        
              * Otherwise, a new [[Column]] is created to represent the literal value. 
        
              * The difference between this function and [[lit]] is that this function 
        
              * can handle parameterized scala types e.g.: List, Seq and Map. 
        
              * 
        
              * @group normal_funcs 
        
              * @since 2.2.0 
        
              */ 
        
             def typedLit[T : TypeTag](literal: T): Column = literal match { 
        
               case c: Column => c 
        
               case s: Symbol => new ColumnName(s.name) 
        
               case _ => Column(Literal.create(literal)) 
        
             }

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala

Lines 165 to 171 in f96a8bf

    
           def create[T : TypeTag](v: T): Literal = Try { 
        
             val ScalaReflection.Schema(dataType, _) = ScalaReflection.schemaFor[T] 
        
             val convert = CatalystTypeConverters.createToCatalystConverter(dataType) 
        
             Literal(convert(v), dataType) 
        
           }.getOrElse { 
        
             Literal(v) 
        
           }

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala

Lines 756 to 763 in 1fd59c1

    
           case t if t <:< localTypeOf[BigDecimal] => Schema(DecimalType.SYSTEM_DEFAULT, nullable = true) 
        
           case t if t <:< localTypeOf[java.math.BigDecimal] => 
        
             Schema(DecimalType.SYSTEM_DEFAULT, nullable = true) 
        
           case t if t <:< localTypeOf[java.math.BigInteger] => 
        
             Schema(DecimalType.BigIntDecimal, nullable = true) 
        
           case t if t <:< localTypeOf[scala.math.BigInt] => 
        
             Schema(DecimalType.BigIntDecimal, nullable = true) 
        
           case t if t <:< localTypeOf[Decimal] => Schema(DecimalType.SYSTEM_DEFAULT, nullable = true)

vs

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala

Lines 62 to 65 in f96a8bf

    
           case d: BigDecimal => Literal(Decimal(d), DecimalType.fromBigDecimal(d)) 
        
           case d: JavaBigDecimal => 
        
             Literal(Decimal(d), DecimalType(Math.max(d.precision, d.scale), d.scale())) 
        
           case d: Decimal => Literal(d, DecimalType(Math.max(d.precision, d.scale), d.scale))

Would you mind if I ask to double check this one please?

cloud-fan · 2018-09-21T02:24:09Z

sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

@@ -330,6 +331,15 @@ class RelationalGroupedDataset protected[sql](
   *   df.groupBy("year").pivot("course").sum("earnings")
   * }}}
   *
+   * From Spark 3.0.0, values can be literal columns, for instance, struct. For pivoting by


3.0.0 => 2.5.0

SparkQA · 2018-09-21T07:05:01Z

Test build #96404 has finished for PR 22316 at commit 382640b.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2018-09-21T07:12:26Z

retest this please

SparkQA · 2018-09-21T10:33:51Z

Test build #96409 has finished for PR 22316 at commit 382640b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-09-21T10:56:06Z

jenkins, retest this, please

SparkQA · 2018-09-21T14:44:13Z

Test build #96420 has finished for PR 22316 at commit 382640b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-09-24T13:50:48Z

LGTM if the decimal precision concern from @HyukjinKwon is addressed.

MaxGekk · 2018-09-24T14:03:11Z

LGTM if the decimal precision concern from @HyukjinKwon is addressed.

@HyukjinKwon Do you expect special tests for decimals?

HyukjinKwon · 2018-09-24T15:33:44Z

Can you just investigate if there's behaviour change about decimal precision? If there is, can you add a simple test if that's a better behaviour? If that's not a better behaviour, let's try-catch for now.

…e-columns2

SparkQA · 2018-09-25T00:29:51Z

Test build #96520 has finished for PR 22316 at commit 49b47fb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-09-25T01:57:10Z

One safe change is to not use the lit function, but to do a manual pattern match and still use Literal.apply. We can investigate Literal.create in a followup

…e-columns2

MaxGekk · 2018-09-28T13:43:30Z

@cloud-fan Thank you for the suggestion. I did it in this way.

SparkQA · 2018-09-28T17:34:49Z

Test build #96759 has finished for PR 22316 at commit d645d06.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-09-29T13:10:38Z

sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

+   * multiple columns, use the `struct` function to combine the columns and values:
+   *
+   * {{{
+   *   df.groupBy($"year")


nit: $"year" -> "year"

Why cannot be grouping by Column type?

we can. just to match the examples with above except the difference. really not a big deal at all.

HyukjinKwon

LGTM except one nit

HyukjinKwon · 2018-09-29T13:49:40Z

I'm merging this. Last change is comment change and lint / unidoc check passed.

HyukjinKwon · 2018-09-29T13:50:17Z

Merged to master.

## What changes were proposed in this pull request? In the PR, I propose to extend implementation of existing method: ``` def pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset ``` to support values of the struct type. This allows pivoting by multiple columns combined by `struct`: ``` trainingSales .groupBy($"sales.year") .pivot( pivotColumn = struct(lower($"sales.course"), $"training"), values = Seq( struct(lit("dotnet"), lit("Experts")), struct(lit("java"), lit("Dummies"))) ).agg(sum($"sales.earnings")) ``` ## How was this patch tested? Added a test for values specified via `struct` in Java and Scala. Closes apache#22316 from MaxGekk/pivoting-by-multiple-columns2. Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>

SparkQA · 2018-09-29T17:17:44Z

Test build #96800 has finished for PR 22316 at commit 43972ef.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? In the PR, I propose to extend implementation of existing method: ``` def pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset ``` to support values of the struct type. This allows pivoting by multiple columns combined by `struct`: ``` trainingSales .groupBy($"sales.year") .pivot( pivotColumn = struct(lower($"sales.course"), $"training"), values = Seq( struct(lit("dotnet"), lit("Experts")), struct(lit("java"), lit("Dummies"))) ).agg(sum($"sales.earnings")) ``` ## How was this patch tested? Added a test for values specified via `struct` in Java and Scala. Closes apache#22316 from MaxGekk/pivoting-by-multiple-columns2. Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>

Hoeze · 2019-12-17T18:18:01Z

Hi, is there a way to pivot multiple columns using PySpark as well?

HyukjinKwon · 2019-12-18T06:33:40Z

You can try:

df.groupby(...).pivot(..., values=[F.struct(F.lit("..."))._jc])

for now.

MaxGekk added 3 commits September 2, 2018 14:24

Support columns as values

0580725

Added a test for the case when values are not specified

1221db3

Added a test for Java

a097b29

MaxGekk mentioned this pull request Sep 2, 2018

[SPARK-25048][SQL] Pivoting by multiple columns in Scala/Java #22030

Closed

HyukjinKwon reviewed Sep 2, 2018

View reviewed changes

Minor changes in an example

ef8e22a

HyukjinKwon reviewed Sep 3, 2018

View reviewed changes

maropu reviewed Sep 3, 2018

View reviewed changes

Improving comments

673ef00

jaceklaskowski reviewed Sep 3, 2018

View reviewed changes

HyukjinKwon reviewed Sep 8, 2018

View reviewed changes

Targeting Spark 3.0.0

8ccf845

HyukjinKwon reviewed Sep 13, 2018

View reviewed changes

HyukjinKwon reviewed Sep 18, 2018

View reviewed changes

cloud-fan reviewed Sep 21, 2018

View reviewed changes

Re-targeting to 2.5.0

382640b

Merge remote-tracking branch 'origin/master' into pivoting-by-multipl…

49b47fb

…e-columns2

MaxGekk added 2 commits September 28, 2018 13:01

Merge remote-tracking branch 'origin/master' into pivoting-by-multipl…

4daeb92

…e-columns2

Replacing lit() by Literal.apply and Column case

d645d06

HyukjinKwon reviewed Sep 29, 2018

View reviewed changes

HyukjinKwon approved these changes Sep 29, 2018

View reviewed changes

Addressing a review comment

43972ef

asfgit closed this in 623c2ec Sep 29, 2018

MaxGekk deleted the pivoting-by-multiple-columns2 branch August 17, 2019 13:35

	/**
	* Creates a [[Column]] of literal value.
	*
	* The passed in object is returned directly if it is already a [[Column]].
	* If the object is a Scala Symbol, it is converted into a [[Column]] also.
	* Otherwise, a new [[Column]] is created to represent the literal value.
	*
	* @group normal_funcs
	* @since 1.3.0
	*/
	def lit(literal: Any): Column = typedLit(literal)

	/**
	* Creates a [[Column]] of literal value.
	*
	* The passed in object is returned directly if it is already a [[Column]].
	* If the object is a Scala Symbol, it is converted into a [[Column]] also.
	* Otherwise, a new [[Column]] is created to represent the literal value.
	* The difference between this function and [[lit]] is that this function
	* can handle parameterized scala types e.g.: List, Seq and Map.
	*
	* @group normal_funcs
	* @since 2.2.0
	*/
	def typedLit[T : TypeTag](literal: T): Column = literal match {
	case c: Column => c
	case s: Symbol => new ColumnName(s.name)
	case _ => Column(Literal.create(literal))
	}

	def create[T : TypeTag](v: T): Literal = Try {
	val ScalaReflection.Schema(dataType, _) = ScalaReflection.schemaFor[T]
	val convert = CatalystTypeConverters.createToCatalystConverter(dataType)
	Literal(convert(v), dataType)
	}.getOrElse {
	Literal(v)
	}

	case t if t <:< localTypeOf[BigDecimal] => Schema(DecimalType.SYSTEM_DEFAULT, nullable = true)
	case t if t <:< localTypeOf[java.math.BigDecimal] =>
	Schema(DecimalType.SYSTEM_DEFAULT, nullable = true)
	case t if t <:< localTypeOf[java.math.BigInteger] =>
	Schema(DecimalType.BigIntDecimal, nullable = true)
	case t if t <:< localTypeOf[scala.math.BigInt] =>
	Schema(DecimalType.BigIntDecimal, nullable = true)
	case t if t <:< localTypeOf[Decimal] => Schema(DecimalType.SYSTEM_DEFAULT, nullable = true)

	case d: BigDecimal => Literal(Decimal(d), DecimalType.fromBigDecimal(d))
	case d: JavaBigDecimal =>
	Literal(Decimal(d), DecimalType(Math.max(d.precision, d.scale), d.scale()))
	case d: Decimal => Literal(d, DecimalType(Math.max(d.precision, d.scale), d.scale))

[SPARK-25048][SQL] Pivoting by multiple columns in Scala/Java #22316

[SPARK-25048][SQL] Pivoting by multiple columns in Scala/Java #22316

Conversation

MaxGekk commented Sep 2, 2018

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

HyukjinKwon commented Sep 2, 2018

SparkQA commented Sep 2, 2018

SparkQA commented Sep 2, 2018

Choose a reason for hiding this comment

maropu Sep 3, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 3, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MaxGekk commented Sep 6, 2018

HyukjinKwon commented Sep 6, 2018

HyukjinKwon commented Sep 6, 2018

HyukjinKwon commented Sep 6, 2018

Choose a reason for hiding this comment

SparkQA commented Sep 8, 2018

HyukjinKwon commented Sep 10, 2018

MaxGekk commented Sep 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Sep 13, 2018

MaxGekk commented Sep 17, 2018

HyukjinKwon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 21, 2018

dilipbiswal commented Sep 21, 2018

SparkQA commented Sep 21, 2018

MaxGekk commented Sep 21, 2018

SparkQA commented Sep 21, 2018

cloud-fan commented Sep 24, 2018

MaxGekk commented Sep 24, 2018

HyukjinKwon commented Sep 24, 2018

SparkQA commented Sep 25, 2018

cloud-fan commented Sep 25, 2018

MaxGekk commented Sep 28, 2018

SparkQA commented Sep 28, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Sep 29, 2018

HyukjinKwon commented Sep 29, 2018

SparkQA commented Sep 29, 2018

Hoeze commented Dec 17, 2019

HyukjinKwon commented Dec 18, 2019

maropu Sep 3, 2018 •

edited