Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25048][SQL] Pivoting by multiple columns in Scala/Java #22316

Closed
wants to merge 11 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Sep 2, 2018

What changes were proposed in this pull request?

In the PR, I propose to extend implementation of existing method:

def pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset

to support values of the struct type. This allows pivoting by multiple columns combined by struct:

trainingSales
      .groupBy($"sales.year")
      .pivot(
        pivotColumn = struct(lower($"sales.course"), $"training"),
        values = Seq(
          struct(lit("dotnet"), lit("Experts")),
          struct(lit("java"), lit("Dummies")))
      ).agg(sum($"sales.earnings"))

How was this patch tested?

Added a test for values specified via struct in Java and Scala.

*
* {{{
* df
* .groupBy($"year")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make this line up

@HyukjinKwon
Copy link
Member

Yup I prefer this way

@SparkQA
Copy link

SparkQA commented Sep 2, 2018

Test build #95590 has finished for PR 22316 at commit a097b29.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 2, 2018

Test build #95592 has finished for PR 22316 at commit ef8e22a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -406,6 +407,14 @@ class RelationalGroupedDataset protected[sql](
* df.groupBy($"year").pivot($"course", Seq("dotNET", "Java")).sum($"earnings")
* }}}
*
* For pivoting by multiple columns, use the `struct` function to combine the columns and values:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the documentation states it's an overloaded version of the `pivot` method with `pivotColumn` of the `String` type., shall we move this contents to that method?

Also, I would document this, for instance,

From Spark 2.4.0, values can be literal columns, for instance, struct. For pivoting by multiple columns, use the struct function to combine the columns and values.

.groupBy($"sales.year")
.pivot(struct(lower($"sales.course"), $"training"))
.agg(sum($"sales.earnings"))
.collect()
Copy link
Member

@maropu maropu Sep 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need this .collect() to cactch the RuntimeException? btw, IMHO AnalysisException is better than RuntimeException in this case? Can't we?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My changes don't throw the exception. It is thrown in the collect() :

@maropu Do you propose to catch RuntimeException and replace it by AnalysisException?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried in your branch;

scala> df.show
+--------+--------------------+
|training|               sales|
+--------+--------------------+
| Experts|[dotNET, 2012, 10...|
| Experts|[JAVA, 2012, 2000...|
| Dummies|[dotNet, 2012, 50...|
| Experts|[dotNET, 2013, 48...|
| Dummies|[Java, 2013, 3000...|
+--------+--------------------+

scala> df.groupBy($"sales.year").pivot(struct(lower($"sales.course"), $"training")).agg(sum($"sales.earnings"))
java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema [dotnet,Dummies]
  at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78)
  at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:164)
  at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:164)
  at scala.util.Try.getOrElse(Try.scala:79)
  at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:163)
  at org.apache.spark.sql.functions$.typedLit(functions.scala:127)

I miss something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I miss something?

No, you don't. The exception for sure is thrown inside of lit because collect() returns a complex value which cannot be "wrapped" by lit. This is exactly checked in the test which I added to show existing behavior.

btw, IMHO AnalysisException is better than RuntimeException in this case?

@maropu Could you explain, please, why do you think AnalysisException is better for the error occurs in run-time?

Just in case, in the PR, I don't aim to change behavior of existing method: def pivot(pivotColumn: Column): RelationalGroupedDataset. I believe it should be discussed separately regarding to needs for changing user visible behavior. The PR aims to improve def pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset to allow users to specify struct literals in particular. Please, see the description.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think invalid queries basically throw `AnalysisException. But, yea, indeed, we'd better to keep the current behaivour. Thanks!

@SparkQA
Copy link

SparkQA commented Sep 3, 2018

Test build #95631 has finished for PR 22316 at commit 673ef00.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -416,7 +426,7 @@ class RelationalGroupedDataset protected[sql](
new RelationalGroupedDataset(
df,
groupingExprs,
RelationalGroupedDataset.PivotType(pivotColumn.expr, values.map(Literal.apply)))
RelationalGroupedDataset.PivotType(pivotColumn.expr, values.map(lit(_).expr)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about map(lit).map(_.expr) instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't see any advantages of this. It is longer and slower.

@MaxGekk
Copy link
Member Author

MaxGekk commented Sep 6, 2018

@HyukjinKwon May I ask you to look at the PR. Is there anything which blocks the PR for now?

@HyukjinKwon
Copy link
Member

Looks good but I wonder if all guys are happy with that involved in the previous PR.

@HyukjinKwon
Copy link
Member

At least @gatorsmile and @cloud-fan, WDYT?

@HyukjinKwon
Copy link
Member

Branch is cut out. Let's target 3.0.0

@@ -330,6 +331,15 @@ class RelationalGroupedDataset protected[sql](
* df.groupBy("year").pivot("course").sum("earnings")
* }}}
*
* From Spark 2.4.0, values can be literal columns, for instance, struct. For pivoting by
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's target 3.0.0 @MaxGekk.

@SparkQA
Copy link

SparkQA commented Sep 8, 2018

Test build #95829 has finished for PR 22316 at commit 8ccf845.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Seems fine to me.

@MaxGekk
Copy link
Member Author

MaxGekk commented Sep 12, 2018

@gatorsmile Do you have any objections for this approach?

@@ -416,7 +426,7 @@ class RelationalGroupedDataset protected[sql](
new RelationalGroupedDataset(
df,
groupingExprs,
RelationalGroupedDataset.PivotType(pivotColumn.expr, values.map(Literal.apply)))
RelationalGroupedDataset.PivotType(pivotColumn.expr, values.map(lit(_).expr)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk, just for doubly doubly sure, shell we Try(...).getOrElse(lit(...).expr)? Looks at least there's one case of a potential behaviour change about scale and precision.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks at least there's one case of a potential behaviour change about scale and precision.

Could you explain, please. Why do you expect some behavior change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now we eventually call Literal.create instead of Literal.apply. I'm not sure if there is a behavior change though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from a quick look, seems Literal.create is more powerful and should not have regressions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true in general but specifically is decimal precision more correct?

@HyukjinKwon
Copy link
Member

LGTM otherwise

@MaxGekk
Copy link
Member Author

MaxGekk commented Sep 17, 2018

@HyukjinKwon @maropu @jaceklaskowski Please, take a look at this PR one more time.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the decimal precision and scale could be different from a cursory look. For instance,

/**
* Creates a [[Column]] of literal value.
*
* The passed in object is returned directly if it is already a [[Column]].
* If the object is a Scala Symbol, it is converted into a [[Column]] also.
* Otherwise, a new [[Column]] is created to represent the literal value.
*
* @group normal_funcs
* @since 1.3.0
*/
def lit(literal: Any): Column = typedLit(literal)
/**
* Creates a [[Column]] of literal value.
*
* The passed in object is returned directly if it is already a [[Column]].
* If the object is a Scala Symbol, it is converted into a [[Column]] also.
* Otherwise, a new [[Column]] is created to represent the literal value.
* The difference between this function and [[lit]] is that this function
* can handle parameterized scala types e.g.: List, Seq and Map.
*
* @group normal_funcs
* @since 2.2.0
*/
def typedLit[T : TypeTag](literal: T): Column = literal match {
case c: Column => c
case s: Symbol => new ColumnName(s.name)
case _ => Column(Literal.create(literal))
}

def create[T : TypeTag](v: T): Literal = Try {
val ScalaReflection.Schema(dataType, _) = ScalaReflection.schemaFor[T]
val convert = CatalystTypeConverters.createToCatalystConverter(dataType)
Literal(convert(v), dataType)
}.getOrElse {
Literal(v)
}

case t if t <:< localTypeOf[BigDecimal] => Schema(DecimalType.SYSTEM_DEFAULT, nullable = true)
case t if t <:< localTypeOf[java.math.BigDecimal] =>
Schema(DecimalType.SYSTEM_DEFAULT, nullable = true)
case t if t <:< localTypeOf[java.math.BigInteger] =>
Schema(DecimalType.BigIntDecimal, nullable = true)
case t if t <:< localTypeOf[scala.math.BigInt] =>
Schema(DecimalType.BigIntDecimal, nullable = true)
case t if t <:< localTypeOf[Decimal] => Schema(DecimalType.SYSTEM_DEFAULT, nullable = true)

vs

case d: BigDecimal => Literal(Decimal(d), DecimalType.fromBigDecimal(d))
case d: JavaBigDecimal =>
Literal(Decimal(d), DecimalType(Math.max(d.precision, d.scale), d.scale()))
case d: Decimal => Literal(d, DecimalType(Math.max(d.precision, d.scale), d.scale))

Would you mind if I ask to double check this one please?

@@ -330,6 +331,15 @@ class RelationalGroupedDataset protected[sql](
* df.groupBy("year").pivot("course").sum("earnings")
* }}}
*
* From Spark 3.0.0, values can be literal columns, for instance, struct. For pivoting by
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3.0.0 => 2.5.0

@SparkQA
Copy link

SparkQA commented Sep 21, 2018

Test build #96404 has finished for PR 22316 at commit 382640b.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Sep 21, 2018

Test build #96409 has finished for PR 22316 at commit 382640b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Sep 21, 2018

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Sep 21, 2018

Test build #96420 has finished for PR 22316 at commit 382640b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM if the decimal precision concern from @HyukjinKwon is addressed.

@MaxGekk
Copy link
Member Author

MaxGekk commented Sep 24, 2018

LGTM if the decimal precision concern from @HyukjinKwon is addressed.

@HyukjinKwon Do you expect special tests for decimals?

@HyukjinKwon
Copy link
Member

Can you just investigate if there's behaviour change about decimal precision? If there is, can you add a simple test if that's a better behaviour? If that's not a better behaviour, let's try-catch for now.

@SparkQA
Copy link

SparkQA commented Sep 25, 2018

Test build #96520 has finished for PR 22316 at commit 49b47fb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

One safe change is to not use the lit function, but to do a manual pattern match and still use Literal.apply. We can investigate Literal.create in a followup

@MaxGekk
Copy link
Member Author

MaxGekk commented Sep 28, 2018

@cloud-fan Thank you for the suggestion. I did it in this way.

@SparkQA
Copy link

SparkQA commented Sep 28, 2018

Test build #96759 has finished for PR 22316 at commit d645d06.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* multiple columns, use the `struct` function to combine the columns and values:
*
* {{{
* df.groupBy($"year")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: $"year" -> "year"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why cannot be grouping by Column type?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can. just to match the examples with above except the difference. really not a big deal at all.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except one nit

@HyukjinKwon
Copy link
Member

I'm merging this. Last change is comment change and lint / unidoc check passed.

@HyukjinKwon
Copy link
Member

Merged to master.

@asfgit asfgit closed this in 623c2ec Sep 29, 2018
daspalrahul pushed a commit to daspalrahul/spark that referenced this pull request Sep 29, 2018
## What changes were proposed in this pull request?

In the PR, I propose to extend implementation of existing method:
```
def pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset
```
to support values of the struct type. This allows pivoting by multiple columns combined by `struct`:
```
trainingSales
      .groupBy($"sales.year")
      .pivot(
        pivotColumn = struct(lower($"sales.course"), $"training"),
        values = Seq(
          struct(lit("dotnet"), lit("Experts")),
          struct(lit("java"), lit("Dummies")))
      ).agg(sum($"sales.earnings"))
```

## How was this patch tested?

Added a test for values specified via `struct` in Java and Scala.

Closes apache#22316 from MaxGekk/pivoting-by-multiple-columns2.

Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
@SparkQA
Copy link

SparkQA commented Sep 29, 2018

Test build #96800 has finished for PR 22316 at commit 43972ef.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

In the PR, I propose to extend implementation of existing method:
```
def pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset
```
to support values of the struct type. This allows pivoting by multiple columns combined by `struct`:
```
trainingSales
      .groupBy($"sales.year")
      .pivot(
        pivotColumn = struct(lower($"sales.course"), $"training"),
        values = Seq(
          struct(lit("dotnet"), lit("Experts")),
          struct(lit("java"), lit("Dummies")))
      ).agg(sum($"sales.earnings"))
```

## How was this patch tested?

Added a test for values specified via `struct` in Java and Scala.

Closes apache#22316 from MaxGekk/pivoting-by-multiple-columns2.

Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
@MaxGekk MaxGekk deleted the pivoting-by-multiple-columns2 branch August 17, 2019 13:35
@Hoeze
Copy link

Hoeze commented Dec 17, 2019

Hi, is there a way to pivot multiple columns using PySpark as well?

@HyukjinKwon
Copy link
Member

You can try:

df.groupby(...).pivot(..., values=[F.struct(F.lit("..."))._jc])

for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants