[SPARK-19882][SQL] Pivot with null as the dictinct pivot value throws NPE #17224

HyukjinKwon · 2017-03-09T12:28:23Z

What changes were proposed in this pull request?

This PR proposes to fix two problems as below:

Use previous code path to handle null in distinct pivot values

An optimisation to this was introduced to prevent each input gets evaluated on every aggregate which seems making it slow when pivotValues are too many. It seems this tightly assumes that this distinct pivot value can't be null.

I could not find a clean and easy workaround to support this and wonder if it is worth. Please guide me if anyone knows a clean and short way to fix it in this two aggregation path.

Fix to count null

Seq(Tuple1(None), Tuple1(Some(1))).toDF("a").groupBy($"a").count().show()

Before (Spark 1.6),

+----+----+---+
|   a|null|  1|
+----+----+---+
|null|   0|  0|
|   1|   0|  1|
+----+----+---+

Before (current master), <- this is currently a regression

java.lang.NullPointerException was thrown.
java.lang.NullPointerException
  at org.apache.spark.sql.catalyst.expressions.aggregate.PivotFirst$$anonfun$4.apply(PivotFirst.scala:145)
  at org.apache.spark.sql.catalyst.expressions.aggregate.PivotFirst$$anonfun$4.apply(PivotFirst.scala:143)
  at scala.collection.immutable.List.map(List.scala:273)
  at org.apache.spark.sql.catalyst.expressions.aggregate.PivotFirst.<init>(PivotFirst.scala:143)
  at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolvePivot$$anonfun$apply$7$$anonfun$24.apply(Analyzer.scala:509)

After,

+----+----+---+
|   a|null|  1|
+----+----+---+
|null|   1|  0|
|   1|   0|  1|
+----+----+---+

It seems we should count null given

Seq(Tuple1(None), Tuple1(Some(1))).toDF("a").groupBy($"a").count().show()

+----+-----+
|   a|count|
+----+-----+
|null|    1|
|   1|    1|
+----+-----+

How was this patch tested?

Unit tests in DataFramePivotSuite.

SparkQA · 2017-03-09T14:38:50Z

Test build #74269 has finished for PR 17224 at commit 0476565.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-03-10T04:05:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

                case First(expr, _) =>
                  First(ifExpr(expr), Literal(true))
                case Last(expr, _) =>
                  Last(ifExpr(expr), Literal(true))
+                case c: Count =>
+                  // In case of count, `null` should be counted.
+                  c.withNewChildren(c.children.map(ifNullSafeExpr))


Let me update this path as soon as we decide what we want in another PR for this JIRA.

HyukjinKwon · 2017-03-10T07:26:28Z

I am closing this per #17226 (comment)

## What changes were proposed in this pull request? Allows null values of the pivot column to be included in the pivot values list without throwing NPE Note this PR was made as an alternative to apache#17224 but preserves the two phase aggregate operation that is needed for good performance. ## How was this patch tested? Additional unit test Author: Andrew Ray <ray.andrew@gmail.com> Closes apache#17226 from aray/pivot-null.

Pivot with null as the pivot value throws NPE

0476565

HyukjinKwon changed the title ~~[SPARK-19882][SQL] Pivot with null as the pivot value throws NPE~~ [SPARK-19882][SQL] Pivot with null as the dictinct pivot value throws NPE Mar 9, 2017

aray mentioned this pull request Mar 9, 2017

[SPARK-19882][SQL] Pivot with null as a distinct pivot value throws NPE #17226

Closed

HyukjinKwon commented Mar 10, 2017

View reviewed changes

HyukjinKwon closed this Mar 10, 2017

HyukjinKwon deleted the SPARK-19882 branch January 2, 2018 03:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19882][SQL] Pivot with null as the dictinct pivot value throws NPE #17224

[SPARK-19882][SQL] Pivot with null as the dictinct pivot value throws NPE #17224

HyukjinKwon commented Mar 9, 2017 •

edited

SparkQA commented Mar 9, 2017

HyukjinKwon Mar 10, 2017

HyukjinKwon commented Mar 10, 2017

[SPARK-19882][SQL] Pivot with null as the dictinct pivot value throws NPE #17224

[SPARK-19882][SQL] Pivot with null as the dictinct pivot value throws NPE #17224

Conversation

HyukjinKwon commented Mar 9, 2017 • edited

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Mar 9, 2017

HyukjinKwon Mar 10, 2017

Choose a reason for hiding this comment

HyukjinKwon commented Mar 10, 2017

HyukjinKwon commented Mar 9, 2017 •

edited