Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19882][SQL] Pivot with null as the dictinct pivot value throws NPE #17224

Closed
wants to merge 1 commit into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Mar 9, 2017

What changes were proposed in this pull request?

This PR proposes to fix two problems as below:

Use previous code path to handle null in distinct pivot values

An optimisation to this was introduced to prevent each input gets evaluated on every aggregate which seems making it slow when pivotValues are too many. It seems this tightly assumes that this distinct pivot value can't be null.

I could not find a clean and easy workaround to support this and wonder if it is worth. Please guide me if anyone knows a clean and short way to fix it in this two aggregation path.

Fix to count null

Seq(Tuple1(None), Tuple1(Some(1))).toDF("a").groupBy($"a").count().show()

Before (Spark 1.6),

+----+----+---+
|   a|null|  1|
+----+----+---+
|null|   0|  0|
|   1|   0|  1|
+----+----+---+

Before (current master), <- this is currently a regression

java.lang.NullPointerException was thrown.
java.lang.NullPointerException
  at org.apache.spark.sql.catalyst.expressions.aggregate.PivotFirst$$anonfun$4.apply(PivotFirst.scala:145)
  at org.apache.spark.sql.catalyst.expressions.aggregate.PivotFirst$$anonfun$4.apply(PivotFirst.scala:143)
  at scala.collection.immutable.List.map(List.scala:273)
  at org.apache.spark.sql.catalyst.expressions.aggregate.PivotFirst.<init>(PivotFirst.scala:143)
  at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolvePivot$$anonfun$apply$7$$anonfun$24.apply(Analyzer.scala:509)

After,

+----+----+---+
|   a|null|  1|
+----+----+---+
|null|   1|  0|
|   1|   0|  1|
+----+----+---+

It seems we should count null given

Seq(Tuple1(None), Tuple1(Some(1))).toDF("a").groupBy($"a").count().show()
+----+-----+
|   a|count|
+----+-----+
|null|    1|
|   1|    1|
+----+-----+

How was this patch tested?

Unit tests in DataFramePivotSuite.

@HyukjinKwon HyukjinKwon changed the title [SPARK-19882][SQL] Pivot with null as the pivot value throws NPE [SPARK-19882][SQL] Pivot with null as the dictinct pivot value throws NPE Mar 9, 2017
@SparkQA
Copy link

SparkQA commented Mar 9, 2017

Test build #74269 has finished for PR 17224 at commit 0476565.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case First(expr, _) =>
First(ifExpr(expr), Literal(true))
case Last(expr, _) =>
Last(ifExpr(expr), Literal(true))
case c: Count =>
// In case of count, `null` should be counted.
c.withNewChildren(c.children.map(ifNullSafeExpr))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me update this path as soon as we decide what we want in another PR for this JIRA.

@HyukjinKwon
Copy link
Member Author

I am closing this per #17226 (comment)

ghost pushed a commit to dbtsai/spark that referenced this pull request Mar 17, 2017
## What changes were proposed in this pull request?

Allows null values of the pivot column to be included in the pivot values list without throwing NPE

Note this PR was made as an alternative to apache#17224 but preserves the two phase aggregate operation that is needed for good performance.

## How was this patch tested?

Additional unit test

Author: Andrew Ray <ray.andrew@gmail.com>

Closes apache#17226 from aray/pivot-null.
@HyukjinKwon HyukjinKwon deleted the SPARK-19882 branch January 2, 2018 03:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants