[SPARK-19573][SQL] Make NaN/null handling consistent in approxQuantile #16971

zhengruifeng · 2017-02-17T03:23:09Z

What changes were proposed in this pull request?

update StatFunctions.multipleApproxQuantiles to handle NaN/null

How was this patch tested?

existing tests and added tests

SparkQA · 2017-02-17T05:17:19Z

Test build #73033 has finished for PR 16971 at commit d5e79a8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2017-02-17T05:19:45Z

@gatorsmile @jkbradley

MLnick · 2017-02-21T07:31:23Z

sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala

@@ -54,6 +54,8 @@ object StatFunctions extends Logging {
   *   Note that values greater than 1 are accepted but give the same result as 1.
   *
   * @return for each column, returns the requested approximations
+   *
+   * @note null and NaN values will be removed from the numerical column before calculation.


I think "will be ignored" is more accurate than "will be removed"

MLnick · 2017-02-21T07:32:41Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala

@@ -89,18 +89,17 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) {
   *   Note that values greater than 1 are accepted but give the same result as 1.
   * @return the approximate quantiles at the given probabilities of each column
   *
-   * @note Rows containing any null or NaN values will be removed before calculation. If
-   *   the dataframe is empty or all rows contain null or NaN, null is returned.
+   * @note null and NaN values will be removed from the numerical column before calculation. If


Again, "ignored" is slightly better than "removed from"

MLnick · 2017-02-21T07:35:52Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala

    try {
-      StatFunctions.multipleApproxQuantiles(df.select(cols.map(col): _*).na.drop(), cols,
+      StatFunctions.multipleApproxQuantiles(df.select(cols.map(col): _*), cols,
        probabilities, relativeError).map(_.toArray).toArray
    } catch {
      case e: NoSuchElementException => null


This went in for the other PR but I still question whether we should be returning null here. Is this standard in SparkSQL? What about returning an empty Array? cc @gatorsmile

+1. I tend to think that the result should be NaN (following the IEEE convention) or null (following scala Option convention). But pending a resolution, I would be fine with throwing an exception because it is the most conservative behavior (stopping computations). Returning null usually causes some issues in a functional context such as Spark.

In Spark SQL, all the other built-in functions will not throw an exception if the input data set is empty. An empty inupt data set is pretty normal. Returning either null or empty Array looks ok to me.

MLnick · 2017-02-21T07:38:22Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala

+    assert(resNaN1(0) === resNaNAll(0)(0))
+    assert(resNaN1(1) === resNaNAll(0)(1))
+    assert(resNaN2(0) === resNaNAll(1)(0))
+    assert(resNaN2(1) === resNaNAll(1)(1))


Do we need a test for one column all nulls (that it returns null)?

Yes, I create a new column containing only NaN/null.

MLnick · 2017-02-21T07:39:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala

@@ -78,7 +80,13 @@ object StatFunctions extends Logging {
    def apply(summaries: Array[QuantileSummaries], row: Row): Array[QuantileSummaries] = {
      var i = 0
      while (i < summaries.length) {
-        summaries(i) = summaries(i).insert(row.getDouble(i))
+        val item = row(i)


This works, though perhaps we can do:

if (!row.isNullAt(i)) { val v = row.getDouble(i) if (!v.isNaN) { summaries(i) = summaries(i).insert(v) } }

SparkQA · 2017-02-21T11:41:09Z

Test build #73211 has finished for PR 16971 at commit 31346c3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

thunterdb · 2017-02-22T22:31:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala

@@ -78,7 +80,12 @@ object StatFunctions extends Logging {
    def apply(summaries: Array[QuantileSummaries], row: Row): Array[QuantileSummaries] = {
      var i = 0
      while (i < summaries.length) {
-        summaries(i) = summaries(i).insert(row.getDouble(i))
+        if (!row.isNullAt(i)) {


Thank you for fixing this issue, it was an oversight in my original implementation.

The current exception being thrown depends on an implementation detail (calling sampled.head). Can you modify the function def query below to explicitly throw an exception if sampled is empty, and document this behavior in that function? This way, we will not forget it if decide to change the semantics of that class.

As @MLnick was mentioning above, it would be preferrable to either return an Option, null or NaN eventually, but this can wait for more consensus.

thunterdb · 2017-02-22T22:31:46Z

@zhengruifeng thanks for looking into this issue. I have one comment above.

zhengruifeng · 2017-02-23T06:11:58Z

@thunterdb Good point. I will check the sampled in def query.

@MLnick @gatorsmile I perfer empty array as the result for empty dataset or columns that only contains na.
And, in the case that only some columns only contains na. Current implementation will return null, and result for all column can not be obtained. I think the result for common columns should be accessable.

val rows = spark.sparkContext.parallelize(Seq(Row(Double.NaN, 1.0, Double.NaN),
+      Row(1.0, -1.0, null), Row(-1.0, Double.NaN, null), Row(Double.NaN, Double.NaN, null),
+      Row(null, null, Double.NaN), Row(null, 1.0, null), Row(-1.0, null, Double.NaN),
+      Row(Double.NaN, null, null)))
     val schema = StructType(Seq(StructField("input1", DoubleType, nullable = true),
+      StructField("input2", DoubleType, nullable = true),
+      StructField("input3", DoubleType, nullable = true)))
     val dfNaN = spark.createDataFrame(rows, schema)
val resNaNAll = dfNaN.stat.approxQuantile(Array("input1", "input2", "input3"),
       Array(q1, q2), epsilon)

In the returned array, result for columns input1 and input2 should be ok, and result for input3 is empty. Array(Array(num1, num2), Array(num3, num4), Array())

MLnick · 2017-02-23T06:20:18Z

Yes my point was returning null is not very idiomatic in Scala. Better to return Option or empty collection. Option doesn't work for Java compat, so empty Array is best in this case I believe. +1 for empty Array and if we can return the quantiles for the non-empty / non-NaN cols as per your suggestion that is ideal.

…

On Thu, 23 Feb 2017 at 08:12, Ruifeng Zheng ***@***.***> wrote: @thunterdb <https://github.com/thunterdb> Good point. I will check the sampled in def query. @MLnick <https://github.com/MLnick> @gatorsmile <https://github.com/gatorsmile> I perfer empty array as the result for empty dataset or columns that only contains na. And, in the case that only some columns only contains na. Current implementation will return null, and result for all column can not be obtained. I think the result for common columns should be accessable. val rows = spark.sparkContext.parallelize(Seq(Row(Double.NaN, 1.0, Double.NaN), + Row(1.0, -1.0, null), Row(-1.0, Double.NaN, null), Row(Double.NaN, Double.NaN, null), + Row(null, null, Double.NaN), Row(null, 1.0, null), Row(-1.0, null, Double.NaN), + Row(Double.NaN, null, null))) val schema = StructType(Seq(StructField("input1", DoubleType, nullable = true), + StructField("input2", DoubleType, nullable = true), + StructField("input3", DoubleType, nullable = true))) val dfNaN = spark.createDataFrame(rows, schema) val resNaNAll = dfNaN.stat.approxQuantile(Array("input1", "input2", "input3"), Array(q1, q2), epsilon) In the returned array, result for columns input1 and input2 should be ok, and result for input3 is empty. Array(Array(num1, num2), Array(num3, num4), Array()) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#16971 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA_SB04fXTwOGec3BqPZ06w9F6ps-hTxks5rfSNLgaJpZM4MD1Gt> .

SparkQA · 2017-02-23T10:45:18Z

Test build #73338 has finished for PR 16971 at commit bdf3bf0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-23T13:06:19Z

Test build #73344 has finished for PR 16971 at commit e0365b9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2017-02-27T11:58:52Z

ping @MLnick @gatorsmile @thunterdb

gatorsmile · 2017-02-27T18:01:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala

+      try {
+        probabilities.map(summary.query)
+      } catch {
+        case e: SparkException => Seq.empty[Double]


Please do not use the Exception handling for this purpose. Instead, you can return None.

SparkQA · 2017-03-01T10:07:28Z

Test build #73674 has finished for PR 16971 at commit 8d5941b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-01T10:15:06Z

Test build #73676 has finished for PR 16971 at commit 402deb0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-03-02T03:17:56Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala

-      case e: NoSuchElementException => null
-    }
+    StatFunctions.multipleApproxQuantiles(df.select(cols.map(col): _*), cols,
+      probabilities, relativeError).map(_.toArray).toArray


Nit: style issue

StatFunctions.multipleApproxQuantiles( df.select(cols.map(col): _*), cols, probabilities, relativeError).map(_.toArray).toArray

gatorsmile · 2017-03-02T03:19:39Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala

    require(quantile >= 0 && quantile <= 1.0, "quantile should be in the range [0.0, 1.0]")
    require(headSampled.isEmpty,
      "Cannot operate on an uncompressed summary, call compress() first")

+    if (sampled.isEmpty) {
+      return None
+    }


Nit:

if (sampled.isEmpty) return None

gatorsmile · 2017-03-02T03:20:09Z

sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala

+          val v = row.getDouble(i)
+          if (!v.isNaN) {
+            summaries(i) = summaries(i).insert(v)
+          }


Nit:

if (!v.isNaN) summaries(i) = summaries(i).insert(v)

gatorsmile · 2017-03-02T03:20:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala

-    summaries.map { summary => probabilities.map(summary.query) }
+    summaries.map { summary =>
+      probabilities.flatMap(summary.query)
+    }


Nit:

summaries.map { summary => probabilities.flatMap(summary.query) }

gatorsmile · 2017-03-02T03:26:09Z

...c/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala

@@ -245,7 +245,7 @@ object ApproximatePercentile {
        val result = new Array[Double](percentages.length)
        var i = 0
        while (i < percentages.length) {
-          result(i) = summaries.query(percentages(i))
+          result(i) = summaries.query(percentages(i)).get


Is it possible to return None? Then, you will get a strange exception. Could you also add a test case for getPercentiles in ApproximatePercentileQuerySuite?

It looks like that it is impossible not return None here: Since summaries.count != 0, the summaries.sampled should not be empty, then None will not be returned. @thunterdb Is this correct?

Yes. I think it is impossible to hit it. We need some test cases to ensure it. See my next comment.

cc @thunterdb to ensure it.

Yes, this is correct. But you should leave a comment, since it is not obvious.

Thanks all. I will add a comment here.

SparkQA · 2017-03-02T07:37:36Z

Test build #73739 has started for PR 16971 at commit 2071aae.

zhengruifeng · 2017-03-02T08:38:26Z

Jenkins, retest this please

SparkQA · 2017-03-02T10:43:22Z

Test build #73744 has finished for PR 16971 at commit 2071aae.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-03-13T04:33:07Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/QuantileSummariesSuite.scala

@@ -55,7 +55,7 @@ class QuantileSummariesSuite extends SparkFunSuite {
  }

  private def checkQuantile(quant: Double, data: Seq[Double], summary: QuantileSummaries): Unit = {
-    val approx = summary.query(quant)
+    val approx = summary.query(quant).get


Add a test case with summary.count == 0 and improve this helper function to cover it?

Ok, I will add a test on empty data here

gatorsmile · 2017-03-16T03:59:15Z

ping @zhengruifeng

SparkQA · 2017-03-17T03:12:10Z

Test build #74703 has finished for PR 16971 at commit 42c7b25.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-17T03:26:53Z

Test build #74705 has finished for PR 16971 at commit 7bf7db3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-17T05:26:12Z

Test build #74717 has finished for PR 16971 at commit 00d67f7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-17T05:42:35Z

Test build #74724 has started for PR 16971 at commit ed6dacd.

gatorsmile · 2017-03-18T23:37:43Z

retest this please

gatorsmile · 2017-03-19T00:32:57Z

LGTM pending Jenkins

cc @thunterdb @MLnick

SparkQA · 2017-03-19T01:19:52Z

Test build #74793 has finished for PR 16971 at commit ed6dacd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-20T03:53:09Z

Test build #74837 has finished for PR 16971 at commit b1125fd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-03-20T21:43:54Z

retest this please

SparkQA · 2017-03-20T23:52:06Z

Test build #74908 has finished for PR 16971 at commit b1125fd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-03-21T01:24:02Z

Since it is close to code freeze, I am first merging this PR. If any more comments, we can resolve them in the follow-up PR.

Thanks! Merging to master.

MLnick reviewed Feb 21, 2017

View reviewed changes

zhengruifeng force-pushed the quantiles_nan branch from d5e79a8 to 31346c3 Compare February 21, 2017 09:45

thunterdb reviewed Feb 22, 2017

View reviewed changes

zhengruifeng force-pushed the quantiles_nan branch from 31346c3 to bdf3bf0 Compare February 23, 2017 09:23

gatorsmile reviewed Feb 27, 2017

View reviewed changes

zhengruifeng force-pushed the quantiles_nan branch from e0365b9 to 8d5941b Compare March 1, 2017 07:58

gatorsmile reviewed Mar 2, 2017

View reviewed changes

zhengruifeng force-pushed the quantiles_nan branch from 402deb0 to 2071aae Compare March 2, 2017 07:37

gatorsmile reviewed Mar 13, 2017

View reviewed changes

zhengruifeng added 2 commits March 17, 2017 09:04

create pr

fe6e384

update docs & tests

988a7e3

zhengruifeng added 6 commits March 17, 2017 09:04

return empty array

24e1134

use SparkException

e0ee654

use None

e914ced

use None

ddb7d4d

fix nits

b581331

add test and comment

42c7b25

zhengruifeng force-pushed the quantiles_nan branch from 2071aae to 42c7b25 Compare March 17, 2017 01:34

update helper

7bf7db3

change tests name

00d67f7

change tests name

ed6dacd

update tests name

b1125fd

asfgit closed this in 10691d3 Mar 21, 2017

zhengruifeng deleted the quantiles_nan branch March 21, 2017 01:31

[SPARK-19573][SQL] Make NaN/null handling consistent in approxQuantile #16971

[SPARK-19573][SQL] Make NaN/null handling consistent in approxQuantile #16971

Conversation

zhengruifeng commented Feb 17, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Feb 17, 2017

zhengruifeng commented Feb 17, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thunterdb Feb 22, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 21, 2017

Choose a reason for hiding this comment

thunterdb commented Feb 22, 2017

zhengruifeng commented Feb 23, 2017

MLnick commented Feb 23, 2017 via email

SparkQA commented Feb 23, 2017

SparkQA commented Feb 23, 2017

zhengruifeng commented Feb 27, 2017

Choose a reason for hiding this comment

SparkQA commented Mar 1, 2017

SparkQA commented Mar 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 2, 2017

zhengruifeng commented Mar 2, 2017

SparkQA commented Mar 2, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Mar 16, 2017

SparkQA commented Mar 17, 2017

SparkQA commented Mar 17, 2017

SparkQA commented Mar 17, 2017

SparkQA commented Mar 17, 2017

gatorsmile commented Mar 18, 2017

gatorsmile commented Mar 19, 2017

SparkQA commented Mar 19, 2017

SparkQA commented Mar 20, 2017

gatorsmile commented Mar 20, 2017

SparkQA commented Mar 20, 2017

gatorsmile commented Mar 21, 2017 • edited Loading

zhengruifeng commented Feb 17, 2017 •

edited

Loading

thunterdb Feb 22, 2017 •

edited

Loading

gatorsmile commented Mar 21, 2017 •

edited

Loading