Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19573][SQL] Make NaN/null handling consistent in approxQuantile #16971

Closed
wants to merge 12 commits into from

Conversation

zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Feb 17, 2017

What changes were proposed in this pull request?

update StatFunctions.multipleApproxQuantiles to handle NaN/null

How was this patch tested?

existing tests and added tests

@SparkQA
Copy link

SparkQA commented Feb 17, 2017

Test build #73033 has finished for PR 16971 at commit d5e79a8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

@gatorsmile @jkbradley

@@ -54,6 +54,8 @@ object StatFunctions extends Logging {
* Note that values greater than 1 are accepted but give the same result as 1.
*
* @return for each column, returns the requested approximations
*
* @note null and NaN values will be removed from the numerical column before calculation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "will be ignored" is more accurate than "will be removed"

@@ -89,18 +89,17 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) {
* Note that values greater than 1 are accepted but give the same result as 1.
* @return the approximate quantiles at the given probabilities of each column
*
* @note Rows containing any null or NaN values will be removed before calculation. If
* the dataframe is empty or all rows contain null or NaN, null is returned.
* @note null and NaN values will be removed from the numerical column before calculation. If
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, "ignored" is slightly better than "removed from"

try {
StatFunctions.multipleApproxQuantiles(df.select(cols.map(col): _*).na.drop(), cols,
StatFunctions.multipleApproxQuantiles(df.select(cols.map(col): _*), cols,
probabilities, relativeError).map(_.toArray).toArray
} catch {
case e: NoSuchElementException => null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This went in for the other PR but I still question whether we should be returning null here. Is this standard in SparkSQL? What about returning an empty Array? cc @gatorsmile

Copy link
Contributor

@thunterdb thunterdb Feb 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I tend to think that the result should be NaN (following the IEEE convention) or null (following scala Option convention). But pending a resolution, I would be fine with throwing an exception because it is the most conservative behavior (stopping computations). Returning null usually causes some issues in a functional context such as Spark.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Spark SQL, all the other built-in functions will not throw an exception if the input data set is empty. An empty inupt data set is pretty normal. Returning either null or empty Array looks ok to me.

assert(resNaN1(0) === resNaNAll(0)(0))
assert(resNaN1(1) === resNaNAll(0)(1))
assert(resNaN2(0) === resNaNAll(1)(0))
assert(resNaN2(1) === resNaNAll(1)(1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a test for one column all nulls (that it returns null)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I create a new column containing only NaN/null.

@@ -78,7 +80,13 @@ object StatFunctions extends Logging {
def apply(summaries: Array[QuantileSummaries], row: Row): Array[QuantileSummaries] = {
var i = 0
while (i < summaries.length) {
summaries(i) = summaries(i).insert(row.getDouble(i))
val item = row(i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works, though perhaps we can do:

if (!row.isNullAt(i)) {
  val v = row.getDouble(i)
  if (!v.isNaN) {
    summaries(i) = summaries(i).insert(v)
  }
}

@SparkQA
Copy link

SparkQA commented Feb 21, 2017

Test build #73211 has finished for PR 16971 at commit 31346c3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -78,7 +80,12 @@ object StatFunctions extends Logging {
def apply(summaries: Array[QuantileSummaries], row: Row): Array[QuantileSummaries] = {
var i = 0
while (i < summaries.length) {
summaries(i) = summaries(i).insert(row.getDouble(i))
if (!row.isNullAt(i)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for fixing this issue, it was an oversight in my original implementation.

The current exception being thrown depends on an implementation detail (calling sampled.head). Can you modify the function def query below to explicitly throw an exception if sampled is empty, and document this behavior in that function? This way, we will not forget it if decide to change the semantics of that class.

As @MLnick was mentioning above, it would be preferrable to either return an Option, null or NaN eventually, but this can wait for more consensus.

@thunterdb
Copy link
Contributor

@zhengruifeng thanks for looking into this issue. I have one comment above.

@zhengruifeng
Copy link
Contributor Author

@thunterdb Good point. I will check the sampled in def query.

@MLnick @gatorsmile I perfer empty array as the result for empty dataset or columns that only contains na.
And, in the case that only some columns only contains na. Current implementation will return null, and result for all column can not be obtained. I think the result for common columns should be accessable.

val rows = spark.sparkContext.parallelize(Seq(Row(Double.NaN, 1.0, Double.NaN),
+      Row(1.0, -1.0, null), Row(-1.0, Double.NaN, null), Row(Double.NaN, Double.NaN, null),
+      Row(null, null, Double.NaN), Row(null, 1.0, null), Row(-1.0, null, Double.NaN),
+      Row(Double.NaN, null, null)))
     val schema = StructType(Seq(StructField("input1", DoubleType, nullable = true),
+      StructField("input2", DoubleType, nullable = true),
+      StructField("input3", DoubleType, nullable = true)))
     val dfNaN = spark.createDataFrame(rows, schema)
val resNaNAll = dfNaN.stat.approxQuantile(Array("input1", "input2", "input3"),
       Array(q1, q2), epsilon)

In the returned array, result for columns input1 and input2 should be ok, and result for input3 is empty. Array(Array(num1, num2), Array(num3, num4), Array())

@MLnick
Copy link
Contributor

MLnick commented Feb 23, 2017 via email

@SparkQA
Copy link

SparkQA commented Feb 23, 2017

Test build #73338 has finished for PR 16971 at commit bdf3bf0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 23, 2017

Test build #73344 has finished for PR 16971 at commit e0365b9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

ping @MLnick @gatorsmile @thunterdb

try {
probabilities.map(summary.query)
} catch {
case e: SparkException => Seq.empty[Double]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not use the Exception handling for this purpose. Instead, you can return None.

@SparkQA
Copy link

SparkQA commented Mar 1, 2017

Test build #73674 has finished for PR 16971 at commit 8d5941b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 1, 2017

Test build #73676 has finished for PR 16971 at commit 402deb0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case e: NoSuchElementException => null
}
StatFunctions.multipleApproxQuantiles(df.select(cols.map(col): _*), cols,
probabilities, relativeError).map(_.toArray).toArray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: style issue

    StatFunctions.multipleApproxQuantiles(
      df.select(cols.map(col): _*),
      cols,
      probabilities,
      relativeError).map(_.toArray).toArray

require(quantile >= 0 && quantile <= 1.0, "quantile should be in the range [0.0, 1.0]")
require(headSampled.isEmpty,
"Cannot operate on an uncompressed summary, call compress() first")

if (sampled.isEmpty) {
return None
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

if (sampled.isEmpty) return None

val v = row.getDouble(i)
if (!v.isNaN) {
summaries(i) = summaries(i).insert(v)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

if (!v.isNaN) summaries(i) = summaries(i).insert(v)

summaries.map { summary => probabilities.map(summary.query) }
summaries.map { summary =>
probabilities.flatMap(summary.query)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

summaries.map { summary => probabilities.flatMap(summary.query) }

@@ -245,7 +245,7 @@ object ApproximatePercentile {
val result = new Array[Double](percentages.length)
var i = 0
while (i < percentages.length) {
result(i) = summaries.query(percentages(i))
result(i) = summaries.query(percentages(i)).get
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to return None? Then, you will get a strange exception. Could you also add a test case for getPercentiles in ApproximatePercentileQuerySuite?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like that it is impossible not return None here: Since summaries.count != 0, the summaries.sampled should not be empty, then None will not be returned. @thunterdb Is this correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I think it is impossible to hit it. We need some test cases to ensure it. See my next comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @thunterdb to ensure it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is correct. But you should leave a comment, since it is not obvious.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks all. I will add a comment here.

@SparkQA
Copy link

SparkQA commented Mar 2, 2017

Test build #73739 has started for PR 16971 at commit 2071aae.

@zhengruifeng
Copy link
Contributor Author

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Mar 2, 2017

Test build #73744 has finished for PR 16971 at commit 2071aae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -55,7 +55,7 @@ class QuantileSummariesSuite extends SparkFunSuite {
}

private def checkQuantile(quant: Double, data: Seq[Double], summary: QuantileSummaries): Unit = {
val approx = summary.query(quant)
val approx = summary.query(quant).get
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test case with summary.count == 0 and improve this helper function to cover it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will add a test on empty data here

@gatorsmile
Copy link
Member

ping @zhengruifeng

@SparkQA
Copy link

SparkQA commented Mar 17, 2017

Test build #74703 has finished for PR 16971 at commit 42c7b25.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 17, 2017

Test build #74705 has finished for PR 16971 at commit 7bf7db3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 17, 2017

Test build #74717 has finished for PR 16971 at commit 00d67f7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 17, 2017

Test build #74724 has started for PR 16971 at commit ed6dacd.

@gatorsmile
Copy link
Member

retest this please

@gatorsmile
Copy link
Member

LGTM pending Jenkins

cc @thunterdb @MLnick

@SparkQA
Copy link

SparkQA commented Mar 19, 2017

Test build #74793 has finished for PR 16971 at commit ed6dacd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 20, 2017

Test build #74837 has finished for PR 16971 at commit b1125fd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Mar 20, 2017

Test build #74908 has finished for PR 16971 at commit b1125fd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

gatorsmile commented Mar 21, 2017

Since it is close to code freeze, I am first merging this PR. If any more comments, we can resolve them in the follow-up PR.

Thanks! Merging to master.

@asfgit asfgit closed this in 10691d3 Mar 21, 2017
@zhengruifeng zhengruifeng deleted the quantiles_nan branch March 21, 2017 01:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants