-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19573][SQL] Make NaN/null handling consistent in approxQuantile #16971
Conversation
Test build #73033 has finished for PR 16971 at commit
|
@@ -54,6 +54,8 @@ object StatFunctions extends Logging { | |||
* Note that values greater than 1 are accepted but give the same result as 1. | |||
* | |||
* @return for each column, returns the requested approximations | |||
* | |||
* @note null and NaN values will be removed from the numerical column before calculation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "will be ignored" is more accurate than "will be removed"
@@ -89,18 +89,17 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) { | |||
* Note that values greater than 1 are accepted but give the same result as 1. | |||
* @return the approximate quantiles at the given probabilities of each column | |||
* | |||
* @note Rows containing any null or NaN values will be removed before calculation. If | |||
* the dataframe is empty or all rows contain null or NaN, null is returned. | |||
* @note null and NaN values will be removed from the numerical column before calculation. If |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, "ignored" is slightly better than "removed from"
try { | ||
StatFunctions.multipleApproxQuantiles(df.select(cols.map(col): _*).na.drop(), cols, | ||
StatFunctions.multipleApproxQuantiles(df.select(cols.map(col): _*), cols, | ||
probabilities, relativeError).map(_.toArray).toArray | ||
} catch { | ||
case e: NoSuchElementException => null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This went in for the other PR but I still question whether we should be returning null
here. Is this standard in SparkSQL? What about returning an empty Array
? cc @gatorsmile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. I tend to think that the result should be NaN
(following the IEEE convention) or null
(following scala Option convention). But pending a resolution, I would be fine with throwing an exception because it is the most conservative behavior (stopping computations). Returning null
usually causes some issues in a functional context such as Spark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Spark SQL, all the other built-in functions will not throw an exception if the input data set is empty. An empty inupt data set is pretty normal. Returning either null
or empty Array
looks ok to me.
assert(resNaN1(0) === resNaNAll(0)(0)) | ||
assert(resNaN1(1) === resNaNAll(0)(1)) | ||
assert(resNaN2(0) === resNaNAll(1)(0)) | ||
assert(resNaN2(1) === resNaNAll(1)(1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a test for one column all nulls (that it returns null)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I create a new column containing only NaN/null.
@@ -78,7 +80,13 @@ object StatFunctions extends Logging { | |||
def apply(summaries: Array[QuantileSummaries], row: Row): Array[QuantileSummaries] = { | |||
var i = 0 | |||
while (i < summaries.length) { | |||
summaries(i) = summaries(i).insert(row.getDouble(i)) | |||
val item = row(i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works, though perhaps we can do:
if (!row.isNullAt(i)) {
val v = row.getDouble(i)
if (!v.isNaN) {
summaries(i) = summaries(i).insert(v)
}
}
d5e79a8
to
31346c3
Compare
Test build #73211 has finished for PR 16971 at commit
|
@@ -78,7 +80,12 @@ object StatFunctions extends Logging { | |||
def apply(summaries: Array[QuantileSummaries], row: Row): Array[QuantileSummaries] = { | |||
var i = 0 | |||
while (i < summaries.length) { | |||
summaries(i) = summaries(i).insert(row.getDouble(i)) | |||
if (!row.isNullAt(i)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for fixing this issue, it was an oversight in my original implementation.
The current exception being thrown depends on an implementation detail (calling sampled.head
). Can you modify the function def query
below to explicitly throw an exception if sampled
is empty, and document this behavior in that function? This way, we will not forget it if decide to change the semantics of that class.
As @MLnick was mentioning above, it would be preferrable to either return an Option, null or NaN eventually, but this can wait for more consensus.
@zhengruifeng thanks for looking into this issue. I have one comment above. |
@thunterdb Good point. I will check the @MLnick @gatorsmile I perfer empty array as the result for empty dataset or columns that only contains na.
In the returned array, result for columns |
Yes my point was returning null is not very idiomatic in Scala. Better to
return Option or empty collection. Option doesn't work for Java compat, so
empty Array is best in this case I believe.
+1 for empty Array and if we can return the quantiles for the non-empty /
non-NaN cols as per your suggestion that is ideal.
…On Thu, 23 Feb 2017 at 08:12, Ruifeng Zheng ***@***.***> wrote:
@thunterdb <https://github.com/thunterdb> Good point. I will check the
sampled in def query.
@MLnick <https://github.com/MLnick> @gatorsmile
<https://github.com/gatorsmile> I perfer empty array as the result for
empty dataset or columns that only contains na.
And, in the case that only some columns only contains na. Current
implementation will return null, and result for all column can not be
obtained. I think the result for common columns should be accessable.
val rows = spark.sparkContext.parallelize(Seq(Row(Double.NaN, 1.0, Double.NaN),
+ Row(1.0, -1.0, null), Row(-1.0, Double.NaN, null), Row(Double.NaN, Double.NaN, null),
+ Row(null, null, Double.NaN), Row(null, 1.0, null), Row(-1.0, null, Double.NaN),
+ Row(Double.NaN, null, null)))
val schema = StructType(Seq(StructField("input1", DoubleType, nullable = true),
+ StructField("input2", DoubleType, nullable = true),
+ StructField("input3", DoubleType, nullable = true)))
val dfNaN = spark.createDataFrame(rows, schema)
val resNaNAll = dfNaN.stat.approxQuantile(Array("input1", "input2", "input3"),
Array(q1, q2), epsilon)
In the returned array, result for columns input1 and input2 should be ok,
and result for input3 is empty. Array(Array(num1, num2), Array(num3,
num4), Array())
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16971 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA_SB04fXTwOGec3BqPZ06w9F6ps-hTxks5rfSNLgaJpZM4MD1Gt>
.
|
31346c3
to
bdf3bf0
Compare
Test build #73338 has finished for PR 16971 at commit
|
Test build #73344 has finished for PR 16971 at commit
|
ping @MLnick @gatorsmile @thunterdb |
try { | ||
probabilities.map(summary.query) | ||
} catch { | ||
case e: SparkException => Seq.empty[Double] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not use the Exception handling for this purpose. Instead, you can return None.
e0365b9
to
8d5941b
Compare
Test build #73674 has finished for PR 16971 at commit
|
Test build #73676 has finished for PR 16971 at commit
|
case e: NoSuchElementException => null | ||
} | ||
StatFunctions.multipleApproxQuantiles(df.select(cols.map(col): _*), cols, | ||
probabilities, relativeError).map(_.toArray).toArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: style issue
StatFunctions.multipleApproxQuantiles(
df.select(cols.map(col): _*),
cols,
probabilities,
relativeError).map(_.toArray).toArray
require(quantile >= 0 && quantile <= 1.0, "quantile should be in the range [0.0, 1.0]") | ||
require(headSampled.isEmpty, | ||
"Cannot operate on an uncompressed summary, call compress() first") | ||
|
||
if (sampled.isEmpty) { | ||
return None | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
if (sampled.isEmpty) return None
val v = row.getDouble(i) | ||
if (!v.isNaN) { | ||
summaries(i) = summaries(i).insert(v) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
if (!v.isNaN) summaries(i) = summaries(i).insert(v)
summaries.map { summary => probabilities.map(summary.query) } | ||
summaries.map { summary => | ||
probabilities.flatMap(summary.query) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
summaries.map { summary => probabilities.flatMap(summary.query) }
@@ -245,7 +245,7 @@ object ApproximatePercentile { | |||
val result = new Array[Double](percentages.length) | |||
var i = 0 | |||
while (i < percentages.length) { | |||
result(i) = summaries.query(percentages(i)) | |||
result(i) = summaries.query(percentages(i)).get |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to return None
? Then, you will get a strange exception. Could you also add a test case for getPercentiles
in ApproximatePercentileQuerySuite
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like that it is impossible not return None
here: Since summaries.count != 0
, the summaries.sampled
should not be empty, then None
will not be returned. @thunterdb Is this correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I think it is impossible to hit it. We need some test cases to ensure it. See my next comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @thunterdb to ensure it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is correct. But you should leave a comment, since it is not obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks all. I will add a comment here.
402deb0
to
2071aae
Compare
Test build #73739 has started for PR 16971 at commit |
Jenkins, retest this please |
Test build #73744 has finished for PR 16971 at commit
|
@@ -55,7 +55,7 @@ class QuantileSummariesSuite extends SparkFunSuite { | |||
} | |||
|
|||
private def checkQuantile(quant: Double, data: Seq[Double], summary: QuantileSummaries): Unit = { | |||
val approx = summary.query(quant) | |||
val approx = summary.query(quant).get |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a test case with summary.count == 0
and improve this helper function to cover it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will add a test on empty data
here
ping @zhengruifeng |
2071aae
to
42c7b25
Compare
Test build #74703 has finished for PR 16971 at commit
|
Test build #74705 has finished for PR 16971 at commit
|
Test build #74717 has finished for PR 16971 at commit
|
Test build #74724 has started for PR 16971 at commit |
retest this please |
LGTM pending Jenkins |
Test build #74793 has finished for PR 16971 at commit
|
Test build #74837 has finished for PR 16971 at commit
|
retest this please |
Test build #74908 has finished for PR 16971 at commit
|
Since it is close to code freeze, I am first merging this PR. If any more comments, we can resolve them in the follow-up PR. Thanks! Merging to master. |
What changes were proposed in this pull request?
update
StatFunctions.multipleApproxQuantiles
to handle NaN/nullHow was this patch tested?
existing tests and added tests