SPARK-11420 Updating Stddev support via Imperative Aggregate #9380

JihongMA · 2015-10-30T17:22:44Z

switched stddev support from DeclarativeAggregate to ImperativeAggregate.

sethah · 2015-11-02T18:32:01Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/functions.scala

@@ -1135,7 +992,76 @@ abstract class CentralMomentAgg(child: Expression) extends ImperativeAggregate w
      moments(4) = buffer.getDouble(fourthMomentOffset)
    }

-    getStatistic(n, mean, moments)
+    if (n == 0.0) null
+    else if (n == 1.0) 0.0


I don't believe we want this behavior, since these edge cases should be handled in the getStatistic implementation. If you see previous PR we established that Skewness and Kurtosis should yield Double.NaN when n == 1.0 but other functions like VariancePop should yield 0.0.

mengxr · 2015-11-03T16:46:51Z

@JihongMA Could you address @sethah 's comment and rebase master? There are some merge conflicts.

JihongMA · 2015-11-03T18:49:05Z

so for skewness and kurtosis in case of count =1, we want to return null instead of 0. I can address it, but instead of returning Double.NaN, should we return null for stddev/variance when count = 0, null will be in line with all other stats functions, like mix, max...

JihongMA · 2015-11-03T18:55:34Z

I propose to return null for all cases which currently Double.NaN is returned. and change getStatistics() to return Any instead of Double.

yu-iskw · 2015-11-03T19:29:58Z

@JihongMA I'm not sure about that. I don't think we should return null, instead of Double.NaN. Why do we need to change the return type?

JihongMA · 2015-11-03T20:19:30Z

getStatistics() will continue to return Double value for normal cases, changing it to return null only for edge cases. is there a strong reason to return Double.NaN? when count = 0, all other stats function, min, max, avg.. all return null.

JihongMA · 2015-11-04T16:03:59Z

@mengxr Please take another look.

JihongMA · 2015-11-04T20:06:12Z

@mengxr rebased with the changes @rxin [SPARK-11490], stddev / variance mapped to the corresponding sample stddev / variance. I checked Hive doesn't support this mapping, but I found other MPP database like Presto did the same alias mapping.
https://prestodb.io/docs/current/functions/aggregate.html

mengxr · 2015-11-05T19:28:20Z

add to whitelist

mengxr · 2015-11-05T19:28:23Z

ok to test

yu-iskw · 2015-11-05T19:29:36Z

@JihongMA I don't know if there are any strong reasons in terms of catalyst. However, personally I think we should separate changing the return type and null from the issue. So, we should focus on refactoring stddev in this PR. It seems that at least Skewness and Kurtosis have nothing to do with the issue. If we need to discuss them, it would be great to do in another issue. Thanks.

mengxr · 2015-11-05T19:33:15Z

+1 on @yu-iskw 's suggestion. Let's keep the changes in this PR minimal. Just replace StdDev implementation by imperative agg. I would prefer returning NaN when n == 1 to null, but we can create a new JIRA to discuss it.

rxin · 2015-11-05T19:41:04Z

BTW with #9480, we might not need to replace it with imperative aggregate anymore.

SparkQA · 2015-11-05T22:00:37Z

Test build #45135 has finished for PR 9380 at commit e3417aa.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class StddevPop(child: Expression,\n * case class StddevSamp(child: Expression,\n

mengxr · 2015-11-06T19:29:50Z

@rxin We still need to run some benchmark. The formulation is simple and fixed, and hence codegen won't bring much performance gain.

@JihongMA Could you address the comments on NaN vs null?

SparkQA · 2015-11-06T23:46:43Z

Test build #45256 has finished for PR 9380 at commit b69d1e6.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class StddevPop(child: Expression,\n * case class StddevSamp(child: Expression,\n

marmbrus · 2015-11-07T01:08:42Z

Even though its simple, I think this implementation is boxing the result, which could result in slower performance on real workloads (but is harder to see in micro benchmarks)

mengxr · 2015-11-09T22:54:28Z

Which part is boxing the result? I tested the following on master with changes from #9480:

val df = sqlContext.range(100000000)
df.select(var_samp("id")).show(); // ~7.5s
df.select(stddev_samp("id")).show() // ~10s

Both have low GC activities.

marmbrus · 2015-11-09T23:03:18Z

The eval call is boxing which you aren't going to see without a groupby.

mengxr · 2015-11-09T23:10:52Z

But eval only happens once per group.

mengxr · 2015-11-11T01:57:03Z

@JihongMA Could you merge the current master? There are some merge conflicts.

For NaN vs. null, we had some discussion in https://issues.apache.org/jira/browse/SPARK-9079. The design is to return NaN if there exist NaN values in the aggregation. I think we should return NaN here, which is consistent with R and Python:

> mean(c())
[1] NA
> var(c(1))
[1] NA

> np.mean([])
Out[1] = na
> np.var([1], ddof=1)
Out[2] = nan

@marmbrus I think we can move the implementation from imperative to declarative in 1.7. This PR is to re-use the CentralMomentAgg for stddev. It removes 70 lines of code, which is a good sign:)

SparkQA · 2015-11-12T00:24:04Z

Test build #45678 has finished for PR 9380 at commit dc0558b.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class StddevSamp(child: Expression,\n * case class StddevPop(\n

felixcheung · 2015-11-12T01:43:54Z

SparkR support has just been added so this change breaks tests

1. Failure (at test_sparkSQL.R#1010): group by, agg functions ------------------
0 not equal to df3_local[df3_local$name == "Andy", ][1, 2]
NaN - 0 == NaN

2. Failure (at test_sparkSQL.R#1041): group by, agg functions ------------------
0 not equal to df7_local[df7_local$name == "ID2", ][1, 2]
NaN - 0 == NaN

JihongMA · 2015-11-12T03:47:15Z

@felixcheung Thank you! this is the change I have made to make it pass for R. I am not familiar with R .

df3 <- agg(gd, age = "stddev")
expect_is(df3, "DataFrame")
df3_local <- collect(df3)
expect_true(is.nan(df3_local[df3_local$name == "Andy",][1, 2]))

felixcheung · 2015-11-12T07:33:10Z

@JihongMA yap that should fix them

JihongMA · 2015-11-12T17:15:42Z

@AmplabJenkins please retest the change.

yu-iskw · 2015-11-12T17:33:35Z

Jenkins, test this please.

yu-iskw · 2015-11-12T17:45:28Z

@JihongMA thanks for the update! Could you revert Skewness.scala and Kurtosis.scala. Since I don't think the change relates to the issue. I know this is a minor thing, but we shouldn't change them in this PR.

SparkQA · 2015-11-12T18:19:02Z

Test build #45749 has finished for PR 9380 at commit 7a239ec.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class StddevSamp(child: Expression,\n * case class StddevPop(\n

mengxr · 2015-11-12T18:46:13Z

test this please

SparkQA · 2015-11-12T21:09:29Z

Test build #45755 has finished for PR 9380 at commit 7a239ec.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class StddevSamp(child: Expression,\n * case class StddevPop(\n

mengxr · 2015-11-12T21:54:27Z

LGTM. Merged into master and branch-1.6. Thanks! Btw, there is a minor style issue I marked inline. @JihongMA Could you submit another PR to change the output of mean([]) to NaN from null?

switched stddev support from DeclarativeAggregate to ImperativeAggregate. Author: JihongMa <linlin200605@gmail.com> Closes #9380 from JihongMA/SPARK-11420.

JihongMA · 2015-11-12T21:59:02Z

@mengxr sure, will take care mean via seperate PR.

JihongMA · 2015-11-12T22:11:35Z

@mengxr do we want to change the behavior for min, max as well?

marmbrus · 2015-11-12T23:03:27Z

No, min and max can be used on string and other types so should not return NaN. We should follow the SQL standard here.

mengxr · 2015-11-12T23:09:56Z

Sounds good.

switched stddev support from DeclarativeAggregate to ImperativeAggregate. Author: JihongMa <linlin200605@gmail.com> Closes apache#9380 from JihongMA/SPARK-11420.

JihongMA added 2 commits October 29, 2015 23:46

SPARK-11420: stddev via Imperative Aggregate

5094db0

handle null

0113626

JihongMA changed the title ~~[Spark 11420] Updating Stddev support via Imperative Aggregate~~ [SPARK-11420] Updating Stddev support via Imperative Aggregate Oct 30, 2015

JihongMA changed the title ~~[SPARK-11420] Updating Stddev support via Imperative Aggregate~~ SPARK-11420 Updating Stddev support via Imperative Aggregate Oct 30, 2015

sethah reviewed Nov 2, 2015
View reviewed changes

JihongMA added 4 commits November 3, 2015 13:15

address comment

24195fe

rebase upstream

c360f11

minor fix

4ca8b19

rebase with upstream

747a911

JihongMA added 2 commits November 4, 2015 11:51

rebase with upstream to revert stddev as alias of stddev_samp

57eeeed

style fix

402971c

rebase with upstream

e3417aa

address comment

b69d1e6

JihongMA added 2 commits November 11, 2015 14:07

rebase with upstream & handle NaN

eea699a

fix tests

dc0558b

JihongMA added 2 commits November 11, 2015 19:22

fix R test

ca407bc

fix test_sparkSQL.R

7a239ec

asfgit closed this in d292f74 Nov 12, 2015

JihongMA deleted the SPARK-11420 branch March 14, 2017 00:42

SPARK-11420 Updating Stddev support via Imperative Aggregate #9380

SPARK-11420 Updating Stddev support via Imperative Aggregate #9380

Conversation

JihongMA commented Oct 30, 2015

sethah Nov 2, 2015

Choose a reason for hiding this comment

mengxr Nov 3, 2015

Choose a reason for hiding this comment

mengxr commented Nov 3, 2015

JihongMA commented Nov 3, 2015

JihongMA commented Nov 3, 2015

yu-iskw commented Nov 3, 2015

JihongMA commented Nov 3, 2015

JihongMA commented Nov 4, 2015

JihongMA commented Nov 4, 2015

mengxr commented Nov 5, 2015

mengxr commented Nov 5, 2015

yu-iskw commented Nov 5, 2015

mengxr commented Nov 5, 2015

rxin commented Nov 5, 2015

SparkQA commented Nov 5, 2015

mengxr commented Nov 6, 2015

SparkQA commented Nov 6, 2015

marmbrus commented Nov 7, 2015

mengxr commented Nov 9, 2015

marmbrus commented Nov 9, 2015

mengxr commented Nov 9, 2015

mengxr commented Nov 11, 2015

SparkQA commented Nov 12, 2015

felixcheung commented Nov 12, 2015

JihongMA commented Nov 12, 2015

felixcheung commented Nov 12, 2015

JihongMA commented Nov 12, 2015

yu-iskw commented Nov 12, 2015

yu-iskw commented Nov 12, 2015

SparkQA commented Nov 12, 2015

mengxr commented Nov 12, 2015

SparkQA commented Nov 12, 2015

mengxr commented Nov 12, 2015

JihongMA commented Nov 12, 2015

JihongMA commented Nov 12, 2015

marmbrus commented Nov 12, 2015

mengxr commented Nov 12, 2015