Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-11420 Updating Stddev support via Imperative Aggregate #9380

Closed
wants to merge 14 commits into from
Closed

SPARK-11420 Updating Stddev support via Imperative Aggregate #9380

wants to merge 14 commits into from

Conversation

JihongMA
Copy link
Contributor

switched stddev support from DeclarativeAggregate to ImperativeAggregate.

@JihongMA JihongMA changed the title [Spark 11420] Updating Stddev support via Imperative Aggregate [SPARK-11420] Updating Stddev support via Imperative Aggregate Oct 30, 2015
@JihongMA JihongMA changed the title [SPARK-11420] Updating Stddev support via Imperative Aggregate SPARK-11420 Updating Stddev support via Imperative Aggregate Oct 30, 2015
@@ -1135,7 +992,76 @@ abstract class CentralMomentAgg(child: Expression) extends ImperativeAggregate w
moments(4) = buffer.getDouble(fourthMomentOffset)
}

getStatistic(n, mean, moments)
if (n == 0.0) null
else if (n == 1.0) 0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe we want this behavior, since these edge cases should be handled in the getStatistic implementation. If you see previous PR we established that Skewness and Kurtosis should yield Double.NaN when n == 1.0 but other functions like VariancePop should yield 0.0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@mengxr
Copy link
Contributor

mengxr commented Nov 3, 2015

@JihongMA Could you address @sethah 's comment and rebase master? There are some merge conflicts.

@JihongMA
Copy link
Contributor Author

JihongMA commented Nov 3, 2015

so for skewness and kurtosis in case of count =1, we want to return null instead of 0. I can address it, but instead of returning Double.NaN, should we return null for stddev/variance when count = 0, null will be in line with all other stats functions, like mix, max...

@JihongMA
Copy link
Contributor Author

JihongMA commented Nov 3, 2015

I propose to return null for all cases which currently Double.NaN is returned. and change getStatistics() to return Any instead of Double.

@yu-iskw
Copy link
Contributor

yu-iskw commented Nov 3, 2015

@JihongMA I'm not sure about that. I don't think we should return null, instead of Double.NaN. Why do we need to change the return type?

@JihongMA
Copy link
Contributor Author

JihongMA commented Nov 3, 2015

getStatistics() will continue to return Double value for normal cases, changing it to return null only for edge cases. is there a strong reason to return Double.NaN? when count = 0, all other stats function, min, max, avg.. all return null.

@JihongMA
Copy link
Contributor Author

JihongMA commented Nov 4, 2015

@mengxr Please take another look.

@JihongMA
Copy link
Contributor Author

JihongMA commented Nov 4, 2015

@mengxr rebased with the changes @rxin [SPARK-11490], stddev / variance mapped to the corresponding sample stddev / variance. I checked Hive doesn't support this mapping, but I found other MPP database like Presto did the same alias mapping.
https://prestodb.io/docs/current/functions/aggregate.html

@mengxr
Copy link
Contributor

mengxr commented Nov 5, 2015

add to whitelist

@mengxr
Copy link
Contributor

mengxr commented Nov 5, 2015

ok to test

@yu-iskw
Copy link
Contributor

yu-iskw commented Nov 5, 2015

@JihongMA I don't know if there are any strong reasons in terms of catalyst. However, personally I think we should separate changing the return type and null from the issue. So, we should focus on refactoring stddev in this PR. It seems that at least Skewness and Kurtosis have nothing to do with the issue. If we need to discuss them, it would be great to do in another issue. Thanks.

@mengxr
Copy link
Contributor

mengxr commented Nov 5, 2015

+1 on @yu-iskw 's suggestion. Let's keep the changes in this PR minimal. Just replace StdDev implementation by imperative agg. I would prefer returning NaN when n == 1 to null, but we can create a new JIRA to discuss it.

@rxin
Copy link
Contributor

rxin commented Nov 5, 2015

BTW with #9480, we might not need to replace it with imperative aggregate anymore.

@SparkQA
Copy link

SparkQA commented Nov 5, 2015

Test build #45135 has finished for PR 9380 at commit e3417aa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class StddevPop(child: Expression,\n * case class StddevSamp(child: Expression,\n

@mengxr
Copy link
Contributor

mengxr commented Nov 6, 2015

@rxin We still need to run some benchmark. The formulation is simple and fixed, and hence codegen won't bring much performance gain.

@JihongMA Could you address the comments on NaN vs null?

@SparkQA
Copy link

SparkQA commented Nov 6, 2015

Test build #45256 has finished for PR 9380 at commit b69d1e6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class StddevPop(child: Expression,\n * case class StddevSamp(child: Expression,\n

@marmbrus
Copy link
Contributor

marmbrus commented Nov 7, 2015

Even though its simple, I think this implementation is boxing the result, which could result in slower performance on real workloads (but is harder to see in micro benchmarks)

@mengxr
Copy link
Contributor

mengxr commented Nov 9, 2015

Which part is boxing the result? I tested the following on master with changes from #9480:

val df = sqlContext.range(100000000)
df.select(var_samp("id")).show(); // ~7.5s
df.select(stddev_samp("id")).show() // ~10s

Both have low GC activities.

@marmbrus
Copy link
Contributor

marmbrus commented Nov 9, 2015

The eval call is boxing which you aren't going to see without a groupby.

@mengxr
Copy link
Contributor

mengxr commented Nov 9, 2015

But eval only happens once per group.

@mengxr
Copy link
Contributor

mengxr commented Nov 11, 2015

@JihongMA Could you merge the current master? There are some merge conflicts.

For NaN vs. null, we had some discussion in https://issues.apache.org/jira/browse/SPARK-9079. The design is to return NaN if there exist NaN values in the aggregation. I think we should return NaN here, which is consistent with R and Python:

> mean(c())
[1] NA
> var(c(1))
[1] NA
> np.mean([])
Out[1] = na
> np.var([1], ddof=1)
Out[2] = nan

@marmbrus I think we can move the implementation from imperative to declarative in 1.7. This PR is to re-use the CentralMomentAgg for stddev. It removes 70 lines of code, which is a good sign:)

@SparkQA
Copy link

SparkQA commented Nov 12, 2015

Test build #45678 has finished for PR 9380 at commit dc0558b.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class StddevSamp(child: Expression,\n * case class StddevPop(\n

@felixcheung
Copy link
Member

SparkR support has just been added so this change breaks tests

1. Failure (at test_sparkSQL.R#1010): group by, agg functions ------------------
0 not equal to df3_local[df3_local$name == "Andy", ][1, 2]
NaN - 0 == NaN

2. Failure (at test_sparkSQL.R#1041): group by, agg functions ------------------
0 not equal to df7_local[df7_local$name == "ID2", ][1, 2]
NaN - 0 == NaN

@JihongMA
Copy link
Contributor Author

@felixcheung Thank you! this is the change I have made to make it pass for R. I am not familiar with R .

df3 <- agg(gd, age = "stddev")
expect_is(df3, "DataFrame")
df3_local <- collect(df3)
expect_true(is.nan(df3_local[df3_local$name == "Andy",][1, 2]))

@felixcheung
Copy link
Member

@JihongMA yap that should fix them

@JihongMA
Copy link
Contributor Author

@AmplabJenkins please retest the change.

@yu-iskw
Copy link
Contributor

yu-iskw commented Nov 12, 2015

Jenkins, test this please.

@yu-iskw
Copy link
Contributor

yu-iskw commented Nov 12, 2015

@JihongMA thanks for the update! Could you revert Skewness.scala and Kurtosis.scala. Since I don't think the change relates to the issue. I know this is a minor thing, but we shouldn't change them in this PR.

@SparkQA
Copy link

SparkQA commented Nov 12, 2015

Test build #45749 has finished for PR 9380 at commit 7a239ec.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class StddevSamp(child: Expression,\n * case class StddevPop(\n

@mengxr
Copy link
Contributor

mengxr commented Nov 12, 2015

test this please

@SparkQA
Copy link

SparkQA commented Nov 12, 2015

Test build #45755 has finished for PR 9380 at commit 7a239ec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class StddevSamp(child: Expression,\n * case class StddevPop(\n

@asfgit asfgit closed this in d292f74 Nov 12, 2015
@mengxr
Copy link
Contributor

mengxr commented Nov 12, 2015

LGTM. Merged into master and branch-1.6. Thanks! Btw, there is a minor style issue I marked inline. @JihongMA Could you submit another PR to change the output of mean([]) to NaN from null?

asfgit pushed a commit that referenced this pull request Nov 12, 2015
switched stddev support from DeclarativeAggregate to ImperativeAggregate.

Author: JihongMa <linlin200605@gmail.com>

Closes #9380 from JihongMA/SPARK-11420.
@JihongMA
Copy link
Contributor Author

@mengxr sure, will take care mean via seperate PR.

@JihongMA
Copy link
Contributor Author

@mengxr do we want to change the behavior for min, max as well?

@marmbrus
Copy link
Contributor

No, min and max can be used on string and other types so should not return NaN. We should follow the SQL standard here.

@mengxr
Copy link
Contributor

mengxr commented Nov 12, 2015

Sounds good.

dskrvk pushed a commit to dskrvk/spark that referenced this pull request Nov 13, 2015
switched stddev support from DeclarativeAggregate to ImperativeAggregate.

Author: JihongMa <linlin200605@gmail.com>

Closes apache#9380 from JihongMA/SPARK-11420.
@JihongMA JihongMA deleted the SPARK-11420 branch March 14, 2017 00:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants