Support variance and standard deviation #2525

navis · 2016-02-23T12:40:15Z

Aggregator for variance. Algorithm is copied from hive UDAF.

Includes VarianceAggregatorFactory(variance), VarianceFoldingAggregatorFactory(varianceFold) and StandardDeviationPostAggregator(stddev)

Introduced some changes of IncrementalIndex to fix NPE which can be thrown when classOfObject() of ObjectColumnSelector for float or long type(non complex type) columns is called, conforming it with that of QueryableIndexStorageAdapter.

fjy · 2016-02-23T19:04:08Z

@navis this is awesome thanks. Are there any docs?

himanshug · 2016-02-23T19:23:19Z

@navis this should be useful, can you move it to an extension though. also it should be possible to test it without changing druid core tests, please see datasketches extension tests for reference.

fjy · 2016-02-24T21:21:18Z

@himanshug why an extension?

himanshug · 2016-02-24T22:02:51Z

@fjy because it is totally possible to do it in extension and takes bloat away from druid-core, if a new person wants to understand druid, he has less to bother about.
anyways, besides the philosophy, i think it was discussed in a dev-syncup and concluded that whatever can be done in a core extension should be done in a core extension.
@drcrallen @cheddar

drcrallen · 2016-02-24T22:13:13Z

I agree with @himanshug on this point.

navis · 2016-02-25T02:18:31Z

Moved to extension

drcrallen · 2016-02-25T17:20:19Z

extensions/variance/pom.xml

@@ -0,0 +1,60 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  ~ Druid - a distributed column store.


I know a lot of the pom's use this notice, but can you use the one present in the java files?

drcrallen · 2016-03-04T00:20:37Z

@navis is this dependent on @himanshug 's arithmetic PR? If so can you comment as such in the master comment?

drcrallen · 2016-03-04T00:24:28Z

@navis in trying to hit the "count == 0" case, can you think of a code path that would hit that case?

drcrallen · 2016-03-04T00:49:18Z

...ons/variance/src/main/java/io/druid/query/aggregation/variance/VarianceBufferAggregator.java

+    buf.putLong(position, count);
+    buf.putDouble(position + SUM_OFFSET, sum);
+    if (count > 1) {
+      double t = count * v - sum;


What algorithm are you using for streaming variance here?

I've forgot the most important thing. The code is copied as-is from apache hive (GenericUDAFVariance) and it says it's from,

Evaluate the variance using the algorithm described by Chan, Golub, and LeVeque in
"Algorithms for computing the sample variance: analysis and recommendations"
The American Statistician, 37 (1983) pp. 242--247.

Added that.

http://www.cs.yale.edu/publications/techreports/tr222.pdf

And supposedly the original algorithm comes from here: http://doi.org/10.1080/00401706.1971.10488826 which is stuck behind a pay wall

Thanks for references.

navis · 2016-03-04T02:16:11Z

@drcrallen No, It's not using arithmetic PR of @himanshug. And in current druid, it seemed not possible to make count == 0 case in real environment. Should we throw exception or something?

drcrallen · 2016-03-04T18:16:51Z

@navis since there isn't an obviously good solution for how to handle variance with count == 0, how about throwing an ISE and mentioning in the code that it shouldn't happen, and the correct behavior is not terribly obvious.

drcrallen · 2016-03-04T18:47:46Z

extensions/variance/src/main/java/io/druid/query/aggregation/variance/VarianceHolder.java

+    VarianceHolder holder2 = (VarianceHolder) rhs;
+
+    final double ratio = holder1.count / (double) holder2.count;
+    final double t = ratio * holder1.sum - holder2.sum;


Orig paper has this as holder1.sum/ratio - holder2.sum

You are right. It could be a huge mistake. There should be one more test with different m and n.

drcrallen · 2016-03-04T18:59:21Z

@navis can you refactor it a bit somehow so that the code for the sum, count, variance streaming update is in one place, and the code for merging streamed updates is in one place? There should hopefully only need to be two methods instead of having a few of them scattered about.

himanshug · 2016-03-05T16:41:59Z

extensions/variance/src/main/java/io/druid/query/aggregation/variance/VarianceDruidModule.java

+  @Override
+  public void configure(Binder binder)
+  {
+    if (ComplexMetrics.getSerdeForType("variance") == null) {


I see that you are using name "variance" and "varianceValue" for the things. Since these names exist in the global namespace of druid and would collide among different aggregator implementations. It would be great to name these specific to the algorithm used e.g. "xxxVariance" instead so that in future more implementations for variance with different algorithms can be incorporated.

How about hiveVariance instead of ChanGolubLeVequeVariance?

navis · 2016-03-07T02:34:50Z

@drcrallen Changed to throw ISE instead of returning null or NaN for count ==0 case.

drcrallen · 2016-03-07T18:13:33Z

RE: naming convention

If the description by Chan, et al. is mostly just copying from Youngs & Cramer, an they post the original implementation of the algorithm, then the scientifically correct name would be YoungsCramerVariance.

navis · 2016-05-19T00:48:08Z

@fjy As commented in #2525 (comment), I really don't want doing instanceOf for all of the inputs.

fjy · 2016-05-19T18:05:22Z

@navis can you explain a bit more about needing to support floats and longs? Why not just have doubles be the default of storing variance? Is it to save storage space?

fjy · 2016-06-15T22:39:57Z

ping @navis

fjy · 2016-06-22T22:10:09Z

@navis

navis · 2016-07-11T04:39:29Z

squashed commits to ease rebase.
@fjy no spectial handling, but I think Long.valueOf(String) and Double.valueOf(String) has some difference on cost.

drcrallen · 2016-07-13T16:48:20Z

Running io.druid.server.coordinator.DruidCoordinatorTest
Tests run: 3, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 64.099 sec <<< FAILURE! - in io.druid.server.coordinator.DruidCoordinatorTest
testCoordinatorRun(io.druid.server.coordinator.DruidCoordinatorTest)  Time elapsed: 61.525 sec  <<< ERROR!
java.lang.Exception: test timed out after 60000 milliseconds
    at java.lang.Thread.sleep(Native Method)
    at io.druid.server.coordinator.DruidCoordinatorTest.testCoordinatorRun(DruidCoordinatorTest.java:376)

https://travis-ci.org/druid-io/druid/jobs/143795144

Looks unrelated

drcrallen · 2016-07-13T16:52:38Z

docs/content/development/extensions-core/stats.md

+The ingestion aggregator can only apply to numeric values. If you use "variance"
+then any input rows missing the value will be considered to have a value of 0.
+
+User can specify expected input type as one of "float", "long", "variance" for ingestion, which is by default "float".


weird, github seems to have reverted to comic-sans for me, that looks like a capital L but is indeed a l

navis · 2016-07-18T02:36:05Z

Introduced changes of IncrementalIndex to fix NPE which can be thrown when classOfObject() of ObjectColumnSelector for float or long type columns is called, conforming it with that of QueryableIndexStorageAdapter.

navis · 2016-07-20T02:18:42Z

Seeing #3226, I think it's a natural trend to apply schema on input row. Just a comment.

fjy · 2016-08-02T18:40:34Z

@navis can we revert the getMetricClass change? I think several folks have pointed out it is unnecessary

navis · 2016-08-03T00:15:15Z

@fjy I think it's needed as commented above, updated master comment of this PR. (@drcrallen sorry, I had not understood what was "master comment")

fjy · 2016-08-03T00:26:40Z

@navis okay

👍 for me

drcrallen · 2016-08-05T00:32:56Z

👍

fjy added this to the 0.9.1 milestone Feb 23, 2016

drcrallen added the Feature label Feb 24, 2016

navis force-pushed the support-variance-aggregator branch from a84bae1 to 7076147 Compare February 25, 2016 02:18

drcrallen mentioned this pull request Feb 25, 2016

add "function" field to long/double Sum aggs and "sqrt" to arithmetic post agg #1965

Closed

drcrallen reviewed Feb 25, 2016
View reviewed changes

drcrallen reviewed Mar 4, 2016
View reviewed changes

navis force-pushed the support-variance-aggregator branch from 7076147 to acdbbc9 Compare March 4, 2016 02:19

drcrallen reviewed Mar 4, 2016
View reviewed changes

himanshug reviewed Mar 5, 2016
View reviewed changes

navis force-pushed the support-variance-aggregator branch from acdbbc9 to 0343c0e Compare March 7, 2016 02:39

navis force-pushed the support-variance-aggregator branch from ce0aa38 to c83e32c Compare July 11, 2016 02:54

drcrallen closed this Jul 13, 2016

drcrallen reopened this Jul 13, 2016

drcrallen reviewed Jul 13, 2016
View reviewed changes

navis force-pushed the support-variance-aggregator branch from c83e32c to 1d710e0 Compare July 18, 2016 02:37

navis added 2 commits July 22, 2016 16:53

Support variance and standard deviation

77505aa

addressed comments

5b56311

navis force-pushed the support-variance-aggregator branch from 1d710e0 to 5b56311 Compare July 22, 2016 07:53

drcrallen merged commit 5b3f0cc into apache:master Aug 5, 2016

jon-wei mentioned this pull request Aug 5, 2016

Add variance aggregator from hive to NOTICE #3327

Merged

gianm mentioned this pull request Sep 23, 2016

Druid 0.9.2 release notes #3503

Closed

kaijianding mentioned this pull request Jan 18, 2017

average aggregator in both ingestion phase and query phase #3859

Closed

drcrallen mentioned this pull request Apr 16, 2018

Fix cache bug in stats module #5650

Merged

3 tasks

jihoonson mentioned this pull request Aug 14, 2019

add copyright info back to NOTICE and NOTICE.BINARY #8298

Merged

seoeun25 added a commit to seoeun25/incubator-druid that referenced this pull request Jan 10, 2020

apache#2525 Add optional "earlyMessageRejectionPeriod" config

fecd4e0

Support variance and standard deviation #2525

Support variance and standard deviation #2525

Conversation

navis commented Feb 23, 2016 • edited Loading

fjy commented Feb 23, 2016

himanshug commented Feb 23, 2016

fjy commented Feb 24, 2016

himanshug commented Feb 24, 2016

drcrallen commented Feb 24, 2016

navis commented Feb 25, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drcrallen commented Mar 4, 2016

drcrallen commented Mar 4, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

navis commented Mar 4, 2016

drcrallen commented Mar 4, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drcrallen commented Mar 4, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

navis commented Mar 7, 2016

drcrallen commented Mar 7, 2016

navis commented May 19, 2016

fjy commented May 19, 2016

fjy commented Jun 15, 2016

fjy commented Jun 22, 2016

navis commented Jul 11, 2016

drcrallen commented Jul 13, 2016

Choose a reason for hiding this comment

navis commented Jul 18, 2016

navis commented Jul 20, 2016

fjy commented Aug 2, 2016

navis commented Aug 3, 2016

fjy commented Aug 3, 2016

drcrallen commented Aug 5, 2016

navis commented Feb 23, 2016 •

edited

Loading