Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support variance and standard deviation #2525

Merged
merged 2 commits into from
Aug 5, 2016

Conversation

navis
Copy link
Contributor

@navis navis commented Feb 23, 2016

Aggregator for variance. Algorithm is copied from hive UDAF.

Includes VarianceAggregatorFactory(variance), VarianceFoldingAggregatorFactory(varianceFold) and StandardDeviationPostAggregator(stddev)

Introduced some changes of IncrementalIndex to fix NPE which can be thrown when classOfObject() of ObjectColumnSelector for float or long type(non complex type) columns is called, conforming it with that of QueryableIndexStorageAdapter.

@fjy
Copy link
Contributor

fjy commented Feb 23, 2016

@navis this is awesome thanks. Are there any docs?

@fjy fjy added this to the 0.9.1 milestone Feb 23, 2016
@himanshug
Copy link
Contributor

@navis this should be useful, can you move it to an extension though. also it should be possible to test it without changing druid core tests, please see datasketches extension tests for reference.

@fjy
Copy link
Contributor

fjy commented Feb 24, 2016

@himanshug why an extension?

@himanshug
Copy link
Contributor

@fjy because it is totally possible to do it in extension and takes bloat away from druid-core, if a new person wants to understand druid, he has less to bother about.
anyways, besides the philosophy, i think it was discussed in a dev-syncup and concluded that whatever can be done in a core extension should be done in a core extension.
@drcrallen @cheddar

@drcrallen
Copy link
Contributor

I agree with @himanshug on this point.

@navis navis force-pushed the support-variance-aggregator branch from a84bae1 to 7076147 Compare February 25, 2016 02:18
@navis
Copy link
Contributor Author

navis commented Feb 25, 2016

Moved to extension

@@ -0,0 +1,60 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
~ Druid - a distributed column store.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know a lot of the pom's use this notice, but can you use the one present in the java files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@drcrallen
Copy link
Contributor

@navis is this dependent on @himanshug 's arithmetic PR? If so can you comment as such in the master comment?

@drcrallen
Copy link
Contributor

@navis in trying to hit the "count == 0" case, can you think of a code path that would hit that case?

buf.putLong(position, count);
buf.putDouble(position + SUM_OFFSET, sum);
if (count > 1) {
double t = count * v - sum;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What algorithm are you using for streaming variance here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've forgot the most important thing. The code is copied as-is from apache hive (GenericUDAFVariance) and it says it's from,

Evaluate the variance using the algorithm described by Chan, Golub, and LeVeque in
"Algorithms for computing the sample variance: analysis and recommendations"
The American Statistician, 37 (1983) pp. 242--247.

Added that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And supposedly the original algorithm comes from here: http://doi.org/10.1080/00401706.1971.10488826 which is stuck behind a pay wall

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for references.

@navis
Copy link
Contributor Author

navis commented Mar 4, 2016

@drcrallen No, It's not using arithmetic PR of @himanshug. And in current druid, it seemed not possible to make count == 0 case in real environment. Should we throw exception or something?

@navis navis force-pushed the support-variance-aggregator branch from 7076147 to acdbbc9 Compare March 4, 2016 02:19
@drcrallen
Copy link
Contributor

@navis since there isn't an obviously good solution for how to handle variance with count == 0, how about throwing an ISE and mentioning in the code that it shouldn't happen, and the correct behavior is not terribly obvious.

VarianceHolder holder2 = (VarianceHolder) rhs;

final double ratio = holder1.count / (double) holder2.count;
final double t = ratio * holder1.sum - holder2.sum;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Orig paper has this as holder1.sum/ratio - holder2.sum

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. It could be a huge mistake. There should be one more test with different m and n.

@drcrallen
Copy link
Contributor

@navis can you refactor it a bit somehow so that the code for the sum, count, variance streaming update is in one place, and the code for merging streamed updates is in one place? There should hopefully only need to be two methods instead of having a few of them scattered about.

@Override
public void configure(Binder binder)
{
if (ComplexMetrics.getSerdeForType("variance") == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you are using name "variance" and "varianceValue" for the things. Since these names exist in the global namespace of druid and would collide among different aggregator implementations. It would be great to name these specific to the algorithm used e.g. "xxxVariance" instead so that in future more implementations for variance with different algorithms can be incorporated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about hiveVariance instead of ChanGolubLeVequeVariance?

@navis
Copy link
Contributor Author

navis commented Mar 7, 2016

@drcrallen Changed to throw ISE instead of returning null or NaN for count ==0 case.

@navis navis force-pushed the support-variance-aggregator branch from acdbbc9 to 0343c0e Compare March 7, 2016 02:39
@drcrallen
Copy link
Contributor

RE: naming convention

If the description by Chan, et al. is mostly just copying from Youngs & Cramer, an they post the original implementation of the algorithm, then the scientifically correct name would be YoungsCramerVariance.

@navis
Copy link
Contributor Author

navis commented May 19, 2016

@fjy As commented in #2525 (comment), I really don't want doing instanceOf for all of the inputs.

@fjy
Copy link
Contributor

fjy commented May 19, 2016

@navis can you explain a bit more about needing to support floats and longs? Why not just have doubles be the default of storing variance? Is it to save storage space?

@fjy
Copy link
Contributor

fjy commented Jun 15, 2016

ping @navis

@fjy
Copy link
Contributor

fjy commented Jun 22, 2016

@navis

@navis navis force-pushed the support-variance-aggregator branch from ce0aa38 to c83e32c Compare July 11, 2016 02:54
@navis
Copy link
Contributor Author

navis commented Jul 11, 2016

squashed commits to ease rebase.
@fjy no spectial handling, but I think Long.valueOf(String) and Double.valueOf(String) has some difference on cost.

@drcrallen
Copy link
Contributor

Running io.druid.server.coordinator.DruidCoordinatorTest
Tests run: 3, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 64.099 sec <<< FAILURE! - in io.druid.server.coordinator.DruidCoordinatorTest
testCoordinatorRun(io.druid.server.coordinator.DruidCoordinatorTest)  Time elapsed: 61.525 sec  <<< ERROR!
java.lang.Exception: test timed out after 60000 milliseconds
    at java.lang.Thread.sleep(Native Method)
    at io.druid.server.coordinator.DruidCoordinatorTest.testCoordinatorRun(DruidCoordinatorTest.java:376)

https://travis-ci.org/druid-io/druid/jobs/143795144

Looks unrelated

@drcrallen drcrallen closed this Jul 13, 2016
@drcrallen drcrallen reopened this Jul 13, 2016
The ingestion aggregator can only apply to numeric values. If you use "variance"
then any input rows missing the value will be considered to have a value of 0.

User can specify expected input type as one of "float", "long", "variance" for ingestion, which is by default "float".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird, github seems to have reverted to comic-sans for me, that looks like a capital L but is indeed a l

@navis
Copy link
Contributor Author

navis commented Jul 18, 2016

Introduced changes of IncrementalIndex to fix NPE which can be thrown when classOfObject() of ObjectColumnSelector for float or long type columns is called, conforming it with that of QueryableIndexStorageAdapter.

@navis navis force-pushed the support-variance-aggregator branch from c83e32c to 1d710e0 Compare July 18, 2016 02:37
@navis
Copy link
Contributor Author

navis commented Jul 20, 2016

Seeing #3226, I think it's a natural trend to apply schema on input row. Just a comment.

@navis navis force-pushed the support-variance-aggregator branch from 1d710e0 to 5b56311 Compare July 22, 2016 07:53
@fjy
Copy link
Contributor

fjy commented Aug 2, 2016

@navis can we revert the getMetricClass change? I think several folks have pointed out it is unnecessary

@navis
Copy link
Contributor Author

navis commented Aug 3, 2016

@fjy I think it's needed as commented above, updated master comment of this PR. (@drcrallen sorry, I had not understood what was "master comment")

@fjy
Copy link
Contributor

fjy commented Aug 3, 2016

@navis okay

👍 for me

@drcrallen
Copy link
Contributor

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants