-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support variance and standard deviation #2525
Conversation
@navis this is awesome thanks. Are there any docs? |
@navis this should be useful, can you move it to an extension though. also it should be possible to test it without changing druid core tests, please see datasketches extension tests for reference. |
@himanshug why an extension? |
@fjy because it is totally possible to do it in extension and takes bloat away from druid-core, if a new person wants to understand druid, he has less to bother about. |
I agree with @himanshug on this point. |
a84bae1
to
7076147
Compare
Moved to extension |
@@ -0,0 +1,60 @@ | |||
<?xml version="1.0" encoding="UTF-8"?> | |||
<!-- | |||
~ Druid - a distributed column store. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know a lot of the pom's use this notice, but can you use the one present in the java files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@navis is this dependent on @himanshug 's arithmetic PR? If so can you comment as such in the master comment? |
@navis in trying to hit the "count == 0" case, can you think of a code path that would hit that case? |
buf.putLong(position, count); | ||
buf.putDouble(position + SUM_OFFSET, sum); | ||
if (count > 1) { | ||
double t = count * v - sum; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What algorithm are you using for streaming variance here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've forgot the most important thing. The code is copied as-is from apache hive (GenericUDAFVariance) and it says it's from,
Evaluate the variance using the algorithm described by Chan, Golub, and LeVeque in
"Algorithms for computing the sample variance: analysis and recommendations"
The American Statistician, 37 (1983) pp. 242--247.
Added that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And supposedly the original algorithm comes from here: http://doi.org/10.1080/00401706.1971.10488826 which is stuck behind a pay wall
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for references.
@drcrallen No, It's not using arithmetic PR of @himanshug. And in current druid, it seemed not possible to make count == 0 case in real environment. Should we throw exception or something? |
7076147
to
acdbbc9
Compare
@navis since there isn't an obviously good solution for how to handle variance with count == 0, how about throwing an ISE and mentioning in the code that it shouldn't happen, and the correct behavior is not terribly obvious. |
VarianceHolder holder2 = (VarianceHolder) rhs; | ||
|
||
final double ratio = holder1.count / (double) holder2.count; | ||
final double t = ratio * holder1.sum - holder2.sum; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Orig paper has this as holder1.sum/ratio - holder2.sum
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. It could be a huge mistake. There should be one more test with different m and n.
@navis can you refactor it a bit somehow so that the code for the sum, count, variance streaming update is in one place, and the code for merging streamed updates is in one place? There should hopefully only need to be two methods instead of having a few of them scattered about. |
@Override | ||
public void configure(Binder binder) | ||
{ | ||
if (ComplexMetrics.getSerdeForType("variance") == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that you are using name "variance" and "varianceValue" for the things. Since these names exist in the global namespace of druid and would collide among different aggregator implementations. It would be great to name these specific to the algorithm used e.g. "xxxVariance" instead so that in future more implementations for variance with different algorithms can be incorporated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about hiveVariance
instead of ChanGolubLeVequeVariance
?
@drcrallen Changed to throw ISE instead of returning null or NaN for count ==0 case. |
acdbbc9
to
0343c0e
Compare
RE: naming convention If the description by Chan, et al. is mostly just copying from Youngs & Cramer, an they post the original implementation of the algorithm, then the scientifically correct name would be YoungsCramerVariance. |
@fjy As commented in #2525 (comment), I really don't want doing |
@navis can you explain a bit more about needing to support floats and longs? Why not just have doubles be the default of storing variance? Is it to save storage space? |
ping @navis |
ce0aa38
to
c83e32c
Compare
squashed commits to ease rebase. |
https://travis-ci.org/druid-io/druid/jobs/143795144 Looks unrelated |
The ingestion aggregator can only apply to numeric values. If you use "variance" | ||
then any input rows missing the value will be considered to have a value of 0. | ||
|
||
User can specify expected input type as one of "float", "long", "variance" for ingestion, which is by default "float". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weird, github seems to have reverted to comic-sans for me, that looks like a capital L
but is indeed a l
Introduced changes of IncrementalIndex to fix NPE which can be thrown when classOfObject() of ObjectColumnSelector for float or long type columns is called, conforming it with that of QueryableIndexStorageAdapter. |
c83e32c
to
1d710e0
Compare
Seeing #3226, I think it's a natural trend to apply schema on input row. Just a comment. |
1d710e0
to
5b56311
Compare
@navis can we revert the getMetricClass change? I think several folks have pointed out it is unnecessary |
@fjy I think it's needed as commented above, updated master comment of this PR. (@drcrallen sorry, I had not understood what was "master comment") |
@navis okay 👍 for me |
👍 |
Aggregator for variance. Algorithm is copied from hive UDAF.
Includes
VarianceAggregatorFactory(variance)
,VarianceFoldingAggregatorFactory(varianceFold)
andStandardDeviationPostAggregator(stddev)
Introduced some changes of
IncrementalIndex
to fix NPE which can be thrown when classOfObject() of ObjectColumnSelector for float or long type(non complex type) columns is called, conforming it with that ofQueryableIndexStorageAdapter
.