-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement special distinctcount #2602
Conversation
@binlijin how fast is the exact distinct count? |
@binlijin can you add a README to the module? Something to help explain how to do the distinct count and also what to watch out for. If you guys use this in production, can you also note it is used in production in the README? |
@fjy, the exact distinct count is faster than thetaSketch ( which use size 16384). |
@binlijin there's some merge conflicts |
@binlijin we'll need some docs on how to use this |
@@ -101,6 +101,7 @@ | |||
<module>extensions/cloudfiles-extensions</module> | |||
<module>extensions/datasketches</module> | |||
<module>extensions/avro-extensions</module> | |||
<module>extensions/distinctcount</module> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's put this in extensions-contrib for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Performance:
Take time:
|
DistinctCount query:
Take time:
|
DistinctCount query:
|
DistinctCount query:
|
@@ -0,0 +1,51 @@ | |||
Discuss at https://groups.google.com/forum/#!topic/druid-development/GXRpXfBzfJs | |||
(1) First use https://github.com/druid-io/druid/pull/2570 to partition data by a dimension for example visitor_id. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of 2 links above can you please document the process of how this extension could be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also pls indicate the assumption that one value has to be present in only one segment or this might overcount.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
{ | ||
byte[] fieldNameBytes = StringUtils.toUtf8(fieldName); | ||
return ByteBuffer.allocate(1 + fieldNameBytes.length).put(CACHE_TYPE_ID).put(fieldNameBytes).array(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this might need to encdoe the name and bitmap type too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
Do you mind adding an entry for this new extension under and instead of the README.md, create a new file under ? This will make it much easier to version extensions and list them on the Druid webpage |
@binlijin also please squash commits @himanshug any more comments? |
Also, wait for #2698 to be merged before making doc changes :) |
@Override | ||
public BufferAggregator factorizeBuffered(ColumnSelectorFactory columnFactory) | ||
{ | ||
return new DistinctCountBufferAggregator(makeDimensionSelector(columnFactory)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might want to take care of the case where one of the segment does not have fieldName column
see https://github.com/druid-io/druid/blob/master/extensions-core/datasketches/src/main/java/io/druid/query/aggregation/datasketches/theta/SketchAggregatorFactory.java#L77
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, thanks
👍 after #2602 (comment) and #2602 (comment) are resolved and commits are squashed. |
@fjy i find the link in docs/content/development/extensions.md is wrong, for example: |
@binlijin the html file is generated when teh docs are ported over to the webpage, you shouldn't need to fix anything |
@binlijin for example, under docs/content, you can run |
@fjy, ok, so i need to minor change this patch. |
@fjy @himanshug i think all comments have been resolved. |
👍 for me |
(2) Second use distinctCount to calculate exact distinct count, make sure queryGranularity is divide exactly by segmentGranularity or else the result will be wrong. | ||
There is some limitations, when use with groupBy, the groupBy keys' numbers should not exceed maxIntermediateRows in every segment, if exceed the result will wrong. And when use with topN, numValuesPerPass should not too big, if too big the distinctCount will use many memory and cause the JVM out of service. | ||
|
||
This has been used in production. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this line is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it is useful to know this
Approach looks good. If you are wanting JUST counts, this can be sped up by implementing this functionality at a query level, and simply returning the cardinality of the resulting filter. |
Strongly suggest fixing #2602 (comment) but the rest is at the author's discretion. 👍 |
@binlijin implementation of |
@binlijin When I used distinctCount aggregation type, I find some bugs.For example,when dimensions size is two,the result will be wrong.I don't know if it is a bug. |
Discuss at https://groups.google.com/forum/#!topic/druid-development/GXRpXfBzfJs
(1) First use #2570 to partition data by visitor_id.
(2) Second use distinctcount to calculate exact UV.
There is some limitations, when use with groupBy, the groupBy keys' numbers should not exceed maxIntermediateRows in every segment, if exceed the result will wrong. And when use with topN, numValuesPerPass should not too big, if too big the distinctcount will use many memory and cause the JVM out of service.