implement special distinctcount #2602

binlijin · 2016-03-07T07:37:51Z

Discuss at https://groups.google.com/forum/#!topic/druid-development/GXRpXfBzfJs
(1) First use #2570 to partition data by visitor_id.
(2) Second use distinctcount to calculate exact UV.
There is some limitations, when use with groupBy, the groupBy keys' numbers should not exceed maxIntermediateRows in every segment, if exceed the result will wrong. And when use with topN, numValuesPerPass should not too big, if too big the distinctcount will use many memory and cause the JVM out of service.

fjy · 2016-03-07T17:31:57Z

@binlijin how fast is the exact distinct count?

fjy · 2016-03-07T17:45:06Z

@binlijin can you add a README to the module? Something to help explain how to do the distinct count and also what to watch out for. If you guys use this in production, can you also note it is used in production in the README?

binlijin · 2016-03-08T01:11:11Z

@fjy, the exact distinct count is faster than thetaSketch ( which use size 16384).
We have done performance test in our perf cluster, and have not use in production, will use this in production lately when we done all the test and prepare all the data.
This PR is for pre review if any one interesting in it.

fjy · 2016-03-14T21:23:07Z

@binlijin there's some merge conflicts

fjy · 2016-03-14T21:23:38Z

@binlijin we'll need some docs on how to use this

fjy · 2016-03-14T21:24:24Z

pom.xml

@@ -101,6 +101,7 @@
        <module>extensions/cloudfiles-extensions</module>
        <module>extensions/datasketches</module>
        <module>extensions/avro-extensions</module>
+        <module>extensions/distinctcount</module>


let's put this in extensions-contrib for now

binlijin · 2016-03-18T07:08:12Z

Performance:
Sketch query

{
  "queryType":"timeseries",
  "dataSource":"someDataSource",
  "granularity":"day",
  "intervals":["2016-03-06T00:00:00/2016-03-06T23:59:59"],
  "aggregations":[
    {
      "fieldName":"visitor_id_sketch",
      "name":"uv",
      "type":"thetaSketch"
    }
  ],
  "context" :  {
    "useCache":"false",
    "populateCache":"false"
  }
}

Take time:

real    0m3.286s
user    0m0.001s
sys 0m0.002s

binlijin · 2016-03-18T07:21:44Z

DistinctCount query:
Use java bitmap

{
  "queryType":"timeseries",
  "dataSource":"someDataSource",
  "granularity":"day",
  "intervals":["2016-03-06T00:00:00/2016-03-06T23:59:59"],
  "aggregations":[
    {
      "fieldName":"visitor_id",
      "name":"uv",
      "bitmap": {"type" : "java"},
      "type":"distinctCount"
    }
  ],
  "context" :  {
    "useCache":"false",
    "populateCache":"false"
  }
}

Take time:

real    0m0.973s
user    0m0.001s
sys 0m0.002s

binlijin · 2016-03-18T07:24:04Z

DistinctCount query:
Use roaring bitmap
"bitmap": {"type" : "roaring"},
Take time:

real    0m2.348s
user    0m0.001s
sys 0m0.002s

binlijin · 2016-03-18T07:29:47Z

DistinctCount query:
Use concise bitmap
"bitmap": {"type" : "concise"},
Take time:

real    3m40.176s
user    0m0.000s
sys 0m0.003s

himanshug · 2016-03-21T17:36:54Z

extensions-contrib/distinctcount/README.md

@@ -0,0 +1,51 @@
+Discuss at https://groups.google.com/forum/#!topic/druid-development/GXRpXfBzfJs
+(1) First use https://github.com/druid-io/druid/pull/2570 to partition data by a dimension for example visitor_id.


instead of 2 links above can you please document the process of how this extension could be used.

also pls indicate the assumption that one value has to be present in only one segment or this might overcount.

himanshug · 2016-03-21T17:48:09Z

...t/src/main/java/io/druid/query/aggregation/distinctcount/DistinctCountAggregatorFactory.java

+  {
+    byte[] fieldNameBytes = StringUtils.toUtf8(fieldName);
+    return ByteBuffer.allocate(1 + fieldNameBytes.length).put(CACHE_TYPE_ID).put(fieldNameBytes).array();
+  }


i think this might need to encdoe the name and bitmap type too.

fjy · 2016-03-23T01:19:00Z

@binlijin

Do you mind adding an entry for this new extension under
docs/content/development/extensions.md

and instead of the README.md, create a new file under
docs/content/development/extensions-contrib

?

This will make it much easier to version extensions and list them on the Druid webpage

fjy · 2016-03-23T01:19:31Z

@binlijin also please squash commits

@himanshug any more comments?

fjy · 2016-03-23T01:24:59Z

Also, wait for #2698 to be merged before making doc changes :)

binlijin · 2016-03-23T01:45:30Z

@fjy, ok, will change it and squash commits, after #2698 merge.

fjy · 2016-03-23T16:09:49Z

@binlijin merged #2698

himanshug · 2016-03-23T21:51:54Z

...t/src/main/java/io/druid/query/aggregation/distinctcount/DistinctCountAggregatorFactory.java

+  @Override
+  public BufferAggregator factorizeBuffered(ColumnSelectorFactory columnFactory)
+  {
+    return new DistinctCountBufferAggregator(makeDimensionSelector(columnFactory));


you might want to take care of the case where one of the segment does not have fieldName column
see https://github.com/druid-io/druid/blob/master/extensions-core/datasketches/src/main/java/io/druid/query/aggregation/datasketches/theta/SketchAggregatorFactory.java#L77

done, thanks

himanshug · 2016-03-23T21:55:04Z

👍 after #2602 (comment) and #2602 (comment) are resolved and commits are squashed.

binlijin · 2016-03-24T02:42:06Z

@fjy i find the link in docs/content/development/extensions.md is wrong, for example:
"|druid-rocketmq|RocketMQ firehose.| [link] (../development/extensions-contrib/rocketmq.html)|"
there is not rocketmq.html in ../development/extensions-contrib/, only rocketmq.md, should fire another PR to fix this.

fjy · 2016-03-24T02:54:59Z

@binlijin the html file is generated when teh docs are ported over to the webpage, you shouldn't need to fix anything

fjy · 2016-03-24T02:55:24Z

@binlijin for example, under docs/content, you can run jekyll serve to render the docs and the html page should be rendered

binlijin · 2016-03-24T03:06:51Z

@fjy, ok, so i need to minor change this patch.

binlijin · 2016-03-25T02:01:55Z

@fjy @himanshug i think all comments have been resolved.

himanshug · 2016-03-28T17:53:02Z

👍 for me

drcrallen · 2016-03-28T18:45:18Z

docs/content/development/extensions-contrib/distinctcount.md

+(2) Second use distinctCount to calculate exact distinct count, make sure queryGranularity is divide exactly by segmentGranularity or else the result will be wrong.
+There is some limitations, when use with groupBy, the groupBy keys' numbers should not exceed maxIntermediateRows in every segment, if exceed the result will wrong. And when use with topN, numValuesPerPass should not too big, if too big the distinctCount will use many memory and cause the JVM out of service.
+
+This has been used in production.


I don't think this line is needed.

i think it is useful to know this

drcrallen · 2016-03-28T19:11:42Z

Approach looks good.

If you are wanting JUST counts, this can be sped up by implementing this functionality at a query level, and simply returning the cardinality of the resulting filter.

drcrallen · 2016-03-29T00:18:00Z

Strongly suggest fixing #2602 (comment) but the rest is at the author's discretion. 👍

leventov · 2017-08-09T11:06:04Z

@binlijin implementation of DistinctCountAggregatorFactory.combine() doesn't really combine distinct counts

PayneZx · 2020-01-03T03:38:26Z

@binlijin When I used distinctCount aggregation type, I find some bugs.For example,when dimensions size is two,the result will be wrong.I don't know if it is a bug.

binlijin added the Feature label Mar 7, 2016

binlijin added this to the 0.9.2 milestone Mar 7, 2016

binlijin closed this Mar 7, 2016

binlijin reopened this Mar 7, 2016

fjy reviewed Mar 14, 2016
View reviewed changes

binlijin closed this Mar 18, 2016

binlijin reopened this Mar 18, 2016

binlijin changed the title ~~implement distinctcount~~ implement special distinctcount Mar 18, 2016

binlijin closed this Mar 18, 2016

binlijin reopened this Mar 18, 2016

binlijin closed this Mar 18, 2016

binlijin reopened this Mar 18, 2016

binlijin closed this Mar 18, 2016

binlijin reopened this Mar 18, 2016

himanshug reviewed Mar 21, 2016
View reviewed changes

binlijin modified the milestones: 0.9.1, 0.9.2 Mar 23, 2016

himanshug reviewed Mar 23, 2016
View reviewed changes

binlijin closed this Mar 24, 2016

binlijin reopened this Mar 24, 2016

binlijin closed this Mar 24, 2016

implement special distinctcount

2729efc

binlijin reopened this Mar 24, 2016

drcrallen reviewed Mar 28, 2016
View reviewed changes

fjy merged commit 62c1dc7 into apache:master Mar 29, 2016

fjy mentioned this pull request May 20, 2016

[WIP] Druid 0.9.1 Release Notes #2999

Closed

		@@ -0,0 +1,51 @@
		Discuss at https://groups.google.com/forum/#!topic/druid-development/GXRpXfBzfJs
		(1) First use https://github.com/druid-io/druid/pull/2570 to partition data by a dimension for example visitor_id.

implement special distinctcount #2602

implement special distinctcount #2602

Conversation

binlijin commented Mar 7, 2016

fjy commented Mar 7, 2016

fjy commented Mar 7, 2016

binlijin commented Mar 8, 2016

fjy commented Mar 14, 2016

fjy commented Mar 14, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

binlijin commented Mar 18, 2016

binlijin commented Mar 18, 2016

binlijin commented Mar 18, 2016

binlijin commented Mar 18, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjy commented Mar 23, 2016

fjy commented Mar 23, 2016

fjy commented Mar 23, 2016

binlijin commented Mar 23, 2016

fjy commented Mar 23, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

himanshug commented Mar 23, 2016

binlijin commented Mar 24, 2016

fjy commented Mar 24, 2016

fjy commented Mar 24, 2016

binlijin commented Mar 24, 2016

binlijin commented Mar 25, 2016

himanshug commented Mar 28, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drcrallen commented Mar 28, 2016

drcrallen commented Mar 29, 2016

leventov commented Aug 9, 2017

PayneZx commented Jan 3, 2020