Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement special distinctcount #2602

Merged
merged 1 commit into from
Mar 29, 2016
Merged

implement special distinctcount #2602

merged 1 commit into from
Mar 29, 2016

Conversation

binlijin
Copy link
Contributor

@binlijin binlijin commented Mar 7, 2016

Discuss at https://groups.google.com/forum/#!topic/druid-development/GXRpXfBzfJs
(1) First use #2570 to partition data by visitor_id.
(2) Second use distinctcount to calculate exact UV.
There is some limitations, when use with groupBy, the groupBy keys' numbers should not exceed maxIntermediateRows in every segment, if exceed the result will wrong. And when use with topN, numValuesPerPass should not too big, if too big the distinctcount will use many memory and cause the JVM out of service.

@binlijin binlijin added this to the 0.9.2 milestone Mar 7, 2016
@binlijin binlijin closed this Mar 7, 2016
@binlijin binlijin reopened this Mar 7, 2016
@fjy
Copy link
Contributor

fjy commented Mar 7, 2016

@binlijin how fast is the exact distinct count?

@fjy
Copy link
Contributor

fjy commented Mar 7, 2016

@binlijin can you add a README to the module? Something to help explain how to do the distinct count and also what to watch out for. If you guys use this in production, can you also note it is used in production in the README?

@binlijin
Copy link
Contributor Author

binlijin commented Mar 8, 2016

@fjy, the exact distinct count is faster than thetaSketch ( which use size 16384).
We have done performance test in our perf cluster, and have not use in production, will use this in production lately when we done all the test and prepare all the data.
This PR is for pre review if any one interesting in it.

@fjy
Copy link
Contributor

fjy commented Mar 14, 2016

@binlijin there's some merge conflicts

@fjy
Copy link
Contributor

fjy commented Mar 14, 2016

@binlijin we'll need some docs on how to use this

@@ -101,6 +101,7 @@
<module>extensions/cloudfiles-extensions</module>
<module>extensions/datasketches</module>
<module>extensions/avro-extensions</module>
<module>extensions/distinctcount</module>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's put this in extensions-contrib for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@binlijin binlijin closed this Mar 18, 2016
@binlijin binlijin reopened this Mar 18, 2016
@binlijin binlijin changed the title implement distinctcount implement special distinctcount Mar 18, 2016
@binlijin
Copy link
Contributor Author

Performance:
Sketch query

{
  "queryType":"timeseries",
  "dataSource":"someDataSource",
  "granularity":"day",
  "intervals":["2016-03-06T00:00:00/2016-03-06T23:59:59"],
  "aggregations":[
    {
      "fieldName":"visitor_id_sketch",
      "name":"uv",
      "type":"thetaSketch"
    }
  ],
  "context" :  {
    "useCache":"false",
    "populateCache":"false"
  }
}

Take time:

real    0m3.286s
user    0m0.001s
sys 0m0.002s

@binlijin
Copy link
Contributor Author

DistinctCount query:
Use java bitmap

{
  "queryType":"timeseries",
  "dataSource":"someDataSource",
  "granularity":"day",
  "intervals":["2016-03-06T00:00:00/2016-03-06T23:59:59"],
  "aggregations":[
    {
      "fieldName":"visitor_id",
      "name":"uv",
      "bitmap": {"type" : "java"},
      "type":"distinctCount"
    }
  ],
  "context" :  {
    "useCache":"false",
    "populateCache":"false"
  }
}

Take time:

real    0m0.973s
user    0m0.001s
sys 0m0.002s

@binlijin
Copy link
Contributor Author

DistinctCount query:
Use roaring bitmap
"bitmap": {"type" : "roaring"},
Take time:

real    0m2.348s
user    0m0.001s
sys 0m0.002s

@binlijin
Copy link
Contributor Author

DistinctCount query:
Use concise bitmap
"bitmap": {"type" : "concise"},
Take time:

real    3m40.176s
user    0m0.000s
sys 0m0.003s

@binlijin binlijin closed this Mar 18, 2016
@binlijin binlijin reopened this Mar 18, 2016
@binlijin binlijin closed this Mar 18, 2016
@binlijin binlijin reopened this Mar 18, 2016
@binlijin binlijin closed this Mar 18, 2016
@binlijin binlijin reopened this Mar 18, 2016
@@ -0,0 +1,51 @@
Discuss at https://groups.google.com/forum/#!topic/druid-development/GXRpXfBzfJs
(1) First use https://github.com/druid-io/druid/pull/2570 to partition data by a dimension for example visitor_id.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of 2 links above can you please document the process of how this extension could be used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also pls indicate the assumption that one value has to be present in only one segment or this might overcount.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

{
byte[] fieldNameBytes = StringUtils.toUtf8(fieldName);
return ByteBuffer.allocate(1 + fieldNameBytes.length).put(CACHE_TYPE_ID).put(fieldNameBytes).array();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this might need to encdoe the name and bitmap type too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@fjy
Copy link
Contributor

fjy commented Mar 23, 2016

@binlijin

Do you mind adding an entry for this new extension under
docs/content/development/extensions.md

and instead of the README.md, create a new file under
docs/content/development/extensions-contrib

?

This will make it much easier to version extensions and list them on the Druid webpage

@fjy
Copy link
Contributor

fjy commented Mar 23, 2016

@binlijin also please squash commits

@himanshug any more comments?

@fjy
Copy link
Contributor

fjy commented Mar 23, 2016

Also, wait for #2698 to be merged before making doc changes :)

@binlijin
Copy link
Contributor Author

@fjy, ok, will change it and squash commits, after #2698 merge.

@binlijin binlijin modified the milestones: 0.9.1, 0.9.2 Mar 23, 2016
@fjy
Copy link
Contributor

fjy commented Mar 23, 2016

@binlijin merged #2698

@Override
public BufferAggregator factorizeBuffered(ColumnSelectorFactory columnFactory)
{
return new DistinctCountBufferAggregator(makeDimensionSelector(columnFactory));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, thanks

@himanshug
Copy link
Contributor

👍 after #2602 (comment) and #2602 (comment) are resolved and commits are squashed.

@binlijin binlijin closed this Mar 24, 2016
@binlijin binlijin reopened this Mar 24, 2016
@binlijin
Copy link
Contributor Author

@fjy i find the link in docs/content/development/extensions.md is wrong, for example:
"|druid-rocketmq|RocketMQ firehose.| [link] (../development/extensions-contrib/rocketmq.html)|"
there is not rocketmq.html in ../development/extensions-contrib/, only rocketmq.md, should fire another PR to fix this.

@fjy
Copy link
Contributor

fjy commented Mar 24, 2016

@binlijin the html file is generated when teh docs are ported over to the webpage, you shouldn't need to fix anything

@fjy
Copy link
Contributor

fjy commented Mar 24, 2016

@binlijin for example, under docs/content, you can run jekyll serve to render the docs and the html page should be rendered

@binlijin
Copy link
Contributor Author

@fjy, ok, so i need to minor change this patch.

@binlijin binlijin closed this Mar 24, 2016
@binlijin binlijin reopened this Mar 24, 2016
@binlijin
Copy link
Contributor Author

@fjy @himanshug i think all comments have been resolved.

@himanshug
Copy link
Contributor

👍 for me

(2) Second use distinctCount to calculate exact distinct count, make sure queryGranularity is divide exactly by segmentGranularity or else the result will be wrong.
There is some limitations, when use with groupBy, the groupBy keys' numbers should not exceed maxIntermediateRows in every segment, if exceed the result will wrong. And when use with topN, numValuesPerPass should not too big, if too big the distinctCount will use many memory and cause the JVM out of service.

This has been used in production.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this line is needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it is useful to know this

@drcrallen
Copy link
Contributor

Approach looks good.

If you are wanting JUST counts, this can be sped up by implementing this functionality at a query level, and simply returning the cardinality of the resulting filter.

@drcrallen
Copy link
Contributor

Strongly suggest fixing #2602 (comment) but the rest is at the author's discretion. 👍

@fjy fjy merged commit 62c1dc7 into apache:master Mar 29, 2016
@leventov
Copy link
Member

leventov commented Aug 9, 2017

@binlijin implementation of DistinctCountAggregatorFactory.combine() doesn't really combine distinct counts

@PayneZx
Copy link

PayneZx commented Jan 3, 2020

@binlijin When I used distinctCount aggregation type, I find some bugs.For example,when dimensions size is two,the result will be wrong.I don't know if it is a bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants