Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal]general exactly count distinct support in Druid #6716

Open
pzhdfy opened this issue Dec 6, 2018 · 4 comments
Open

[Proposal]general exactly count distinct support in Druid #6716

pzhdfy opened this issue Dec 6, 2018 · 4 comments
Labels

Comments

@pzhdfy
Copy link
Contributor

@pzhdfy pzhdfy commented Dec 6, 2018

the pr for branch master here:#7594
the pr for branch 0.12.x here:#7582

Motivation

In many case, we need exactly count distinct, like charging.
But now, druid can not do well.

Now, there are three method in dimension:

  1. cardinality aggregator: use HLL, not exact.
  2. nested group by: exact, but need huge resource
  3. http://druid.io/docs/latest/development/extensions-contrib/distinctcount.html : exact, but many limits
    1) only one dimension
    2) can not across intervals

one method in metrics:
1.HyperUnique/DataSketches aggregator: use HLL, faster than cardinality aggregator because pre-calculate at ingest , not exact

We need a more general method to support
1.more than one dimension
2.can across all intervals

Design target
1.Because exactly count distinct is not very needed in realtime ,We first will design to implement in offline Ingestion.
2.Only implement in metrics, so we can pre-calculate at ingest. If need count distinct a dimension, we can add a metric to do instead.
3.It will be a pluggable extension

The detail design is divided into two parts.
1.One is the how to convert different dim value(string/double/long) to a different global uniq int ID , and same dim value in different intervals will convert to the same ID, so roll up can be done across all intervals. We need a global dict to do this, which is a trieTreeDict for one column
And because the dict may be large, we put it on hdfs and build it with MapReduce.
And maybe two or more tasks build one dict, we need a zookeeper distribute lock to keep consistency.
about the trieTreeDict, we can simply reuse the data structure in apache kylin.
2.and the other is how to store uniq int IDs in druid metrics with least storage, so we need a new metrics type driven by bitmap.

the overview design is like this
image

the build process
image

Ingestion and query schema
image

@kaijianding
Copy link
Contributor

@kaijianding kaijianding commented Dec 6, 2018

@pzhdfy it sounds different with http://druid.io/docs/latest/development/extensions-contrib/distinctcount.html which is only for batch ingestion and only for one dimension.
Could you explain more on how count distinct is done across segments?
Is global uniq ID global across all segments or just a period like one day(the period you do offline batch ingest)? Is global uniq ID generated on a central server like coordinator, or an algorithm which can ensure id is always different in different segments?

@pzhdfy pzhdfy changed the title [Proposal] exactly count distinct support in Druid [Proposal]more general exactly count distinct support in Druid Dec 6, 2018
@pzhdfy
Copy link
Contributor Author

@pzhdfy pzhdfy commented Dec 6, 2018

@pzhdfy it sounds different with http://druid.io/docs/latest/development/extensions-contrib/distinctcount.html which is only for batch ingestion and only for one dimension.
Could you explain more on how count distinct is done across segments?
Is global uniq ID global across all segments or just a period like one day(the period you do offline batch ingest)? Is global uniq ID generated on a central server like coordinator, or an algorithm which can ensure id is always different in different segments?

I updated the description, we will add more details later.

@pzhdfy pzhdfy changed the title [Proposal]more general exactly count distinct support in Druid [Proposal]general exactly count distinct support in Druid Apr 30, 2019
@wangxiaobaidu11
Copy link
Contributor

@wangxiaobaidu11 wangxiaobaidu11 commented Oct 25, 2019

@pzhdfy I am using ’unique‘ accurate deduplication method and groupby native query, but the obtained result is not accurate. Groupby does not support unqiue, right?

@xiaoDjun
Copy link

@xiaoDjun xiaoDjun commented Aug 25, 2021

@pzhdfy hi, does SQL support the uniq query now?
Like theta use APPROX_COUNT_DISTINCT_DS_THETA with sql

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants