Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

auto sharding strategy for theta sketch #9437

Open
patelprateek opened this issue Sep 20, 2022 · 3 comments
Open

auto sharding strategy for theta sketch #9437

patelprateek opened this issue Sep 20, 2022 · 3 comments

Comments

@patelprateek
Copy link

I was going through the pr : #5316
Can you please point me to how or where is this implemented. How do we define high cardinality threshold

I am running into issues where different sets can be different cardinality and error is high and wanted insights on how to tune theta params during my indexing phase . what is a reasonable theta threshold to decide high cardinality

@Jackie-Jiang
Copy link
Contributor

Here is the doc for the function: https://docs.pinot.apache.org/configuration-reference/functions/distinctcountthetasketch
You may also learn more about theta sketch here: https://datasketches.apache.org/docs/Theta/ThetaSketchFramework.html

There is one parameter that can be passed in the function: nominalEntries. By default it is set to 4096, and you may try a higher value to get better accuracy (performance will be worse)

@patelprateek
Copy link
Author

May be my question wasn't clear. I understand what theta sketches are , but trying to understand how you build auto sharding for some high cardinality segments when constructing theta sketch , what is considered high cardinality , what thresholds ?
IIUC intersection(theta_sketch(a) , theta_sketch(b)) can have high error rate when jaccard similarity is low or difference between cardinality of A and B sets are big , so you also shard the bigger set to have size smaller . Trying to understand better on how is this sharding implemented

@Jackie-Jiang
Copy link
Contributor

The implementation for this support is in the DistinctCountThetaSketchAggregationFunction class.
With the current implementation, we don't shard the set. I think this can be a good optimization, and we need some research to decide a cardinality threshold to shard the set. We can also consider providing this threshold as a parameter to the function.
Do you want to help contribute this feature?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants