auto sharding strategy for theta sketch #9437

patelprateek · 2022-09-20T09:50:59Z

I was going through the pr : #5316
Can you please point me to how or where is this implemented. How do we define high cardinality threshold

I am running into issues where different sets can be different cardinality and error is high and wanted insights on how to tune theta params during my indexing phase . what is a reasonable theta threshold to decide high cardinality

Jackie-Jiang · 2022-09-21T20:08:50Z

Here is the doc for the function: https://docs.pinot.apache.org/configuration-reference/functions/distinctcountthetasketch
You may also learn more about theta sketch here: https://datasketches.apache.org/docs/Theta/ThetaSketchFramework.html

There is one parameter that can be passed in the function: nominalEntries. By default it is set to 4096, and you may try a higher value to get better accuracy (performance will be worse)

patelprateek · 2022-09-21T23:17:48Z

May be my question wasn't clear. I understand what theta sketches are , but trying to understand how you build auto sharding for some high cardinality segments when constructing theta sketch , what is considered high cardinality , what thresholds ?
IIUC intersection(theta_sketch(a) , theta_sketch(b)) can have high error rate when jaccard similarity is low or difference between cardinality of A and B sets are big , so you also shard the bigger set to have size smaller . Trying to understand better on how is this sharding implemented

Jackie-Jiang · 2022-09-23T18:48:11Z

The implementation for this support is in the DistinctCountThetaSketchAggregationFunction class.
With the current implementation, we don't shard the set. I think this can be a good optimization, and we need some research to decide a cardinality threshold to shard the set. We can also consider providing this threshold as a parameter to the function.
Do you want to help contribute this feature?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

auto sharding strategy for theta sketch #9437

auto sharding strategy for theta sketch #9437

patelprateek commented Sep 20, 2022

Jackie-Jiang commented Sep 21, 2022

patelprateek commented Sep 21, 2022

Jackie-Jiang commented Sep 23, 2022

auto sharding strategy for theta sketch #9437

auto sharding strategy for theta sketch #9437

Comments

patelprateek commented Sep 20, 2022

Jackie-Jiang commented Sep 21, 2022

patelprateek commented Sep 21, 2022

Jackie-Jiang commented Sep 23, 2022