For DISTINCT_COUNT, automatically convert Set to HyperLogLog when cardinality is too high#8074
For DISTINCT_COUNT, automatically convert Set to HyperLogLog when cardinality is too high#8074Jackie-Jiang wants to merge 1 commit intoapache:masterfrom
Conversation
…dinality is too high
richardstartin
left a comment
There was a problem hiding this comment.
This changes the semantics of the function so that it is no longer a distinct count. I think it would be better to raise an error if the threshold is breached, which would educate the user to use distinctcounthll instead (this would also be documented clearly).
Agreed. I also see some users actually reports the error is about 3% when the cardinality is around 200M comparing |
Good point. We can definitely reject the query if it goes over the threshold, and make the threshold configurable. |
|
can we list out all the options we have
The reason why I don't prefer second option where we return error and ask users to use distinctcounthll
My preference is to go with option 1 but start with -1 as the default value which makes the feature off by default but have the ability to override it using per query option or server config. This might still mean that they will see OOM when the distinct set does not fit in memory but they have the option to fix it via server config (no need of code change) or queryoption (needs change in app code) |
|
I think there is a good argument for introducing a new function which degrades to hll at a threshold (which the user could even supply). My expectation of distinct count is that it is exact if it produces a result, degrading to an approximate result without being explicit about it makes query results hard to reason about. Degradation to an approximate result at an arbitrary threshold is arguably worse than always producing an approximate result. |
|
The basic problem here is needing to choose between being wrong and not producing a result when the cardinality is high, given the definition of distinct count. In my opinion, OOM risk should be mitigated explicitly by resource controls/circuit breakers and not by relaxing semantics. If this PR is merged as is, it’s a statement that producing a result is prioritised over being correct, but one of those options has to be chosen. |
|
This PR should be good to go if the default threshold is changed to infinity or -1 right? That keeps the existing behavior and does not change anything but it allows the both operators and users to control the behavior. |
|
-1 on changing the default behavior. As @richardstartin mentioned, we should not change the default behavior because the clients do expect the exact result when they run Meanwhile, if we keep the default behavior as it is, the user now needs to be educated enough to know how to configure hll threshold |
|
@snleee what is your take on the behavior I suggested?
I also wanted to call out again that asking users to use distinctcounthll is not a feasible solution because distinctcounthll will always use hyperloglog and return approximate results. What this PR is trying to add is a way to automatically switch to hyperloglog if the distinct set exceeds a threshold and this is dependent on the data and query. |
|
How about introducing a new aggregation function which automatically degrade to hll, then we can add a config on the broker to rewrite
|
Good idea 👍 |
|
@kishoreg Oh, so we want the new behavior where we return the correct result until the threshold and return the approximate after threshold. In that case, using @Jackie-Jiang How will you deal with the new input |
@snleee If user doesn't explicitly call the new function, distinctcount will be rewritten to the default one (log2m=12, conversionThreshold=10k) |
|
@richardstartin @xiangfu0 @kishoreg @snleee Per the discussion, I've moved the logic into a different function |
Description
For
DISTINCT_COUNTandDISTINCT_COUNT_MVaggregation function, currently we useSetto store all the values, which can cause memory issues and potentially exhaust the memory for Servers or Brokers. This PR adds the support to automatically convert theSettoHyperLogLogif the set size grows too big to protect the servers. This conversion only applies to aggregation only queries, but not the group-by queries.By default, when the set size exceeds 100K, it will be converted to a HyperLogLog with log2m of 12.
The log2m and threshold can be configured using the second argument (literal) of the function:
hllLog2m: log2m of the converted HyperLogLog (default 12)hllConversionThreshold: set size threshold to trigger the conversion, non-positive means never convert (default 100K)Example query:
SELECT DISTINCTCOUNT(myCol, 'hllLog2m=8;hllConversionThreshold=10') FROM myTableRelease Notes
Add second argument (literal) to
DISTINCT_COUNTandDISTINCT_COUNT_MVaggregation function for optional parameters:hllLog2m: log2m of the converted HyperLogLog (default 12)hllConversionThreshold: set size threshold to trigger the conversion, non-positive means never convert (default 100K)For
DISTINCT_COUNTandDISTINCT_COUNT_MVaggregation only queries, if the result is over 100K, the query will useHyperLogLogand return approximate result by default. To get back to the 100% accurate behavior, sethllConversionThresholdto a non-positive value.