Skip to content

For DISTINCT_COUNT, automatically convert Set to HyperLogLog when cardinality is too high#8074

Closed
Jackie-Jiang wants to merge 1 commit intoapache:masterfrom
Jackie-Jiang:distinct_count_to_hll
Closed

For DISTINCT_COUNT, automatically convert Set to HyperLogLog when cardinality is too high#8074
Jackie-Jiang wants to merge 1 commit intoapache:masterfrom
Jackie-Jiang:distinct_count_to_hll

Conversation

@Jackie-Jiang
Copy link
Copy Markdown
Contributor

Description

For DISTINCT_COUNT and DISTINCT_COUNT_MV aggregation function, currently we use Set to store all the values, which can cause memory issues and potentially exhaust the memory for Servers or Brokers. This PR adds the support to automatically convert the Set to HyperLogLog if the set size grows too big to protect the servers. This conversion only applies to aggregation only queries, but not the group-by queries.

By default, when the set size exceeds 100K, it will be converted to a HyperLogLog with log2m of 12.
The log2m and threshold can be configured using the second argument (literal) of the function:

  • hllLog2m: log2m of the converted HyperLogLog (default 12)
  • hllConversionThreshold: set size threshold to trigger the conversion, non-positive means never convert (default 100K)

Example query:
SELECT DISTINCTCOUNT(myCol, 'hllLog2m=8;hllConversionThreshold=10') FROM myTable

Release Notes

Add second argument (literal) to DISTINCT_COUNT and DISTINCT_COUNT_MV aggregation function for optional parameters:

  • hllLog2m: log2m of the converted HyperLogLog (default 12)
  • hllConversionThreshold: set size threshold to trigger the conversion, non-positive means never convert (default 100K)

For DISTINCT_COUNT and DISTINCT_COUNT_MV aggregation only queries, if the result is over 100K, the query will use HyperLogLog and return approximate result by default. To get back to the 100% accurate behavior, set hllConversionThreshold to a non-positive value.

Copy link
Copy Markdown
Member

@richardstartin richardstartin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the semantics of the function so that it is no longer a distinct count. I think it would be better to raise an error if the threshold is breached, which would educate the user to use distinctcounthll instead (this would also be documented clearly).

@xiangfu0
Copy link
Copy Markdown
Contributor

This changes the semantics of the function so that it is no longer a distinct count. I think it would be better to raise an error if the threshold is breached, which would educate the user to use distinctcounthll instead (this would also be documented clearly).

Agreed. I also see some users actually reports the error is about 3% when the cardinality is around 200M comparing DISTINCT_COUNT and DISTINCT_COUNT_BITMAP

@Jackie-Jiang
Copy link
Copy Markdown
Contributor Author

This changes the semantics of the function so that it is no longer a distinct count. I think it would be better to raise an error if the threshold is breached, which would educate the user to use distinctcounthll instead (this would also be documented clearly).

Good point. We can definitely reject the query if it goes over the threshold, and make the threshold configurable.
@kishoreg What's your opinion on this behavior? Reject or silently convert?

@kishoreg
Copy link
Copy Markdown
Member

kishoreg commented Jan 26, 2022

can we list out all the options we have

  1. Automatically convert set to hyperloglog after a threshold
    a. Threshold is set to something 100K by default
    b. threshold is set to -1 which means feature is off and folks can change it
    c. user has the ability to control the threshold through query option (enable_approx_distinct_threshold=100,000)

  2. Return error if the threshold is reached
    a. user then uses disctintcounthll

The reason why I don't prefer second option where we return error and ask users to use distinctcounthll

  • the users cannot change to distinctcountsql because this will always return approximate even when it does not hit the threshold.
  • most of them will not hit this error in testing and will directly see this in production which is too late.
  • Pinot is mostly accessed programmatically via apps and the app user cannot really do much when the app returns error.
  • distinctcounthll is not really a standard sql and wont work with other standard tools like tableau, superset etc

My preference is to go with option 1 but start with -1 as the default value which makes the feature off by default but have the ability to override it using per query option or server config. This might still mean that they will see OOM when the distinct set does not fit in memory but they have the option to fix it via server config (no need of code change) or queryoption (needs change in app code)

@richardstartin
Copy link
Copy Markdown
Member

I think there is a good argument for introducing a new function which degrades to hll at a threshold (which the user could even supply). My expectation of distinct count is that it is exact if it produces a result, degrading to an approximate result without being explicit about it makes query results hard to reason about. Degradation to an approximate result at an arbitrary threshold is arguably worse than always producing an approximate result.

@richardstartin
Copy link
Copy Markdown
Member

richardstartin commented Jan 27, 2022

The basic problem here is needing to choose between being wrong and not producing a result when the cardinality is high, given the definition of distinct count. In my opinion, OOM risk should be mitigated explicitly by resource controls/circuit breakers and not by relaxing semantics. If this PR is merged as is, it’s a statement that producing a result is prioritised over being correct, but one of those options has to be chosen.

@kishoreg
Copy link
Copy Markdown
Member

This PR should be good to go if the default threshold is changed to infinity or -1 right? That keeps the existing behavior and does not change anything but it allows the both operators and users to control the behavior.

@snleee
Copy link
Copy Markdown
Contributor

snleee commented Jan 27, 2022

-1 on changing the default behavior. As @richardstartin mentioned, we should not change the default behavior because the clients do expect the exact result when they run distinctcount.

Meanwhile, if we keep the default behavior as it is, the user now needs to be educated enough to know how to configure hll threshold DISTINCTCOUNT(myCol, 'hllLog2m=8;hllConversionThreshold=10') to control the behavior when firing the query. Then, why not ask them to use distinctcounthll directly?

@kishoreg
Copy link
Copy Markdown
Member

@snleee what is your take on the behavior I suggested?

  • default behavior remains the same by having threshold = -1.
  • Operators can explicitly set a threshold to change default behavior
  • Users can still override the threshold on a per query basis either using query option or new argument to distinctCount function.

I also wanted to call out again that asking users to use distinctcounthll is not a feasible solution because distinctcounthll will always use hyperloglog and return approximate results. What this PR is trying to add is a way to automatically switch to hyperloglog if the distinct set exceeds a threshold and this is dependent on the data and query.

@Jackie-Jiang
Copy link
Copy Markdown
Contributor Author

How about introducing a new aggregation function which automatically degrade to hll, then we can add a config on the broker to rewrite distinctCount to this new function? This way we get the following benefits:

  • No performance hit to distinctCount
  • Can be enabled by admin to always fall-back to the new function
  • User can explicitly choose the new function

@richardstartin
Copy link
Copy Markdown
Member

How about introducing a new aggregation function which automatically degrade to hll, then we can add a config on the broker to rewrite distinctCount to this new function? This way we get the following benefits:

  • No performance hit to distinctCount
  • Can be enabled by admin to always fall-back to the new function
  • User can explicitly choose the new function

Good idea 👍

@snleee
Copy link
Copy Markdown
Contributor

snleee commented Jan 27, 2022

@kishoreg Oh, so we want the new behavior where we return the correct result until the threshold and return the approximate after threshold. In that case, using distinctcounthll will not work and I would +1 on Jackie's approach. That will achieve both. (1. keep the default behavior 2. add the ability to override the default behavior for advanced admin)

@Jackie-Jiang How will you deal with the new input hllLog2m=8;hllConversionThreshold=10 when we fall back to the new function with config? The original distinctcount function currently doesn't get the second parameter.

@Jackie-Jiang
Copy link
Copy Markdown
Contributor Author

@Jackie-Jiang How will you deal with the new input hllLog2m=8;hllConversionThreshold=10 when we fall back to the new function with config? The original distinctcount function currently doesn't get the second parameter.

@snleee If user doesn't explicitly call the new function, distinctcount will be rewritten to the default one (log2m=12, conversionThreshold=10k)

@Jackie-Jiang
Copy link
Copy Markdown
Contributor Author

@richardstartin @xiangfu0 @kishoreg @snleee Per the discussion, I've moved the logic into a different function DistinctCountSmartHLLAggregationFunction in #8189. Could you please take a look?

@Jackie-Jiang Jackie-Jiang deleted the distinct_count_to_hll branch February 11, 2022 00:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants