Add approx_top_k
aggregate based on the (Filtered) Space-Saving algorithm, and use it in histogram
#12653
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR implements the
approx_top_k
function using the Space-Saving algorithm, specifically the Filtered Space-Saving variant. This algorithm tries to to find thek
most frequently occurring values without having to keep all distinct elements in memory (which is required to exactly find the most frequently occurring elements in a data stream).Generally when the differences between frequently occurring values are large, this algorithm is guaranteed to find the most frequent values (i.e. for skewed data). When they are small (i.e. for uniform data), different values might be returned. The main idea is that finding the most frequently occurring values is only really interesting in skewed data sets anyway.
Syntax:
Algorithm
It is described in the paper in more detail - but essentially the way the algorithm works is to maintain an exact count of a subset of the values (called the "monitored values"). In our implementation we choose to monitor up to
k * 3
values in the current implementation. The higher the monitor count, the more memory used, but the more accurate the result.When a new value is seen, there are essentially two paths:
lowest_count + 1
The idea is that frequent values will bubble up and be monitored, while less frequent values will swap around in the lower slots of the counters.
The filtered variant of the algorithm adds another step - where we keep a list of approximate counts based on the hash of the values. We then avoid swapping in a value if the approximate count for the hash of the value is lower than
lowest_count
. This improves performance because swapping a monitored value involves hash table operations (erasing/inserting a value).Implementation
We currently only provide two implementations: the string implementation and the fallback implementation. As a result, while this works for all values, it is slower than it needs to be for integers/numerics since we don't have special code generated for fixed-width types.
Histogram
The
approx_top_k
is used to select bins for thesample
technique of thehistogram
function which is selected by default for non-numeric/non-datetime types. This combined withhistogram_exact
provides more informative histograms for other types, e.g.: