Add `approx_top_k` aggregate based on the (Filtered) Space-Saving algorithm, and use it in histogram #12653

Mytherin · 2024-06-23T07:55:20Z

This PR implements the approx_top_k function using the Space-Saving algorithm, specifically the Filtered Space-Saving variant. This algorithm tries to to find the k most frequently occurring values without having to keep all distinct elements in memory (which is required to exactly find the most frequently occurring elements in a data stream).

Generally when the differences between frequently occurring values are large, this algorithm is guaranteed to find the most frequent values (i.e. for skewed data). When they are small (i.e. for uniform data), different values might be returned. The main idea is that finding the most frequently occurring values is only really interesting in skewed data sets anyway.

Syntax:

approx_top_k(column, k)

Algorithm

It is described in the paper in more detail - but essentially the way the algorithm works is to maintain an exact count of a subset of the values (called the "monitored values"). In our implementation we choose to monitor up to k * 3 values in the current implementation. The higher the monitor count, the more memory used, but the more accurate the result.

When a new value is seen, there are essentially two paths:

If the value is monitored, increment the exact count of that value
If the value is not monitored, swap the replace the monitored entry with the lowest count with the new value and set the count to lowest_count + 1

The idea is that frequent values will bubble up and be monitored, while less frequent values will swap around in the lower slots of the counters.

The filtered variant of the algorithm adds another step - where we keep a list of approximate counts based on the hash of the values. We then avoid swapping in a value if the approximate count for the hash of the value is lower than lowest_count. This improves performance because swapping a monitored value involves hash table operations (erasing/inserting a value).

Implementation

We currently only provide two implementations: the string implementation and the fallback implementation. As a result, while this works for all values, it is slower than it needs to be for integers/numerics since we don't have special code generated for fixed-width types.

Histogram

The approx_top_k is used to select bins for the sample technique of the histogram function which is selected by default for non-numeric/non-datetime types. This combined with histogram_exact provides more informative histograms for other types, e.g.:

D select * from histogram(ontime, uniquecarrier, bin_count := 8);
┌────────────────┬────────┬──────────────────────────────────────────────────────────────────────────────────┐
│      bin       │ count  │                                       bar                                        │
│    varchar     │ uint64 │                                     varchar                                      │
├────────────────┼────────┼──────────────────────────────────────────────────────────────────────────────────┤
│ AA             │ 677215 │ ██████████████████████████████████████████████████████▏                          │
│ AS             │ 139971 │ ███████████▏                                                                     │
│ B6             │ 225718 │ ██████████████████                                                               │
│ DL             │ 696931 │ ███████████████████████████████████████████████████████▊                         │
│ EV             │ 274565 │ █████████████████████▉                                                           │
│ OO             │ 521956 │ █████████████████████████████████████████▊                                       │
│ UA             │ 435757 │ ██████████████████████████████████▉                                              │
│ WN             │ 999114 │ ████████████████████████████████████████████████████████████████████████████████ │
│ (other values) │ 305230 │ ████████████████████████▍                                                        │
└────────────────┴────────┴──────────────────────────────────────────────────────────────────────────────────┘

…small so we can lazily allocate values

… hash repeatedly

…speed up approx top k

Merge pull request duckdb/duckdb#12653 from Mytherin/approxtopk2

Mytherin added 10 commits June 21, 2024 16:30

Initial implementation of approx top k

81010c1

Use arena allocator for allocating strings

c255273

Rework code to use references stored adjacent in an array instead

9415396

Fix assert

08e826d

Format + special case for varchar

ff3e999

Add more tests, swap order of values from small -> large to large -> …

27b2d5f

…small so we can lazily allocate values

Use ApproxTopKString that stores the hash instead of re-computing the…

24970ab

… hash repeatedly

Add the filter extension based on Filtered Space-Saving algorithm to …

a14722b

…speed up approx top k

Format

4321f90

Format fix

a04b977

Mytherin added the Needs Documentation Use for issues or PRs that require changes in the documentation label Jun 24, 2024

Mytherin merged commit 1826262 into duckdb:main Jun 24, 2024
39 of 40 checks passed

duckdblabs-bot mentioned this pull request Jun 24, 2024

[duckdb/#12653] - Add approx_top_k aggregate based on the (Filtered) Space-Saving algorithm, and use it in histogram needs documentation duckdb/duckdb-web#3148

Open

github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Jun 24, 2024

chore: Update vendored sources to duckdb/duckdb@1826262

d259c7e

Merge pull request duckdb/duckdb#12653 from Mytherin/approxtopk2

Mytherin deleted the approxtopk2 branch June 27, 2024 13:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `approx_top_k` aggregate based on the (Filtered) Space-Saving algorithm, and use it in histogram #12653

Add `approx_top_k` aggregate based on the (Filtered) Space-Saving algorithm, and use it in histogram #12653

Mytherin commented Jun 23, 2024

Add approx_top_k aggregate based on the (Filtered) Space-Saving algorithm, and use it in histogram #12653

Add approx_top_k aggregate based on the (Filtered) Space-Saving algorithm, and use it in histogram #12653

Conversation

Mytherin commented Jun 23, 2024

Algorithm

Implementation

Histogram

Add `approx_top_k` aggregate based on the (Filtered) Space-Saving algorithm, and use it in histogram #12653

Add `approx_top_k` aggregate based on the (Filtered) Space-Saving algorithm, and use it in histogram #12653