Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add approx_top_k aggregate based on the (Filtered) Space-Saving algorithm, and use it in histogram #12653

Merged
merged 10 commits into from
Jun 24, 2024

Conversation

Mytherin
Copy link
Collaborator

This PR implements the approx_top_k function using the Space-Saving algorithm, specifically the Filtered Space-Saving variant. This algorithm tries to to find the k most frequently occurring values without having to keep all distinct elements in memory (which is required to exactly find the most frequently occurring elements in a data stream).

Generally when the differences between frequently occurring values are large, this algorithm is guaranteed to find the most frequent values (i.e. for skewed data). When they are small (i.e. for uniform data), different values might be returned. The main idea is that finding the most frequently occurring values is only really interesting in skewed data sets anyway.

Syntax:

approx_top_k(column, k)

Algorithm

It is described in the paper in more detail - but essentially the way the algorithm works is to maintain an exact count of a subset of the values (called the "monitored values"). In our implementation we choose to monitor up to k * 3 values in the current implementation. The higher the monitor count, the more memory used, but the more accurate the result.

When a new value is seen, there are essentially two paths:

  • If the value is monitored, increment the exact count of that value
  • If the value is not monitored, swap the replace the monitored entry with the lowest count with the new value and set the count to lowest_count + 1

The idea is that frequent values will bubble up and be monitored, while less frequent values will swap around in the lower slots of the counters.

The filtered variant of the algorithm adds another step - where we keep a list of approximate counts based on the hash of the values. We then avoid swapping in a value if the approximate count for the hash of the value is lower than lowest_count. This improves performance because swapping a monitored value involves hash table operations (erasing/inserting a value).

Implementation

We currently only provide two implementations: the string implementation and the fallback implementation. As a result, while this works for all values, it is slower than it needs to be for integers/numerics since we don't have special code generated for fixed-width types.

Histogram

The approx_top_k is used to select bins for the sample technique of the histogram function which is selected by default for non-numeric/non-datetime types. This combined with histogram_exact provides more informative histograms for other types, e.g.:

D select * from histogram(ontime, uniquecarrier, bin_count := 8);
┌────────────────┬────────┬──────────────────────────────────────────────────────────────────────────────────┐
│      bin       │ count  │                                       bar                                        │
│    varchar     │ uint64 │                                     varchar                                      │
├────────────────┼────────┼──────────────────────────────────────────────────────────────────────────────────┤
│ AA             │ 677215 │ ██████████████████████████████████████████████████████▏                          │
│ AS139971 │ ███████████▏                                                                     │
│ B6             │ 225718 │ ██████████████████                                                               │
│ DL             │ 696931 │ ███████████████████████████████████████████████████████▊                         │
│ EV             │ 274565 │ █████████████████████▉                                                           │
│ OO             │ 521956 │ █████████████████████████████████████████▊                                       │
│ UA             │ 435757 │ ██████████████████████████████████▉                                              │
│ WN             │ 999114 │ ████████████████████████████████████████████████████████████████████████████████ │
│ (other values) │ 305230 │ ████████████████████████▍                                                        │
└────────────────┴────────┴──────────────────────────────────────────────────────────────────────────────────┘

@Mytherin Mytherin added the Needs Documentation Use for issues or PRs that require changes in the documentation label Jun 24, 2024
@Mytherin Mytherin merged commit 1826262 into duckdb:main Jun 24, 2024
39 of 40 checks passed
github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Jun 24, 2024
@Mytherin Mytherin deleted the approxtopk2 branch June 27, 2024 13:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Documentation Use for issues or PRs that require changes in the documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant