While this began just as a kind of adaptive index/hash table library, it has grown into more a collection in the theme of database/big-data related data structures & algorithms. { Let's say the "ad" in "adix" now stands for "ADvanced" | "AscenDant" as well as "adaptive" ;-) } Most of these are à la carte and I hope you find them useful. I try to keep the source code short & to the point. In particular, as an overview/index, here be:
-
The original associative lookup modules:
-
Basic Sorting: nsort Radix sort only by NEEDED BITS; Often 5-10X faster than
algorithm.sort
if you sort a lot of meso-scale data (merge sorts always win for HUGE data; Very few have it). (Could buffer writes to ensure full cache-line pokes.) -
Basic Sketches (abbreviated/approximate stores; aka "Digests") for:
- Membership: bltab (bit- level table; Like more successfully marketed Bloom|Cuckoo filters, but lower latency & slightly bigger)
- Count Distinct: uniqce aka count unique or cardinality estimation
- Approx Most Often: amoft (aka approximate top-K most frequent | heavy-hitters)
- Quantiles: lghisto or tdigest (for slower more accurate tail quantiles)
- An amalgam:
mvstat
that works likestd/stats
but supportsdel
, i.e. sliding/moving windows over data streams (like a moving average) as well as moving or running quantiles vialghisto
. Also includes bulk array stats that in some compile modes get fully SIMD vectorized inner loops.
And some utility modules:
- althash: salt-able alternate hash functions for lptabz
- xlang: Definitions for C refugees for bist
- sequint: a fixed stride "bit matrix" using "batch"/number ops.
- memutil: memory shifting utilities
- cumsum: parallel prefix sum using Intel SIMD for nsort
- bitop: re-impl std/bitops things to be more CT friendly
- topk: spends 2X the (small)
space of
std/heapqueue
-based top-k stream algo to scale O(lg k) better.
While sketches are popular, like merge sort (vs. radix sort), they often need huge data to pay off. Essentially, probabilistic analysis ("Oh wow, I can do that?!") distracts from space-time trade-offs. This distraction is worsened by there being space-time-accuracy "trade-off pyramids".
So, I can say here "spending a bit more space can yield major speed-ups", and it sounds blatantly obvious to even the most casual observer. Yet, I've also seen it neglected in this context countless times. The academic literature does not help, often being "blood sport" for more compressed data | accuracy with no regard to speed.
So, e.g., on my primary 32 GiB RAM dev box with bu/zipf
, I cannot make exact
lfreq
slower than Approximately Most Often sketches(bu/oft
). tests/bl.nim
shows another example (also written up
here in a Bloom
filter / membership approximation context where spending 2-4X what a Bloom takes
space-wise can buy a 7-10X latency shrink. (Histograms & UCE are both pretty
good deals, though, if errors are acceptable, and adix/bltab
with fingerprint
keys is arguably just "a better 'sketch' ").
As a brief guide I would start with NOTES.md
and then look at the top part of
lptabz.nim
. TODO.md
also has a lot of notes in it. My overarching vision
is to allow "the fast way" most of the time, especially for developers that know
how to provide a good hash
, but to also have auto fall backs to "safer ways"
with optional messages to let devs know they may need to intervene by changing
some defaults at table construction time (or else let users/sysadmins know that
some input may be violating the assumptions of some code sensitive to inputs).
Commercial database systems may have done this for decades, but hasn't really
percolated into commonly available runtime libs. (Depth-based growth trigger is
likely the simplest example of Profile-Guided Optimization for data structures.
A.Dain Samples 1993 PhD thesis has some more.)