# HllSet Relational Algebra

The **HyperLogLog (HLL)** algorithm is quite clever. It estimates the number of unique elements in a set of uniformly distributed random numbers by looking at the maximum number of trailing zeros in the binary form of each number. If you observe **n** as the maximum number of trailing zeros, then you can guess that there are about **2^n** distinct elements in the set.

But there’s a catch — the estimate can vary a lot. So, to get a more accurate estimate, the HLL algorithm splits the multiset into several smaller subsets. For each subset, it finds the maximum number of trailing zeros and then uses a harmonic mean to combine these numbers, giving a better overall estimate of the total number of unique elements.

The HLL data structure is represented as a **k-tuple \( t = (n1, n2, . . . , ni, . . . , nk) \)**, where each **\( ni \)** is the maximum number of trailing zeros for the i-th subset. This setup allows you to merge multiple HLLs without losing information. When you do merge them, the resultant HLL provides the same count of unique elements as if you had calculated it on the combined original datasets.

However, HLLs have their limitations—they don’t support other set operations like **intersection**. To fix this, you can enhance the structure by using **bit-vectors** instead of just the **maximum number** of zeros in the tuple t. By using **bit-vectors** to keep track of all the trailing zeros for each subset, this improved structure, which we call **HllSets** (HyperLogLog Sets), **allows you to perform all set operations**.


# References
1. https://en.wikipedia.org/wiki/HyperLogLog
2. https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
3. https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40671.pdf
4. https://redis.io/docs/data-types/probabilistic/hyperloglogs/
5. https://github.com/ascv/HyperLogLog/blob/master/README.md
6. https://en.wikipedia.org/wiki/Inclusion%E2%80%93exclusion_principle
7. https://en.wikipedia.org/wiki/Algebra_of_sets
8. https://dl.acm.org/doi/10.1145/358396.358400

In [1]:
# using Pkg
# Pkg.activate(".")
# Pkg.instantiate()
# Pkg.add("CSV")
# Pkg.add("Arrow")
# Pkg.add("Tables")
# Pkg.add("JSON3")

## Creating HllSets and applying basic operations

In [2]:
using Random
using FilePathsBase: extension, Path

include("src/sets32.jl")

import .HllSets as set

# Initialize test HllSets
hll1 = set.HllSet{5}(); hll1_seeded = set.HllSet{5}()
hll2 = set.HllSet{5}(); hll2_seeded = set.HllSet{5}()
hll3 = set.HllSet{5}(); hll3_seeded = set.HllSet{5}()
hll4 = set.HllSet{5}(); hll4_seeded = set.HllSet{5}()
hll5 = set.HllSet{5}(); hll5_seeded = set.HllSet{5}()

# Generate datasets from random strings
s1 = Set(randstring(7) for _ in 1:10)
s2 = Set(randstring(7) for _ in 1:15)
s3 = Set(randstring(7) for _ in 1:100)
s4 = Set(randstring(7) for _ in 1:20)
s5 = Set(randstring(7) for _ in 1:130)

# Add datasets to HllSets
set.add!(hll1, s1); set.add!(hll1_seeded, s1, seed=123)
set.add!(hll2, s2); set.add!(hll2_seeded, s2, seed=123)
set.add!(hll3, s3); set.add!(hll3_seeded, s3, seed=123)
set.add!(hll4, s4); set.add!(hll4_seeded, s4, seed=123)
set.add!(hll5, s5); set.add!(hll5_seeded, s5, seed=123)

println(hll1.counts, "\n", count(hll1))
println(hll1_seeded.counts, "\n", count(hll1_seeded))

println("Size of hll1: ", set.sizeof(hll1), "; \nSize of hll1_seeded: ", set.sizeof(hll1_seeded))

UInt32[0x00000002, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000001, 0x00000004, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000004, 0x00000000, 0x00000000, 0x00000000, 0x00000001, 0x00000001, 0x00000000, 0x00000000, 0x00000001, 0x00000000, 0x00000000, 0x00000002, 0x00000001, 0x00000000, 0x00000000, 0x00000000]
10
UInt32[0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000004, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000002, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000001, 0x00000000, 0x00000000, 0x00000004, 0x00000020, 0x00000000, 0x00000002, 0x00000002, 0x00000004, 0x00000000, 0x00000000, 0x00000002, 0x00000000, 0x00000000]
12
Size of hll1: 32; 
Size of hll1_seeded: 32


In [3]:
# Print cardinality of datasets and HllSets side by side
println(length(s1), " : ", count(hll1))
println(length(s2), " : ", count(hll2))
println(length(s3), " : ", count(hll3))
println(length(s4), " : ", count(hll4))
println(length(s5), " : ", count(hll5))

# union
println("\nunion:\n", length(s1 ∪ s2 ∪ s3 ∪ s4 ∪ s5), " : ", count(hll1 ∪ hll2 ∪ hll3 ∪ hll4 ∪ hll5), "\n")

# intersection
println("intersection (standard HllSet with seeded):\n", count(hll1 ∩ hll1_seeded))


10 : 10
15 : 13
100 : 105
20 : 24
130 : 121

union:
275 : 295

intersection (standard HllSet with seeded):
0


### HllSet Universes

In [4]:
A = set.HllSet{5}(); A_123 = set.HllSet{5}()
B = set.HllSet{5}(); B_123 = set.HllSet{5}()
C = set.HllSet{5}(); C_123 = set.HllSet{5}()

items_t1 = Set(["string0", "string1", "string2", "string3", "string4", "string5", "string6", "string7", "string8", "string9", "string10"])
items_t2 = Set(["string3", "string4", "string5", "string6", "string7", "string8", "string9", "string10", "string11"])
items_t3 = Set(["string5", "string6", "string7", "string8", "string9", "string10", "string11"])

set.add!(A, items_t1); set.add!(A_123, items_t1, seed=123)
set.add!(B, items_t2); set.add!(B_123, items_t2, seed=123)
set.add!(C, items_t3); set.add!(C_123, items_t3, seed=123)

In [5]:

# Default and seeded HllSet Universes
U = A ∪ B ∪ C; U_123 = A_123 ∪ B_123 ∪ C_123

# Intersection of 2 Universes is Empty (almost)
println("U ∩ U_123: ", count(U ∩ U_123), "\n")

println("A: ", count(A)); println("A_123: ", count(A_123))
println("B: ", count(B)); println("B_123: ", count(B_123))
println("C: ", count(C)); println("C_123: ", count(C_123))
println("U: ", count(U)); println("U_123: ", count(U_123))

println("AB = A ∩ B: ", count(A ∩ B)); println("AB_123 = A_123 ∩ B_123: ", count(A_123 ∩ B_123))
println("AC = A ∩ C: ", count(A ∩ C)); println("AC_123 = A_123 ∩ C_123: ", count(A_123 ∩ C_123))
println("BC = B ∩ C: ", count(B ∩ C)); println("BC_123 = B_123 ∩ C_123: ", count(B_123 ∩ C_123))

U ∩ U_123: 2

A: 11
A_123: 12
B: 10
B_123: 11
C: 7
C_123: 8
U: 12
U_123: 14
AB = A ∩ B: 9
AB_123 = A_123 ∩ B_123: 10
AC = A ∩ C: 6
AC_123 = A_123 ∩ C_123: 7
BC = B ∩ C: 7
BC_123 = B_123 ∩ C_123: 8


### Defining Probabilities and Conditional Proabilities with HllSets

In [6]:
# Probabilities
println("P(A) = |A| / |U|: ", count(A) / count(U)); println("P(A_123) = |A_123| / |U_123|: ", count(A_123) / count(U_123), "\n")
println("P(B) = |B| / |U|: ", count(B) / count(U)); println("P(B_123) = |B_123| / |U_123|: ", count(B_123) / count(U_123), "\n")
println("P(C) = |C| / |U|: ", count(C) / count(U)); println("P(C_123) = |C_123| / |U_123|: ", count(C_123) / count(U_123), "\n", "\n")

# Conditional Probabilities
println("P(A | B) = |AB| / |B|: ", count(A ∩ B) / count(B)); println("P(A_123 | B_123) = |AB_123| / |B_123|: ", count(A_123 ∩ B_123) / count(B_123), "\n")
println("P(B | A) = |AB| / |A|: ", count(A ∩ B) / count(A)); println("P(A_123 | A_123) = |AB_123| / |A_123|: ", count(A_123 ∩ B_123) / count(A_123), "\n")
println("P(A | C) = |AC| / |C|: ", count(A ∩ C) / count(C)); println("P(A_123 | C_123) = |AC_123| / |C_123|: ", count(A_123 ∩ C_123) / count(C_123), "\n")
println("P(C | A) = |AC| / |A|: ", count(A ∩ C) / count(A)); println("P(A_123 | A_123) = |AC_123| / |A_123|: ", count(A_123 ∩ C_123) / count(A_123), "\n", "\n")

println("P(B | C) = BC / C: ", count(B ∩ C) / count(C))
println("P(C | B) = BC / B: ", count(B ∩ C) / count(B), "\n")

P(A) = |A| / |U|: 0.9166666666666666
P(A_123) = |A_123| / |U_123|: 0.8571428571428571

P(B) = |B| / |U|: 0.8333333333333334
P(B_123) = |B_123| / |U_123|: 0.7857142857142857

P(C) = |C| / |U|: 0.5833333333333334
P(C_123) = |C_123| / |U_123|: 0.5714285714285714


P(A | B) = |AB| / |B|: 0.9
P(A_123 | B_123) = |AB_123| / |B_123|: 0.9090909090909091

P(B | A) = |AB| / |A|: 0.8181818181818182
P(A_123 | A_123) = |AB_123| / |A_123|: 0.8333333333333334

P(A | C) = |AC| / |C|: 0.8571428571428571
P(A_123 | C_123) = |AC_123| / |C_123|: 0.875

P(C | A) = |AC| / |A|: 0.5454545454545454
P(A_123 | A_123) = |AC_123| / |A_123|: 0.5833333333333334


P(B | C) = BC / C: 1.0
P(C | B) = BC / B: 0.7



### **How Many Universes Can We Have?**

The answer depends on the operating system we are using. For instance, in a 64-bit environment, there are (2^64 - 1) (over 18 quintillion, or approximately 1.8 × 10^19) distinct values available to be used as seeds for generating HllSets.

If these universes are constructed from the same collection of original datasets but utilize different seed values for the hash function to build HllSets, they will be structurally very similar, if not nearly identical. This phenomenon is observed (as we see it in the provided code) in two universes and remains consistent for three or more universes as well.

We refer to this phenomenon as **The Entanglement of HllSets**.

**Within the framework of SGS, entanglement signifies that when identical data is fed into HllSets—defined by varying hash functions and potentially different precision parameters (P)—the resulting structures tend to be remarkably similar, if not identical.**

Uncovering hidden structures requires considerable effort, especially when working with very large datasets. **HllSet Entanglement provides an opportunity to "teleport" insights discovered in one SGS to another SGS that has been fed with the same or similar data**. This transfer of knowledge can occur without the need to move any data or repeat the discovery process.

### Some other HllSet operations

In [7]:
hll_diff = set.set_xor(A, B)
println("HLL xor: ", count(hll_diff))

hll_int = intersect(A, B)

println("hll_int: ", count(hll_int))

println()
println("=====================================")
hll_comp_1 = set.set_comp(A, B)
println("Comp 1: ", count(hll_comp_1))
println("A: ", count(A))

println()
println("=====================================")
hll_comp_2 = set.set_comp(B, A)
println("Comp 2: ", count(hll_comp_2))
println("B: ", count(B))

HLL xor: 4
hll_int: 9

Comp 1: 3
A: 11

Comp 2: 1
B: 10
