# HllSet Relational Algebra

The HyperLogLog (HLL)[2] algorithm is based on the observation that the cardinality of a multiset of uniformly distributed random numbers can be estimated by calculating the maximum number of trailing zeros in the binary representation of each number in the set. If the maximum number of trailing zeros observed is denoted as n, then an estimate for the number of distinct elements in the set is given by 2n. [1]

However, the variance of this estimation can be significant. To address this issue, the HLL algorithm divides the multiset into multiple subsets. It calculates the maximum number of trailing zeros in the numbers within each subset and then combines these estimates using a harmonic mean to provide an overall estimate of the cardinality of the entire set.

The HLL data structure is represented as a k-tuple t = (n1, n2, . . . ni … nk), where ni represents the maximum number of trailing zeros calculated for the i-th subset in the multiset. This structure allows for the lossless merging of two or more HLLs, where the resulting HLL is equivalent to the calculation performed on the union of the original datasets and yields the same cardinality estimation.

While the HLL structure does not support other set operations, such as intersection, it is possible to enhance it by replacing the maximum number of zeros in the tuple t with bit-vectors. By incorporating bit-vectors to store all numbers of trailing zeros for each subset, this upgraded structure, we call it as HllSets (HyperLogLog Sets), enables the implementation of all set operations.

# References
https://en.wikipedia.org/wiki/HyperLogLog
https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40671.pdf
https://redis.io/docs/data-types/probabilistic/hyperloglogs/
https://github.com/ascv/HyperLogLog/blob/master/README.md
https://en.wikipedia.org/wiki/Inclusion%E2%80%93exclusion_principle
https://en.wikipedia.org/wiki/Algebra_of_sets
https://github.com/alexmy21/lisa_meta/blob/main/lisa.ipynb


In [None]:
# using Pkg
# Pkg.activate(".")
# Pkg.instantiate()
# Pkg.add("CSV")
# Pkg.add("Arrow")
# Pkg.add("Tables")
# Pkg.add("JSON3")

# Creating HllSets and applying basic operations

In [None]:
using Random
using FilePathsBase: extension, Path

include("src/sets32.jl")

import .HllSets as set

# Initialize test HllSets
hll1 = set.HllSet{5}(); hll1_seeded = set.HllSet{5}()
hll2 = set.HllSet{5}(); hll2_seeded = set.HllSet{5}()
hll3 = set.HllSet{5}(); hll3_seeded = set.HllSet{5}()
hll4 = set.HllSet{5}(); hll4_seeded = set.HllSet{5}()
hll5 = set.HllSet{5}(); hll5_seeded = set.HllSet{5}()

# Generate datasets from random strings
s1 = Set(randstring(7) for _ in 1:10)
s2 = Set(randstring(7) for _ in 1:15)
s3 = Set(randstring(7) for _ in 1:100)
s4 = Set(randstring(7) for _ in 1:20)
s5 = Set(randstring(7) for _ in 1:130)

# Add datasets to HllSets
set.add!(hll1, s1); set.add!(hll1_seeded, s1, seed=123)
set.add!(hll2, s2); set.add!(hll2_seeded, s2, seed=123)
set.add!(hll3, s3); set.add!(hll3_seeded, s3, seed=123)
set.add!(hll4, s4); set.add!(hll4_seeded, s4, seed=123)
set.add!(hll5, s5); set.add!(hll5_seeded, s5, seed=123)

println(hll1.counts, "\n", count(hll1))
println(hll1_seeded.counts, "\n", count(hll1_seeded))

println("Size of hll1: ", set.sizeof(hll1), "; \nSize of hll1_seeded: ", set.sizeof(hll1_seeded))

In [None]:
# Print cardinality of datasets and HllSets side by side
println(length(s1), " : ", count(hll1))
println(length(s2), " : ", count(hll2))
println(length(s3), " : ", count(hll3))
println(length(s4), " : ", count(hll4))
println(length(s5), " : ", count(hll5))

# union
println("\nunion:\n", length(s1 ∪ s2 ∪ s3 ∪ s4 ∪ s5), " : ", count(hll1 ∪ hll2 ∪ hll3 ∪ hll4 ∪ hll5), "\n")

# intersection
println("intersection (standard HllSet with seeded):\n", count(hll1 ∩ hll1_seeded))


# HllSet Universes

In [None]:
A = set.HllSet{5}(); A_seeded = set.HllSet{5}()
B = set.HllSet{5}(); B_seeded = set.HllSet{5}()
C = set.HllSet{5}(); C_seeded = set.HllSet{5}()

items_t1 = Set(["string0", "string1", "string2", "string3", "string4", "string5", "string6", "string7", "string8", "string9", "string10"])
items_t2 = Set(["string3", "string4", "string5", "string6", "string7", "string8", "string9", "string10", "string11"])
items_t3 = Set(["string5", "string6", "string7", "string8", "string9", "string10", "string11"])

set.add!(A, items_t1); set.add!(A_seeded, items_t1, seed=123)
set.add!(B, items_t2); set.add!(B_seeded, items_t2, seed=123)
set.add!(C, items_t3); set.add!(C_seeded, items_t3, seed=123)

In [None]:

# Default and seeded HllSet Universes
U = A ∪ B ∪ C; U_seeded = A_seeded ∪ B_seeded ∪ C_seeded

println("A: ", count(A)); println("A_seeded: ", count(A_seeded))
println("B: ", count(B)); println("B_seeded: ", count(B_seeded))
println("C: ", count(C)); println("C_seeded: ", count(C_seeded))
println("U: ", count(U)); println("U_seeded: ", count(U_seeded))

println("AB = A ∩ B: ", count(A ∩ B)); println("AB_seeded = A_seeded ∩ B_seeded: ", count(A_seeded ∩ B_seeded))
println("AC = A ∩ C: ", count(A ∩ C)); println("AC_seeded = A_seeded ∩ C_seeded: ", count(A_seeded ∩ C_seeded))
println("BC = B ∩ C: ", count(B ∩ C)); println("BC_seeded = B_seeded ∩ C_seeded: ", count(B_seeded ∩ C_seeded))

In [None]:
# Probabilities
println("P(A) = |A| / |U|: ", count(A) / count(U)); println("P(A_seeded) = |A_seeded| / |U_seeded|: ", count(A_seeded) / count(U_seeded))
println("P(B) = |B| / |U|: ", count(B) / count(U)); println("P(B_seeded) = |B_seeded| / |U_seeded|: ", count(B_seeded) / count(U_seeded))
println("P(C) = |C| / |U|: ", count(C) / count(U)); println("P(C_seeded) = |C_seeded| / |U_seeded|: ", count(C_seeded) / count(U_seeded), "\n")

# Conditional Probabilities
println("P(A | B) = |AB| / |B|: ", count(A ∩ B) / count(B)); println("P(B_seeded | A_seeded) = |AB_seeded| / |A_seeded|: ", count(A_seeded ∩ B_seeded) / count(A_seeded))
println("P(B | A) = |AB| / |A|: ", count(A ∩ B) / count(A)); println("P(A_seeded | B_seeded) = |AB_seeded| / |B_seeded|: ", count(A_seeded ∩ B_seeded) / count(B_seeded))
println("P(A | C) = |AC| / |C|: ", count(A ∩ C) / count(C)); println("P(C_seeded | A_seeded) = |AC_seeded| / |A_seeded|: ", count(A_seeded ∩ C_seeded) / count(A_seeded))
println("P(C | A) = |AC| / |A|: ", count(A ∩ C) / count(A)); println("P(A_seeded | C_seeded) = |AC_seeded| / |C_seeded|: ", count(A_seeded ∩ C_seeded) / count(C_seeded), "\n")

println("P(B | C) = BC / C: ", count(B ∩ C) / count(C))
println("P(C | B) = BC / B: ", count(B ∩ C) / count(B), "\n")

hll_diff = set.set_xor(A, B)
println("HLL xor: ", count(hll_diff))

hll_int = intersect(A, B)

println("hll_int: ", count(hll_int))

println()
println("=====================================")
hll_comp_1 = set.set_comp(A, B)
println("Comp 1: ", count(hll_comp_1))
println("A: ", count(A))

println()
println("=====================================")
hll_comp_2 = set.set_comp(B, A)
println("Comp 2: ", count(hll_comp_2))
println("B: ", count(B))