# HyperLogLog

> Count billions of items in data stream while only using only a few kilobytes of memory

HyperLogLog is an alorithm that esimates the total number of unique items in data stream by counting the number of leading zeroes in the 64bit hashed version of that element. User-defined number $p$, or the number of leading zeroes in the binary form of element, is used as precision for this estimation.

In [38]:
import(
  "fmt"
  "flag"
  "errors"
  "os"
  "log"
  "bufio"
  "strings"
  "math"
  _ "bytes"
  _ "encoding/binary"
)

In [24]:
import(
    "github.com/spaolacci/murmur3"
)

func string2hash(data string) uint64 {
    return hasher([]byte(data))
}

func hasher(data []byte) uint64 {
    hasher := murmur3.New128()
    hasher.Write(data)
    h1, _ := hasher.Sum128()
    return h1
}

First $p$ elements will be used to derive the number of counting buckets that we will maintain in memory. These bits serve as indices for our counters. Once a new item is encountered in data stream, then hashed representation of that element will be shifted bitwise to the left and the number of leading zeroes in the following 64 - $p$ bits will be used as value in that counter. Note that counter in a bucket will only be updated when an element with higher number of leading zeroes is oberved.

For example, imagine a following structure when $p = 4$:
```
0001: 1
0011: 2
0010: 1
0111: 5
...
```

Calculating harmonic mean on this data structure will give as an estimation of total number of unique items in stream.

In [47]:
import (
    "math/bits"
)
// N = number of buckets in registry
// b = number of significant bits used assigning incoming data
// leading zeros in remaining binary value are used to estimate probability
// buckets = estimators with Length of N
type HLL struct {
  m uint32
  p uint
  // each bucket will hold max( count_zeroes + 1 ) in 64bit uint with 4..16 bits already derived
  // thus, uint8 cannot overflow
  buckets []uint8
  alpha float64

  cardinality uint64
}

// Init new HLL instance
// user should choose between 4 and 16 significant bits
// each bit increases computational complexity by order of magnitude
// 16 = 65536, 15 = 32768, 14 = 16384, etc
// x << y == x * 2**y
func InitHLL(precision uint) (h *HLL, err error) {
    if precision < 4 || precision > 16 {
        return nil, errors.New("precision must be integer between 4 and 16")
    }
    h = &HLL{
        p:  precision,
        m:  1 << precision,
    }
    h.buckets = make([]uint8, h.m)
    // Magic numbers for hash collision correction
    switch h.m {
    case 16:
        h.alpha = 0.673
    case 32:
        h.alpha = 0.697
    case 64:
        h.alpha = 0.709
    default:
        h.alpha = 0.7213 / ( 1 + 1.079 / float64(h.m) )
    }    
    return h, err
}

func (h *HLL) AddString(item string) *HLL {
    return h.Add(string2hash(item))
}

// Add adds a previously hashed item into HLL registers
// rho represents the leading zeros in significant part
func (h *HLL) Add(hash uint64) *HLL {
    diff := 64 - h.p
    index := hash >> diff
    tail := hash << h.p
    count := uint8(bits.LeadingZeros64(tail)) + 1

    if count > h.buckets[index] {
        h.buckets[index] = count
    }
    return h
}

func (h *HLL) Count() *HLL {
    Z := float64(0)
    for _, c := range h.buckets {
        if c > 0 {
            Z += float64(1 / math.Pow(float64(2), float64(c)))
        }
    }
    Z = 1 / Z
    count := h.alpha * math.Pow( float64(h.m), 2) * Z
    h.cardinality = uint64( math.Floor(count) )
    return h
}

In [48]:
import(
    "os"
)
src := "/home/jovyan/data/SDM/logs/apache.log"
bits := 10
hll, _ := InitHLL(uint(bits))

var split []string
words := make(map[string]uint64)
distinct := uint64(0)

file, err := os.Open(src)
if err != nil {
    log.Fatal(err)
}

scanner := bufio.NewScanner(file)
for scanner.Scan() {
    split = strings.Split(scanner.Text(), " ")
    for _, word := range split {
        hll.AddString(word)

        // traditional counting for comparison
        if _, ok := words[word]; ok {
            words[word]++
        } else {
            words[word] = 1
            distinct++
        }
    }
}
file.Close()

hll.Count()
mistake := 100 - (( float64( hll.cardinality ) / float64( distinct ) ) * 100)

In [49]:
fmt.Println("Actual distinct:", distinct)
fmt.Println("Estimated distinct:", hll.cardinality)
fmt.Println("Calculated error:", mistake, "%")
fmt.Println("HLL with ", hll.m, "8bit counters takes about", float64( hll.m )/1024, "kilobytes")

Actual distinct: 28680
Estimated distinct: 28463
Calculated error: 0.7566248256624704 %
HLL with  1024 8bit counters takes about 1 kilobytes
53
<nil>
