<a href="https://colab.research.google.com/github/dan-manolescu/data-structures-fun/blob/main/C12_Bloom_Filters.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [15]:
from typing import Any, List
from collections.abc import Callable

import hashlib

A simple implementation for a bloom filter.
bins = list of binary values representing the bloom filter
size = size of the bloom filter (length of the bins list)
h = list of hash functions to be applied on any value added to the bloom filter
k = size of the h list

Each hash function is expected to have two arguments: the key to be hashed and the size of the hash to be produced.

In [16]:
class BloomFilter:
    def __init__(self, size: int, h: List[Callable[[Any, int], int]]):
        self.size = size
        self.k = len(h)
        self.bins = [0] * size
        self.h = h

    def BloomFilterInsertKey(self, key: Any) -> None:
        for i in range(self.k):
            index = self.h[i](key, self.size)
            self.bins[index] = 1

    def BloomFilterLookup(self, key: Any) -> bool:
        for i in range(self.k):
            index = self.h[i](key, self.size)
            if self.bins[index] == 0:
                return False
        return True


A sample of hash functions list to use with out bloom filter class.

In [17]:
hash_list = [lambda key, size: int(hashlib.sha256(str(key).encode('utf-8')).hexdigest(), 16) % (i + 1) for i in range(size)]

bf = BloomFilter(10, hash_list)
bf.BloomFilterInsertKey('hahahaha')
print(bf.BloomFilterLookup('cc'))
print(bf.BloomFilterLookup('hahahaha'))
bf.BloomFilterInsertKey('bau')
bf.BloomFilterInsertKey('cau')
bf.BloomFilterInsertKey('cau cau')
print(bf.bins)
print(bf.BloomFilterLookup('cau cau c'))

True
True
[0, 0, 1, 0, 0, 0, 1, 0, 0, 1]
False


Let's define three other hash functions:

* MurmurHash3: This is a popular hash function that is known for its speed and
efficiency.
* JenkinsHash3: This is another fast and efficient hash function.
* crc32: This is a cyclic redundancy check (CRC) hash function that is commonly used for error detection.

In [18]:
def MurmurHash3(value: Any, size: int=1024) -> int:
    """Calculates the MurmurHash3 hash of the given value."""
    import hashlib
    digest = hashlib.sha3_512(str(value).encode('utf-8')).hexdigest()
    return int(digest, 16) % size

def JenkinsHash3(value: Any, size: int=1024) -> int:
    """Calculates the JenkinsHash3 hash of the given value."""
    hash_value = 0
    for byte in str(value).encode('utf-8'):
        hash_value = (hash_value << 8) ^ byte
        for i in range(7):
            hash_value ^= hash_value >> i
    return hash_value % size

def crc32(value: Any, size: int=1024) -> int:
    """Calculates the CRC32 hash of the given value."""
    import binascii
    hash_value = binascii.crc32(str(value).encode('utf-8')) % size
    return hash_value

Example of using the three hash functions in another bloom filter.

In [21]:
bf2 = BloomFilter(1024, [MurmurHash3, JenkinsHash3, crc32])
bf2.BloomFilterInsertKey('Test string 1')
print(bf2.BloomFilterLookup('Test string 2'))
print(bf2.BloomFilterLookup('Test string 1'))
bf2.BloomFilterInsertKey('bau')
bf2.BloomFilterInsertKey('cau')
bf2.BloomFilterInsertKey('cau cau')
print(bf2.BloomFilterLookup('bau'))
print(bf2.BloomFilterLookup('cau cauu'))

False
True
True
False
