# Data Stream Mining

Find three examples in Python for Data Mining streams (Bloom filters, etc) and
compare their efficiency.

For the data stream, we will generate 100 random numbers between 1 and 100

In [24]:
import random


def getStream():
    items = []
    for x in range(100):
        items.append(round(random.random() * 100))
    return items

## Reservoir Sampling

Source: [GeeksforGeeks: Reservoir Sampling](https://www.geeksforgeeks.org/reservoir-sampling)

Reservoir Sampling is a technique which handler very big samples of data, that
usually stretch across long time and are so large that won't fit into computer
memory. The idea is taking the data in chunks of _n_ items and sampling _k_ items
from each chunk randomly.

Reservoir Sampling is performance-efficient and crude, but it serves it purpose
of keeping up with a generic sample from the stream.

In [25]:


# A utility function to print an array.
def printArray(stream, n):
    for i in range(n):
        print(stream[i], end=" ")
    print()

# A function to randomly select k items from stream[0..n-1].
def selectKItems(stream, n, k):
        i = 0

        # reservoir[] is the output array. Initialize it with
        # first k elements from stream[]
        reservoir = [0] * k
        for i in range(k):
            reservoir[i] = stream[i]

        # Iterate from the (k+1)th element to nth element
        while(i < n):
            # Pick a random index
            # from 0 to i.
            j = random.randrange(i + 1)

            # If the randomly picked index is smaller than k,
            # then replace the element present at the index
            # with new element from stream
            if j < k:
                reservoir[j] = stream[i]
            i += 1

        print("Following are 'k' - randomly selected items:")
        printArray(reservoir, k)

stream = getStream()
n = len(stream)
k = 5
selectKItems(stream, n, k);

Following are 'k' - randomly selected items:
17 62 13 44 62 


## Bloom Filter

Source: [Devopedia: Bloom Filter](https://devopedia.org/bloom-filter)

Bloom filter is used to compress the data records into a hash table. It then
stores much less space than the original items, and it much faster to determine
if an item is in that hash table or not, in a probabilistic fashion (result is
not 100% guaranteed). This is usable in sampling large data streams, where 100%
precision is not required.

The Bloom Filter has a little different purpose than the Reservoir Sampling.
Sometimes they are used together (Sampling to get a n updated sample, and the
Bloom Filter to detect that the sample does not have repeated items).

Bloom Filter is very size-efficient. Form mere bits of data it can detect if an
entry is present in a sample with very high probability. It is especially useful
in the area of data layers and IoT, where costly data requests can be mitigated
by fast and inexpensive checks in the bit-string for whether such requests are
necessary. Sometimes a mere boolean response is enough (in some preliminary
security response inside an IoT device, for example).

In [26]:
# Source: https://www.kdnuggets.com/2016/08/gentle-introduction-bloom-filter.html
# Accessed 2020-05-11

from bitarray import bitarray
import mmh3

class BloomFilter(set):

    def __init__(self, size, hash_count):
        super(BloomFilter, self).__init__()
        self.bit_array = bitarray(size)
        self.bit_array.setall(0)
        self.size = size
        self.hash_count = hash_count

    def __len__(self):
        return self.size

    def __iter__(self):
        return iter(self.bit_array)

    def add(self, item):
        for ii in range(self.hash_count):
            index = mmh3.hash(item, ii) % self.size
            self.bit_array[index] = 1

        return self

    def __contains__(self, item):
        out = True
        for ii in range(self.hash_count):
            index = mmh3.hash(item, ii) % self.size
            if self.bit_array[index] == 0:
                out = False

        return out

def main():
    bloom = BloomFilter(100, 10)
    animals = ['dog', 'cat', 'giraffe', 'fly', 'mosquito', 'horse', 'eagle',
               'bird', 'bison', 'boar', 'butterfly', 'ant', 'anaconda', 'bear',
               'chicken', 'dolphin', 'donkey', 'crow', 'crocodile']
    # First insertion of animals into the bloom filter
    for animal in animals:
        bloom.add(animal)

    # Membership existence for already inserted animals
    # There should not be any false negatives
    for animal in animals:
        if animal in bloom:
            print('{} is in bloom filter as expected'.format(animal))
        else:
            print('Something is terribly went wrong for {}'.format(animal))
            print('FALSE NEGATIVE!')

    # Membership existence for not inserted animals
    # There could be false positives
    other_animals = ['badger', 'cow', 'pig', 'sheep', 'bee', 'wolf', 'fox',
                     'whale', 'shark', 'fish', 'turkey', 'duck', 'dove',
                     'deer', 'elephant', 'frog', 'falcon', 'goat', 'gorilla',
                     'hawk' ]
    for other_animal in other_animals:
        if other_animal in bloom:
            print('{} is not in the bloom, but a false positive'.format(other_animal))
        else:
            print('{} is not in the bloom filter as expected'.format(other_animal))

main()

dog is in bloom filter as expected
cat is in bloom filter as expected
giraffe is in bloom filter as expected
fly is in bloom filter as expected
mosquito is in bloom filter as expected
horse is in bloom filter as expected
eagle is in bloom filter as expected
bird is in bloom filter as expected
bison is in bloom filter as expected
boar is in bloom filter as expected
butterfly is in bloom filter as expected
ant is in bloom filter as expected
anaconda is in bloom filter as expected
bear is in bloom filter as expected
chicken is in bloom filter as expected
dolphin is in bloom filter as expected
donkey is in bloom filter as expected
crow is in bloom filter as expected
crocodile is in bloom filter as expected
badger is not in the bloom filter as expected
cow is not in the bloom filter as expected
pig is not in the bloom filter as expected
sheep is not in the bloom, but a false positive
bee is not in the bloom filter as expected
wolf is not in the bloom filter as expected
fox is not in the blo