# Cuckoo Filter 

<img src="images/03.svg">

### Introduction and Post Summary

Probabilistic data structures store data compactly, and with a low memory footprint with the goal of providing approximate answers to queries about stored data. They are designed to answer queries in a space efficient manner while potentially sacrificing accuracy. However, they typically provide guarantees and bounds on the error rates depending on the specifications of the data structure in question. Because they provide low memory footprints, probabilisitic data structures are particularly useful in streaming and low power settings. Hence, they can also be extremely useful in big data situations like couting views on a video, and maintaining a list of unique tweets in the past. At Fast Forward Labs, probabilistic data structures were explored in the report, "Probabilistic Methods for Real-time Streams." In this post, we provide an update; particularly, we take a look at Cuckoo filters, which are a [recently](https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf) introduced probabilistic data structure. The Cuckoo filter was designed as an improvement upon the standard bloom filter. We provide a python implementation of the Cuckoo filter, and show comparisons to a counting Bloom filter; a Bloom filter variant. 

### Bloom filter

 [Bloom filters](https://en.wikipedia.org/wiki/Bloom_filter) are a popular probabilistic data structure that allow space-effiient testing of set membership. For example, say one were monitoring a real-time stream of tweets, a bloom filter allows for testing whether a tweet has been encountered beforehand. Bloom filters leverage hash functions to compactly encode items as integers, which then serve as indices of an bit array that are set. To test if an item has been seen before, the item is hashed to produce its set of indices, and each index is checked to see if its set. Since multiple items can possibly be hashed to the same indices, a membership test returns a false or maybe; hence, Bloom filters 
 
 
Traditional bloom filters do not support deletions, since hashing is lossy and irreversible; hence deletions require the entire filter to be rebuilt. However, in several cases, one often needs to delete items seen in the past. For example, in the tweet streaming case mentioned above, one might want to delete certain tweets that were seen in the past. To incorporate deletions, the counting Bloom filter was introduced. To support deletions, counting Bloom filters extend buckets in traditional Bloom filters from single bit values to n-bit counters; hence, on insertions, the indices of the Bloom filter are incremented as opposed to set. 

<img src="images/14.svg">

### Cuckoo filter

The cuckoo filter is an alternative to the Bloom filter when one requires support for 
deletions. They were introduced in [2014](https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf) by Fan et. al. Cuckoo filters provide insert, delete, and lookup capabilities, which is similar to the counting Bloom filter. However, Cuckoo filters are backed by a different underlying data structure to the Bloom filter, and the insertion procedure is different as well. Here, we provide a quick overview of the cuckoo filter. 

The cuckoo filter consists of a [cuckoo hash table](http://web.stanford.edu/class/archive/cs/cs166/cs166.1146/lectures/13/Small13.pdf) that stores the 'fingerprints' of items inserted. The fingerprint of an item is a bit string derived from the hash of that item. A cuckoo hash table consists of an array of buckets where an item to be inserted is mapped to two possible buckets based on two hash functions. Each bucket can be configured to store variable number of fingerprints. Typically, a cuckoo filter is identified  by the its fingerprint size and bucket size. For example, a (2,4)
cuckoo filter stores 2 bit length fingerprints and each bucket in the cuckoo hash table can store up to 4 fingerprints. Following the above [paper](https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf), we implemented the cuckoo filter in python [link to final repo]. Below, we initialize an example cuckoo filter and test simple inserts and deletions. We also implement a counting bloom filter as well in order to make performance comparisons. Below, we initialize both data structures. 

In [None]:
from cuckoofilter import CuckooFilter
c_filter = CuckooFilter(10000, 2) #specify capacity and fingerprint size

In [None]:
c_filter.insert("James")
print("James" in c_filter)

c_filter.remove("James")
print("James" in c_filter)

In [None]:
from cuckoofilter import CountingBloomFilter
b_filter = CountingBloomFilter(10000) #specify the capacity of a counting bloom filter

b_filter.add("James")
print("James" in b_filter)

b_filter.remove("James")
print("James" in b_filter)

### Inserting into a cuckoo filter

The cuckoo filter supports three key operations: insert, delete, and lookup. Of all these operations, the insert operation is most involved. To insert an item into the cuckoo filter, one derives two indices from the item based on hashing the item and its fingerprint. On obtaining these indices, one then inserts the item's fingerprint into one of the two possible buckets. As the cuckoo hash table begins to fill up, one 
can encounter a situation where the two possible indices where an item can be inserted has been filled. 
In this case, items currently in the cuckoo hash table are swapped to their alternative indices to free up space for
inserting the new item. By implementing insertion in this manner, one can easily delete an item from the table by looking up its fingerprint in one of two possible indices, and deleting this fingerprint if present. To make the insertion procedure more concrete, we provide code below implementing the insertion procedure.

In [None]:
#example function to demonstrate how to insert into a cuckoo filter. 

import mmh3

def obtain_indices_from_item(item_to_insert, fingerprint_size, capacity):
    
    #hash the string item 
    hash_value = mmh3.hash_bytes(item_to_insert)
    
    #subset the hash to a fingerprint size
    fingerprint = hash_value[:fingerprint_size]
    
    #derive the index
    index_1 = int.from_bytes(hash_value, byteorder="big")
    index_1 = index_1 % capacity
    
    #derive the index from the fingerprint
    hashed_fingerprint = mmh3.hash_bytes(fingerprint)
    finger_print_index = int.from_bytes(hashed_fingerprint, byteorder="big")
    finger_print_index = finger_print_index % capacity
    
    #second index -> first_index xor index derived from hash(fingerprint)
    index_2 = index_1 ^ finger_print_index
    index_2 = index_2 % capacity
    
    return index_1, index_2, fingerprint

def insert_into_table(table, index_1, index_2, bucket_capacity):
    #now insert item into the table
    if len(table[index_1]) < bucket_capacity:
        table[index_1].append(fp)
        return table, index_1

    if len(table[index_1]) < bucket_capacity:
        table[index_2].append(fp)
        return table, index_2

    #Move items to other positions in the table. 
    
    return "Unable to Insert Item"    

In [None]:
#let's create a crude cuckoo hashtable
capacity = 10 #capacity of our cuckoo hashtable
bucket_capacity = 4
table = [[] for _ in range(capacity)]

In [None]:
#obtain possibe indices
index_1, index_2, fp = obtain_indices_from_item("James", 2, 10)

In [None]:
table, _ = insert_into_table(table, index_1, index_2, bucket_capacity)

In [None]:
table

## Bench marking against counting bloom filter

Now we compare the Cuckoo filter to the counting Bloom filter. A critical metric for probabilistic data structures like the Bloom and Cuckoo filters is the false positive rate. As shown in the insertion section, comparing the Cuckoo filter and the Bloom filter can be tricky given the difference in their internal workings. To tackle the issue of false positive rates, we  fix the space at 0.1MB for both structures. Consequently, we then vary the capacities of both structures and observe false positive rate. Below we show a graph of the false positive rate vs the capcity for both structures. 

<img src="images/false_positive.png">

As seen in the graph, a key advantage of the cuckoo filter is that with fixed space, the cuckoo filter provides much lower false positive rates at smaller capacities. Of note is that the Cuckoo filter here is a straightforward implementation without any space optimizations. This further indicates that a Cuckoo filter provides better performance withouth any tuning compared to optimized Bloom filters. See [Notebook] for other performance benchmarks comparing the counting Bloom filter and the Cuckoo filter. 

### Real world application [Is this still needed?]

While probabilistic data structures can seem esoteric, they are actually quite useful in practice. Consider the issue of engagement and growth that is typically of concern to large scale internet applications. For example, one common critic of Twitter is that it is difficult for new users to become engaged. To tackle this issue, members of the growth & engagement team at twitter can develop several marketting campaigns targetted at new and unengaged users in order to get them using twitter more often. To aid such work, every new user can be added to a cuckoo filter on registration. Now, once a user becomes active (whatever the definition of active is), the user can be removed from the cuckoo filter.  Consequently, the team can target growth campaigns to individuals currently in the cuckoo filter. From time to time, currently active users can lapse to become inactive. These users can be added and removed from the cuckoo filter depending on their activity level. Given the ease with which a cuckoo filter can be implemented, it is particularly amenable to such use case. For an application with 100s of millions of users, the cuckoo at its simplest implementation can help to provide a low memory footprint coupled with low false positive rates in this case. 

## Conclusion

Bloom filters and its variants have proven useful in streaming applications and others where membership testing is critical. In this post, we have shown how a Cuckoo filter, which can be implemented simply, provides better practical performance, under certain circumstances, out of the box and without tuning than counting Bloom filters at lower capacities. Ultimately, Cuckoo filters can serve as alternatives in scenarios where a counting Bloom filter would normally be used. 