# Cuckoo Filter 

### Introduction and Post Summary
Probabilistic data structures are useful in streaming settings and other applications with significant memory requirements. In streaming scenarios, one often wants to test whether an item has been encountered before. For example, imagine an application that consumes tweets from the twitter api. A natural query might be whether a tweet has been encountered before. With hundreds of millions of tweets published daily, storing all tweets in a set[hyperlink] will require significant memory. Probabilistic data structures like the bloom filter are designed to answer such queries in a space efficient manner. Here, we explore a recently introduced probabilistic data structure: the cuckoo filter. The cuckoo filter was designed as an improvement upon the standard bloom filter. In this post, we provide a python implementation of the cuckoo filter, and provide comparisons to a counting bloom filter; a bloom filter variant that allows for dynamic deletions. 

### Bloom filter
At Fast Forward Labs, bloom filters and other probabilistic data structures were explored in detail in the report, "Probabilistic Methods for Real-time Streams." Bloom filters [hyperlink] leverage hash functions to encode items compactly in an array. Encoded values derived from hashing an item to be inserted into a bloom filter can then be used as indices of an array. The derived indices can then be set to indicate that an item has been inserted into an array. Because two items can hashed to the same indices, membership queries for bloom filters can lead to false positives. Traditional bloom filters do not support deletions, since hashing is lossy and irreversible; hence deletions require the entire filter to be rebuilt. Variants of the bloom filter that allow for deletions have however been introduced as well. One popular variant is the counting bloom filter [link]. 


### Overview of cuckoo filter
The cuckoo filter is designed to provide capabilities similar to the counting bloom filter. The cuckoo filter consists of a cuckoo hash table [hyperlink] that stores the 'fingerprints' of items inserted. The fingerprint of an item is a reduced bit string derived from the hash of that item. A cuckoo hash table "consists of an array of buckets" where an item to be inserted is mapped to two possible buckets based on two hash functions. Each bucket can be configured to store variable number of fingerprints. Typically, a cuckoo filter is identified  by the its fingerprint size and bucket size. For example, a (2,4)
cuckoo filter stores 2 bit length fingerprints and each bucket in
the cuckoo hash table can store up to 4 fingerprints. Following the introduction paper [link], we implemented the cuckoo filter in python. Below we initialize an example cuckoo filter and test simple inserts and deletions. We also implement a counting bloom filter as well in order to make performance comparisons. 

In [1]:
from cuckoofilter import CuckooFilter

c_filter = CuckooFilter(100, 2) #specify capacity and fingerprint size

In [None]:
c_filter.insert("James")

print(Jame)

## Bench marking against counting bloom filter
In this section, we compare the cuckoo filter to the counting bloom filter across a variety of scenarios: insertion time, deletion time, space requirements, and false positive rate.  

#### Insertion Time


As shown in the different graphs above, the cuckoo filter has several advantages compared to standard bloom filters.
Despite these advantages cuckoo filters are particularly suited to applications where a low false positive rate is desired as well as smaller spacer requirements compared to the bloom filter. 

In [None]:
### Graph of sizes for different inserts

### false positive for different sizes, extreme case when it is almost full as well. 

### total size benchmark graph