In [25]:
import csv
import mmh3
from bitarray import bitarray
import math
import time

We will use murmur3 for generating hash functions and bitarray for storing the bloom filter

In this application we have a dataset of 20,000 malicious URLs for creating our bloom filter

As checking  each URL that user entered on browser against the malicious URL  database can make the browser slow.

Therefore any URL was first checked against a local Bloom filter, and only if the Bloom filter returned a positive result was a full check of the URL performed (and the user warned, if that too returned a positive result).

In [26]:
# size = 100000
# bit_array = bitarray(size)

False Positive in bloom filter depends size of bitarray and number of hash functions used.
For optimal results we will set no of elements to be inserted to 20000 and false positive rate to 0.05.

In [27]:
n = 20000
p = 0.05
m = -(n * math.log(p))/(math.log(2)**2)
k = (m/n) * math.log(2) 
size = int(m)

m denote size of bloom filter
k denote no of hash functions used.

In [28]:
size

124704

In [29]:
k

4.321928094887363

In [30]:
bit_array = bitarray(size)

Using bitarray(size), we will initialize the bit_array with given size and set all enteries to 0.

In [60]:
start_time = time.time()
bit_array.setall(0)

def mapf(url):
    b1 = mmh3.hash(url, 1) % size
    bit_array[b1]=1

    b2 = mmh3.hash(url, 2) % size
    bit_array[b2]=1

    b3 = mmh3.hash(url, 3) % size
    bit_array[b3]=1

    b4 = mmh3.hash(url, 4) % size
    bit_array[b4]=1

#     b5 = mmh3.hash(url, 5) % size
#     bit_array[b5]=1

#     b6 = mmh3.hash(url, 6) % size
#     bit_array[b6]=1

#     b7 = mmh3.hash(url, 7) % size
#     bit_array[b7]=1


r = csv.reader(open("new_data.csv"))
for row in r:
    url=row[1]
    mapf(url)
print("Time for build --- %s seconds ---" % (time.time() - start_time))

Time for build --- 0.07630491256713867 seconds ---


mapf() function is used to set the bloom filter indexes for each url passed as parameter.Each URL is passed to hash functions and indexes return by the hash functions are set to 1 in bloom filter bit array.

In [61]:
start_time = time.time()
disk_accesses=0;

ba = bit_array

def check(url):
    global disk_accesses
    b1 = mmh3.hash(url, 1) % size
    b2 = mmh3.hash(url, 2) % size
    b3 = mmh3.hash(url, 3) % size
    b4 = mmh3.hash(url, 4) % size
#     b5 = mmh3.hash(url, 5) % size
#     b6 = mmh3.hash(url, 6) % size
#     b7 = mmh3.hash(url, 7) % size
    if(ba[b1]==True 
    and ba[b2]==True 
    and ba[b3]==True 
    and ba[b4]==True 
#     and ba[b5]==True 
#     and ba[b6]==True 
#     and ba[b7]==True
      ):
        disk_accesses=disk_accesses+1



r = csv.reader(open("top2m.csv"))

for row in r:
    url=row[1]
    check(url)
print(disk_accesses);

print("Time for check --- %s seconds ---" % (time.time() - start_time))

245
Time for check --- 0.004625082015991211 seconds ---


Now check(url) function will be used to test whether URL is present in file or not using bloom filter.
top2m.csv contains URLs which are not present in our initial malicious url file.
We will count the false positives.
To analyze our result, we have vary the number of hash functions and size of bloom filter and get the false positive values and using this dataset of different numbers of hash function, varied size of bloom filter and false positive values we will plot the graph.

In [58]:
start_time = time.time()
link = csv.reader(open("new_data.csv"))
data=[]
for item in link:
    data.append(item[1])
test = csv.reader(open("top2m.csv"))
count=0
for number,item in test:
    if item in data:
        count+=1
print(count)
print("Time for check --- %s seconds ---" % (time.time() - start_time))

200
Time for check --- 0.2752988338470459 seconds ---


Here we are checking the actual time taken if for every URL that needs to be checked we are going through the file to check if url is present.

200