## Filtro de Bloom

In [1]:
import numpy as np
import murmurhash

class FiltroBloom:  
    def __init__(self, n, m, k):  
        self.n = n
        self.m = m
        self.k = k
        self.arrbit = np.zeros(n, dtype=np.bool)

    def registra(self, s):
        for i in range(self.k):
            hv = murmurhash.mrmr.hash(s, i) % self.n 
            self.arrbit[hv] = True 

    def verifica(self, s):
        bits = np.zeros(self.k, dtype=np.bool)
        for i in range(self.k):
            hv = murmurhash.mrmr.hash(s,i) % self.n 
            bits[i] = self.arrbit[hv]
        return np.all(bits)

Para probar nuestro filtro vamos a usar una lista de URLs populares.

In [2]:
!wget https://gist.githubusercontent.com/demersdesigns/4442cd84c1cc6c5ccda9b19eac1ba52b/raw/cf06109a805b661dd12133f9aa4473435e478569/craft-popular-urls

--2024-05-04 09:22:31--  https://gist.githubusercontent.com/demersdesigns/4442cd84c1cc6c5ccda9b19eac1ba52b/raw/cf06109a805b661dd12133f9aa4473435e478569/craft-popular-urls
Resolviendo gist.githubusercontent.com (gist.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Conectando con gist.githubusercontent.com (gist.githubusercontent.com)[185.199.109.133]:443... conectado.
Petición HTTP enviada, esperando respuesta... 200 OK
Longitud: 2254 (2.2K) [text/plain]
Guardando como: “craft-popular-urls.3”


2024-05-04 09:22:32 (14.0 MB/s) - “craft-popular-urls.3” guardado [2254/2254]





Leemos la lista de URLs.


In [3]:
with open('craft-popular-urls') as f:
    urls = f.read().split('\n')
print(urls)

['http://www.youtube.com', 'http://www.facebook.com', 'http://www.baidu.com', 'http://www.yahoo.com', 'http://www.amazon.com', 'http://www.wikipedia.org', 'http://www.qq.com', 'http://www.google.co.in', 'http://www.twitter.com', 'http://www.live.com', 'http://www.taobao.com', 'http://www.bing.com', 'http://www.instagram.com', 'http://www.weibo.com', 'http://www.sina.com.cn', 'http://www.linkedin.com', 'http://www.yahoo.co.jp', 'http://www.msn.com', 'http://www.vk.com', 'http://www.google.de', 'http://www.yandex.ru', 'http://www.hao123.com', 'http://www.google.co.uk', 'http://www.reddit.com', 'http://www.ebay.com', 'http://www.google.fr', 'http://www.t.co', 'http://www.tmall.com', 'http://www.google.com.br', 'http://www.360.cn', 'http://www.sohu.com', 'http://www.amazon.co.jp', 'http://www.pinterest.com', 'http://www.netflix.com', 'http://www.google.it', 'http://www.google.ru', 'http://www.microsoft.com', 'http://www.google.es', 'http://www.wordpress.com', 'http://www.gmw.cn', 'http://w

Instanciamos nuestra clase y registramos las URL

In [4]:
fb = FiltroBloom(1000, len(urls), 5)

for u in urls:
    fb.registra(u)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  self.arrbit = np.zeros(n, dtype=np.bool)


Revisamos cómo queda el arreglo de bits después de registrar todos los elementos

In [5]:
print(u'Proporción de bits distintos a 0 = {0}'.format(fb.arrbit.nonzero()[0].size / fb.arrbit.size))

Proporción de bits distintos a 0 = 0.39


Verificamos algunas URL

In [6]:
print(fb.verifica('http://www.youtube.com'))
print(fb.verifica('http://www.facebook.com'))
print(fb.verifica('http://www.yahoo.com'))
print(fb.verifica('http://www.amazon.com'))
print(fb.verifica('http://www.wikipedia.org'))
print(fb.verifica('http://www.baidu.com'))
print(fb.verifica('http://www.twitter.com'))
print(fb.verifica('http://www.unam.mx'))
print(fb.verifica('http://www.twitter.com/'))
print(fb.verifica('https://www.twitter.com'))
print(fb.verifica('https://www.twitter.com/'))

True
True
True
True
True
True
True
False
False
True
False


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  bits = np.zeros(self.k, dtype=np.bool)


## Ejercicio


    Explora distintos valores de hiperparámetros
    Cambia la función hash de la clase
