## Filtro de Bloom

En esta libreta programaremos un filtro de Bloom usando NumPy. Un filtro de Bloom consiste en un arreglo de \\(n\\) bits inicializados con  \\(0\\).

* Construcción
  1. Para cada elemento \\(s\\) del conjunto de cardinalidad \\(m\\), se calculan los valores _hash_ con \\(k\\) funciones distintas \\(h_1(s), h_2(s), \ldots, h_k(s)\\).
  2. Los \\(k\\) bits en las posiciones correspondientes a los \\(k\\) valores _hash_ se ponen a 1.
  
* Verificación de pertenencia de un nuevo elemento \\(\tilde{s}\\)
  1. Calcula los valores _hash_ para \\(\tilde{s}\\): \\(h_1(\tilde{s}), h_2(\tilde{s}), \ldots , h_k(\tilde{s})\\).
  2. Si todos los bits en las posiciones correspondientes a los \\(k\\) valores _hash_ son 1, entonces el elemento \\(\tilde{s}\\) sí pertenece al conjunto, en caso contrario no pertenece.
  
  
Esta libreta está basada del material del Dr. Gibran Fuentes

In [0]:
!pip install murmurhash

You should consider upgrading via the '/databricks/python3/bin/python -m pip install --upgrade pip' command.[0m


In [0]:
import numpy as np
import murmurhash

class FiltroBloom:  
  def __init__(self, n, m, k):  
    self.n = n
    self.m = m
    self.k = k
    self.arrbit = np.zeros(n, dtype=np.bool)

  def registra(self, s):
    for i in range(self.k):
      hv = murmurhash.mrmr.hash(s, i) % self.n 
      self.arrbit[hv] = True 

  def verifica(self, s):
    bits = np.zeros(self.k, dtype=np.bool)
    for i in range(self.k):
      hv = murmurhash.mrmr.hash(s,i) % self.n 
      bits[i] = self.arrbit[hv]

    return np.all(bits)

In [0]:
!wget https://gist.githubusercontent.com/demersdesigns/4442cd84c1cc6c5ccda9b19eac1ba52b/raw/cf06109a805b661dd12133f9aa4473435e478569/craft-popular-urls

--2022-04-07 13:31:33--  https://gist.githubusercontent.com/demersdesigns/4442cd84c1cc6c5ccda9b19eac1ba52b/raw/cf06109a805b661dd12133f9aa4473435e478569/craft-popular-urls
Resolving gist.githubusercontent.com (gist.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to gist.githubusercontent.com (gist.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2254 (2.2K) [text/plain]
Saving to: ‘craft-popular-urls.2’


2022-04-07 13:31:33 (32.9 MB/s) - ‘craft-popular-urls.2’ saved [2254/2254]



In [0]:
with open('craft-popular-urls') as f:
  urls = f.read().split('\n')
print(urls)

['http://www.youtube.com', 'http://www.facebook.com', 'http://www.baidu.com', 'http://www.yahoo.com', 'http://www.amazon.com', 'http://www.wikipedia.org', 'http://www.qq.com', 'http://www.google.co.in', 'http://www.twitter.com', 'http://www.live.com', 'http://www.taobao.com', 'http://www.bing.com', 'http://www.instagram.com', 'http://www.weibo.com', 'http://www.sina.com.cn', 'http://www.linkedin.com', 'http://www.yahoo.co.jp', 'http://www.msn.com', 'http://www.vk.com', 'http://www.google.de', 'http://www.yandex.ru', 'http://www.hao123.com', 'http://www.google.co.uk', 'http://www.reddit.com', 'http://www.ebay.com', 'http://www.google.fr', 'http://www.t.co', 'http://www.tmall.com', 'http://www.google.com.br', 'http://www.360.cn', 'http://www.sohu.com', 'http://www.amazon.co.jp', 'http://www.pinterest.com', 'http://www.netflix.com', 'http://www.google.it', 'http://www.google.ru', 'http://www.microsoft.com', 'http://www.google.es', 'http://www.wordpress.com', 'http://www.gmw.cn', 'http://w

In [0]:
fb = FiltroBloom(1000, len(urls), 5)
for u in urls:
  fb.registra(u)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  self.arrbit = np.zeros(n, dtype=np.bool)


In [0]:
print(u'Proporción de bits distintos a 0 = {0}'.format(fb.arrbit.nonzero()[0].size / fb.arrbit.size))

Proporción de bits distintos a 0 = 0.39


In [0]:
print(fb.verifica('http://www.youtube.com'))
print(fb.verifica('http://www.facebook.com'))
print(fb.verifica('http://www.yahoo.com'))
print(fb.verifica('http://www.amazon.com'))
print(fb.verifica('http://www.wikipedia.org'))
print(fb.verifica('http://www.baidu.com'))
print(fb.verifica('http://www.twitter.com'))
print(fb.verifica('http://www.unam.mx'))
print(fb.verifica('http://www.twitter.com/'))
print(fb.verifica('https://www.twitter.com'))
print(fb.verifica('https://www.twitter.com/'))

True
True
True
True
True
True
True
False
False
True
False
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  bits = np.zeros(self.k, dtype=np.bool)
