## Challenge

Create a `train_bloom_filter` function which retrieves `displayname`s from stackoverflow users (file `0.xml`) data. Use those names for training a Bloom filter, and save it under `resources` directory in a file called `hot_names_bloom_filter`. Notice that `train_bloom_filter` should returned an instance of trained Bloom filter.

In [6]:
%%bash

# -- run it in order to fetch the data
wget -P resources https://s3.eu-central-1.amazonaws.com/learning.big.data/stackoverflow-users/0.xml

In [24]:
%%file tests.py

from unittest import TestCase
from random import sample
from string import ascii_letters

from pybloom_live import BloomFilter

# from exercise import row_to_dict, train_bloom_filter
from answer import row_to_dict, train_bloom_filter


class BloomFilterTestCase(TestCase):
    
    def setUp(self):
        names = set()
        with open('resources/0.xml', 'r') as f:
            for line in f:
                user = row_to_dict(line)
                names.add(user['displayname'])

        self.names = names
    
    def test_train_bloom_filter__directly(self):
        bf = train_bloom_filter()
        
        for name in sample(self.names, 10):
            assert name in bf
            
            prefix = ''.join(sample(ascii_letters, 10))
            fake_name = f'{prefix}{name}'
            assert fake_name not in bf

    def test_train_bloom_filter__from_file(self):
        with open('./resources/hot_names_bloom_filter', 'rb') as f:
            bf = BloomFilter.fromfile(f)

        for name in sample(self.names, 10):
            assert name in bf
            
            prefix = ''.join(sample(ascii_letters, 10))
            fake_name = f'{prefix}{name}'
            assert fake_name not in bf

Overwriting tests.py


In [25]:
!py.test -s tests.py

platform linux -- Python 3.6.5, pytest-3.8.2, py-1.6.0, pluggy-0.7.1
rootdir: /home/maciej/projects/learning-big-data, inifile: pytest.ini
plugins: pythonpath-0.7.1, mock-0.10.1, cov-2.5.1
collected 2 items                                                              [0m[1m

tests.py ..



## Your solution

In [9]:
%%file exercise.py

import xmltodict
from pybloom_live import BloomFilter


def row_to_dict(row):
    """
    NOTE: you can use this function to parse incoming xml rows into 
    python dicts
    
    """
    row = row.strip()
    record = dict(xmltodict.parse(row)['row'])        

    return {k.replace('@', '').lower(): v for k, v in record.items()}


def train_bloom_filter():
    pass             

Writing exercise.py


## The answer

In [2]:
# !cat answer.py