## Challenge

Create `ReduceSideWithBloomFilterJob` MapReduce job using `mrjob`:
- It accepts two types of inputs: users and badges and should perform inner join on `userid` field. 
- The resulting records should be in the form of dictionary with field `user` (where user record should be placed) and `badges` (where list of badges belonging to a given user should be placed)
- it should only consider users with `reputation` greater than `1500`
- it should use a Bloom Filter to filter out (at the mapper level) all badge records for users who have reputation lower than `1500` 

In [4]:
%%bash

# -- run it in order to fetch the data
wget -P resources https://s3.eu-central-1.amazonaws.com/learning.big.data/stackoverflow-users/0.xml
mv resources/0.xml resources/0-users.xml 
wget -P resources https://s3.eu-central-1.amazonaws.com/learning.big.data/stackoverflow-badges/0.xml
mv resources/0.xml resources/0-badges.xml 

In [26]:
%%file tests.py

# from exercise import ReduceSideWithBloomFilterJob
from answer import ReduceSideWithBloomFilterJob


def test_job():
    job = ReduceSideWithBloomFilterJob(args=[
        './resources/0-users.xml',
        './resources/0-badges.xml',        
    ])

    results = []
    with job.make_runner() as runner:
        runner.run()
        for key, value in job.parse_output(runner.cat_output()):
            results.append(value)
    
    assert len(results) == 233
    assert results[0]['user']['id'] == '2975952'
    assert len(results[0]['badges']) == 1    

Overwriting tests.py


In [27]:
!py.test -s tests.py

platform linux -- Python 3.6.5, pytest-3.8.2, py-1.6.0, pluggy-0.7.1
rootdir: /home/maciej/projects/learning-big-data, inifile: pytest.ini
plugins: pythonpath-0.7.1, mock-0.10.1, cov-2.5.1
collected 1 item                                                               [0m[1m

tests.py .



## Your solution

In [28]:
%%file exercise.py

import os

from mrjob.job import MRJob
import xmltodict
from pybloom_live import BloomFilter


basedir = os.path.dirname(__file__)


def row_to_dict(row):
    row = row.strip()
    record = dict(xmltodict.parse(row)['row'])        

    return {k.replace('@', '').lower(): v for k, v in record.items()}


#
# TRAIN BLOOM FILTER
#
# -- training the Bloom filter
# -- you code goes here...
bf = 'replace this with something...'

with open('./resources/hot_user_ids.bf', 'wb') as f:
    bf.tofile(f)
      
        
class ReduceSideWithBloomFilterJob(MRJob):

    def mapper_init(self):
        with open(os.path.join(basedir, 'resources/hot_user_ids.bf'), 'rb') as f:
            self.filter = BloomFilter.fromfile(f)
        
    def mapper(self, _, line):
        # -- your code goes here
        yield 'a', 'a'
            
    def reducer(self, userid, entities):
        # -- your code goes here
        yield 'a', 'a'

Writing exercise.py


## The answer

In [30]:
# cat answer.py