## Challenge

Create a Map Reduce job (using `mrjob`) which will randomly sample 1% of stackoverflow Users 

In [6]:
%%bash

# -- run it in order to fetch the data
wget -P resources https://s3.eu-central-1.amazonaws.com/learning.big.data/stackoverflow-users/0.xml

In [1]:
%%file tests.py

import json

# from exercise import RandomSamplingJob
from answer import RandomSamplingJob


def test_job():
    job = RandomSamplingJob(args=['./resources/0.xml'])

    results = set()
    with job.make_runner() as runner:
        runner.run()
        for key, value in job.parse_output(runner.cat_output()):
            results.add((key, value))

    assert abs(len(results) / 10 ** 3 - 1) < 0.05         

Overwriting tests.py


In [2]:
!py.test -s tests.py

platform linux -- Python 3.6.5, pytest-3.8.2, py-1.6.0, pluggy-0.7.1
rootdir: /home/maciej/projects/learning-big-data, inifile: pytest.ini
plugins: pythonpath-0.7.1, mock-0.10.1, cov-2.5.1
collected 1 item                                                               [0m[1m

tests.py .



## Your solution

In [7]:
%%file exercise.py

from mrjob.job import MRJob


class RandomSamplingJob(MRJob):

    def mapper(self, key, line):
        yield 'a', 'a'

Writing exercise.py


## The answer

In [10]:
%%file answer.py

import json
import random

from mrjob.job import MRJob
import xmltodict


class RandomSamplingJob(MRJob):

    def mapper(self, key, line):
        line = line.strip()
        record = dict(xmltodict.parse(line)['row'])        
        user = {k.replace('@', '').lower(): v for k, v in record.items()}
        
        if random.random() < 0.01: 
            yield (key, json.dumps(user))    

Writing answer.py


In [None]:
# !cat answer.py