## Challenge

Create `BadgeCountJob` (using `mrjob`) which will calculate count of specific stackoverflow badge as present in the analysed files. If possible try to leverage the combiner optimization technique. 

NOTE: Before you try working on Map Reduce Job try to achieve the same using pure python and working only on a single `0.xml` (available in the current directory).

As an extra create `BadgeCountPerUserJob` to count badges per user.

In [7]:
%%bash

# -- run it in order to fetch the data
wget -P resources https://s3.eu-central-1.amazonaws.com/learning.big.data/stackoverflow-badges/0.xml

In [107]:
%%file tests.py

import json

# from exercise import BadgeCountJob
from answer import BadgeCountJob

# from exercise import BadgeCountPerUserJob
from answer import BadgeCountPerUserJob


def test_badge_count_job():
    job = BadgeCountJob(args=['./resources/0.xml'])

    results = set()    
    with job.make_runner() as runner:
        runner.run()
        for key, value in job.parse_output(runner.cat_output()):
            results.add((key, value))

    # -- all records should be unique
    assert len(results) == len(set(results))
    
    results = sorted(results, key=lambda x: (x[1], x[0]), reverse=True)

    # -- top 5 should be ok
    assert results[:5] == [
        ('Popular Question', 14687), 
        ('Informed', 9557), 
        ('Notable Question', 7466), 
        ('Editor', 7230), 
        ('Yearling', 5903),
    ]
    # -- bottom 5 should be ok
    assert results[-5:][::-1] == [
        ('Census', 1), 
        ('activeadmin', 1), 
        ('ajax', 1), 
        ('alembic', 1), 
        ('amazon-web-services', 1)
    ]
    
    
def test_badge_per_user_count_job():
    job = BadgeCountPerUserJob(args=['./resources/0.xml'])

    results = []    
    with job.make_runner() as runner:
        runner.run()
        for user_id, name_counts in job.parse_output(runner.cat_output()):
            results.append((user_id, name_counts))

    # -- all records should be unique
    assert len(results) == 80987

    results = {user_id: name_counts for user_id, name_counts in results}
    assert results['1000090'] == [['Necromancer', 1]]
    assert set(tuple(r) for r in results['217408']) == set([
        ('Nice Answer', 11), 
        ('Enlightened', 8),         
        ('Good Answer', 6), 
        ('Announcer', 2),
        ('Great Answer', 1), 
        ('Guru', 1),        
        ('Popular Question', 1),         
        ('Revival', 1),
    ])

Overwriting tests.py


In [108]:
!py.test -s tests.py

platform linux -- Python 3.6.5, pytest-3.8.2, py-1.6.0, pluggy-0.7.1
rootdir: /home/maciej/projects/learning-big-data, inifile: pytest.ini
plugins: pythonpath-0.7.1, mock-0.10.1, cov-2.5.1
collected 2 items                                                              [0m[1m

tests.py ..



## Your solution

In [113]:
%%file exercise.py

from mrjob.job import MRJob
import xmltodict


class BadgeCountJob(MRJob):
    
    def mapper(self, _, line):
        yield 'a', 'a'
        
    def combiner(self, badge_name, counts):
        yield 'a', 'a'
        
    def reducer(self, badge_name, counts):
        yield 'a', 'a'   
        
        
class BadgeCountPerUserJob(MRJob):        
    def mapper(self, _, line):
        yield 'a', 'a'
        
    def combiner(self, user_id, name_counts):
        yield 'a', 'a'
        
    def reducer(self, user_id, name_counts):        
        yield 'a', 'a' 

Writing exercise.py


## Answer

In [112]:
# !cat answer.py