## TASK

Create MapReduce Job which will calculate count of each stackoverflow badge per user. If possible try to leverage the combiner optimization technique. 

NOTE: Before you try working on Map Reduce Job try to achieve the same using pure python and working only on a single `0.xml` (available in the current directory).

In [1]:
from glob import glob
from uuid import uuid4

import requests
import xmltodict

from job import Job


get_stackoverflow_badges_uri = (
    'https://s3.eu-central-1.amazonaws.com/learning.big.data/stackoverflow-badges/{}'.format)

## Records Reader

In [2]:
def record_reader(line):
    record = dict(xmltodict.parse(line.decode('utf-8'))['row'])        
    yield (
        record['@Id'],
        {k.replace('@', '').lower(): v for k, v in record.items()},
    )            

In [3]:
# -- test `record_reader`
response = requests.get(get_stackoverflow_badges_uri('0.xml'), stream=True)

records = []
for line in response.iter_lines():
    if line:
        records.append(next(record_reader(line)))
        
print(records[:2])

[('26066242', {'id': '26066242', 'userid': '8125167', 'name': 'Supporter', 'date': '2017-11-28T19:34:25.047', 'class': '3', 'tagbased': 'False'}), ('26066243', {'id': '26066243', 'userid': '9006638', 'name': 'Supporter', 'date': '2017-11-28T19:34:25.047', 'class': '3', 'tagbased': 'False'})]


## Mapper

In [4]:
def mapper(key, value):
    yield (value['userid'], value['name'])

In [5]:
# -- test mapper
[next(mapper(key, value)) for key, value in records[:10]]

[('8125167', 'Supporter'),
 ('9006638', 'Supporter'),
 ('4892968', 'Supporter'),
 ('3204673', 'Supporter'),
 ('1108484', 'Taxonomist'),
 ('3203282', 'Teacher'),
 ('3926187', 'Teacher'),
 ('4134228', 'Teacher'),
 ('8474041', 'Teacher'),
 ('9019981', 'Informed')]

## Reducer

In [6]:
def reducer(key, values):
    counts = {}
    for value in values:
        counts.setdefault(value, 0)
        counts[value] += 1
        
    yield (key, counts)

In [7]:
# -- test reducer
next(reducer('9019981', ['Teacher', 'Informed', 'Teacher']))

('9019981', {'Teacher': 2, 'Informed': 1})

In [8]:
Job(
    input_uris=[
        get_stackoverflow_badges_uri('0.xml'),
        get_stackoverflow_badges_uri('1.xml'),
        get_stackoverflow_badges_uri('2.xml'),        
    ], 
    record_reader=record_reader,
    mapper=mapper,  
    reducer=reducer,
).run()




JOB ID: a9aeb394-84b5-443c-8a05-25f12950db37

INPUT SIZE: 34635070

OUTPUT PATH: /home/jovyan/work/map_reduce/.outputs/a9aeb394-84b5-443c-8a05-25f12950db37

EXECUTION TIME: 16.247

MAX SHUFFLE SIZE: 11726800

FILES:
+----------------------------+--------------+
|          filename          | size (bytes) |
| mapper_0__partition_0.json | 988954       |
+----------------------------+--------------+
| mapper_0__partition_1.json | 978468       |
+----------------------------+--------------+
| mapper_0__partition_2.json | 989323       |
+----------------------------+--------------+
| mapper_0__partition_3.json | 993619       |
+----------------------------+--------------+
| mapper_1__partition_0.json | 980454       |
+----------------------------+--------------+
| mapper_1__partition_1.json | 978535       |
+----------------------------+--------------+
| mapper_1__partition_2.json | 962044       |
+----------------------------+--------------+
| mapper_1__partition_3.json | 975995       |

In [9]:
!head -n 10 ./.outputs/a9aeb394-84b5-443c-8a05-25f12950db37/reducer_1.json

{"key": "939944", "value": {"Popular Question": 1}}
{"key": "1175296", "value": {"Popular Question": 1, "Caucus": 1}}
{"key": "1130069", "value": {"Popular Question": 3}}
{"key": "2798506", "value": {"Popular Question": 1}}
{"key": "4319644", "value": {"Student": 1}}
{"key": "5545371", "value": {"Student": 1}}
{"key": "5545153", "value": {"Supporter": 1}}
{"key": "323767", "value": {"Yearling": 2}}
{"key": "1313030", "value": {"Yearling": 1}}
{"key": "888068", "value": {"Yearling": 1}}


## SELF EXPLORATION
- read & merge content returned by all reducers
- which user has the most number of badges?
- does the algorithm work correctly?
- check if there were any benefits coming from the usage of combiners (even in the local setup) -> how combiner could look like?
- how well does it scale? how many mappers we would need to process 1 TB of data in say 1 day?
- how many records would we have in that 1TB
- what is the `MB/s` and `records/s` speed of our above home made framework? 