## TASK

Create MapReduce Job which will calculate count of specific stackoverflow badge as present in the analysed files. If possible try to leverage the combiner optimization technique. 

NOTE: Before you try working on Map Reduce Job try to achieve the same using pure python and working only on a single `0.xml` (available in the current directory).

In [1]:
from glob import glob
from uuid import uuid4

import requests
import xmltodict

from job import Job


get_stackoverflow_badges_uri = (
    'https://s3.eu-central-1.amazonaws.com/learning.big.data/stackoverflow-badges/{}'.format)

## Records Reader

In [2]:
def record_reader(line):
    record = dict(xmltodict.parse(line.decode('utf-8'))['row'])        
    yield (
        record['@Id'],
        {k.replace('@', '').lower(): v for k, v in record.items()},
    )            

In [3]:
response = requests.get(
    get_stackoverflow_badges_uri('0.xml'), 
    stream=True)

records = []
for line in response.iter_lines():
    if line:
        records.append(next(record_reader(line)))
        
print(records[:2])

[('26066242', {'id': '26066242', 'userid': '8125167', 'name': 'Supporter', 'date': '2017-11-28T19:34:25.047', 'class': '3', 'tagbased': 'False'}), ('26066243', {'id': '26066243', 'userid': '9006638', 'name': 'Supporter', 'date': '2017-11-28T19:34:25.047', 'class': '3', 'tagbased': 'False'})]


## Mapper

In [4]:
def mapper(key, value):
    yield (value['name'], 1)

In [5]:
# -- test mapper
[next(mapper(key, value)) for key, value in records[:10]]

[('Supporter', 1),
 ('Supporter', 1),
 ('Supporter', 1),
 ('Supporter', 1),
 ('Taxonomist', 1),
 ('Teacher', 1),
 ('Teacher', 1),
 ('Teacher', 1),
 ('Teacher', 1),
 ('Informed', 1)]

## Reducer

In [6]:
def reducer(key, values):
    yield (key, sum(values))

In [7]:
# -- test reducer
next(reducer('Supporter', [1, 3, 2, 1]))

('Supporter', 7)

## Job

In [8]:
Job(
    input_uris=[
        get_stackoverflow_badges_uri('0.xml'),
        get_stackoverflow_badges_uri('1.xml'),
        get_stackoverflow_badges_uri('2.xml'),    
        get_stackoverflow_badges_uri('3.xml'),            
        get_stackoverflow_badges_uri('4.xml'),            
        get_stackoverflow_badges_uri('5.xml'),            
        get_stackoverflow_badges_uri('6.xml'),            
        get_stackoverflow_badges_uri('7.xml'),                            
    ], 
    record_reader=record_reader,
    mapper=mapper, 
    combiner=reducer,
    reducer=reducer,
    config={
        'num_of_mappers': 4,
        'num_of_reducers': 4,
    }).run()




JOB ID: 7701992a-1f43-40b4-8324-7754461c725e

INPUT SIZE: 92573565

OUTPUT PATH: /home/jovyan/work/map_reduce/.outputs/7701992a-1f43-40b4-8324-7754461c725e

EXECUTION TIME: 31.186

MAX SHUFFLE SIZE: 78683

FILES:
+----------------------------+--------------+
|          filename          | size (bytes) |
| mapper_0__partition_0.json | 2735         |
+----------------------------+--------------+
| mapper_0__partition_1.json | 2471         |
+----------------------------+--------------+
| mapper_0__partition_2.json | 2888         |
+----------------------------+--------------+
| mapper_0__partition_3.json | 2191         |
+----------------------------+--------------+
| mapper_1__partition_0.json | 2825         |
+----------------------------+--------------+
| mapper_1__partition_1.json | 3154         |
+----------------------------+--------------+
| mapper_1__partition_2.json | 2494         |
+----------------------------+--------------+
| mapper_1__partition_3.json | 2586         |
+-

In [9]:
!head -n 10 .outputs/7701992a-1f43-40b4-8324-7754461c725e/mapper_0__partition_0.json

{"key": "Informed", "values": [9557]}
{"key": "Tumbleweed", "values": [3461]}
{"key": "Yearling", "values": [5903]}
{"key": "Custodian", "values": [4357]}
{"key": "Autobiographer", "values": [2984]}
{"key": "Famous Question", "values": [2567]}
{"key": "Favorite Question", "values": [151]}
{"key": "Necromancer", "values": [2025]}
{"key": "Notable Question", "values": [7466]}
{"key": "Revival", "values": [1104]}


In [10]:
!head -n 10 .outputs/7701992a-1f43-40b4-8324-7754461c725e/reducer_1.json

{"key": "Supporter", "value": 28778}
{"key": "ios5", "value": 2}
{"key": "Citizen Patrol", "value": 4500}
{"key": "Editor", "value": 52131}
{"key": "Popular Question", "value": 93792}
{"key": "Student", "value": 47555}
{"key": "Good Answer", "value": 10331}
{"key": "Announcer", "value": 3551}
{"key": "Promoter", "value": 1460}
{"key": "Populist", "value": 499}


## SELF EXPLORATION
- read & merge content returned by all reducers
- check if really all the keys where uniquely distributed among reducers
- check how even the distribution of data was 
- check if there were any benefits coming from the usage of combiners (even in the local setup)
- how well does it scale? how many mappers we would need to process 1 TB of data in say 1 day?
- how many records would we have in that 1TB
- what is the `MB/s` and `records/s` speed of our above home made framework? 