# MapReduce Using `MRJob`

## A Job Posting Dataset

The sample dataset we will use (`data/job-data/job-data-2018-09-08-00-00-37.txt`) contains job postings from on one of the US job search websites. The data is stored with each row as a JSON document representing a job posting record. 

The example below shows a sample job postings from the data file. The sample record has been formatted with 4 spaces indentation. In the real file, each record is stored as a JSON document in one row.

## 1. Protocols For Input & Output

mrjob assumes that all data is newline-delimited bytes. Each job has an *input protocol*, an *output protocol*, and an *internal protocol*.

The default *input* protocol is `RawValueProtocol`, which just reads in a line as a `str`.
The default *output* and *internal* protocols are both `JSONProtocol`, which reads and writes JSON strings separated by a tab character.

The protocols can be changed by overwritting the corresponding attributes: `INPUT_PROTOCOL`, `INTERNAL_PROTOCOL`, and `OUTPUT_PROTOCOL`.

For more information, see [Protocols](https://pythonhosted.org/mrjob/guides/writing-mrjobs.html#job-protocols).

`JSONValueProtocol` encodes value as a JSON and discard key (key is read in as None). To load the job posting dataset, we can set `INPUT_PROTOCOL = JSONValueProtocol` which automaticall loads input data as Python `dict` objects.

The example below loads the data into `mapper` and generates output of key-value pairs where keys are *jobId*(`int`) and values are *jobLocation*(`json`). Note that no `reducer` is provided, this type of jobs are sometimes called *map only* jobs.

In [3]:
%%file mr-jobs/protocol-map-only.py
import json

from mrjob.job import MRJob
from mrjob.protocol import JSONValueProtocol, RawValueProtocol

class MRWordFrequencyCount(MRJob):
    
    INPUT_PROTOCOL = JSONValueProtocol
    
    def mapper(self, _, value):
        yield value.get('jobId', None), value.get('jobLocation', None)

        
if __name__ == '__main__':
    MRWordFrequencyCount.run()

Writing mr-jobs/protocol-map-only.py


In [4]:
!python3 mr-jobs/protocol-map-only.py ../data/job-data/job-data-2018-09-08-00-00-37.txt --output-dir test-out

No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/protocol-map-only.hadoop.20180913.161010.645216
job output is in test-out
Removing temp directory /tmp/protocol-map-only.hadoop.20180913.161010.645216...
