# A more pythonic MapReduce

With Hadoop Streaming we switched MapReduce from Java to Python.

How could we improve some more? 

What problems do we deal with?

We have to directly deal with HDFS
* move input files
* recover outputs
* human make mistakes

Not easy debugging

* logs via jobtracker
* errors are Java stacktrace, often unrelated with the real problem

We have to write two different files
* one for the mapper
* one for the reducer
* not a **module**

Is there a Python part of the language that can help representing a MapReduce task?

In [None]:
# A python class

class myclass(object):
    
    a_property = 42
    
    def a_method(self):
        pass
    def another_method(self, var):
        print(var)


In [None]:
# Create instance of our class
instance = myclass()
# Use it
instance.a_method()
instance.another_method("test")
instance.a_property

In [None]:
class MapReduce(object):
    """ A MapReduce class prototype """
    
    def mapper(self, line):
        pass
    def reducer(self, sorted_line):
        pass

# MRJob 
A more pythonic MapReduce library from Yelp

<img src='https://avatars1.githubusercontent.com/u/49071?v=3&s=400' width=300>

> “Easiest route to Python programs that run on Hadoop”

Install with: 
```bash
pip install mrjob
```

**Running modes**
* Test your code locally without installing Hadoop 
* or run it on a cluster of your choice!
    - Integrates with Amazon **Elastic MapReduce** (EMR)
    - same code with local, Hadoop, EMR
    - easy to run your job in the cloud as on your laptop

### How does MrJob work?

* Python module built **on top of Hadoop Streaming**
    - HS jar opens a subprocess to your code
    - sends it input via stdin
    - gathers results via stdout
* Wrap HDFS pre and post processing if hadoop exists
* a consistent interface across every environment it supports
* automatically serializes/deserializes data flow out of each task 
    - e.g. JSON: json.loads() and json.dumps()

## Getting hands dirty

In [None]:
from mrjob.job import MRJob

A job is defined by a class extended from MRJob package

* Contains methods that define the steps of a Hadoop job
* A “step” consists of a mapper, a combiner, and a reducer. 
* All of those  are optional, though you must have at least one.


In [None]:
class myjob(MRJob):
    def mapper(self, _, line):
        pass
    def combiner(self, key, values):
        pass
    def reducer(self, key, values):
        pass
    def steps(self):
        return [ MRStep(mapper=self.mapper, combiner=self.combiner, reducer=self.reducer) ]

## WordCount

### Mapper

The mapper() method takes a key and a value as args

```
    def mapper(self, _, line):
        pass
```

* E.g. key is ignored and a single line of text input is the value

* Yields as many key-value pairs as it likes

### Yield?

**Warning**: `yield` != `return`

> yield return a generator, the one you usually use with  ```print i; for i in generator```
    
Example
```python
def mygen():
    for i in range(1,10):
    # THIS IS WHAT HAPPENS INSIDE THE MAPPER/REDUCER
        yield i, “value” 

for key, value in mygen():
    print key, value
```

### Reducer

The reduce() method takes a key and an iterator of values

```
    def reducer(self, key, values):
        pass
```

* Also yields as many key-value pairs as it likes
    * E.g. it sums the values for each key
* Represent the  numbers of characters, words, and lines in the initial input



## Let's write our job

In [None]:
# Little configuration

mydir = "mymrjob"
%env mydir = $mydir
myinput = "./data/txt/twolines.txt"
%env myinput $myinput
myscript = mydir + "/wordcount.py"
%env myscript $myscript

%system mkdir -p $mydir
%env myoutput $mydir/out.txt
%env mylog $mydir/out.log

Create the job file

In [None]:
%%writefile $myscript

from mrjob.job import MRJob
class MRWordCount(MRJob):
    """ Wordcount with MapReduce in a pythonic way"""

    def mapper(self, key, line):
        for word in line.split(' '):
             yield word.lower(), 1

    def reducer(self, word, occurrences):
        yield word, sum(occurrences)

if __name__ == '__main__':
    MRWordCount.run()

Note!
    
Thanks to MrJob and generators we do not care inside the reducer to check when the value is changing.

## I/O

You can pass input via stdin but be aware that mrjob will just dump it to a file first:
```bash
$ python my_job.py < input.txt
```

You can pass multiple input files, mixed with stdin (using the – character)
```bash
$ python my_job.py input1.txt input2.txt - < input3.txt
```

By default, output will be written to stdout.
```bash
$ python my_job.py input.txt
```


In [None]:
# Execute MrJob
! python $myscript $myinput 1> $myoutput 2> $mylog

In [None]:
%cat $myoutput

In [None]:
%cat $mylog

Here's an empty **template** to work with in copy/paste

```python
#!/usr/bin/env python
# -*- coding: utf-8 -*-
""" MapReduce easily with Python """

from mrjob.job import MRJob
from mrjob.step import MRStep

class job(MRJob):
    def mapper(self, _, line):
        pass
    def reducer(self, key, line):
        pass
    def steps(self):
        return [ 
            MRStep(mapper=self.mapper, reducer=self.reducer)
        ]

if __name__ == "__main__":
    job.run()
    
```

With MRStep you define iterations

## Exercise

Convert your exercise for vowels inside the divine comedy into MrJob class

## Running on Hadoop

By default, mrjob will run your job on your local (normal) environment) in a single Python process 

You change the way the job is run with the `-r/--runner` option:

```
-r inline, -r local, -r hadoop, or -r emr
```

Use also `--verbose` option to show all the steps

So we just need to add `-r hadoop`.

<small>Note: The `capture` *magic* is another way we could handle output.</small>

In [None]:
%%capture hadoop_out
# Execute MrJob on Hadoop and let the magic handle outputs
! python $myscript $myinput -r hadoop 2> $mylog

In [None]:
hadoop_out.show()

In [None]:
%cat $mylog

### Protocols

http://mrjob.readthedocs.org/en/latest/guides/writing-mrjobs.html#protocols



MRJob add many comodities!

Some of them may result expensive on heavy computing.

For example MapReduce data transfer is serialize in JSON format.

In [None]:
class MyMRJob(mrjob.job.MRJob):

    # these are the defaults
    INPUT_PROTOCOL = mrjob.protocol.RawValueProtocol
    INTERNAL_PROTOCOL = mrjob.protocol.JSONProtocol
    OUTPUT_PROTOCOL = mrjob.protocol.JSONProtocol


Let's see what it means

In [None]:
%%writefile protocol.py
from mrjob.job import MRJob

class MRWordCount(MRJob):

    def mapper(self, key, line):
        for word in line.split(' '):
             yield word.lower(), 1

                
    def reducer(self, word, occurrences):
        yield word, sum(occurrences)

if __name__ == '__main__':
    MRWordCount.run()

In [None]:
! python protocol.py /data/lectures/data/books/twolines.txt 2> /dev/null

In [None]:
%%writefile protocol.py
from mrjob.job import MRJob
from mrjob.protocol import PickleProtocol

class MRWordCount(MRJob):

    # Optimization on internal protocols
    INTERNAL_PROTOCOL = PickleProtocol
    OUTPUT_PROTOCOL = PickleProtocol
    
    def mapper(self, key, line):
        for word in line.split(' '):
             yield word.lower(), 1

                
    def reducer(self, word, occurrences):
        yield word, sum(occurrences)

if __name__ == '__main__':
    MRWordCount.run()

In [None]:
! python protocol.py /data/lectures/data/books/twolines.txt 2> /dev/null

# Getting deeper

Let's try this together step by step

In [None]:
myfile = "./data/txt/prince.txt"

How does the wordcounter work with this file?

In [None]:
%%writefile job.py
from mrjob.job import MRJob

class SomeJob(MRJob):
    """ Counting the words """
    
    def mapper(self, key, line):
        # Removing extra characters
        for word in line.strip('.;,()').split():
            yield word.lower(), 1

    def reducer(self, word, occurrences):
        yield word, sum(occurrences)

if __name__ == '__main__':
    SomeJob.run()

In [None]:
! python job.py $myfile 1> out.txt 2> err.txt

In [None]:
import os

def get_output(outfile):
    """ Get Hadoop results from MrJob output file """
    
    data = {}

    if os.path.exists(outfile):
        with open(outfile) as out:
            outstring = out.read()
        for line in outstring.split('\n'):
            if line.strip() == '':
                continue
            tmp = line.split('\t')
            data[tmp[0].strip('"')] = int(tmp[1])

    return data


In [None]:
# Recover last cell output
data = get_output("out.txt")

In [None]:
data

In [None]:
def slicedict(mydict, num):
    """ Get maximum 'num' (random) occurences from the dict """
    count = 1
    sliced = {}
    for key, value in mydict.items():
        sliced[key] = value
        count += 1
        if count > num:
            break

    return sliced
    

In [None]:
slicedict(data, 10)

In [None]:
def sortdict(mydict):
    """ Sort in descend based on values count """
    return sorted(mydict.items(), key=lambda x: x[1], reverse=True) 


In [None]:
sortdict(data)[:5]

## More than a single step

mrjob can be configured to run different steps

for each step you can specify which part has to be executed
and the method to use within the class you wrote


In [None]:
def steps(self):
    return [
        MRStep(
            mapper=self.mapper_get_words,
            combiner=self.combiner_count_words,
            reducer=self.reducer_count_words),
        MRStep(reducer=self.reducer_find_max_word)
    ]

This could be an improvement for our mrjob extension:

- if steps method is provided, override the default written by the extension

A quick note

* With MrJob you cannot connect to a **remote** Hadoop cluster. 
    - Hadoop does not allow job submissions (class or executables) from outside.
* On the contrary EMR on Amazon can be accessible from your laptop.
    - Amazon created the [boto api](http://boto.readthedocs.org/en/latest/ref/emr.html) to solve the issue.

### Runners

http://mrjob.readthedocs.org/en/latest/guides/runners.html

You cannot use the programmatic runner functionality in the same file as your job class.


In [None]:
## RUNNER: JUST AN EXAMPLE

# Load class for mapreduce
from mypackage import SomeClassWeWrote
#from job import MRcoverage
from mrjob.util import log_to_stream

if __name__ == '__main__':

    # Create object
    mrjob = SomeClassWeWrote(args=[   \
        '-r', 'inline',
        #'-r', 'local',
        # '-r', 'hadoop',
        # '--jobconf=mapreduce.job.maps=10',
        # '--jobconf=mapreduce.job.reduces=4'
        #'--jobconf=stream.recordreader.compression=bz2'
        ])

    # Run and handle output
    with mrjob.make_runner() as runner:

        # Redirect hadoop logs to stderr
        log_to_stream("mrjob")
        # Execute the job
        runner.run()

        # Do something with stream output (e.g. file, database, etc.)
        for line in runner.stream_output():
            key, value = mrjob.parse_output_line(line)
            print(key, value)

# Recap with Mrjob

You should really read the [documentation of the latest version](http://mrjob.readthedocs.org/en/latest/).

It covers every need without getting too much complicated.
There are many other options and advanced behaviours to discover.

<small>
Note 1: We are using the most recent version (*release* **v.0.5.0dev** ) because its the first one to support Python 3.

Note 2: Developer are there to help you, see my case in https://github.com/Yelp/mrjob/issues/1142
</small>



* More documentation than any other framework or library
* Write code in a single class (per Hadoop job)
    * Map and Reduce are single methods
    * Very clean and simple
* Advanced configuration
    * Configure multiple steps
    * Handle command line options inside the python code (see docs)
* Easily wrap input/output 
    * No data copy required with HDFS
* Hadoop logs or errors directly into the script output
* **Switch environment without changing the code...!**

**Cons**

* Doesn’t give you the same level of access to Hadoop APIs 
    - Better: Dumbo and Pydoop
    - Other libraries can be faster if you use typedbytes

### Comparison
<img src='http://blog.cloudera.com/wp-content/uploads/2013/01/features.png'>

### Performance
<img src='http://blog.cloudera.com/wp-content/uploads/2013/01/performance.png'>

source: http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/

*One final note*: 
> Open source is a great thing

The community listens to you:
https://github.com/Yelp/mrjob/issues/1142

# End of this Chapter