# Word Count Sorting

Very often an output of a MapReduce Job is an input for another MapReduce Job. MRJob has a concept of `steps` to achieve exactly that.

See the official [website](https://mrjob.readthedocs.io/en/latest/guides/quickstart.html#writing-your-second-job) for more information.

The template for a step pipeline, here with two steps:
- step1: with a mapper, a combiner and a reducer
- step2: with just a mapper and a reducer

looks like the following

```python
from mrjob.job import MRJob
from mrjob.step import MRStep

class MyJob(MRJob):

    def steps(self):
        return [
            MRStep(mapper=self.mapper_one,
                   combiner=self.combiner_one,
                   reducer=self.reducer_one),
            MRStep(mapper=self.mapper_two, reducer=self.reducer_two)
        ]

    def mapper_one(self, _, line):
        raise NotImplementedError

    def combiner_one(self, key, counts):
        raise NotImplementedError

    def reducer_one(self, key, counts):
        raise NotImplementedError
    
    def mapper_two(self, key, counts):
        raise NotImplementedError
    
    def reducer_two(self, key, counts):
        raise NotImplementedError


if __name__ == '__main__':
    MyJob.run()
```

## Your Solution
Try to implement a sorted wordcount with two steps. In the first step you do the normal wordcount. What should you implement for the second step?

## Sorting Behavior in Hadoop
If you want to control the sorting behavior before the reducing phase for a step, then you have to give a `JobConf` for this step:
```python
def steps(self):
        JOBCONF_STEP2 = {
            'mapred.output.key.comparator.class':'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
            'mapred.text.key.comparator.options':'-nr',
        }
        return [
            MRStep(mapper=self.mapper,
                   combiner=self.combiner,
                   reducer=self.reducer),
            MRStep(jobconf=JOBCONF_STEP2, mapper=self.mapper_two, reducer=self.two)]
```

This would sort the keys numerically and in descending order. Note, however, this will only work when you run `MrJob` in the `hadoop` mode and not in `local` mode.

In [None]:
%%writefile wc.py

#!/usr/bin/python3

# your solution

## Testing the Code

### Locally (no sorting)

In [None]:
!python wc.py /data/dataset/text/small.txt

### On Hadoop - Results on Console

In [None]:
!python wc.py -r hadoop hdfs:///dataset/text/small.txt

### On Hadoop - Results are Written to HDFS

In [None]:
!python wc.py -r hadoop hdfs:///dataset/text/small.txt --output-dir hdfs:///results/wordcount/sorted/small --no-output

## Sorted WordCount for holmes.txt

In [None]:
!python wc.py -r hadoop hdfs:///dataset/text/holmes.txt --output-dir hdfs:///results/wordcount/sorted/holmes --no-output

## Sorted WordCount for gutenberg_all.txt (Optional)

Depending on the implementation, this can run for more than 30 minutes!

In [None]:
!python wc.py -r hadoop hdfs:///dataset/text/gutenberg_all.txt --output-dir hdfs:///results/wordcount/sorted/gutenberg --no-output