# Part 1: Running a Hadoop job using only Python code. #

# Step 1: Load the dataset containing all of Shakespeares bibliography #

In [1]:
# Make a directory
!hdfs dfs -mkdir -p /datasets

# Download the datasets
!wget -q http://www.gutenberg.org/cache/epub/100/pg100.txt \
    -O ../datasets/shakespeare_all.txt
    
# Get the texts
!hdfs dfs -put -f ../datasets/shakespeare_all.txt /datasets/shakespeare_all.txt
!hdfs dfs -put -f ../datasets/hadoop_git_readme.txt /datasets/hadoop_git_readme.txt

# List the datasets downloaded
!hdfs dfs -ls /datasets

Found 2 items
-rw-r--r--   1 vagrant supergroup       1365 2016-10-06 18:40 /datasets/hadoop_git_readme.txt
-rw-r--r--   1 vagrant supergroup    5589889 2016-10-06 18:40 /datasets/shakespeare_all.txt


# Step 2: Create the mapper and the reducer #

## Step 2.1: Mapper

The mapper reads lines from the stdin and prints the key:value pairs of the number of characters(except the newline), the number of words(splitting the line onthe whitespace), the number of lines(always 1).

In [2]:
# Mapper
with open('mapper_hadoop.py', 'w') as fh:
    fh.write("""#!/usr/bin/env python

import sys

for line in sys.stdin:
    print "chars", len(line.rstrip('\\n'))
    print "words", len(line.split())
    print "lines", 1
    """)

## Step 2.2 Reducer ##

The reducer sums up the values for each key and prints the grand total.

In [3]:
# Reducer
with open('reducer_hadoop.py', 'w') as fh:
    fh.write("""#!/usr/bin/env python

import sys

counts = {"chars": 0, "words":0, "lines":0}

for line in sys.stdin:
    kv = line.rstrip().split()
    counts[kv[0]] += int(kv[1])

for k,v in counts.items():
    print k, v
    """) 

# Step 3: Run MapReduce #

## Step 3.1: Running on local system ##

In [4]:
!chmod a+x *_hadoop.py

In [5]:
# Shuffler is replaced with sort -k1,1 which sorts the input strings using the first field(key)
# Piping, concatenate the dataset -> mapper -> shuffler -> reducer

!cat ../datasets/hadoop_git_readme.txt | ./mapper_hadoop.py | sort -k1,1 | ./reducer_hadoop.py

chars 1335
lines 31
words 179


## Step 3.2: Running on Hadoop ##

In [6]:
# Create dir to store results
!hdfs dfs -mkdir -p /tmp

In [7]:
# Remove anything inside the tmp folder
!hdfs dfs -rm -f -r /tmp/mr.out

'''
4 step process:

1. We want to use the Hadoop streaming capability.
2. Distribute the files to each mapper as they are local files (-files)
3. Set the mapper and reducer we are using. (-mapper, -reducer)
4. Define the input file (-input) and the output director y (-output)
'''
!hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.4.jar \
-files mapper_hadoop.py,reducer_hadoop.py \
-mapper mapper_hadoop.py -reducer reducer_hadoop.py \
-input /datasets/hadoop_git_readme.txt -output /tmp/mr.out

packageJobJar: [/tmp/hadoop-unjar8239820158064390153/] [] /tmp/streamjob3508329584284935653.jar tmpDir=null
16/10/06 18:54:32 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
16/10/06 18:54:33 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
16/10/06 18:54:33 INFO mapred.FileInputFormat: Total input paths to process : 1
16/10/06 18:54:33 INFO mapreduce.JobSubmitter: number of splits:2
16/10/06 18:54:34 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1475771633082_0004
16/10/06 18:54:34 INFO impl.YarnClientImpl: Submitted application application_1475771633082_0004
16/10/06 18:54:34 INFO mapreduce.Job: The url to track the job: http://sparkbox:8088/proxy/application_1475771633082_0004/
16/10/06 18:54:34 INFO mapreduce.Job: Running job: job_1475771633082_0004
16/10/06 18:54:41 INFO mapreduce.Job: Job job_1475771633082_0004 running in uber mode : false
16/10/06 18:54:41 INFO mapreduce.Job:  map 0% reduce 0%
16/10/06 18:54:47 INFO mapreduce.

# Step 4: Studying the output # 

In [8]:
!hdfs dfs -ls /tmp/mr.out
'''
There are 2 files:
1. _SUCCESS and indicates that the MR job has finished the writing stage in the directory.
2. part-00000 which contains the actual results
'''

Found 2 items
-rw-r--r--   1 vagrant supergroup          0 2016-10-06 18:54 /tmp/mr.out/_SUCCESS
-rw-r--r--   1 vagrant supergroup         33 2016-10-06 18:54 /tmp/mr.out/part-00000


# Part 2: Using the MrJob library to run the Hadoop job. #

# Step 1: Write the python file using the MrJob functionalities #

Mappers and reducers are wrapepd in a subclass of MRJob. Inputs are not read from stdin, but passed as a function argument and outputs are not printed, but yielded.

In [9]:
with open("MrJob_job1.py", "w") as fh:
    fh.write("""
from mrjob.job import MRJob


class MRWordFrequencyCount(MRJob):

    def mapper(self, _, line):
        yield "chars", len(line)
        yield "words", len(line.split())
        yield "lines", 1

    def reducer(self, key, values):
        yield key, sum(values)


if __name__ == '__main__':
    MRWordFrequencyCount.run()    
    """)

## Step 2: Running it locally ##

In [10]:
# Executes mapper and reducer, prints the result and cleans up the temporary directory
!python MrJob_job1.py ../datasets/hadoop_git_readme.txt

No configs found; falling back on auto-configuration
Creating temp directory /tmp/MrJob_job1.vagrant.20161006.190646.064219
Running step 1 of 1...
Streaming final output from /tmp/MrJob_job1.vagrant.20161006.190646.064219/output...
"chars"	1335
"lines"	31
"words"	179
Removing temp directory /tmp/MrJob_job1.vagrant.20161006.190646.064219...


## Step 3: Running on Hadoop MapReduce and HDFS##

We run the same python file, with the `-r hadoop` option.

In [11]:
!python MrJob_job1.py -r hadoop hdfs:///datasets/hadoop_git_readme.txt

No configs found; falling back on auto-configuration
Looking for hadoop binary in /usr/local/hadoop/bin...
Found hadoop binary: /usr/local/hadoop/bin/hadoop
Creating temp directory /tmp/MrJob_job1.vagrant.20161006.190837.776982
Using Hadoop version 2.6.4
Copying local files to hdfs:///user/vagrant/tmp/mrjob/MrJob_job1.vagrant.20161006.190837.776982/files/...
Looking for Hadoop streaming jar in /usr/local/hadoop...
Found Hadoop streaming jar: /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.4.jar
Running step 1 of 1...
  packageJobJar: [/tmp/hadoop-unjar5878441889199379327/] [] /tmp/streamjob7184503074016446018.jar tmpDir=null
  Connecting to ResourceManager at /0.0.0.0:8032
  Connecting to ResourceManager at /0.0.0.0:8032
  Total input paths to process : 1
  number of splits:2
  Submitting tokens for job: job_1475771633082_0005
  Submitted application application_1475771633082_0005
  The url to track the job: http://sparkbox:8088/proxy/application_1475771633082_0005/
  Run

# Part 3: Running a process that needs more than one MapReduce step #

Our objective here is to find the most common word used by Shakespere. Using MrJob we will create a cascade of MapReduce operations where each output is the input of the next stage. 

# Step 1: Define the MapReducer #

In [12]:
'''
Stage 1:

Step 1: Map (mapper_get_words)
A key-map tuple is yielded for each word. The key is the lowercased word and the value is always 1.

Step 2: reduce (reducer_count_words)
For each key(lowercased word), we sum all the values. The o/p will tell us how many times the word appears in the text.

Stage 2:

Step 1: Map (mapper_word_count_one_key)
We flip the key-value tuples and put them as values of a new key pair. To force one reducer to have all the tuples we 
assign the same key, None to each output tuple.

Step 2: Reduce (reducer_find_max_word)
We discard the only key available and extract the maximum of the values. 
'''

with open("MrJob_job2.py", "w") as fh:
    fh.write("""
from mrjob.job import MRJob
from mrjob.step import MRStep
import re

WORD_RE = re.compile(r"[\w']+")


class MRMostUsedWord(MRJob):

    def steps(self):
        return [
            MRStep(mapper=self.mapper_get_words,
                   reducer=self.reducer_count_words),
            MRStep(mapper=self.mapper_word_count_one_key,
                   reducer=self.reducer_find_max_word)
        ]

    def mapper_get_words(self, _, line):
        # yield each word in the line
        for word in WORD_RE.findall(line):
            yield (word.lower(), 1)

    def reducer_count_words(self, word, counts):
        # send all (num_occurrences, word) pairs to the same reducer.
        yield (word, sum(counts))
    
    def mapper_word_count_one_key(self, word, counts):
        # send all the tuples to same reducer
        yield None, (counts, word)

    def reducer_find_max_word(self, _, count_word_pairs):
        # each item of word_count_pairs is a tuple (count, word),
        yield max(count_word_pairs)


if __name__ == '__main__':
    MRMostUsedWord.run()
""")

# Step 2: Run the MapReducer #

In [13]:
# Run locally
!python MrJob_job2.py --quiet ../datasets/shakespeare_all.txt

27801	"the"


In [14]:
# Run on the Hadoop cluster
!python MrJob_job2.py -r hadoop --quiet hdfs:///datasets/shakespeare_all.txt

27801	"the"


# Conclusion #

The most common word used by Shakespere is 'The', used over 27k times.