## **Running Hadoop MapReduce On Google Colab**
**Module 2, Section 3.1**  
**Block 9: Big Data Processing and NLP**

<a href="https://colab.research.google.com/github/datasciencepathways/hadoop_map_reduce/blob/main/tutorials/running_mapreduce_on_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
 
In our previous tutorial we went through the steps of installing a single-node, pseudo-distributed Hadoop cluster on Google Colab. This tutorial will show you how to run a simple MapReduce job on Hadoop. 

The tutorial should not take more than 30 minutes to complete. 

The tutorial has been written in a way such that all commands work out of the box in Google Colab. However, if a particular command does not work or you get a weird error message, please add your question to the discussion forum.

The main steps for running a MapReduce Hadoop job on Google Colab are listed below.

1. [Hadoop Install](#hadoop)
2. [Running WordCount](#wordcount)
3. [Conclusion](#end)


## <a name="hadoop"></a>Hadoop Install
Since the Google Colab environment is refreshed each time you open a new notebook, we will first need to install Hadoop in this VM instance. You can just follow the steps for the previous tutorial. For convenience, the sequence of commands for Hadoop installation is given below. 

In [None]:
!wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz

In [None]:
!tar -xzvf hadoop-3.3.1.tar.gz

In [None]:
!ls hadoop-3.3.1/bin

In [None]:
!cp -r hadoop-3.3.1/ /usr/local/

In [None]:
!readlink -f /usr/bin/java | sed 's/bin\/java//'

Now, use the folder navigation pane on the left to browse to the file `/usr/local/hadoop-3.3.0/etc/hadoop/hadoop-env.sh`. Double-click on the file to open it for editing. Uncomment the line that begins with `export JAVA_HOME=` (should be line 54). Then add the Java path after the `=`

```bash
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/
```

In [None]:
!/usr/local/hadoop-3.3.1/bin/hdfs namenode -format

In [None]:
!/usr/local/hadoop-3.3.1/bin/hadoop

##<a name="end"></a>The Word Count Problem 

The word count application counts the number of occurrences of all words appearing in a set of documents. As discussed in the previous lectures, the word count problem is a classic example of a MapReduce task. In this tutorial, we will run a simple MapReduce implementation of a word count application. This is going to serve as the [Hello World](https://en.wikipedia.org/wiki/%22Hello,_World!%22_program) of Hadoop MapReduce!  

### Getting the Code 

You can find a simple Python implementation from the Block 9 git repo. To obtain the code clone the repo to Google Colab  

In [None]:
!git clone https://github.com/apanqasem/hadoop_map_reduce.git

The mapper and reducer codes can be found in the `Hadoop/mapper.py` and `Hadoop/reducer.py` files. For convenience, let's copy them to the current directory. 

In [None]:
!cp hadoop_map_reduce/mapper.py .

In [None]:
!cp hadoop_map_reduce/reducer.py .

To be able to execute these codes, we will need to set the execute permission on the two files.

In [None]:
!chmod u+x mapper.py
!chmod u+rwx reducer.py

### Input and Output 

For any Hadoop job, we need to provide the names of an `input` and an `output` directory. The input directory is the place where the program is going to look for its input data. The output directory is the location where the output is going to be written. 

These directories can be given any names. By convention, the names typically contain input/output as a suffix or a prefix. Let's create the input directory. 

In [None]:
!mkdir test_input

The output directory will be automatically created when the Hadoop job runs. 

### Creating the input. 

Now, we will need to supply the input text files for our word count program to analyze. For this very simple example, we will just create a couple of files with a few words in them. This will allow us to visually verify that the program is runnning correctly. In the next tutorial, we will learn how to run Hadoop MapReduce with real datasets. 

We can create short text files using the Linux echo command. We will create five files containing lyrics from [Hamilton](https://en.wikipedia.org/wiki/Hamilton_(musical)).

In [None]:
!echo "The ten-dollar founding father without a father" > hamilton1.txt
!echo "Got a lot farther by working a lot harder" > hamilton2.txt
!echo "By being a lot smarter" > hamilton3.txt
!echo "By being a self-starter" > hamilton4.txt
!echo "By fourteen, they placed him in charge of a trading charter" > hamilton5.txt

Now, we will move these files to the input directory, we just created. 

In [None]:
!mv hamilton*.txt test_input

### Running the job

Now we are all set to run our first MapReduce Hadoop job. 

The general format for the command for running a MapReduce Haddop job is as follows: 

```bash 
hadoop jar hadoop-streaming.jar \
-input name_of_input_file \
-output name_of_output_directory \
-file name_of_mapper_file \
-mapper the_mapper_cmd \
-file name_of_reducer_file \
-reducer the_reducer_cmd \
```

That's a pretty long command! Let's break it down. 

  * `hadoop` says that we are launching a Hadoop task 
  * `jar` is the command to execute a Java program.
  * `hadoop-streaming.jar` is a utility that comes with Haddop; it converts mapper and reducer code written in a different programming language to run in a MapReduce pattern in Hadoop; in this example we are feeding Python but we could also have written the coee in Ruby, R etc. 
  * `-input name_of_input_file` specifies the name of the input file for the MapReduce application 
  * `-output name_of_output_directory` is the output directory 
  * `-file name_of_mapper_file` is the name of the mapper file 
  * `-file name_of_reducer_file` is the name of the reducer file 
  * `-mapper the_mapper_cmd` is the actual command that needs to be executed to run the mapper 
  * `-reducer the_reducer_cmd` is the actual command that needs to be executed to run the reducer

Our actual command for running the job is as follows

In [None]:
!rm -rf test_output
!/usr/local/hadoop-3.3.1/bin/hadoop jar /usr/local/hadoop-3.3.1/share/hadoop/tools/lib/hadoop-streaming-3.3.1.jar -input test_input/hamilton*.txt -output test_output -file mapper.py  -file reducer.py  -mapper 'python mapper.py'  -reducer 'python reducer.py'

If everything went ok, you should see a whole bunch of output. The last line should give you the name of the output directory. 

Let's check the contents of the output directory. 

In [None]:
!ls -ltr test_output

`part-00000` is the file that should contain the results of output of the word count program. Let's check the contents of that file. You should see a recent timestamp on this file. 

In [None]:
!cat test_output/part-00000

## <a name="end"></a>Conclusion

Does the output look correct? If so, then you have just run your first MapReduce job on a Hadoop cluster!