#Running Hadoop MapReduce On Google Colab

In our previous tutorial we went through the steps of installing a single-node, pseude-distributed Hadoop cluster on Google Colab. This tutorial will show you how to run a simple map-reduce job on Hadoop. 

The tutorial should not take note take more thant 30 minutes. You do not have to complete it in one sitting.  

The tutorial has been written in a way such that all of the commands work out of the box in Google Colab. However, if a particular command does not work you get a weird error message, please add your question to the discussion forum.

The main steps for running a MapReduce Hadoop jobs are listed below.

1. [Hadoop Install](#hadoop)
2. [Running WordCount](#wordcount)
3. [Conclusion](#end)


## <a name="hadoop"></a>Hadoop Install
Since the Google Colab environment is refreshed each time you open a new notebook, we will first need to install Hadoop in this VM instance. You can just follow the steps in the previous tutorial. For convenience the sequence of commands for Hadoop installation is given below. 

In [None]:
!wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz

In [None]:
!tar -xzvf hadoop-3.3.1.tar.gz

In [None]:
!ls hadoop-3.3.1/bin

In [None]:
!cp -r hadoop-3.3.1/ /usr/local/

In [None]:
!readlink -f /usr/bin/java 

Now, use the folder navigation pane on the left to browse to the file `usr/local/hadoop-3.3.0/etc/hadoop/hadoop-env.sh`. Double-click on the file to open it for editing. Uncomment the line begins with `export JAVA_HOME=` (should be line 54). Then add the Java path after the `=`

```bash
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/
```

In [None]:
!/usr/local/hadoop-3.3.1/bin/hdfs namenode -format

In [None]:
!/usr/local/hadoop-3.3.1/bin/hadoop

##<a name="end"></a>The Word Count Problem 

The word count application counts the number occurrences of all words appearing in a set of documents. As discussed in the previous lectures, the word count problem is a classic example of a MapReduce task. In this tutorial we will run a simple MapReduce implementation of a word count application. 

### Getting the Code 

You can find a simple Python implementation from the course git repo. To obtain the code clone the repo to Google Colab  

In [None]:
!git clone https://github.com/anjalysam/Hadoop.git 

The mapper and reducer codes can be found in the `Hadoop/mapper.py` and `Hadoop/reducer.py` files. For convenience, let's copy them to the current directory. 

In [None]:
!cp Hadoop/mapper.py .

In [None]:
!cp Hadoop/reducer.py .

To be able to execute these codes, we will need to set the execute permission. 

In [None]:
!chmod u+x mapper.py
!chmod u+rwx reducer.py

### Input and Output 

For any Hadoop job, we need to an input and output directory. These directories can be given any names. By convention, the names typically contain input/outpu as a suffix/prefix. Let's create the input directory. 

In [None]:
!mkdir ~/input

The output directory will be automatically created when the Hadoop job runs. 

### Obtaining the input. 

We will run the word count problem on datasets from newsgroups. You can fetch the datasets using the following command

In [None]:
!wget http://qwone.com/~jason/20Newsgroups/20news-18828.tar.gz

The dataset is in gzipped format. Let's untar and unzip it. 

In [None]:
!tar -xzvf 20news-18828.tar.gz

Now, lets copy one of the files to the input directory.

In [None]:
!cp 20news-18828/alt.atheism/49960 ~/input

### Running the job

Now we are all set to run our first MapReduce Hadoop job. 

The general format for the command for running a MapReduce Haddop job is as follows: 

```bash 
hadoop jar hadoop-streaming.jar \
-input name_of_input_file \
-output name_of_output_directory \
-file name_of_mapper_file \
-mapper the_mapper_cmd \
-file name_of_reducer_file \
-reducer the_reducer_cmd \
```

That's a pretty long command. Let's break it down. 

  * `hadoop` says that we are launching a Hadoop task 
  * `jar` is the command to execute a Java program.
  * `hadoop-streaming.jar` is a utility that comes with Haddop; it converts mapper and reducer code written in a differen programming to run in a MapReduce pattern in Hadoop; in this example we are feeding Python but we could also have written the coee Ruby, R etc. 
  * `-input name_of_input_file` specifies the name of the input file for the MapReduce application 
  * `-output name_of_output_directory` is the output directory 
  * `-file name_of_mapper_file` is the name of the mapper file 
  * `-file name_of_reducer_file` is the name of the reducer file 
  * `-mapper the_mapper_cmd` is the actual command that needs to be executed to run the mapper 
  * `-reducer the_reducer_cmd` is the actual command that needs to be executed to run the reducer

Our actual command for running the job is as follows

In [None]:
!ls

In [None]:
!/usr/local/hadoop-3.3.1/bin/hadoop jar /usr/local/hadoop-3.3.1/share/hadoop/tools/lib/hadoop-streaming-3.3.1.jar -input ~/input/49960 -output output -file mapper.py  -file reducer.py  -mapper 'python mapper.py'  -reducer 'python reducer.py'

If everything went right, you should see a whole bunch of output. The last line should give you the name of the output directory. 

Let's check the contents of the output directory. 

In [None]:
!ls -ltr output

`part-00000` is the file that contains the output of the program. Let's check the contents of that file. 

In [None]:
!cat output/part-00000

## <a name="end"></a>Conclusion

That's it! You have run your first MapReduce job on Hadoop.