##**Hadoop On Google Colab**
**Module 2.2**  
**Block 9: Big Data Processing and NLP**

<a href="https://colab.research.google.com/github/datasciencepathways/hadoop_map_reduce/blob/main/tutorials/hadoop_install_on_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
 
In our previous tutorial we went through the steps of installing a single-node, pseude-distributed Hadoop cluster on a Linux system. This tutorial will describes how to build and configure a single-node Hadoop system on [Google Colab](https://colab.research.google.com/). Google Colab is a free Jupyter notebook environment that runs in the cloud. 
As we will setting up Hadoop is much easier then setting it up on your system. However, the downside is that you will repeat the installation step each time you launch a Google Colab. The reason for this is that Google Colab gives a new VM each time you open a notebook. Nontheless, Google Colab is a great sandbox for getting some practice with Hadoop and distributed processing, among many other things.  

The tutorial should not take more than more 20 minutes. You do not have to complete it in one sitting, however. 

The tutorial has been written in a way such that all of the commands work out of the box in Google Colab. However, if a particular command does not work you get a weird error message, please add your question to the discussion forum.

The main steps for setting up Hadoop on Google Colab are listed below.

[Getting Ready](#getting_ready)  
[Download and Installation](#download)  
[Configuration](#config)   
[Running and Testing](#running)  
[Conclusion](#end)  


## <a name="getting_ready"></a> Getting Ready 
The Google Colab environment should have all the necessary software and packages to run Hadoop. 

But just as a precaution make sure that a recent version of Jave Runtime Environment (JRE) is installed. 


In [None]:
 !java -version 

## <a name="download"></a>Download and Installation 

You will need to download Hadoop from the [Apache Mirror server](https://dlcdn.apache.org/hadoop/common/). To download version 3.3.1, use the following command 


In [None]:
!wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz

Now untar and unzip the archive using the `tar` command. 

In [None]:
!tar -xzvf hadoop-3.3.1.tar.gz

The command should take a few seconds to run and you should see lots of output on the screend. If the untar command completed successfuly, the Hadoop binaries and libraries should now be in the `hadoop-3.3.1` directory. Check to see if the binaries are indeed there. 


In [None]:
!ls hadoop-3.3.1/bin

Right now, the installation of Hadoop is in the local directory. Although it does not really matter in Colab, it's generally a good idea to move it to a system location. On Colab, you always have `root` access on the VM that is given to you. So you can use the `cp` command to copy Hadoop to `/usr/local`

In [None]:
!cp -r hadoop-3.3.1/ /usr/local/

## <a name="config"></a>Configuration 

Hadoop is now installed on your system. But before we can run it we need to modify on configuration file.

Hadoop requires that you set the path to Java, either as an environment variable or in the Hadoop configuration file. Just like Linux systems, Google Colab uses a symbolic link to the Java installation. To get the _actual_ location of Java use the following command. 

In [None]:
!readlink -f /usr/bin/java 

Now, use the folder navigation pane on the left to browse to the file `usr/local/hadoop-3.3.0/etc/hadoop/hadoop-env.sh`. Double-click on the file to open it for editing. Uncomment the line begins with `export JAVA_HOME=` (should be line 54). Then add the Java path after the `=`

```bash
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/
```

### <a name="format"></a>Formatting the `namenode` 

Now, we have Hadoop installed and configured. There is one more step before we can run the Hadoop cluster. The `namenode` must be fortmatted with `hdfs`. This is not the same as formatting or re-formatting a hard disk. This just ensure that the `namenode` has all the information to work with the target hard disk.  


In [None]:
!/usr/local/hadoop-3.3.1/bin/hdfs namenode -format

## <a name="running"></a>Running and Testing 

At this stage, we should be all set to run the Hadoop single-node cluster. To start the cluster, we need to invoke a script that is the in Hadoop sbin directory. 

In [None]:
!/usr/local/hadoop-3.3.1/bin/hadoop

## <a name="end"></a>Conclusion

That's it! You now have a working Hadoop system in Google Colab.