<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Hadoop Lab

_Authors: Dave Yerrington (SF)_

---
### Learning Objectives
*After this lab, you will be able to:*
- Install local virtual machine running apache Hadoop
- Navigate Hadoop file system (HDFS)

### Preparation
*Before this lesson, you will need to:*
- Go through the installation of the virtual machine
- Have the students download the virtual machine image on their own before the lab; see instructions below.
- Note: Data assets for this lesson are included in the virtual machine.

> Instructors: This week requires some additional preparation. *First*, you'll need to assist students with the use of the Virtual Machine to run the first few lessons and labs. *Second*, you'll need to check with your local team about buying or accessing AWS credits for your students to use EC2 and EMR services.


### Lesson Guide
- [Introduction](#intro)
- [Installing the virtual machine](#vm)
    - [Import the VM in VirtualBox](#import-vm)
- [Launch the VM](#launch)
- [Start the Bigdata tools](#start)
- [Hadoop](#hadoop)
- [YARN](#yarn)
- [Exploring HDFS from the command line](#guided-practice)
    - [Exercise 1](#ex1)
- [Exploring HDFS from the web interface](#guided-practice2)
    - [Exercise 2](#ex2)
- [Hadoop word count](#guided-practice3)
    - [Exercise 3](#ex3)
- [Hadoop streaming word count](#guided-practice4)
    - [Exercise 4](#ex4)
- [Additional resources](#resources)

<a name="intro"></a>
## Introduction
---

In this lab we will explore Hadoop, a very common implementation of the map-reduce framework. We will do this through the use of a virtual machine, i.e. a simulated computer running on a host computer (our laptops).

This lab will guide you through the installation and configuration of the virtual environment. The environment is a virtual machine that runs on your computer and that comes packaged with a lot of neat software including:

- Hadoop
- Hive
- Hue
- Spark
- Python with many useful packages

<a name="vm"></a>
## Installing the virtual machine
---

The first step in our journey is going to be to start a local virtual machine which will use throughout this week.

In order to simplify the process, we've made this machine available as a virtualbox file at [this Dropbox location](https://www.dropbox.com/sh/ktjhecqklpvwcce/AADZBLKS6KQJL3hUt10eQiqSa?dl=0). 

From now on I will assume you have already installed [Virtualbox](https://www.virtualbox.org/) on your computer. If you have not installed them, please go ahead and do that immediately.

<a id='import-vm'></a>
### Import the VM in VirtualBox

Oracle VM VirtualBox is a free and open-source hypervisor for x86 computers from Oracle Corporation. Developed initially by Innotek GmbH, it was acquired by Sun Microsystems in 2008 which was in turn acquired by Oracle in 2010.

VirtualBox may be installed on a number of host operating systems, including: Linux, OS X, Windows, Solaris, and OpenSolaris. It supports the creation and management of guest virtual machines running versions and derivations of Windows, Linux, BSD, OS/2, Solaris, Haiku, OSx86 and others.

For some guest operating systems, a "Guest Additions" package of device drivers and system applications is available which typically improves performance, especially of graphics.

Once you have downloaded it, import it in virtualbox.

![](./assets/images/virtualbox.png)

![](./assets/images/import.png)

<a id='launch'></a>
## Launch the VM
---

The VM is launched by pressing the Launch green arrow. This will open a terminal window and you'll see a lot of text. Finally you will be prompted to login. Do not login here. Instead connect via ssh from a terminal windows by typing:
    
    ssh vagrant@10.211.55.101
    password: vagrant

![](./assets/images/launch.png)
    


<a id='start'></a>
## Start the Bigdata tools
---

Once you're logged in type:

    $ bigdata_start.sh

and the following services will be started:

- Hadoop
- HDFS
- Yarn
- Hive server
- Hue
- Jupyter Notebook

You may be requested for a password a few times (that's "vagrant"), just type it in.

Let's have a look at some of the services available in this virtual machine.

<a id='hadoop'></a>
## Hadoop

---

Apache Hadoop is an open-source software framework for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework.

The core of Apache Hadoop consists of a storage part, known as **Hadoop Distributed File System (HDFS)**, and a processing part called MapReduce. Hadoop splits files into large blocks and distributes them across nodes in a cluster.

### HDFS

The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file-system written in Java for the Hadoop framework. It's the file system supporting Hadoop.


<a id='yarn'></a>
## YARN
---

Yarn is a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users' applications. The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM).

The ResourceManager and the NodeManager form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The NodeManager is the per-machine framework agent who is responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler.

The Yarn resource manager offers a web interface, that is accessible on our VM at this address:

http://10.211.55.101:8088/cluster

Go ahead and type that in your browser and you should see a screen like this:

![](./assets/images/yarn.png)

This will be useful when we run a hadoop job, in order to check the status of advancement.

<a name="guided-practice"></a>
## Exploring HDFS from the command line
---

Hadoop offers a command line interface to navigate the HDFS. The full documentation can be found here:

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html

We've pre-loaded the machine with a few datasets, let's explore them typing the command:

    $ hadoop fs -ls

<a id='ex1'></a>
### Exercise 1
Explore HDFS and describe the content of each folder it contains. You will need to use a combination of commands like:

    - ls
    - cat

<a name="guided-practice2"></a>
## Exploring HDFS from the web interface
---

Hadoop also offers a web interface to navigate and manage HDFS. It can be found at this address:

http://10.211.55.101:50070

and it looks like this:

![](./assets/images/hdfsweb.png)

<a id='ex2'></a>
### Exercise 2
Find how you can navigate the HDFS from the web interface. Is the content listed similar to what you were finding with the command line?


In [None]:
# Answer: no, because the web interface displays the content of the root folder, 
# while the hadoop fs command automatically goes into the /users/hadoop folder.

<a name="guided-practice3"></a>
## Hadoop word count

Let's create a very short file and count the number of words using Hadoop:

    $ hadoop fs -mkdir wordcount-input
    
    $ echo "hello dear world hello" | hadoop fs -put - wordcount-input/hello.txt

<a id='ex3'></a>
### Exercise 3:
Run the word count with the following command:

    $ hadoop jar /usr/local/lib/hadoop-2.7.2/share/hadoop/mapreduce/hadoop*example*.jar \
                  wordcount wordcount-input wordcount-output


![](./assets/images/hdwcshell.png)

![](./assets/images/hdwcyarn.png)

Check the results by typing:

    $ hadoop fs -cat wordcount-output/part*
    
you should see:

    dear   1
    hello  2
    world  1

<a name="guided-practice4"></a>
## Hadoop streaming word count
---

Hadoop also offers a streaming interface. The streaming interface will process the data as a stream, one piece at a time, and it requires to be told what to do with each piece of data. This is somewhat similar to what we did with the map-reduce from the shell that we used in the previous class.

Let's use the same python scripts to run a hadoop streaming map-reduce. We have pre-copied those scripts to your VM home folder, so that they are easy to access.

First of all let's copy some data to hdfs. The data folder contains a folder called `project_gutenberg`. Let's copy that to hadoop:

    $ hadoop fs -copyFromLocal data/project_gutenberg project_gutenberg
    $ hadoop fs -copyFromLocal scripts scripts

Go ahead and check that it's there:

http://10.211.55.101:50070/explorer.html#/user/vagrant

Great! Now we should pipe all the data contained in that folder through our scripts with hadoop streaming.
First let's make sure that the scripts work by using the shell pipes we learned in the last lecture.

    $ cat data/project_gutenberg/pg84.txt | python scripts/mapper.py | sort -k1,1 | python scripts/reducer.py 

Great! They still work. Ok now let's do hadoop streaming MR:

    $ export STREAMING_JAR=/usr/local/lib/hadoop-2.7.2/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar
    
    $ hadoop jar $STREAMING_JAR  \
      -file /home/vagrant/scripts/mapper.py   \
      -mapper /home/vagrant/scripts/mapper.py \
      -file /home/vagrant/scripts/reducer.py  \
      -reducer /home/vagrant/scripts/reducer.py \
      -input /user/vagrant/project_gutenberg/* \
      -output /user/vagrant/output_gutenberg


Check the status of your MR job here:

http://10.211.55.101:8088/cluster/apps

You can check your results in the HDFS explorer:

http://10.211.55.101:50070/explorer.html#/user/vagrant/output_gutenberg

<a id='ex4'></a>
### Exercise 4

You have learned how to spin up a local virtual machine running Hadoop and how to submit map reduce job flows to it! Congratulations.

Go ahead and perform the map-reduce word count on the project gutenberg data using the Hadoop Jar used in exercise 3. You should get the list words with the counts as output. You can also save that list to a file and open it in Pandas to sort the words by the most frequent.

<a id='resources'></a>
## Additional resources
---

- [Hadoop](http://hadoop.apache.org/)
- [Hadoop command line](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html)
- [YARN](http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html)
- [Hadoop Streaming tutorial](http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/)
- [Hadoop Streaming doc](https://hadoop.apache.org/docs/r1.2.1/streaming.html)