<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Hadoop Lab

_Authors: Dave Yerrington (SF)_

---
### Learning Objectives
*By the end of this lab, you will be able to:*
- Install a local virtual machine running Apache Hadoop.
- Navigate the Hadoop-distributed file system (HDFS).

### Pre-Work
*Before this lesson, you will need to:*
- Review the installation of the virtual machine.
- Download the virtual machine image before the lab (see instructions below).
- Note: Data assets for this lesson are included in the virtual machine.

> Instructors: This week requires some additional preparation. *First*, you'll need to assist students with the use of the virtual machine to run the first few lessons and labs. *Second*, you'll need to check with your local team about buying or accessing AWS credits for your students to use EC2 and EMR services.


### Lesson Guide
- [Introduction](#intro)
- [Installing the Virtual Machine (VM)](#vm)
    - [Import the VM in VirtualBox](#import-vm)
- [Launch the VM](#launch)
- [Start the Big Data Tools](#start)
- [Hadoop](#hadoop)
- [YARN](#yarn)
- [Exploring HDFS From the Command Line](#guided-practice)
    - [Exercise 1](#ex1)
- [Exploring HDFS From the Web Interface](#guided-practice2)
    - [Exercise 2](#ex2)
- [Hadoop Word Count](#guided-practice3)
    - [Exercise 3](#ex3)
- [Hadoop Streaming Word Count](#guided-practice4)
    - [Exercise 4](#ex4)
- [Additional Resources](#resources)

<a name="intro"></a>
## Introduction
---

In this lab, we will explore Hadoop, a common implementation of the MapReduce framework. We'll do this using a virtual machine (i.e., a simulated computer running on a host computer (our laptops)).

This lab will guide you through the installation and configuration of the virtual environment. The environment includes a virtual machine that runs on your computer and comes packaged with useful software, including:

- Hadoop.
- Hive.
- Hue.
- Spark.
- Python with many useful packages.

<a name="vm"></a>
## Installing the Virtual Machine
---

The first step in our journey is to start a local virtual machine that we'll use throughout the week.

In order to simplify the process, we've made this machine available as a VirtualBox file at [this Dropbox location](https://www.dropbox.com/sh/ktjhecqklpvwcce/AADZBLKS6KQJL3hUt10eQiqSa?dl=0). 

From now on, we'll assume that you've already installed [VirtualBox](https://www.virtualbox.org/) on your computer. If you haven't, please install it immediately.

<a id='import-vm'></a>
### Import the VM in VirtualBox

Oracle VM VirtualBox is a free, open-source hypervisor for x86 computers from Oracle Corporation. It was initially developed by innotek GmbH, which was acquired by Sun Microsystems in 2008 (which was in turn acquired by Oracle in 2010).

VirtualBox may be installed on a number of host operating systems, including Linux, OS X, Windows, Solaris, and OpenSolaris. It supports the creation and management of guest virtual machines running versions and derivations of Windows, Linux, BSD, OS/2, Solaris, Haiku, OSx86, and others.

For some guest operating systems, a "Guest Additions" package of device drivers and system applications is available, which typically improves performance — especially of graphics.

Once you've downloaded it, import it in VirtualBox.

![](./assets/images/virtualbox.png)

![](./assets/images/import.png)

<a id='launch'></a>
## Launch the VM
---

The VM is launched by pressing the green launch arrow. This will open a terminal window where you’ll see a lot of text. Finally, you'll be prompted to log in. Do not log in here. Instead, connect via `ssh` from a terminal window by typing:
    
    ssh vagrant@10.211.55.101
    password: vagrant

![](./assets/images/launch.png)
    


<a id='start'></a>
## Start the Big Data Tools
---

Once you're logged in, type:

    $ bigdata_start.sh

This will start the following services:

- Hadoop.
- HDFS.
- YARN.
- Hive server.
- Hue.
- Jupyter Notebook.

You may be asked for the password, "vagrant," a few times — just type it in.

Let's have a look at some of the services available in this virtual machine.

<a id='hadoop'></a>
## Hadoop

---

Apache Hadoop is an open-source software framework for the distributed storage and processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with the fundamental assumption that hardware failures are common and should be automatically handled by the framework.

The core of Apache Hadoop consists of a storage part, known as the **Hadoop-distributed file system (HDFS)**, and a processing part called MapReduce. Hadoop splits files into large blocks and distributes them across nodes in a cluster.

### HDFS

The HDFS is a distributed, scalable, and portable file system written in Java to support the Hadoop framework. 


<a id='yarn'></a>
## YARN
---

YARN is a resource-management platform responsible for managing computing resources in clusters and using them to schedule users' applications. The fundamental idea of YARN is to split the functionalities of resource management and job scheduling/monitoring into separate daemons. It's goal is to have a global resource manager (RM) and per-application application master (AM).

The resource manager and node manager form the data computation framework. The resource manager is the ultimate authority that arbitrates resources among all the applications in the system. The node manager is the per-machine framework agent that's responsible for containers, monitoring resource usage (CPU, memory, disk, network), and reporting the results to the resource manager/scheduler.

The YARN resource manager offers a web interface that is accessible on our VM at this address:

http://10.211.55.101:8088/cluster

Type that in your browser and you should see a screen like this:

![](./assets/images/yarn.png)

This will be useful for checking the advancement status when we run a Hadoop job.

<a name="guided-practice"></a>
## Exploring HDFS From the Command Line
---

Hadoop offers a command line interface for navigating the HDFS. The full documentation can be found here:

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html

We've pre-loaded our machine with a few data sets. Let's explore them by typing this command:

    $ hadoop fs -ls

<a id='ex1'></a>
### Exercise 1
Explore HDFS and describe the content of each folder it contains. You'll need to use a combination of commands, such as:

    - ls
    - cat

<a name="guided-practice2"></a>
## Exploring HDFS From the Web Interface
---

Hadoop also offers a web interface for navigating and managing HDFS. It can be found at this address:

http://10.211.55.101:50070

It looks like this:

![](./assets/images/hdfsweb.png)

<a id='ex2'></a>
### Exercise 2
Find out how you can navigate the HDFS from the web interface. Is the content listed similar to what you were finding with the command line?


In [None]:
# Answer: No. The web interface displays the content of the root folder, 
# while the Hadoop fs command automatically goes into the /users/hadoop folder.

<a name="guided-practice3"></a>
## Hadoop Word Count

Let's create a short file and get its word count using Hadoop:

    $ hadoop fs -mkdir wordcount-input
    
    $ echo "hello dear world hello" | hadoop fs -put - wordcount-input/hello.txt

<a id='ex3'></a>
### Exercise 3:

Run the word count with the following command:

    $ hadoop jar /usr/local/lib/hadoop-2.7.2/share/hadoop/mapreduce/hadoop*example*.jar \
                  wordcount wordcount-input wordcount-output


![](./assets/images/hdwcshell.png)

![](./assets/images/hdwcyarn.png)

Check the results by typing:

    $ hadoop fs -cat wordcount-output/part*
    
You should see:

    dear   1
    hello  2
    world  1

<a name="guided-practice4"></a>
## Hadoop Streaming Word Count
---

Hadoop also offers a streaming interface. This streaming interface will process the data as a stream, one piece at a time, but it must to be told what to do with each piece of data. This is somewhat similar to what we did with MapReduce from the shell in a previous class. 

Let's use the same Python scripts to run a Hadoop streaming MapReduce. We have pre-copied those scripts to your VM home folder so they're easy to access.

First, let's copy some data to the HDFS. The data folder contains a folder called `project_gutenberg`. Let's copy that to Hadoop:

    $ hadoop fs -copyFromLocal data/project_gutenberg project_gutenberg
    $ hadoop fs -copyFromLocal scripts scripts

Go ahead and check that it's there:

http://10.211.55.101:50070/explorer.html#/user/vagrant

Great! Now, we should pipe all the data contained in that folder through our scripts with Hadoop streaming.
Let's make sure that the scripts work by using the shell pipes we learned in the last lecture.

    $ cat data/project_gutenberg/pg84.txt | python scripts/mapper.py | sort -k1,1 | python scripts/reducer.py 

Great! They still work. Now, let's do Hadoop streaming MapReduce:

    $ export STREAMING_JAR=/usr/local/lib/hadoop-2.7.2/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar
    
    $ hadoop jar $STREAMING_JAR  \
      -file /home/vagrant/scripts/mapper.py   \
      -mapper /home/vagrant/scripts/mapper.py \
      -file /home/vagrant/scripts/reducer.py  \
      -reducer /home/vagrant/scripts/reducer.py \
      -input /user/vagrant/project_gutenberg/* \
      -output /user/vagrant/output_gutenberg


Check the status of your MapReduce job here:

http://10.211.55.101:8088/cluster/apps

Check your results in the HDFS explorer:

http://10.211.55.101:50070/explorer.html#/user/vagrant/output_gutenberg

<a id='ex4'></a>
### Exercise 4

Congratulations! You've learned how to use a local virtual machine running Hadoop and how to submit MapReduce job flows to it.

Now, perform the MapReduce word count on the Project Gutenberg data using the Hadoop jar from the last exercise. You should get the list words with the counts as output. You can also save that list to a file and open it in Pandas to sort the words by frequency. 

<a id='resources'></a>
## Additional Resources
---

- [Hadoop](http://hadoop.apache.org/)
- [Hadoop Command Line](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html)
- [YARN](http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html)
- [Hadoop Streaming Tutorial](http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/)
- [Hadoop Streaming Document](https://hadoop.apache.org/docs/r1.2.1/streaming.html)