In [None]:
%%javascript
$.getScript('http://asimjalis.github.io/ipyn-ext/js/ipyn-present.js')

<!-- 
This file was auto-generated from markdown using notedown.
Instead of modifying the ipynb modify the markdown source. 
-->

<h1 class="tocheading">HDFS</h1>
<div id="toc"></div>

Why Hadoop
==========

Big Data Problem
----------------

We have a 100 TB of sales data that looks like this:

ID    |Date          |Store  |State |Product   |Amount
--    |----          |-----  |----- |-------   |------
101   |11/13/2014    |100    |WA    |331       |300.00
104   |11/18/2014    |700    |OR    |329       |450.00

What If
-------

What are some of the questions we could answer if we could process this huge data set?

- How much revenue did we make by store, state?

- How much revenue did we make by product?

- How much revenue did we make by week, month, year?

Statistical Uses
----------------

Why are these interesting?

- These questions can help us figure out which products are selling
  in which markets, at what time of the year.

Engineering Problem
-------------------

To answer these questions we have to solve two problems:

- Store 100 TB of data

- Process 100 TB of data

Here is our starting point:

- To solve this problem we have been provided with 1000 commodity Linux servers.

- How can we organize these machines to store and process this data.

Objectives
----------

By the end of this class, we will be able to:

- Explain how HDFS splits up large files into blocks and stores them
  on a cluster.
  
- Explain how HDFS uses replication to ensure fault tolerance of the
  data.

- Explain how HDFS uses `fsimage` and `edits` files to ensure fault
  tolerance of the metadata.

<!--TODO: Sync up objectives with actual content-->

Hadoop Intro
============

Hadoop
------

Hadoop is a cluster operating system. It is made up of:

- HDFS, which coordinates storing large amounts of data on a
  cluster.

- MapReduce which coordinates processing data across a cluster of
  machines.

Google Papers
-------------

Hadoop, HDFS, and MapReduce are open source implementations of the
ideas in these papers from Google and Stanford.

- Paper #1: [2003] The Google File System     
    <http://research.google.com/archive/gfs-sosp2003.pdf>

- Paper #2: [2004] MapReduce: Simplified Data Processing on Large Clusters    
    <http://research.google.com/archive/mapreduce-osdi04.pdf>

- Paper #3: [2006] Bigtable: A Distributed Storage System for Structured Data
    <http://static.googleusercontent.com/media/research.google.com/en/us/archive/bigtable-osdi06.pdf>


Doug Cutting
------------

<img src="images/doug-cutting.png">

Hadoop
------

<img style="width:50%" src="images/yellow-elephant-hadoop.jpg">

Hadoop Analogy
--------------

<img src="images/devastator-transformer.jpg">

Hadoop Analogy
--------------

System     |Analogy
------     |-------
Hadoop     |Cluster Operating System
HDFS       |Cluster Disk Drive
MapReduce  |Cluster CPU

- Hadoop clusters are made up of commodity Linux machines.

- Each machine is weak and limited.

- Hadoop combines these machines.

- The Hadoop cluster is bigger and more powerful than the individual
  machines.

HDFS Daemons
------------

Daemon Name          |Role                              |Number Deployed
-----------          |-----------                       |---------------
NameNode             |Manages DataNodes and metadata    |1 per cluster
Secondary NameNode   |Compacts recovery metadata        |1 per cluster
Standby NameNode     |NameNode backup                   |1 per cluster
DataNode             |Stores/processes file parts       |1 per worker machine

HDFS Cluster
------------

<img src="images/hdfs-cluster-arch.png">

HDFS Operation
==============

HDFS Command Line Access
------------------------

Command                                  |Meaning
-------                                  |-------
`hadoop fs -ls hdfs://nn:8020/user/jim`  |List home directory of user `jim`
`hadoop fs -ls /user/jim`                |List home directory of user `jim`
`hadoop fs -ls data`                     |List home directory of user `jim`
`hadoop fs -ls `                         |List home directory of user `jim`
`hadoop fs -mkdir dir`                   |Make new directory `/user/jim/dir`
`hadoop fs -mkdir -p a/b/c/dir`          |Make new directory and all missing parents
`hadoop fs -rm file`                     |Remove `/user/jim/file`
`hadoop fs -rm -r dir`                   |Remove `/user/jim/dir` and all its contents
`hadoop fs -rm dir`                      |Remove `/user/jim/dir` if it is empty
`hadoop fs -put file1 file2`             |Copy local `file1` to `/user/jim/file2`
`hadoop fs -put file1 dir/`              |Copy local `file1` to `/user/jim/dir/file2`
`echo 'hi' `&#124;` hadoop fs -put - file.txt`|Put string `hi` into `/user/jim/file.txt`
`hadoop fs -get /user/jim/file.txt`      |Copy `/user/jim/file.txt` to local `file.txt`
`hadoop fs -get /user/jim/file1 file2`   |Copy `/user/jim/file1` to local `file2`
`hadoop fs -cat /user/jim/file.txt`      |Cat `/user/jim/file.txt` to stdout

Pop Quiz
--------

<details><summary>
Q: What is the advantage of the streaming put?
</summary>
1. The data does not have to be staged anywhere.<br>
2. You can put data into HDFS from a running program.<br>
</details>

<details><summary>
Q: Can you access HDFS files on a remote cluster?
</summary>
1. Use `hadoop fs -ls hdfs://remote-nn:8020/user/jim/path`<br>
2. Use `hadoop fs -Dfs.defaultFS=hdfs://remote-nn:8020 /user/jim/path`<br>
3. Use `hadoop fs -fs hdfs://remote-nn:8020 /user/jim/path`<br>
</details>

Files Blocks Replicas
---------------------

<img src="images/hdfs-files-blocks-replicas.png">

- Each file is made up of blocks.

- Each block has 3 copies or *replicas*.

- The replicas are equivalent---there is no primary replica.

- The block size for a file cannot be changed once a file is created.

- The replication for a file *can* be changed dynamically.


Custom Block Size and Replication
---------------------------------

Command                                          |Meaning
-------                                          |-------
`hadoop fs -D dfs.blocksize=67108864 -put file`  |Put `file` with block size 64MB 
`hadoop fs -D dfs.replication=1 -put file`       |Put `file` with replication 1
`hadoop fs -setrep 1 file`                       |Change replication of `file` to 1
`hadoop fs -setrep -R 1 dir`                     |Change replication of `dir` and contents to 1
`hadoop fs -setrep -w 1 file`                    |Change replication and block till done

File Security
-------------

Command                               |Meaning
-------                               |-------
`hadoop fs -chown jim file1`          |Change owner of `file1` to `jim`
`hadoop fs -chgrp staff file1`        |Change group of `file1` to `staff`
`hadoop fs -chown jim:staff file1`    |Change both owner and group
`hadoop fs -chown -R jim:staff dir1`  |Change both for `dir1` and its contents
`hadoop fs -chgrp -R staff dir1`      |Change group of `dir1` and its contents
`hadoop fs -chmod 755 file1`          |Set `file1` permissions to `rwxr-xr-x`
`hadoop fs -chmod 755 -R dir1`        |Set permissions for `dir1` and its contents

Hadoop Security
---------------

Q: What is the primary HDFS security model?

- HDFS uses the Unix file system security model.

- Unix secures access through authentication and authorization.

- Authentication: Who are you.

- Authorization: Who has access to a file.

- By default HDFS enforces authorization but not authentication.

Q: How can I impersonate someone to hack their files in HDFS? 

- `sudo -u jim hadoop fs -cat /user/jim/deep-dark-secrets.txt`

Q: How can I secure the system against impersonation?

- To fix this you have to enable Kerberos.

MapReduce
=========

MapReduce Intro
---------------

Q: What is MapReduce?

- MapReduce is a system for processing data on HDFS.

- In the *map* phase data is processed locally, with one *mapper* per
  block.

- In the *reduce* phase the results of the map phase are consolidated.

Data Locality
-------------

Q: What is *data locality*?

- Data locality is the secret sauce in HDFS and MapReduce. 

- Data locality means MapReduce runs mappers on locally machines with HDFS blocks.

- When these machine are busy mappers may come up on other machines.


Data Locality
-------------

<img src="images/hdfs-map-reduce-data-locality.png">


MapReduce Examples
------------------

Q: How can I run MapReduce programs?

- Hadoop ships with some MapReduce example programs.

- The jar files are located at `/usr/lib/hadoop-mapreduce`.

- Here is how you can get more information about them.

Command                                                    |Result
-------                                                    |------
`hadoop jar /path/hadoop-mapreduce-examples-VER.jar`       |Lists all programs
`hadoop jar /path/hadoop-mapreduce-examples-VER.jar sleep` |List usage of `sleep`

Q: What are some of the interesting example programs?

Program                |Description
-------                |-----------
`aggregatewordcount`   |Counts words in input
`grep`                 |Counts matches of regex in input
`pi`                   |Estimates Pi using monte-carlo method
`sleep`                |Sleeps 
`sudoku`               |Sudoku solver
`teragen`              |Generate data for terasort
`terasort`             |Run terasort
`teravalidate`         |Checking results of terasort

Pop Quiz
--------

<details><summary>
Q: What is the advantage of running `grep` as a MapReduce program?
</summary>
`grep` on MapReduce will scan blocks in parallel and complete
faster.
</details>