In [1]:
%%javascript
$.getScript('http://asimjalis.github.io/ipyn-ext/js/ipyn-present.js')

<IPython.core.display.Javascript object>

<!-- 
This file was auto-generated from markdown using notedown.
Instead of modifying the ipynb modify the markdown source. 
-->

<h1 class="tocheading">HDFS</h1>
<div id="toc"></div>

Why Hadoop
==========

Big Data Problem
----------------

We have a 100 TB of sales data that looks like this:

ID    |Date          |Store  |State |Product   |Amount
--    |----          |-----  |----- |-------   |------
101   |11/13/2014    |100    |WA    |331       |300.00
104   |11/18/2014    |700    |OR    |329       |450.00

What If
-------

What are some of the questions we could answer if we could process this huge data set?

- How much revenue did we make by store, state?

- How much revenue did we make by product?

- How much revenue did we make by week, month, year?

Statistical Uses
----------------

Why are these interesting?

- These questions can help us figure out which products are selling
  in which markets, at what time of the year.

Engineering Problem
-------------------

To answer these questions we have to solve two problems:

- Store 100 TB of data

- Process 100 TB of data

Here is our starting point:

- To solve this problem we have been provided with 1000 commodity Linux servers.

- How can we organize these machines to store and process this data.

Objectives
----------

By the end of this class, we will be able to:

- Explain how HDFS splits up large files into blocks and stores them
  on a cluster.
  
- Explain how HDFS uses replication to ensure fault tolerance of the
  data.

- Explain how HDFS uses `fsimage` and `edits` files to ensure fault
  tolerance of the metadata.

<!--TODO: Sync up objectives with actual content-->

Hadoop Intro
============

Hadoop
------

Hadoop is a cluster operating system. It is made up of:

- HDFS, which coordinates storing large amounts of data on a
  cluster.

- MapReduce which coordinates processing data across a cluster of
  machines.

Google Papers
-------------

Hadoop, HDFS, and MapReduce are open source implementations of the
ideas in these papers from Google and Stanford.

- Paper #1: [2003] The Google File System     
    <http://research.google.com/archive/gfs-sosp2003.pdf>

- Paper #2: [2004] MapReduce: Simplified Data Processing on Large Clusters    
    <http://research.google.com/archive/mapreduce-osdi04.pdf>

- Paper #3: [2006] Bigtable: A Distributed Storage System for Structured Data
    <http://static.googleusercontent.com/media/research.google.com/en/us/archive/bigtable-osdi06.pdf>


Doug Cutting
------------

<img src="https://s3-us-west-2.amazonaws.com/dsci6007/assets/doug-cutting.png">

Hadoop
------

<img style="width:50%" src="https://s3-us-west-2.amazonaws.com/dsci6007/assets/yellow-elephant-hadoop.jpg">

Hadoop Analogy
--------------

<img src="https://s3-us-west-2.amazonaws.com/dsci6007/assets/devastator-transformer.jpg">

Hadoop Analogy
--------------

System     |Analogy
------     |-------
Hadoop     |Cluster Operating System
HDFS       |Cluster Disk Drive
MapReduce  |Cluster CPU

- Hadoop clusters are made up of commodity Linux machines.

- Each machine is weak and limited.

- Hadoop combines these machines.

- The Hadoop cluster is bigger and more powerful than the individual
  machines.

HDFS Daemons
------------

Daemon Name          |Role                              |Number Deployed
-----------          |-----------                       |---------------
NameNode             |Manages DataNodes and metadata    |1 per cluster
Secondary NameNode   |Compacts recovery metadata        |1 per cluster
Standby NameNode     |NameNode backup                   |1 per cluster
DataNode             |Stores/processes file parts       |1 per worker machine

HDFS Cluster
------------

<img src="https://s3-us-west-2.amazonaws.com/dsci6007/assets/hdfs-cluster-arch.png">

HDFS Operation
==============

HDFS Command Line Access
------------------------

Command                                  |Meaning
-------                                  |-------
`hadoop fs -ls hdfs://nn:8020/user/jim`  |List home directory of user `jim`
`hadoop fs -ls /user/jim`                |List home directory of user `jim`
`hadoop fs -ls data`                     |List home directory of user `jim`
`hadoop fs -ls `                         |List home directory of user `jim`
`hadoop fs -mkdir dir`                   |Make new directory `/user/jim/dir`
`hadoop fs -mkdir -p a/b/c/dir`          |Make new directory and all missing parents
`hadoop fs -rm file`                     |Remove `/user/jim/file`
`hadoop fs -rm -r dir`                   |Remove `/user/jim/dir` and all its contents
`hadoop fs -rm dir`                      |Remove `/user/jim/dir` if it is empty
`hadoop fs -put file1 file2`             |Copy local `file1` to `/user/jim/file2`
`hadoop fs -put file1 dir/`              |Copy local `file1` to `/user/jim/dir/file2`
`echo 'hi' `&#124;` hadoop fs -put - file.txt`|Put string `hi` into `/user/jim/file.txt`
`hadoop fs -get /user/jim/file.txt`      |Copy `/user/jim/file.txt` to local `file.txt`
`hadoop fs -get /user/jim/file1 file2`   |Copy `/user/jim/file1` to local `file2`
`hadoop fs -cat /user/jim/file.txt`      |Cat `/user/jim/file.txt` to stdout

Pop Quiz
--------

<details><summary>
Q: What is the advantage of the streaming put?
</summary>
1. The data does not have to be staged anywhere.<br>
2. You can put data into HDFS from a running program.<br>
</details>

<details><summary>
Q: Can you access HDFS files on a remote cluster?
</summary>
1. Use `hadoop fs -ls hdfs://remote-nn:8020/user/jim/path`<br>
2. Use `hadoop fs -Dfs.defaultFS=hdfs://remote-nn:8020 /user/jim/path`<br>
3. Use `hadoop fs -fs hdfs://remote-nn:8020 /user/jim/path`<br>
</details>

Files Blocks Replicas
---------------------

<img src="https://s3-us-west-2.amazonaws.com/dsci6007/assets/hdfs-files-blocks-replicas.png">

- Each file is made up of blocks.

- Each block has 3 copies or *replicas*.

- The replicas are equivalent---there is no primary replica.

- The block size for a file cannot be changed once a file is created.

- The replication for a file *can* be changed dynamically.


Custom Block Size and Replication
---------------------------------

Command                                          |Meaning
-------                                          |-------
`hadoop fs -D dfs.blocksize=67108864 -put file`  |Put `file` with block size 64MB 
`hadoop fs -D dfs.replication=1 -put file`       |Put `file` with replication 1
`hadoop fs -setrep 1 file`                       |Change replication of `file` to 1
`hadoop fs -setrep -R 1 dir`                     |Change replication of `dir` and contents to 1
`hadoop fs -setrep -w 1 file`                    |Change replication and block till done

File Security
-------------

Command                               |Meaning
-------                               |-------
`hadoop fs -chown jim file1`          |Change owner of `file1` to `jim`
`hadoop fs -chgrp staff file1`        |Change group of `file1` to `staff`
`hadoop fs -chown jim:staff file1`    |Change both owner and group
`hadoop fs -chown -R jim:staff dir1`  |Change both for `dir1` and its contents
`hadoop fs -chgrp -R staff dir1`      |Change group of `dir1` and its contents
`hadoop fs -chmod 755 file1`          |Set `file1` permissions to `rwxr-xr-x`
`hadoop fs -chmod 755 -R dir1`        |Set permissions for `dir1` and its contents

Hadoop Security
---------------

Q: What is the primary HDFS security model?

- HDFS uses the Unix file system security model.

- Unix secures access through authentication and authorization.

- Authentication: Who are you.

- Authorization: Who has access to a file.

- By default HDFS enforces authorization but not authentication.

Q: How can I impersonate someone to hack their files in HDFS? 

- `sudo -u jim hadoop fs -cat /user/jim/deep-dark-secrets.txt`

Q: How can I secure the system against impersonation?

- To fix this you have to enable Kerberos.

Hadoop Configuration
====================

Configuration Parameters
------------------------

Q: Why is configuration important?

- Hadoop uses its configuration system for customizing defaults.

- Almost everything in configurable in Hadoop.

- The configuration system is like Hadoop's spinal cord.

Q: How many configuration parameters are there?

- Over 900.

Q: Where can I get a list of all the configuration parameters?

- Go to <https://hadoop.apache.org/docs/stable/>

- Look at lower left corner for links to default configuration settings

Changing Defaults
-----------------

Q: How can I change the default behavior of HDFS and MapReduce?

- Hadoop uses *configuration* files for all changes to the default
  behavior.

- Hadoop configuration files are located at `/etc/hadoop/conf`.

- This can be changed by adding a different location to the
  `CLASSPATH` environment variable.

Hadoop Configuration Files
--------------------------

Q: What are the different configuration files in `/etc/hadoop/conf`?

Configuration File   |Purpose                    |Which daemons read
------------------   |-------                    |------------------
`hadoop-env.sh`      |Environment variables      |All daemons
`core-site.xml`      |Core                       |All daemons
`hdfs-site.xml`      |HDFS                       |HDFS daemons
`yarn-site.xml`      |YARN                       |YARN daemons
`mapred-site.xml`    |MapReduce                  |MapReduce daemons and processes
`log4j.properties`   |Logging                    |All daemons
`include`            |Whitelist of worker nodes  |Master daemons
`exclude`            |Blacklist of worker nodes  |Master daemons
`allocations.xml`    |Scheduling                 |Master daemons

- `include`, `exclude`, and `allocations.xml` file names can be
  changed to something else. 

Configuration Settings Per Job
------------------------------

Q: Can I change the configuration values when I run a command or
submit a MapReduce job?

- Users can override some configuration values.

Q: What is the order of precedence in decreasing precedence?

- Configuration defined in `Job` object in Hadoop app
- Command-line `-D PARAM=VALUE`
- Client `/etc/hadoop/conf/*`
- Master or worker node `/etc/hadoop/conf/*`

Q: How can I override configuration properties through the command
line?

Configuration Command Line Syntax
---------------------------------

Syntax                       |Meaning
------                       |-------
`-D PARAM=VALUE`             |Set value of parameter `PARAM` to `VALUE`
`-fs hdfs://1.2.3.4:8020`    |Set NameNode to `hdfs://1.2.3.4:8020`
`-jt 1.2.3.4:8088`           |Set JobTracker to `1.2.3.4:8088`

Locking Configuration
---------------------

Q: How can an admin lock down configuration?

```
<property>
<name>some.property.name</name>
<value>some.value</value>
<final>true</final>
</property>
```

Deprecated Properties
---------------------

Property           |Example Value          |Meaning
--------           |-------------          |-------
`fs.defaultFS`     |`hdfs://1.2.3.4:8020`  |NameNode address and port
`fs.default.name`  |`hdfs://1.2.3.4:8020`  |NameNode address and port (deprecated)

- Deprecated properties are still supported.

- Watch out for multiple properties affecting the same configuration
  behavior.

- After changing configuration you have to bounce the affected
  daemons.

Hadoop Architecture
===================

Problem: Large Files
--------------------

Q: How would you design a system that had to store files larger than
what you can fit on a single machine?

- Split the file into 128 MB chunks. We will call them *blocks*.

- Spread them across the cluster on worker nodes called *DataNodes*.

- Keep track of where each block is located on a *master node* called
  the *NameNode*.

- To read a file we ask the NameNode where all its blocks are located.

Problem: DataNode Failure
-------------------------

Q: What if a machine holding a block fails?

- Lets keep 3 copies or *replicas* of each block on different
  DataNodes.

- The NameNode has the metadata.

- The NameNode knows what blocks make up a file, and where all the
  replicas of the blocks are located.

NameNode DataNode Communication
-------------------------------

- All DataNodes heartbeat into the NameNode every 3 seconds.

- The heartbeat says that the DataNode is alive and how busy it is.

- Every 6 hours each DataNode sends a block report to the NameNode.

- The block report lists all the blocks on the DataNode.

- The NameNode sends back any commands for the DataNode as a response.

NameNode DataNode Communication
-------------------------------

- DataNodes initiate conversation.

- NameNode responds with instructions.

<img src="https://s3-us-west-2.amazonaws.com/dsci6007/assets/hdfs-nn-dns.png">

Heartbeats
----------

<img src="https://s3-us-west-2.amazonaws.com/dsci6007/assets/hdfs-nn-dn-communication.png">

Block Reports
-------------

<img src="https://s3-us-west-2.amazonaws.com/dsci6007/assets/hdfs-block-report.png">

Pop Quiz
--------

<details><summary>
Q: Does the NameNode ever initiate conversation with the DataNode? 
</summary>
No. It only communicates in response to a heartbeat or a block report.
</details>

<details><summary>
Q: Why does the NameNode not initiate conversation with the DataNode? 
</summary>
By waiting for heartbeats the NameNode ensures that it is only talking
to DataNodes that are alive.
</details>

HDFS Read
---------

Q: How does a client read a file from HDFS?

- The client requests the file from the NameNode.

- The NameNode sends back the locations of the first 10 blocks.

- The client reads the closest replica for each block.

HDFS Closeness
--------------

Q: How does an HDFS client determine closeness?

<img src="https://s3-us-west-2.amazonaws.com/dsci6007/assets/hdfs-closeness.png">

Pop Quiz
--------

<details><summary>
Q: Can a DataNode be an HDFS client?
</summary>
1. Yes.<br>
2. Any machine that wants to interact HDFS files is a client.<br>
3. Closeness only makes sense for clients in the cluster.<br>
</details>

HDFS Read
---------

<img src="https://s3-us-west-2.amazonaws.com/dsci6007/assets/hdfs-file-read.png">

HDFS Read Sequence
------------------

<img src="https://s3-us-west-2.amazonaws.com/dsci6007/assets/hdfs-file-read-seq.png">

HDFS Write 
----------

Q: How does a client write a file to HDFS?

- Client tells NameNode it wants to put file.

- NameNode gives it list of 3 DataNodes for first block.

- Client sends first chunk of first block to the DataNode 1.

- DataNode 1 sends it to DataNode 2, which sends it to DataNode 3.

- Each receiver acknowledges the chunk.

- Client gets list of DataNodes for each block.

Q: What if one or two DataNodes go down?

- Client reconstitutes pipeline with surviving DataNodes. 

- After block uploaded NameNode tells new DataNode to replicate.

HDFS Write 
----------

<img src="https://s3-us-west-2.amazonaws.com/dsci6007/assets/hdfs-file-write.png">

HDFS Write Sequence
-------------------

<img src="https://s3-us-west-2.amazonaws.com/dsci6007/assets/hdfs-file-write-seq.png">


Pop Quiz
--------

<details><summary>
Q: Can the client read a file while it is being written?
</summary>
1. Yes.<br>
2. The client will think it has read the file despite reading it partially.<br>
3. Clients should coordinate to ensure writing is done before reading.<br>
</details>


Block Placement Policy
----------------------

Q: How does the NameNode decide where to place blocks?

- If client is a DataNode then NameNode places the first replica on it.

- Otherwise it places the first replica based on which DataNode has
  heartbeated in and is least busy.

- The second replica is placed on a different rack from the first
  replica.

- The third replica is placed in the same rack as the second but on a
  different node.

- Further replicas are placed on random nodes in the cluster.

Block Placement Policy
----------------------

<img src="https://s3-us-west-2.amazonaws.com/dsci6007/assets/hdfs-block-placement-policy.png">

Pop Quiz
--------

<details><summary>
Q: Suppose you notice that a particular DataNode keeps getting filled
up with block replicas. What might be the problem?
</summary>
1. It is likely that there is a program on the DataNode that is
writing files to HDFS.<br>
2. If you run a program on the DataNode make sure that you round-robin
it with all the DataNodes to ensure balancing.<br>
</details>

MapReduce
=========

MapReduce Intro
---------------

Q: What is MapReduce?

- MapReduce is a system for processing data on HDFS.

- In the *map* phase data is processed locally, with one *mapper* per
  block.

- In the *reduce* phase the results of the map phase are consolidated.

Data Locality
-------------

Q: What is *data locality*?

- Data locality is the secret sauce in HDFS and MapReduce. 

- Data locality means MapReduce runs mappers on locally machines with HDFS blocks.

- When these machine are busy mappers may come up on other machines.


Data Locality
-------------

<img src="https://s3-us-west-2.amazonaws.com/dsci6007/assets/hdfs-map-reduce-data-locality.png">


MapReduce Examples
------------------

Q: How can I run MapReduce programs?

- Hadoop ships with some MapReduce example programs.

- The jar files are located at `/usr/lib/hadoop-mapreduce`.

- Here is how you can get more information about them.

Command                                                    |Result
-------                                                    |------
`hadoop jar /path/hadoop-mapreduce-examples-VER.jar`       |Lists all programs
`hadoop jar /path/hadoop-mapreduce-examples-VER.jar sleep` |List usage of `sleep`

Q: What are some of the interesting example programs?

Program                |Description
-------                |-----------
`aggregatewordcount`   |Counts words in input
`grep`                 |Counts matches of regex in input
`pi`                   |Estimates Pi using monte-carlo method
`sleep`                |Sleeps 
`sudoku`               |Sudoku solver
`teragen`              |Generate data for terasort
`terasort`             |Run terasort
`teravalidate`         |Checking results of terasort

Pop Quiz
--------

<details><summary>
Q: What is the advantage of running `grep` as a MapReduce program?
</summary>
`grep` on MapReduce will scan blocks in parallel and complete
faster.
</details>