# Module 3: Hadoop and Related Tools

## Introduction
Apache Hadoop is an ecosystem of software modules and frameworks that collectively form a highly scalable system for doing analytics on massive datasets.  The Apache Hadoop project was born from the need to efficiently process and analyze data at Internet scale.  Relational databases are very nice for doing queries but are too slow for truly big (terabytes or petabytes of) data. In this module we will explore the Apache Hadoop ecosystem and related tools.

## Learning Outcomes

In this module, you will learn the following:
* Explore the projects that make up the Apache Hadoop ecosystem 
* Gain an understanding of the Map-Reduce programming model
* Practical overview of Hadoop related tools


## Readings and Resources

We invite you to further supplement this notebook with the following recommended texts.

> George, L, Kunigk, J., Wilkinson, P., and Buss, I. (2018). Architecting Modern Data Platforms. A Guide to Enterprise Hadoop at Scale. O’Reilly. http://shop.oreilly.com/product/0636920054825.do

> Holmes, A. (2014). Hadoop in Practice, 2nd Edition. Manning. https://www.oreilly.com/library/view/hadoop-in-practice/9781617292224/

> Sammer, E. (2012). Hadoop Operations.  A Guide for Developers and Administrators. O’Reilly. http://shop.oreilly.com/product/0636920025085.do

> Seidman, J., Shapira, G., Malaska, T., and Grover, M. (2015).  Hadoop Application Architectures. O’Reilly. 2015.  http://shop.oreilly.com/product/0636920033196.do

> Wampler, D., Rutherglen, J., and Capriolo, E. (2012). Programming Hive. Data Warehouse and Query Language for Hadoop. O’Reilly. http://shop.oreilly.com/product/0636920023555.do

> White, T. (2015). Hadoop: The Definitive Guide 4th Edition. O’Reilly. http://shop.oreilly.com/product/0636920033448.do

## Basics of Hadoop

The Hadoop ecosystem provides a set of tools, services and systems that allows organizations to create a system that exhibits the following characteristics:
* **Scalable** - The system achieves high scale by processing large volumes of data across a cluster of servers (nodes).    Nodes can be easily added or removed from the cluster to scale the system up or down as needed.
* **Reliability/Fault Tolerance** – Data is replicated across multiple nodes.  That way if one node goes down the system can still function by shifting processing to another node that has a copy of the same data.
* **Distributed Processing** – Hadoop makes use of the **Map Reduce** programming model (described in detail below), and breaks down data processing into units of work called **jobs**.  Jobs are designed such that they can run independently from each other.  As a result, jobs can be run in parallel on separate nodes within the cluster.
* **High Availability** – Since data is replicated across multiple nodes, the data is highly available to jobs running in the cluster.
* **Flexible** – Hadoop can handle structured, semi-structured and unstructured data.
* **Ease-of-Use** – Hadoop handles the details of distributed computing by taking care of orchestrating jobs and replicating data across the cluster.  This allows organizations to concentrate on the higher level business logic without worrying about the messy details of managing concurrent processing.
* **Economical** – All the software components within the ecosystem are licensed under open-source licenses.  Furthermore, these software systems can run on low-cost hardware which makes it financially easy for organizations to adopt.
* **Data Locality** -  Hadoop makes an effort to run jobs on nodes that are physically close to the data they require.  This minimizes the network traffic across the cluster of nodes which would otherwise be a bottleneck.

### Core projects (components) of Hadoop
The Hadoop ecosystem is made of many different software projects that provide an array of software libraries, frameworks, tools and runtimes.  The following are the set of core components that all Hadoop distributed systems contain:
* **MapReduce** – a programming paradigm that is used to create programs that run on a distributed system. 
* **Hadoop Distributed File System (HDFS)** – a file system that stores replicated data in a distributed system
* **Yet Another Resource Negotiator (YARN)** – a system that manages the execution of programs in a distributed system 

### Additional Apache projects related to Hadoop
There are several other Hadoop-related open source projects that add additional capabilities such as security, easier-to-use programming models, machine learning algorithms, management tools, etc.  Organizations can mix and match these components to create a highly scalable system that meets their specific needs.  New projects are regularly being developed that add even more capabilities.  The following table shows the most commonly used Hadoop-related projects.  Later in the course we will look at another, Spark, in detail. 

<table> 
    <tr><th>Category</th><th>Open Source Projects</th></tr>
    <tr><td>Data Ingestion and Extraction</td><td><ul><li>Flume</li><li>Kafka</li><li>Sqoop</li></ul></td></tr>
    <tr><td>High-Level Data Language Models</td><td><ul><li>PIG</li><li>Hive</li></ul></td></tr>
    <tr><td>Data Stores Systems</td><td><ul><li>HBase</li><li>HCatalog</li></ul></td></tr>
    <tr><td>Big Data File Formats</td><td><ul><li>ORC</li><li>Parquet</li><li>Avro</li></ul></td></tr>
    <tr><td>Management Systems and Tools</td><td><ul><li>Zookeeper</li><li>Ambari</li><li>Azkaban</li><li>Falcon</li><li>Sentry</li><li>Ranger</li></ul></td></tr>
    <tr><td>Data Analytics Tools and Frameworks</td><td><ul><li>Drill</li><li>Mahout</li></ul></td></tr>
</table>

## Hadoop Terminology

The world of Hadoop and cluster computing involves a lot of nomenclature, some of which may be new to you.  Here is a handy glossary to refer back to as you read the following sections.

**Application**: An **application** is a program you've written, usually in Java (or something that calls precompiled modules in Java) to do a Big Data computation for you

**Application JAR**: If you're using Java or Scala, this is essentially the compiled executable of your application

**Cluster**: A set of similar servers that are networked tightly together so they can be used for parallel computing

**Cluster Manager**: The operating system-like program that manages resources (CPU, disk, memory) on a cluster.  This is something that is set up when the cluster is first created.  Hadoop's cluster manager is called YARN (Yet Another Resource Negotiator).

**Combiner**: A mini-**reducer** task that is used to do a small aggregation step to reduce the amount of data that needs to go to the next full **map** or **reduce** task

**DataNode**: A program that runs on a Hadoop cluster **node** and manages access to data on that **node**

**Driver Program**: The main() function of your application. It usually runs on your local workstation or laptop and kicks off a massively parallel computation on a Hadoop cluster.

**HDFS**: THe Hadoop file system.  This a layer that sits on top of the host operating system (usually Linux) running on each **node** and enables files to be larger than the host operating system would otherwise allow and for the files to be distributed across multiple cluster **node**s

**Job**: A parallel computation consisting of multiple **task**s that gets created by your **application**

**Key-Value Pair**: A 2-tuple where the first entry is called the key and the second is called the value e.g. ('age', 28) or as it would be written in JSON or a Python dictionary: { 'age': 28 }

**Map**: A **task** that transforms data

**Mapper**: A **map task** (or more precisely an instance of the Java object that describes it)

**Master Node**: A **node** in a Hadoop cluster that is dedicated to managing the cluster (as opposed to being used to processing data.  See **Slave/Worker Node**

**Node**: A single server out of many in a cluster.  This can be either a physical server or a virtual machine.

**MapReduce**: The **programming model** that Hadoop supports

**Metadata**: Data about the layout of data in a file i.e. is is raw binary data or tabular, and if tabular, how many columns and what are the column's names, etc.

**Partitioner**: A **task** that splits the data coming out of a **map** task so that the data is roughly evenly split between the **reducer**s in the next stage

**Programming Model**: A pattern or style of program code; an abstraction that simplifies programming.  For example, SQL is based on the **programming model** that all data takes the form of tables that can be manipulated with the kind of declarative statements it supports.  Data which doesn't fit this model probably can't be processed using SQL.  Hadoop assumes all the operations you will want to do on data can be expressed as either a **map** or a **reduce**.

**Reduce**: A task that aggregates data

**Reducer**: A **reduce task** (or more precisely the Java object that describes it)

**Slave/Worker Node**: A node that runs **MapReduce** tasks

**Sort/Shuffler**: A **task** that sorts data after a **mapper** phase so that all data with the same key are together allowing aggregation to be done in a single pass over the data 

**Stage**: Each job gets divided into smaller sets of tasks, called **stages**

**Task**: A unit of work on a Hadoop cluster

**Worker Node**: Any node that can run application code in the cluster

## MapReduce Programming Model
Hadoop was designed to process large quantities of data, primarily for analytics, using a specific programming model (or pattern) called MapReduce. The terms *map* and *reduce* come from functional programming.  A *map* in functional programming is a transformation of some kind i.e. a mapping of data from one value to another.  A *reduce* is some type of aggregation.  (If you're familiar with functional programming you might want to know what happened to *filter* and *fold*. *Filter* is lumped in with *map* in MapReduce and *reduce* is another word for *fold*).

The data in Hadoop is usually stored in files containing key-value pairs. In essence, MapReduce is a divide-and-conquer technique that breaks down a problem into smaller independent jobs that process these key-value pair files and does it relatively quickly because it splits the work into a lot of small tasks that run in parallel on a cluster of servers.

* **Map** – responsible for parsing, transforming and filtering either raw or key-value data into a new dataset of key-value pairs.
* **Reduce** – receives input from Map jobs and is responsible for grouping and aggregating the data into a smaller set of key-value pairs.

The important takeaway is that the big data problem is attacked by breaking the data into smaller chunks that are processed using many smaller jobs rather than as one big computation.  Hadoop was invented because the data sizes were getting so large that it was impossible to process the datasets otherwise.  It solves two key problems:

* The processing of big data would otherwise just take way too long (you might wait months for a result)
* When you distribute the computing across many nodes, chances are excellent that for a long computation one or more nodes will fail during a run, so fault detection and automatic recovery is critical

### Map Reduce Application Flow
Within the MapReduce programming model, an application transitions through two main stages as illustrated below.  Each phase is composed of multiple tasks.
![MapReduceApplicationPipeline.png](attachment:MapReduceApplicationPipeline.png)
#### Phase 1: The Mapper phase
During the Mapper phase, the application executes the following three kinds of tasks:
1. **Mapper** – mapper tasks transform input data from the file system into intermediate key-value output pairs.  
2. **Combiner** – combiner tasks are optional and are also known as “Mini-Reducer” tasks.  These tasks summarize the map output pairs with the same key before sending the data to the reducer phase.  This helps the overall performance of the system by decreasing the amount of data that needs to be sent to the reducer phase.  
3. **Partitioning** – partitioners take the key-value output pairs and determine where the output is routed to.

#### Phase 2: The Reducer phase
During the Reducer phase, the application executes the following two type of tasks:
1. **Shuffle/Sort** – during this stage the output pairs are sorted and merged.  The output of this task is keys and values that is sent to the reduce tasks.  This task brings together items to be aggregated on the same nodes so that the aggregation (reduce) can run quickly.
2. **Reduce** – finally, this task produces the final resulting output that is written back to the file system.

Let’s look at an example to understand the MapReduce programming model.  In our example, we will perform a word count on a document.  That is, we will count the number of occurrences for each unique word within the document. Consider the following document:

```
Baseball Football Golf
Golf Hockey Football Golf
Baseball Hockey Football
```
To process this document, we can construct a MapReduce pipeline as shown in the diagram below.

![MapReduceExamplePipeline.png](attachment:MapReduceExamplePipeline.png)

In our pipeline, we first split the document line by line.  We can store each line into a separate file on Hadoop's Distributed File System (HDFS).  We have 3 mapper jobs that each consume (read) one of the lines.  The mapper job extracts each word from the line of text and produces a key-value pair where the key is a word and the value is 1 (representing 1 instance of the word--we will add up the ones to get a total count in the next step).  Thus, the line, “Golf Hockey Football Golf” will produce the following key-value pairs:
```
Golf: 1
Golf: 1
Hockey: 1
Football: 1
```
This input is then fed into the Combiner task that will aggregate all the data with the same key.  Thus, the output of the Combiner task is as follows:
```
Golf: 2
Hockey: 1
Football: 1
```
Note that the Combiner tasks takes the two "Golf" key-value pairs and collapses these values into one key-value pair where the key is “Golf” and the value is the aggregated sum of “2”.
Next the Partitioner will determine which Sort/Shuffle job will process a particular key-value pair.  The Partitioner uses a **hash function** to make the determination.   In our example, we will use the key from the key-value pair as the hash function to determine which Sort/Shuffle task will handle the key-value pair.  As a result, keys that are the same are routed to the same Sort/Shuffle task. 
Finally, during the “Shuffle and Sort” task, the key-value pair will be processed to produce a key with a set of values (in this case a list of counts of words in a line) that will be sent to the Reducer task.  For instance, the Reducer that handles the “Football” key would receive the following input:
```
Football: 1,1,1
```
The Reducer will receive these pairs of keys with lists of counts as their values and output the aggregated key-value pair, in this case summing up the 1's in the value list.  This will produce the final word count result which is stored back to HDFS. In the case of the Reduce task that handles the "Football" key, the output would be as follows:
```
Football: 3
```
Unfortunately though, not all applications are well-suited to a Map-Reduce paradigm.  Hadoop is only for large-scale problems that can be naturally broken down into concurrent jobs like this.

### Map-Reduce High-level Programming Models
As we’ve seen in the previous section, writing a MapReduce application is quite complicated and is composed of many parts.  Consider a Hadoop system that produces a business report by processing petabytes of business data.  We would need to figure out how to write a program consisting of all the necessary MapReduce functions to split, map, combine, partition, sort, group and reduce the data.  This can be a daunting task to create even a simple report.  Fortunately, the Hadoop ecosystem provides two software tools that provide higher-level programming models to hide the details of writing a MapReduce program.  
* **Hive** – provides SQL functionality to allow for data warehouse tasks
* **Pig** – provides a high-level SQL-like scripting language for creating data analysis programs

Both Hive and Pig make the process of developing MapReduce program simpler and more understandable than directly writing a MapReduce program.  These programming models allow developers to concentrate on the business logic of the application rather than the details of writing MapReduce tasks.

### Hive
Hive is a software system that is used for querying and analyzing large datasets stored in the Hadoop filesystem.  Hive defines a declarative language called HiveQL that allows users to do SQL-like queries.  Because the data is stored in key-value pairs in files rather than a relational database, there are some limitations on the types of SQL queries that can be done with acceptable performance.

The following are some sample HiveQL queries that give you a sense of the syntax of the language:

A simple query that retrieves all columns and rows from table t1
```
SELECT * FROM t1 
```
The following query returns the first 5 customers sorted by the create_date field:
```
SELECT * FROM customers ORDER BY create_date LIMIT 5
```
Look familiar? Although HiveQL is very similar to SQL, there are some important differences.  For more information on the syntax of the language refer to Hive’s Language Manual (https://cwiki.apache.org/confluence/display/Hive/LanguageManual)

#### Hive Runtime
The following diagram shows the various components of Hive and illustrates how Hive takes a query and runs the query on a Hadoop system.  
![HiveRuntime.png](attachment:HiveRuntime.png)

The flow of a Hive query is as follows: 
1. The HiveQL query is submitted by the client (either a browser app or another system) to the Hive runtime.  
2. The driver consumes (reads) the queries and sends the query to the HiveSQL compiler.  The compiler will parse the query and construct an execution plan.  The execution plan is a set of instructions that define the various MapReduce tasks required to satisfy the query.  
3. Next, metadata information, such as the table schema, is retrieved from the metastore.  The responsibility of the **metastore** is to tell the compiler where the data is stored in Hadoop.  That is, given a set of tables, the metastore knows which data node in Hadoop contains the information needed by the query.
4. The execution plan is then submitted back to the driver.
5. The driver will then submit the plan to the execution engine.
6. The execution engine will interpret these instructions and communicate with the Hadoop cluster (Hadoop distributed system) to execute the query.  
7. The results are then received back to the execution engine.
8. The execution engine sends the result back to the driver.
9. Finally, the driver sends the result back to the client.

### Pig
Pig is a platform that defines a high-level dataflow language that looks like a cross between Python and SQL. It has an advantage over SQL in that it gives you more control over how a query is processed and so for a skilled user can return results more quickly in some cases than using Hive.  This was more of an issue in the early days of Hadoop when the Hive query planner was new and not as good as it is today. Like Hive, Pig generates a set of MapReduce programs for you that Hadoop can then process.

Appropriately, this language is called, you guessed it, Pig Latin.  This language was developed with the following design goals:
* **Ease of programming** - Complex tasks such as grouping, sorting and transformations are easy to write, understand and maintain.
* **Optimization opportunities** - The language hides the complexity of execution optimization, allowing developers to focus on the semantics of the task rather than the efficiency.
* **Extensibility** - Developers can create their own functions to do custom processing.
The Pig Latin language defines many data-processing operations like join, filter, group, etc.   For more information on the syntax of the language refer to http://pig.apache.org/docs/r0.17.0/basic.html.  To get a sense of the language, consider the following Pig script.  The script defines code that counts the number of words and stores the output to a file on HDFS.  Notice that we don’t need to create any MapReduce functions and we only need to worry about the business logic of counting words.  
```Pig
input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray);
 
 -- Extract words from each line and put them into a pig bag
 -- datatype, then flatten the bag to get one word on each row
 words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
 
 -- filter out any words that are just white spaces
 filtered_words = FILTER words BY word MATCHES '\\w+';
 
 -- create a group for each word
 word_groups = GROUP filtered_words BY word;
 
 -- count the entries in each group
 word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;
 
 -- order the records by count
 ordered_word_count = ORDER word_count BY count DESC;
 STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';
```
#### Pig Runtime
The Pig runtime consists of several components that will process a Pig script and produce a MapReduce java program optimized to run on a Hadoop distributed system.  The following illustrates the components of the Pig runtime. 

![PIGRuntime.png](https://cdn.educba.com/academy/wp-content/uploads/2020/01/pig-architecture.jpg.webp)

The Pig runtime processes the Pig Latin script as follows:
1. The **parser** reads the Pig Latin script and constructs an internal representation of it called a logical plan.
2. The logical plan is then fed into the **optimizer**.  The optimizer will modify the logical plan so that the instructions can run efficiently.  
3. The logical plan is fed into the **compiler**, where the compiler will construct an execution plan.  Just like the execution plan we discussed in the Hive runtime, the execution plan is a set of instructions that define the various MapReduce tasks that represent the query.  
4. The execution plan is sent to the **execution engine**.  
5. The Execution engine will interpret these instructions and communicate with the Hadoop cluster (**Hadoop distributed system**) to execute the execution plan.  

## Hadoop Distributed File System (HDFS)
The Hadoop Distributed File System (HDFS) is the main storage system used to store and retrieve data across a distributed cluster of nodes when using Hadoop.  HDFS is based on the Google File System and so is also highly scalable, fault tolerant and reliable.  In addition, HDFS can run on affordable and easy to obtain commodity hardware (i.e. on cheap, mass-produced general-purpose servers).

The HDFS architecture is based on the following guiding principles:
* **Tolerance of Hardware Failure** – the file system should be able to detect hardware failures and automatically recover from these system failures.
* **Streaming Data Access** – the file system should support high throughput of data.  
* **Large Data Sets** – the file system should be able to support file sizes of gigabytes to petabytes.  Moreover, the system should be able to support tens of millions of files.
* **Simple Coherency Model** - Internet-scale analytics applications need to capture large volumes of data arriving that needs to be written to disk quickly so nothing is lost. If the data is to be used for analytics it won't need to be updated once stored i.e. it's an immutable record of history.  HDFS files can be added to at the end only, as new data comes in, but not modified later.  In order to capture data fast (a key Hadoop design goal) HDFS does not support indexes, transactions or in-place editing like relational datastores.
* **“Moving Computation is Cheaper than Moving Data”** – the file system should provide the ability to execute computational jobs so that they are physically close to the data.  This achieves data locality to minimize network congestion and improve the overall performance of the system.  Hadoop sends the (usually short) programs out to the servers to run rather than bringing the data from where it is stored to where it will be processed.  This is the reverse of the traditional client-server system design where data is stored in database servers and moves to application servers for business logic processing.
* **Portability Across Heterogeneous Hardware and Software Platforms** – to maximize widespread adoption, the file system should be supported on as many different platforms as possible. 

### HDFS Clusters
HDFS runs on a distributed system that is based on a **cluster computing** architecture.  A **cluster** is a group of one or more computers or systems referred to as **nodes** that work together as a single centralized data processing resource. A cluster being managed by Hadoop will have anywhere from a handful of nodes for an experimental setup to tens of thousands for an Internet portal like Yahoo!. Nodes within the cluster communicate with each other using a dedicated network as the speed of communication between the nodes can be a key determinant of how fast it can complete a computation.

Hadoop's benefits such as performance, scalability, and fault tolerance are a result of using this kind of cluster architecture.  For instance, to scale out the system, one can easily add nodes to an existing cluster to boost the computing power.  Also, if a node fails and goes down, other nodes in the cluster can take over and handle completing the computation.

HDFS clusters are based on a master-slave cluster model where a master node manages a set of slave nodes.   Each HDFS node is one of the two types, with workers being more numerous than masters:

* **Master Node** – Master nodes keep track of where data is stored within the clusters' data-storing nodes and manage the MapReduce tasks running in the cluster. There is one specific type of master node that is critical to the operation of a Hadoop cluster called the NameNode.  The responsibility of the NameNode is to provide centralized access to keep track of all information associated with a file (i.e. the metadata).  The NameNode is also responsible for managing data replication across all its slave nodes.  
* **Slave/Worker Node** - Slave/Worker nodes store copies of the data and run the MapReduce tasks on the data stored in them.  

HDFS supports the hierarchical directory structure of a traditional file system such as folders and files.  The directory structure is managed by a special node called the NameNode. Data is stored in **blocks** that are spread across nodes.  By default, the size of a block is 128MB.  Whereas traditional file systems store a file on a single hard drive and try to keep the file blocks contiguous on disk, the contents of a file larger than 128MB in HDFS will be split into multiple data blocks, each stored in their own separate host operating system file but still appearing to the HDFS user as if it were a single file.  Hadoop manages the illusion that it is a single file so the user doesn't need to be concerned with it.  The service that runs on a node and manages data on that node is called **DataNode**.

This way HDFS can store files that are bigger than a disk drive could otherwise hold, and blocks can be replicated so that if a disk fails the data is not lost.  By default, HDFS makes 3 copies of each data block and places the two copies on other nodes.  The number of copies is known as the replication factor.

Consider the following diagram that shows the relationship between NameNodes, DataNodes and data blocks with a replication factor of 2.  Notice that there are 2 copies of each file spread across the DataNodes.  Also notice that the contents of the files are split into multiple data blocks.  For example, consider the file ‘/usr/myfiles/report.csv’.  This file is split into two blocks, indicated by block 1 and block 2.  Since the replication factor is set to 2, we have two copies of block 1 and block 2.  One copy of block 1 is stored on DataNode 1 while the second copy is stored on DateNode 3.  Similarly, there are two copies of block 2, one stored on DataNode 1 and one on DataNode 2.

![HDFSClusterDiagram.png](attachment:HDFSClusterDiagram.png)

### HDFS Data Integrity
Data corruption can occur in any system when data is written to disk or during transit within the network, due to hardware failure or transmission errors.  HDFS detects data corruption when reading data stored on disk or receiving data from the network by verifying the data’s **checksum**.  A checksum is a small set of additional information generated by the stored or transmitted data for the purpose of helping detect any integrity issues with the data.  When data is received by the DataNodes, it will verify the data’s checksum before writing the data to a data block on HDFS.  Similarly, when a data block is read from HDFS, the checksum will be verified.  The DataNode keeps a log of the time when the data block was last verified.  Periodically, HDFS will scan logs to see if any of the data blocks have failed their checksum verification or if any of the disks holding copies fail to respond.  If a checksum verification has failed this means that the data block is corrupted.  HDFS will replace the failed data block with one of the good copies from another DataNode.  Thus, HDFS automatically ensures data integrity of the data.

### HDFS Compression
HDFS compresses the data stored on the file system to save space and maximize the storage.  Compressing the data also means that it takes less bandwidth to transfer data across the cluster of nodes.  HDFS supports several compression formats such as Gzip, BZip, DEFLATE and LZO.  The encryption mechanism is transparent to clients.  Thus, clients are unaware that the data is encrypted on HDFS.

### HDFS High Availability
HDFS achieves high availability by configuring the cluster in an **Active-Passive configuration**.  In this configuration, each NameNode has a secondary NameNode that is not active and is on standby (passive).  The passive NameNode acts as a backup that is ready to take over if the active NameNode fails due to network connectivity issues or if it shuts down unexpectedly.  Clients only connect to the active NameNode.  If the active NameNode fails, HDFS will automatically failover to the Passive NameNode.  The failover process takes about a minute to complete.  

### Hadoop File Formats
Hadoop HDFS allows you to store data in many different file formats such as JSON, XML or CSV.  However, these human readable file formats may not be optimal for storing big data in a distributed file system.  There are three other file formats that have been developed to efficiently store data in a Hadoop cluster.  These file formats are **Optimized Row Columnar (ORC)**, **Avro** and **Parquet**, and provide the following optimizations over traditional file formats:

* **Compression** – the file formats are machine readable binary formats that provide some level of compression.  Compression is especially important if you are processing terabytes or petabytes of data.  
* **Splitability** – the file formats can easily be split across multiple partitions while file formats like raw JSON or XML cannot be split without some level of transformation.  Large-scale processing requires the data to be broken down so that independent jobs can process smaller pieces of the data in parallel. 
* **Interoperability** – all three file formats store metadata information with the data.   Since the data is self-described, any node in the cluster will know how to process an ORC, Parquet, or Avro file.
* **Schema Evolution** – Over the lifespan of an application the underlying data may change.  For example, new data types may be introduced or removed.  The file system must support when fields are added or deleted to the data schema.

#### Optimized Row Columnar (ORC)
ORC was developed by Hortonworks as a file format for Hive.  ORC is a column-oriented format that is optimized for read-intensive workloads such as data warehouse applications.  Column-oriented formats provide efficient compressed storage that can be easily partitioned across a cluster of nodes.  Organizations such as Facebook uses ORC to store petabytes of data (Facebook, 2014)
https://code.fb.com/core-data/scaling-the-facebook-data-warehouse-to-300-pb/

#### Parquet
Parquet is another column-oriented file format that was developed by Cloudera and Twitter.  Similar to ORC, the file format is optimized for read-intensive workloads and is commonly used with Impala (one ofthe query engines used by analysts and data scientists to perform data analytics on data stored in Hadoop).  

#### Avro
Avro is a row-oriented file format that was developed by the Hadoop working group in 2009.  The schema is stored in JSON format while the data is stored in binary format.  Since the data is stored in binary format, it makes the data compact and fast to process.  The row-oriented file format makes it optimal for write-intensive workloads since new data can simply be appended as an additional row to the end of the datastore.  One of the key features of Avro is its robust support for schema evolution so that old programs can process new schema changes while new programs can process data based on an old schema.

The following table compares the difference between the big data file formats and ranks which format has better support for certain features (Saurabh, S., 2018). 

| Features | Avro | Parquet | ORC |
|----------|----------|----------|----------|
| Schema Evolution Support | 1st | 3rd | 2nd |
| Compression Support | 2nd | 3rd | 1st |
| Splitability Support | 2nd | 2nd | 1st |
| Most Compatible Platforms | Kafka, Druid | Impala, Arrow, Drill, Spark | Hive, Presto |
| Row or Column Oriented | Row | Column | Column |
| Optimized for Read or Write | Write | Read | Read |

### HDFS Datastore abstractions
Applications that work directly with HDFS need to be concerned where the data is stored, how it is stored and what format the data is stored in.  For instance, the application must determine how to partition the data across the nodes and must know whether the data is column-oriented or row-oriented.  In order to reduce the complexity for programmers, Hadoop provides two components that make Hadoop easier to work with: 
* **HBase** – A column-oriented NoSQL data store
* **HCatalog** – Metadata and table management system

#### HBase
HBase is a column-oriented NoSQL datastore developed for Hadoop that runs on top of HDFS.  HBase provides an abstraction layer that allows applications to access and store data in a NoSQL datastore instead of directly storing data in flat files on the file system.

HBase is better suited for real-time read/write applications such as operational applications, while HDFS alone is really only appropriate for batch processing.  Consider a monitoring system that consumes large streams of log data from systems throughout an organization.  Operational staff require real-time analytics to gauge the health of the systems so that they can respond quickly to system failures.  In this scenario, HBase could be a good choice for the datastore to process and analyze system metrics in real-time.  Since HBase is column-oriented, executing queries will only read the data for the columns actually required by the query.  This reduces disk I/O and network traffic. 

Although HBase uses HDFS it does not use MapReduce and so can for the right applications provide response times in milliseconds rather than seconds or minutes.

Being a column-oriented NoSQL datastore built on HDFS, HBase inherits the following benefits:
* Fault Tolerance 
  * Data is replicated across the system
  * Automatic fail over
* Fast
  * Near real-time lookups
  * In-memory caching
* Flexible Schema
  * Can store structured, unstructured and semi-structured data

#### HCatalog
HCatalog provides a consistent schema and data type mechanism that can be shared across various Hadoop tools and frameworks.  Particularly, HCatalog provides a relational view of HDFS so that tools can query the data without being concerned about the file format (i.e. ORC, JSON, Parquet, etc.) or where the data is stored within the cluster.  The following diagram illustrates how HCatalog hides the details of the file format from tools and programming models such as Pig, Hive and MapReduce.

![HCatalogDiagram.png](attachment:HCatalogDiagram.png)

Since HCatalog provides a relational view of data in HDFS, data is modelled as tables where records are rows and rows are made up of columns.  If was developed to provide Hive and other relational query tools with a standardized way of representing schemas. Its Data Definition Language (DDL) is based on Hive’s.  Tools can use the DDL to create, alter, or drop tables in the same way SQL is used to create, alter and drop tables in a relational database.  For more information on HCatalog's DDL refer to (https://cwiki.apache.org/confluence/display/Hive/HCatalog+CLI#HCatalogCLI-HCatalogDDL).

## Hadoop Management Systems
Given that Hadoop can be a large, expensive, complex distributed system, organizations want to make sure the system is functioning optimally, especially if the system is made up of thousands of nodes.  Furthermore, organizations are also concerned with security and confidentiality.  The Hadoop ecosystem provides several open source platforms to manage and provision the Hadoop distributed system.  
* **Ambari** – is a management platform that provisions, manages, monitors, and secures Hadoop clusters.  For more information refer to https://hortonworks.com/apache/ambari/.
* **Azkaban** – is a workflow manager to define and run Hadoop jobs.   Organizations make use of Azkaban’s web UI to visualize and define the application workflow.  Organization can also use Azkaban to schedule the workflow to run on the Hadoop cluster.  Azkaban provides monitoring capabilities that will notify when the application succeeds or fails.  For more information refer to https://azkaban.readthedocs.io.
* **Falcon** – is a data governance tool that allow users to define, schedule, and monitor data management policies.  Organizations uses Falcon to define data pipelines to consume and process large amounts of data into the Hadoop system.  The engine and supporting frameworks allow organizations to easily define, audit and monitor these pipelines.   For more information refer to https://hortonworks.com/apache/falcon/
* **Sentry** – is system that enforces fine-grain security policies on the data in a Hadoop cluster.  Sentry provides role-based access controls that can be applied at various system scopes.  For example, access control permissions can be defined at the system, database, table or even at the operation level (i.e. insert, select, etc.).  For more information refer to https://sentry.apache.org/.
* **Ranger** – is another security system that provides similar capabilities to those of Sentry.  For more information refer to https://hortonworks.com/apache/ranger/.
* **Zookeeper** - is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.  Consider a case when running a complex distributed application across thousands of nodes that processes  terabytes of data. Now further let's say that this distributed application must run continuously without downtime because it is providing some kind of service on the Internet.  If a bug is discovered in the system or a configuration change is needed, how does one propagate this change across the cluster in a reliable and safe manner?  Zookeeper addresses this concern by allowing organizations to push configuration changes or bug fixes across the cluster in a coordinated fashion.  For more information refer to https://zookeeper.apache.org/.  

## Getting data into and out of Hadoop
One of the major challenges an organization faces is ingesting and exporting large amounts of data into a system.  Organizations are concerned with the cost, volume of data at scale.  Fast and reliable ingestion (importing) / exporting of data is often critical.  There are two commonly used projects that focus on ingesting data into Hadoop.
> **Flume** – provides a highly scalable and reliable system that specializes in moving large amounts of streaming or batch data into HDFS or HBase.  Data can come from different data sources such as log data, email messages, social media data, etc.  Flume also provides Extract, Transform, and Load (ETL) capabilities when processing the data.  Since reliability is resource intensive, Flume supports different levels of reliability in favor of speeding up the ingestion rate.  For more information refer to https://flume.apache.org/.

> **Kafka** – a general purpose publish-subscribe messaging platform that is also highly scalable and reliable.  Kafka achieves this by providing high throughput, built-in partitioning, replication and fault tolerance.  It is used mainly by Hadoop for message ingestion and stream processing.  Messages are considered as a small data abstraction that can be thought of as a time-sorted file such as a log file.  At a high-level, the publish-subscribe messaging architecture can be illustrated in the following diagram.  Kafka is newer and has largely superceded Flume.
![KafkaArchitecture.png](attachment:KafkaArchitecture.png)
> Messages are published by producers, categorized into topics and pulled for processing by consumers.  Consumers subscribe to topics they want to read messages from and then write the data to HDFS.  Notice that this architecture allows for parallel message processing by fragmenting the processing across multiple partitions.  This allows Kafka to run on a distributed cluster which provides high throughput, partitioning, reliability and fault tolerance.  For more information refer to https://kafka.apache.org/.

In contrast, another project, **Sqoop** (pronounced Scoop-the q is supposed to look like an ice cream scoop), focusses both on importing and exporting data in and out of Hadoop.
> **Sqoop** – provides the ability to import and export bulk structured (usually tabular) data between Hadoop and relational databases.  Sqoop makes use of the Hadoop cluster to create map jobs to import data in a scalable and efficient manner.  Importing data requires two steps:
1. **Gather Metadata** - Sqoop reads the database and HDFS metadata (schemas) and submits map-only Hadoop jobs to the Hadoop cluster. 
2. **Submit Map-Only job** - Hadoop map jobs do the actual transfer between the database and HDFS as shown below.
![sqoopimport.jpg](attachment:sqoopimport.jpg)
Source: “Apache Sqoop - Overview” by Apache Sqoop, The Apache Software Foundation Blogging in Action., October 2011

> When Sqoop is asked to export the data, Sqoop generates map jobs to read the data in the same manner, by using the metadata for the files being read and for the database tables being copied to.

> ![sqoopexport.jpg](attachment:sqoopexport.jpg)
Source: “Apache Sqoop - Overview” by Apache Sqoop, The Apache Software Foundation Blogging in Action., October 2011

> For more information on Sqoop refer to https://sqoop.apache.org/.

## Other Hadoop Projects
We have only discussed the more commonly used Hadoop projects within the Hadoop ecosystem.  There are many other projects that contribute to the richness of the Hadoop ecosystem to add additional capabilities.  You are encouraged to explore the web to read up on other Hadoop projects. 

## References

Apache Drill. (Public Domain). Architecture Introduction. Retrieved from https://drill.apache.org/docs/architecture-introduction/

Apache Flume. (Public Domain). Flume 1.9.0 User Guide. Retrieved from https://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html

Apache Hadoop. (Public Domain). HDFS Architecture.  https://hadoop.apache.org/docs/r3.0.0/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

Apache Hadoop.  (Public Domain). HDFS Federation. Retrieved from   https://hadoop.apache.org/docs/r3.0.3/hadoop-project-dist/hadoop-hdfs/Federation.html

HortonWorks. (Public Domain). An Introduction to HDFS Federation. Retrieved from  
https://hortonworks.com/blog/an-introduction-to-hdfs-federation/

Apache Hadoop.  (Public Domain). MapReduce Tutorial. Retrieved from  https://hadoop.apache.org/docs/r3.0.3/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

Apache Hive. (Public Domain). Hive Design. Retrieved from https://cwiki.apache.org/confluence/display/Hive/Design

Azkaban. (Public Domain). Azkaban documentation!. Retrieved from https://azkaban.readthedocs.io/en/latest/

BMC Software. (Public Domain). Hadoop Clusters : An Introduction.  Hadoop Guide. Retrieved from  http://www.bmcsoftware.ca/guides/hadoop-clusters.html

Difference between Pig and Hive-The Two Key Components of Hadoop Ecosystem.  DeZyre.  (Public Domain).  https://www.dezyre.com/article/difference-between-pig-and-hive-the-two-key-components-of-hadoop-ecosystem/79

Facebook. (Public Domain). Scaling the Facebook data warehouse to 300 PB.  Retrieved from https://code.fb.com/core-data/scaling-the-facebook-data-warehouse-to-300-pb/

Hortonworks. (Public Domain). Ambari Overview. Retrieved from https://hortonworks.com/apache/ambari/

Hortonworks. (Public Domain). Apache Hadoop YARN – Background and an Overview.  Retrieved from https://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/
  
Hortonworks. (Public Domain). Apache Ranger. Retrieved from https://hortonworks.com/apache/ranger/

Hortonworks.  (Public Domain). Introduction to Apache Falcon: Data Governance for Hadoop.  Retrieved from  https://hortonworks.com/blog/introduction-apache-falcon-hadoop/

IBM Big Data & Analytics Hub. (Public Domain).  What is Hadoop? Retrieved from https://www.ibmbigdatahub.com/blog/what-hadoop

Kafka. (Public Domain). Introduction.  Retrieved from https://kafka.apache.org/intro

Mahout. (Public Domain). Mahout.  Retrieved from https://mahout.apache.org/docs/latest/

Miner, D., Radtka, Z. M. (2015). Chapter 1. Hadoop Distributed File System (HDFS), Hadoop with Python.  Boston, MA: O'Reilly Media, Inc.

Saurabh, S., 2018, An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC, Nexla.com.  Retrieved from https://thinksis.com/wp-content/uploads/2018/10/Nexla_Whitepaper_Introduction-to-Big-Data-Formats-Saket-Saurabh.pdf

The Apache Software Foundation Blogging in Action. (Public Domain). Apache Sqoop - Overview. Retrieved from https://blogs.apache.org/sqoop/entry/apache_sqoop_overview

The Apache Software Foundation Blogging in Action. (Public Domain). Apache Sentry. Retrieved from https://blogs.apache.org/sentry/entry/sentry_graduates_to_a_top

Wikipedia. (Public Domain). Apache Hadoop. Retrieved from https://en.wikipedia.org/wiki/Apache_Hadoop

Wikipedia. (Public Domain). Map Reduce. Retrieved from https://en.wikipedia.org/wiki/MapReduce

Wikipedia. (Public Domain). Apache Mahout. Retrieved from 
https://en.wikipedia.org/wiki/Apache_Mahout

White, T (2010). Chapter 4. Hadoop I/O, Hadoop: The Definitive Guide, 2nd.  Boston, MA: O'Reilly Media, Inc.

Zookeeper. (Public Domain). Apache Zookeeper. Retrieved from https://zookeeper.apache.org/