# HDFS (Hadoop Distributed File System)

## What is HDFS?

HDFS, or the Hadoop Distributed File System, is the primary storage system of Hadoop, which is an open-source framework for processing and storing large datasets in a distributed computing environment. HDFS is designed to scale up from a single server to thousands of machines, with each offering local computation and storage.

**Key features**:
1. **Block-based Structure**: Data in HDFS is stored in blocks (commonly 128 MB or 256 MB in size), and these blocks are distributed across the cluster. Each block is replicated multiple times (usually three) to handle hardware failure.

2. **Fault Tolerance**: Due to its block replication mechanism, HDFS is fault-tolerant. If a block or node fails, data can be recovered from another node where the block is replicated.

3. **Write-once, Read-many Model**: HDFS is primarily designed for large data sets and supports a write-once and read-many times paradigm. This model simplifies data coherency issues.

4. **High Throughput & Scalability**: HDFS is designed to provide high throughput for data access and can easily scale out by adding more machines to the cluster.

5. **Data Locality**: One of the primary objectives of HDFS is to store data on the compute nodes so that processing tasks can run on nodes where data is locally stored. This minimizes the data transfer across the cluster and increases processing speed.

6. **Simple Coherency Model**: Once written, data/files can't be modified, only appended to. This eliminates potential issues that can arise from multiple sources trying to update a file simultaneously.

7. **Large Data Sets**: HDFS is designed to handle very large files, making it suitable for big data processing tasks.

8. **Streaming Data Access**: HDFS is optimized for streaming access of its datasets, meaning it's best suited for applications that require sequential access, rather than random access.

9. **Integration with Hadoop Ecosystem**: HDFS is deeply integrated with various components of the Hadoop ecosystem like MapReduce, YARN, Hive, Pig, and others. This allows for efficient processing and management of big data.

To sum it up, HDFS is a distributed and scalable file system that is a fundamental component of the Hadoop ecosystem and is designed specifically for storing and processing massive datasets.

## HDFS architecture

<img src="images/hdfs_architecture.png" title="HDFS Architecture" width="700px"/>

The HDFS architecture is designed with a master-slave topology, where data is broken into blocks and distributed across multiple nodes in a cluster. Let's dive into the primary components and their roles:

**1. NameNode (Master Server)**:
   - **Role**: Manages and maintains the metadata of HDFS.
   - **Function**: Does not store the actual data but maintains the file system tree and the metadata for all the files and directories in the system. This metadata is stored in RAM for fast access.
   - **Responsibilities**: 
     - Managing the file system namespace.
     - Regulating client access to files.
     - Executing file system operations such as renaming, closing, and opening files and directories.
     - Keeping track of the block mapping to DataNodes.

**2. Secondary NameNode**:
   - **Role**: Performs housekeeping functions for the NameNode.
   - **Function**: It periodically merges the changes (edits) with the filesystem image (fsimage) and produces an updated version of fsimage. This helps in preventing the edit log on the NameNode from becoming excessively large.
   - **Note**: Many modern Hadoop deployments use the HA (High Availability) architecture, replacing the traditional Secondary NameNode with a Standby NameNode.

**3. DataNode (Slave Server)**:
   - **Role**: Stores and manages the actual data blocks of HDFS.
   - **Function**: 
     - Stores data in the local file system (like ext4, xfs).
     - Creates, deletes, and replicates blocks based on instructions from the NameNode.
     - Periodically sends a heartbeat to the NameNode to signal it's alive. Along with the heartbeat, it sends a block report, which lists all the blocks on a DataNode.
   
**4. Block**:
   - **Role**: The fundamental storage unit of HDFS.
   - **Function**: Each file is divided into blocks of a fixed size (default is 128MB or 256MB). These blocks are distributed across the cluster, and multiple copies (replicas) of each block are maintained to ensure fault tolerance.
   - **Note**: While not a component in the traditional sense like NameNode or DataNode, blocks are a foundational concept in HDFS architecture.

**5. Client**:
   - **Role**: Interacts with HDFS.
   - **Function**: When an HDFS client wants to read a file, it communicates with the NameNode to determine the block locations. The client then contacts the respective DataNodes to read or write data.

**6. Cluster**:
   - **Role**: A collection of nodes.
   - **Function**: A cluster typically consists of a single NameNode (or two in an HA setup) and multiple DataNodes. The client applications can be run on an external machine or on nodes within the cluster.

In addition to these core components, the HDFS architecture has built-in features to handle failures and ensure high data availability:

- **Replication**: Each block is replicated multiple times (default is three) across different DataNodes to ensure fault tolerance. If a block (or DataNode) fails, data can be read from another replica.

- **Fault Tolerance**: HDFS is designed to detect failures and automatically recover from them. If a DataNode fails, the system ensures that the replication factor of all blocks stored on that node is maintained by creating new replicas on other nodes.

- **High Availability (HA)**: In HA-enabled HDFS architectures, there are two NameNodes: Active and Standby. Both maintain the file system metadata in sync. If the active NameNode fails, the standby takes over its duties to ensure high availability.

This architecture enables HDFS to provide a robust and scalable storage solution suitable for storing vast amounts of data and serving large-scale data processing tasks.

## HDFS High Availability architecture (the one used in Stratio)

<img src="images/hdfs-HA-architecture.png" title="HDFS HA Architecture" width="700px"/>

In addition to all the mentioned components, in HA architectures (like the one used in Stratio) there are some extra components worth mentioning

**1. Standby NameNode**: Replaces the traditional Secondary NameNode. It performs the same functions as the Secondary NameNode while acting as a backup to the Active NameNode, prepared to take over its functions without any loss of data or significant downtime in case of failure (Readiness for Failover).

**2. JournalNode**:

- **Role**:
JournalNodes facilitate the High Availability feature of HDFS by providing a way to synchronize metadata changes between the Active and Standby NameNodes.

- **Function**:
    - JournalNodes maintain a log (or journal) of metadata changes made by the Active NameNode. When the Active NameNode makes any change to the metadata, it records this change in its local logs and also writes it to a majority of the configured JournalNodes.

    - The Standby NameNode is continuously watching these JournalNodes and reading the metadata updates. Once it reads these updates, the Standby NameNode applies them to its own namespace, thereby ensuring that both NameNodes remain synchronized.

- **Responsibilities**:
    1. **Storing Metadata Updates**: JournalNodes receive and store updates from the Active NameNode. These updates include operations like file creation, deletion, renaming, and more.

    2. **Synchronization**: Facilitate the synchronization of the metadata changes between the Active and Standby NameNodes, ensuring the Standby NameNode can quickly take over in case of a failure.

    3. **Maintaining Write Ahead Logs (WAL)**: Just like databases use Write Ahead Logging for durability and recovery, the JournalNode stores changes in a similar manner. If a NameNode crashes, this log ensures that the state can be fully recovered.

    4. **Handling Failovers**: During a failover event (e.g., if the Active NameNode crashes), the Standby NameNode will ensure it has read all the logs from the JournalNodes before promoting itself to the Active state.

    5. **Responding to NameNode Requests**: JournalNodes cater to read requests from the Standby NameNode and write requests from the Active NameNode. They ensure the logs are available for reading and store new logs reliably.

    6. **Quorum-based Commit**: For a metadata change to be considered committed, it must be written to a majority of JournalNodes (e.g., 2 out of 3, or 3 out of 5). This quorum-based approach ensures that even if a JournalNode or two become unavailable, the system can still function and maintain consistency.

**3. Apache ZooKeeper**: is a distributed coordination service. It helps in managing and coordinating the two NameNodes, maintaining configuration information, naming, providing distributed synchronization, and group services. In essence, controls that the system can automatically recover from NameNode failures, ensuring high availability and preventing split-brain scenarios.

## HDFS useful commands

### List Files/Directories
   - Command: `hdfs dfs -ls <path>`
   - Example: `hdfs dfs -ls /user/hadoop/dir1`

### Create a Directory
   - Command: `hdfs dfs -mkdir <path>`
   - Example: `hdfs dfs -mkdir /user/hadoop/newdir`

### Delete a File/Directory
   - Command: `hdfs dfs -rm <path>`
   - Example: `hdfs dfs -rm /user/hadoop/file1.txt`
   - For directories (recursive delete): `hdfs dfs -rm -r /user/hadoop/dir1`

### Upload a File to HDFS
   - Command: `hdfs dfs -put <local-source> <hdfs-destination>`
   - Example: `hdfs dfs -put /localpath/file.txt /user/hadoop/`

### Download a File from HDFS
   - Command: `hdfs dfs -get <hdfs-source> <local-destination>`
   - Example: `hdfs dfs -get /user/hadoop/file.txt /localpath/`

### Display File Content
   - Command: `hdfs dfs -cat <path>`
   - Example: `hdfs dfs -cat /user/hadoop/file1.txt`

### Move File/Directory within HDFS
   - Command: `hdfs dfs -mv <source> <destination>`
   - Example: `hdfs dfs -mv /user/hadoop/file1.txt /user/hadoop/dir1/`

### Copy File/Directory within HDFS
   - Command: `hdfs dfs -cp <source> <destination>`
   - Example: `hdfs dfs -cp /user/hadoop/file1.txt /user/hadoop/dir1/`

### Display Disk Usage Statistics
   - Command: `hdfs dfs -du <path>`
   - Example: `hdfs dfs -du /user/hadoop/`

`NOTE`: Add the parameter `-h` for human readable size, which means using MB, GB and not bytes.

### Change Owner of a File/Directory
   - Command: `hdfs dfs -chown <owner>:<group> <path>`
   - Example: `hdfs dfs -chown hadoop:admin /user/hadoop/file1.txt`

### Change Permissions of a File/Directory
   - Command: `hdfs dfs -chmod <mode> <path>`
   - Example: `hdfs dfs -chmod 755 /user/hadoop/file1.txt`

### Get Detailed Information about a File/Directory
   - Command: `hdfs dfs -stat <path>`
   - Example: `hdfs dfs -stat /user/hadoop/file1.txt`

### Read the first N lines of a file
   - Command: `hdfs dfs -cat <path> | head -n <N_LINES>`
   - Example: `hdfs dfs -cat /user/hadoop/file1.txt | head -n 10`