Skip to content

Commit

Permalink
HDFS-8657. Update docs for mSNN. Contributed by Jesse Yates.
Browse files Browse the repository at this point in the history
  • Loading branch information
atm committed Jul 20, 2015
1 parent e4f7562 commit ed01dc7
Show file tree
Hide file tree
Showing 3 changed files with 45 additions and 29 deletions.
2 changes: 2 additions & 0 deletions hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
Original file line number Diff line number Diff line change
Expand Up @@ -341,6 +341,8 @@ Trunk (Unreleased)
HDFS-8627. NPE thrown if unable to fetch token from Namenode
(J.Andreina via vinayakumarb)

HDFS-8657. Update docs for mSNN. (Jesse Yates via atm)

Release 2.8.0 - UNRELEASED

NEW FEATURES
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -65,18 +65,18 @@ This impacted the total availability of the HDFS cluster in two major ways:
* Planned maintenance events such as software or hardware upgrades on the
NameNode machine would result in windows of cluster downtime.

The HDFS High Availability feature addresses the above problems by providing the option of running two redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby. This allows a fast failover to a new NameNode in the case that a machine crashes, or a graceful administrator-initiated failover for the purpose of planned maintenance.
The HDFS High Availability feature addresses the above problems by providing the option of running two (or more, as of Hadoop 3.0.0) redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby(s). This allows a fast failover to a new NameNode in the case that a machine crashes, or a graceful administrator-initiated failover for the purpose of planned maintenance.

Architecture
------------

In a typical HA cluster, two separate machines are configured as NameNodes. At any point in time, exactly one of the NameNodes is in an *Active* state, and the other is in a *Standby* state. The Active NameNode is responsible for all client operations in the cluster, while the Standby is simply acting as a slave, maintaining enough state to provide a fast failover if necessary.
In a typical HA cluster, two or more separate machines are configured as NameNodes. At any point in time, exactly one of the NameNodes is in an *Active* state, and the others are in a *Standby* state. The Active NameNode is responsible for all client operations in the cluster, while the Standby is simply acting as a slave, maintaining enough state to provide a fast failover if necessary.

In order for the Standby node to keep its state synchronized with the Active node, the current implementation requires that the two nodes both have access to a directory on a shared storage device (eg an NFS mount from a NAS). This restriction will likely be relaxed in future versions.
In order for the Standby nodes to keep their state synchronized with the Active node, the current implementation requires that the nodes have access to a directory on a shared storage device (eg an NFS mount from a NAS). This restriction will likely be relaxed in future versions.

When any namespace modification is performed by the Active node, it durably logs a record of the modification to an edit log file stored in the shared directory. The Standby node is constantly watching this directory for edits, and as it sees the edits, it applies them to its own namespace. In the event of a failover, the Standby will ensure that it has read all of the edits from the shared storage before promoting itself to the Active state. This ensures that the namespace state is fully synchronized before a failover occurs.
When any namespace modification is performed by the Active node, it durably logs a record of the modification to an edit log file stored in the shared directory. The Standby nodes are constantly watching this directory for edits, and as it sees the edits, it applies them to its own namespace. In the event of a failover, the Standby will ensure that it has read all of the edits from the shared storage before promoting itself to the Active state. This ensures that the namespace state is fully synchronized before a failover occurs.

In order to provide a fast failover, it is also necessary that the Standby node have up-to-date information regarding the location of blocks in the cluster. In order to achieve this, the DataNodes are configured with the location of both NameNodes, and send block location information and heartbeats to both.
In order to provide a fast failover, it is also necessary that the Standby nodes have up-to-date information regarding the location of blocks in the cluster. In order to achieve this, the DataNodes are configured with the location of all NameNodes, and send block location information and heartbeats to all the NameNodes.

It is vital for the correct operation of an HA cluster that only one of the NameNodes be Active at a time. Otherwise, the namespace state would quickly diverge between the two, risking data loss or other incorrect results. In order to ensure this property and prevent the so-called "split-brain scenario," the administrator must configure at least one *fencing method* for the shared storage. During a failover, if it cannot be verified that the previous Active node has relinquished its Active state, the fencing process is responsible for cutting off the previous Active's access to the shared edits storage. This prevents it from making any further edits to the namespace, allowing the new Active to safely proceed with failover.

Expand All @@ -87,9 +87,9 @@ In order to deploy an HA cluster, you should prepare the following:

* **NameNode machines** - the machines on which you run the Active and Standby NameNodes should have equivalent hardware to each other, and equivalent hardware to what would be used in a non-HA cluster.

* **Shared storage** - you will need to have a shared directory which both NameNode machines can have read/write access to. Typically this is a remote filer which supports NFS and is mounted on each of the NameNode machines. Currently only a single shared edits directory is supported. Thus, the availability of the system is limited by the availability of this shared edits directory, and therefore in order to remove all single points of failure there needs to be redundancy for the shared edits directory. Specifically, multiple network paths to the storage, and redundancy in the storage itself (disk, network, and power). Beacuse of this, it is recommended that the shared storage server be a high-quality dedicated NAS appliance rather than a simple Linux server.
* **Shared storage** - you will need to have a shared directory which the NameNode machines have read/write access to. Typically this is a remote filer which supports NFS and is mounted on each of the NameNode machines. Currently only a single shared edits directory is supported. Thus, the availability of the system is limited by the availability of this shared edits directory, and therefore in order to remove all single points of failure there needs to be redundancy for the shared edits directory. Specifically, multiple network paths to the storage, and redundancy in the storage itself (disk, network, and power). Beacuse of this, it is recommended that the shared storage server be a high-quality dedicated NAS appliance rather than a simple Linux server.

Note that, in an HA cluster, the Standby NameNode also performs checkpoints of the namespace state, and thus it is not necessary to run a Secondary NameNode, CheckpointNode, or BackupNode in an HA cluster. In fact, to do so would be an error. This also allows one who is reconfiguring a non-HA-enabled HDFS cluster to be HA-enabled to reuse the hardware which they had previously dedicated to the Secondary NameNode.
Note that, in an HA cluster, the Standby NameNodes also perform checkpoints of the namespace state, and thus it is not necessary to run a Secondary NameNode, CheckpointNode, or BackupNode in an HA cluster. In fact, to do so would be an error. This also allows one who is reconfiguring a non-HA-enabled HDFS cluster to be HA-enabled to reuse the hardware which they had previously dedicated to the Secondary NameNode.

Deployment
----------
Expand Down Expand Up @@ -124,17 +124,15 @@ The order in which you set these configurations is unimportant, but the values y

Configure with a list of comma-separated NameNode IDs. This will be used by
DataNodes to determine all the NameNodes in the cluster. For example, if you
used "mycluster" as the nameservice ID previously, and you wanted to use "nn1"
and "nn2" as the individual IDs of the NameNodes, you would configure this as
used "mycluster" as the nameservice ID previously, and you wanted to use "nn1","nn2" and "nn3" as the individual IDs of the NameNodes, you would configure this as
such:

<property>
<name>dfs.ha.namenodes.mycluster</name>
<value>nn1,nn2</value>
<value>nn1,nn2,nn3</value>
</property>

**Note:** Currently, only a maximum of two NameNodes may be configured per
nameservice.
**Note:** The minimum number of NameNodes for HA is two, but you can configure more. Its suggested to not exceed 5 - with a recommended 3 NameNodes - due to communication overheads.

* **dfs.namenode.rpc-address.[nameservice ID].[name node ID]** - the fully-qualified RPC address for each NameNode to listen on

Expand All @@ -150,6 +148,10 @@ The order in which you set these configurations is unimportant, but the values y
<name>dfs.namenode.rpc-address.mycluster.nn2</name>
<value>machine2.example.com:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.nn3</name>
<value>machine3.example.com:8020</value>
</property>

**Note:** You may similarly configure the "**servicerpc-address**" setting if
you so desire.
Expand All @@ -167,16 +169,20 @@ The order in which you set these configurations is unimportant, but the values y
<name>dfs.namenode.http-address.mycluster.nn2</name>
<value>machine2.example.com:50070</value>
</property>
<property>
<name>dfs.namenode.http-address.mycluster.nn3</name>
<value>machine3.example.com:50070</value>
</property>

**Note:** If you have Hadoop's security features enabled, you should also set
the *https-address* similarly for each NameNode.

* **dfs.namenode.shared.edits.dir** - the location of the shared storage directory

This is where one configures the path to the remote shared edits directory
which the Standby NameNode uses to stay up-to-date with all the file system
which the Standby NameNodes use to stay up-to-date with all the file system
changes the Active NameNode makes. **You should only configure one of these
directories.** This directory should be mounted r/w on both NameNode machines.
directories.** This directory should be mounted r/w on the NameNode machines.
The value of this setting should be the absolute path to this directory on the
NameNode machines. For example:

Expand All @@ -203,7 +209,7 @@ The order in which you set these configurations is unimportant, but the values y
It is critical for correctness of the system that only one NameNode be in the
Active state at any given time. Thus, during a failover, we first ensure that
the Active NameNode is either in the Standby state, or the process has
terminated, before transitioning the other NameNode to the Active state. In
terminated, before transitioning another NameNode to the Active state. In
order to do this, you must configure at least one **fencing method.** These are
configured as a carriage-return-separated list, which will be attempted in order
until one indicates that fencing has succeeded. There are two methods which
Expand Down Expand Up @@ -320,7 +326,7 @@ After all of the necessary configuration options have been set, one must initial
* If you have already formatted the NameNode, or are converting a
non-HA-enabled cluster to be HA-enabled, you should now copy over the
contents of your NameNode metadata directories to the other, unformatted
NameNode by running the command "*hdfs namenode -bootstrapStandby*" on the
NameNodes by running the command "*hdfs namenode -bootstrapStandby*" on the
unformatted NameNode. Running this command will also ensure that the shared
edits directory (as configured by **dfs.namenode.shared.edits.dir**) contains
sufficient edits transactions to be able to start both NameNodes.
Expand All @@ -329,7 +335,7 @@ After all of the necessary configuration options have been set, one must initial
command "*hdfs -initializeSharedEdits*", which will initialize the shared
edits directory with the edits data from the local NameNode edits directories.

At this point you may start both of your HA NameNodes as you normally would start a NameNode.
At this point you may start all of your HA NameNodes as you normally would start a NameNode.

You can visit each of the NameNodes' web pages separately by browsing to their configured HTTP addresses. You should notice that next to the configured address will be the HA state of the NameNode (either "standby" or "active".) Whenever an HA NameNode starts, it is initially in the Standby state.

Expand Down

0 comments on commit ed01dc7

Please sign in to comment.