HDFS-16303. Improve handling of datanode lost while decommissioning #3674

KevinWikant · 2021-11-17T21:03:11Z

Description of PR

Fixes a bug in Hadoop HDFS where if more than "dfs.namenode.decommission.max.concurrent.tracked.nodes" datanodes are lost while in state decommissioning, then all forward progress towards decommissioning any datanodes (including healthy datanodes) is blocked

JIRA: https://issues.apache.org/jira/browse/HDFS-16303

How was this patch tested?

Unit Testing

Added new unit tests:

TestDecommission.testRequeueUnhealthyDecommissioningNodes (& TestDecommissionWithBackoffMonitor.testRequeueUnhealthyDecommissioningNodes)
DatanodeAdminMonitorBase.testPendingNodesQueueOrdering
DatanodeAdminMonitorBase.testPendingNodesQueueReverseOrdering

All "TestDecommission", "TestDecommissionWithBackoffMonitor", & "DatanodeAdminMonitorBase" tests pass when run locally

Note that without the "DatanodeAdminManager" changes the new test "testRequeueUnhealthyDecommissioningNodes" fails because it times out waiting for the healthy nodes to be decommissioned

> mvn -Dtest=TestDecommission#testRequeueUnhealthyDecommissioningNodes test
...
[ERROR] Errors: 
[ERROR]   TestDecommission.testRequeueUnhealthyDecommissioningNodes:1776 » Timeout Timed...

> mvn -Dtest=TestDecommissionWithBackoffMonitor#testRequeueUnhealthyDecommissioningNodes test
...
[ERROR] Errors: 
[ERROR]   TestDecommissionWithBackoffMonitor>TestDecommission.testRequeueUnhealthyDecommissioningNodes:1776 » Timeout

Manual Testing

create Hadoop cluster with:
- 30 datanodes initially
- custom Namenode JAR containing this change
- hdfs-site configuration "dfs.namenode.decommission.max.concurrent.tracked.node = 10"

> cat /etc/hadoop/conf/hdfs-site.xml | grep -A 1 'tracked'
    <name>dfs.namenode.decommission.max.concurrent.tracked.nodes</name>
    <value>10</value>

reproduce the bug: https://issues.apache.org/jira/browse/HDFS-16303
- start decommissioning over 20 datanodes
- terminate 20 datanodes while they are in state decommissioning
- observe the Namenode logs to validate that there are 20 unhealthy datanodes stuck "in Decommission In Progress"

2021-11-15 17:57:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes will be tracked at a time. 10 nodes are currently queued waiting to be decommissioned.
2021-11-15 17:57:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, re-queueing 10 nodes which are dead while in Decommission In Progress.

2021-11-15 17:58:14,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes will be tracked at a time. 10 nodes are currently queued waiting to be decommissioned.
2021-11-15 17:58:14,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, re-queueing 10 nodes which are dead while in Decommission In Progress.

2021-11-15 17:58:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes will be tracked at a time. 10 nodes are currently queued waiting to be decommissioned.
2021-11-15 17:58:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, re-queueing 10 nodes which are dead while in Decommission In Progress.

2021-11-15 17:59:14,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes will be tracked at a time. 10 nodes are currently queued waiting to be decommissioned.
2021-11-15 17:59:14,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, re-queueing 10 nodes which are dead while in Decommission In Progress.

scale-up to 25 healthy datanodes & then decommission 22 of those datanodes (all but 3)
- observe the Namenode logs to validate those 22 healthy datanodes are decommissioned (i.e. HDFS-16303 is solved)

2021-11-15 17:59:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes will be tracked at a time. 10 nodes are currently queued waiting to be decommissioned.
2021-11-15 17:59:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, re-queueing 10 nodes which are dead while in Decommission In Progress.

2021-11-15 18:00:14,487 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 42 nodes decommissioning but only 10 nodes will be tracked at a time. 32 nodes are currently queued waiting to be decommissioned.
2021-11-15 18:00:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 42 nodes decommissioning but only 10 nodes will be tracked at a time. 32 nodes are currently queued waiting to be decommissioned.
2021-11-15 18:01:14,486 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 32 nodes decommissioning but only 10 nodes will be tracked at a time. 32 nodes are currently queued waiting to be decommissioned.
2021-11-15 18:01:44,486 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 32 nodes decommissioning but only 10 nodes will be tracked at a time. 22 nodes are currently queued waiting to be decommissioned.
2021-11-15 18:02:14,486 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 22 nodes decommissioning but only 10 nodes will be tracked at a time. 22 nodes are currently queued waiting to be decommissioned.

2021-11-15 18:02:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes will be tracked at a time. 12 nodes are currently queued waiting to be decommissioned.
2021-11-15 18:02:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, re-queueing 8 nodes which are dead while in Decommission In Progress.

2021-11-15 18:03:14,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes will be tracked at a time. 10 nodes are currently queued waiting to be decommissioned.
2021-11-15 18:03:14,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, re-queueing 10 nodes which are dead while in Decommission In Progress.

For code changes:

[yes] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
[n/a] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
[n/a] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
[no] If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

HDFS-16303. Improve handling of datanode lost while decommissioning

b77f043

KevinWikant closed this Nov 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDFS-16303. Improve handling of datanode lost while decommissioning #3674

HDFS-16303. Improve handling of datanode lost while decommissioning #3674

Uh oh!

KevinWikant commented Nov 17, 2021 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

HDFS-16303. Improve handling of datanode lost while decommissioning #3674

HDFS-16303. Improve handling of datanode lost while decommissioning #3674

Uh oh!

Conversation

KevinWikant commented Nov 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of PR

How was this patch tested?

Unit Testing

Manual Testing

For code changes:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

KevinWikant commented Nov 17, 2021 •

edited

Loading