-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDFS-17218. NameNode should process time out excess redundancy blocks #6176
Conversation
@@ -1007,6 +1013,7 @@ public void updateRegInfo(DatanodeID nodeReg) { | |||
for(DatanodeStorageInfo storage : getStorageInfos()) { | |||
if (storage.getStorageType() != StorageType.PROVIDED) { | |||
storage.setBlockReportCount(0); | |||
storage.setBlockContentsStale(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why set content stale here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zhangshuyan0 for you comment.
The modifications here may not be directly related to the current problem.
the reason for modifying this is that if the current dn is re-registered, block deletion or exception may occur on the dn during this period, because FBR has not yet been completed, and the NN side memory record is different from the actual dn block.
If processExtraRedundancyBlock is executed at this time, block loss may occur.
such as a file has 2 replicas, but only 1 replica is expected. when processExtraRedundancyBlock is executed, a live dn will be choose for deletion, which will cause a block miss, so the dn re-registering needs to be marked blockContentsStale will avoid this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank @haiyang1987 for your reply. I understand what you mean. This patch removes corresponding excess replicas from ExcessRedundancyMap when re-registering, so NameNode does not know whether the replica is still on the register DN any more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @haiyang1987 for your report. And thanks @zhangshuyan0 for your review.
I think this modification is a new bug, not related to this case, we need to fix this bug in a new issue.
We can produce this bug by the following steps:
- Assume that there is block1 contains three replicas, dn1, dn2
- DN1 is shutdown for maintenance for corrupt disk
- Admin removed the corrupted disk and restart datanode
- DN1 try to register it to NameNode through registerDatanode rpc
- End-user try to decrease the replicas of block from 2 to 1 through setReplication RPC
- Block1 still contains three replicas in namenode, but the dn1 is not existed because it is stored in a corrupt disk
- NameNode select dn2 as a redundancy replica for this block to delete
- DN1 try to report all stored blocks to namnode through blockreport rpc
- NameNode will remove the dn1 replica for block1 because the blockreport from DN1 doesn't contains block1
After these two operations(setReplication from end-user and restart from admin), the block1 may lose all replicas.
So I think we should mark all storage as a stale storage while namenode processing registerdatanode rpc, so that this case can be fixed.
@zhangshuyan0 @haiyang1987 I'm looking forward your good idea, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ZanderXu Thanks for your reply. I think this modification is not a new bug. Before this patch, NameNode knows all excess replicas even though a DataNode is re-registered, so it wouldn't delete more replicas than expected.
As the situation you just said, we can discuss it from two aspects:
- If NameNode knows the corrupt replicas corresponding to corrupt disk, it will not delete the only healthy replica.
- If NameNode know nothing about the corrupt disk, the essence of the problem is that the Admin manually removed some replicas without notifying NameNode. Then in the time between "the replica has been removed" and "NN learns that the replica has been removed", there is always a chance that the only healthy replica will be deleted. So, I think the key to solving this problem is to immediately notify NN which disk is corrupt. Restarting & re-registering may not be necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this situation can happen at any time, not just between "registerDataNode" and "blockReport". Why do you think that after the DN is re-registered, the probability of the above situation happening will increase, and it needs to be dealt with specifically?
Yes, there are some other situations can cause this case. And I don't think the probability of the above situation happening will increase.
I just think we have a chance to reduce this case. So I think we need to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the relationship between stale storage and ExcessRedundancyMap.
- StaleStorage is used to prevent the namenode from deleting replicas of blocks whose replicas are indeterminate.
- ExcessRedundancyMap is used to mark the replicas of blocks that namenode is deleting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ZanderXu I think there is a misunderstanding between us. I totally agree with this change. The difference between us may be that I think it is more appropriate to merge this change with this PR instead of opening a new issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StaleStorage is used to prevent the namenode from deleting replicas of blocks whose replicas are indeterminate.
About StaleStorage we can say comments in code:
Lines 142 to 149 in 42e695d
* At startup or at failover, the storages in the cluster may have pending | |
* block deletions from a previous incarnation of the NameNode. The block | |
* contents are considered as stale until a block report is received. When a | |
* storage is considered as stale, the replicas on it are also considered as | |
* stale. If any block has at least one stale replica, then no invalidations | |
* will be processed for this block. See HDFS-1972. | |
*/ | |
private boolean blockContentsStale = true; |
At startup or at failover, the storages in the cluster may have pending block deletions from a previous incarnation of the NameNode.
From this, it can be seen that, the design of the "stale content" is to address the "indeterminate" caused by pending deletions. By the way, if the information provided by ExcessRedundancyMap is accurate, there will be no "indeterminate" caused by pending deletions.
See also: https://issues.apache.org/jira/browse/HDFS-1972
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, late reply.
Thanks @ZanderXu @zhangshuyan0 for your detailed reply. It is very meaningful to me and learned a lot.
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
The failed unit test seems unrelated to the change. |
Hi @Hexiaoqiao @zhangshuyan0 do you have any comments or suggestions about this PR? Thanks. |
@@ -1819,6 +1820,19 @@ void removeBlocksAssociatedTo(final DatanodeStorageInfo storageInfo) { | |||
storageInfo, node); | |||
} | |||
|
|||
/** Remove the blocks to the given DatanodeDescriptor from InvalidateBlocks. */ | |||
void removeBlocksFromInvalidateBlocks(final DatanodeDescriptor node) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest modifying the method name:
removeBlocksFromInvalidateBlocks -> removeNodeFromInvalidateBlocks
} | ||
|
||
/** Remove the blocks to the given DatanodeDescriptor from excessRedundancyMap. */ | ||
LightWeightHashSet<BlockInfo> removeBlocksFromExcessRedundancyMap( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removeBlocksFromExcessRedundancyMap -> removeNodeFromExcessRedundancyMap
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
Back to this PR.
I totally support the solution @zhangshuyan0 mentioned here. https://issues.apache.org/jira/browse/HDFS-17218?focusedCommentId=17774766&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17774766 |
Thanks @Hexiaoqiao for your comment. The current pr to slove problem of ExcessRedundancyMap leakage on a case-by-case basis, and the implementation cost is relatively low. Of course, if we decide to adopt the timeout mechanism solution, I will submit a new PR. look forward to your feedback. Thanks. |
I try to implement this based on the timeout mechanism solution. However, there is a case where I have some questions, such as:
The question here is how the current NN can define a reasonable timeframe to determine whether Block1 corresponding to DN1 in ExcessRedundancyMap has timed out. Hi @Hexiaoqiao @ZanderXu @zhangshuyan0 excuse me, do you have any suggestions for this case? |
💔 -1 overall
This message was automatically generated. |
I think we can determine whether the replica in ExcessRedundancyMap has timed out based on the configured timeout paprameter. As for the scenario you mentioned, I think this can be done directly: NN determines that DN1 has timed out and sends it another delete command. Will this have any adverse effects? |
Thanks @zhangshuyan0 for your detailed suggestions. |
64dc319
to
7fa80cc
Compare
💔 -1 overall
This message was automatically generated. |
Hi @Hexiaoqiao @ayushtkn @ZanderXu @zhangshuyan0 |
Test failures seems unrelated. |
try { | ||
Iterator<Map.Entry<String, LightWeightHashSet<ExcessBlockInfo>>> iter = | ||
excessRedundancyMap.getExcessRedundancyMap().entrySet().iterator(); | ||
while (iter.hasNext() && processed < excessRedundancyTimeoutCheckLimit) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the size of excessRedundancyMap
is large and there are few items that have timed out, the lock holding time of this method may be very long. It is recommended to try to avoid this situation, such as increasing the value of variable processed
for every block processed, rather than just for blocks that have timed out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Get it, i will update it later.
Thanks @zhangshuyan0 for your comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update PR.
Hi @zhangshuyan0 please help me review it again when you have free time. Thanks~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great progress here. Leave some comments inline, PFYI.
Map.Entry<String, LightWeightHashSet<ExcessBlockInfo>> entry = iter.next(); | ||
String datanodeUuid = entry.getKey(); | ||
LightWeightHashSet<ExcessBlockInfo> blocks = entry.getValue(); | ||
List<ExcessRedundancyMap.ExcessBlockInfo> sortedBlocks = new ArrayList<>(blocks); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ExcessRedundancyMap
is redundant here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Get it and will fix.
DatanodeStorageInfo datanodeStorageInfo = iterator.next(); | ||
DatanodeDescriptor datanodeDescriptor = datanodeStorageInfo.getDatanodeDescriptor(); | ||
if (datanodeDescriptor.getDatanodeUuid().equals(datanodeUuid)) { | ||
if (datanodeStorageInfo.getState().equals(State.NORMAL)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about combine these two conditions to one as if (a && b) { do something; }
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Get it and will fix.
final LightWeightHashSet<BlockInfo> set = map.get(dn.getDatanodeUuid()); | ||
return set != null && set.contains(blk); | ||
final LightWeightHashSet<ExcessBlockInfo> set = map.get(dn.getDatanodeUuid()); | ||
return set != null && set.contains(new ExcessBlockInfo(blk)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am concerned if it will involve more heap footprint when new
frequently. Is it necessary here?
if (set == null) { | ||
return false; | ||
} | ||
|
||
final boolean removed = set.remove(blk); | ||
final boolean removed = set.remove(new ExcessBlockInfo(blk)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the last comment too.
return false; | ||
} | ||
ExcessBlockInfo other = (ExcessBlockInfo) obj; | ||
return (this.blockInfo.equals(other.blockInfo)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it enough to compare blockInfo
only? If true, we don't need to create new instance to contains
or remove
to avoid more heap footprint cost. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sir's suggestion is reasonable. here it's sufficient to just compare blockInfo, i will fix it to avoid more heap footprint cost.
assertEquals(0, blockManager.getPendingDeletionBlocksCount()); | ||
assertNotNull(excessDn); | ||
|
||
// Name node will ask datanode to delete replicas in heartbeat response. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer NameNode
to Name node
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Get it and will fix.
💔 -1 overall
This message was automatically generated. |
Thanks @Hexiaoqiao for your comment, i will fix it later. |
Update PR. |
💔 -1 overall
This message was automatically generated. |
Test failures seems unrelated. However changes in the |
🎊 +1 overall
This message was automatically generated. |
Hi @Hexiaoqiao @ayushtkn @zhangshuyan0 @tomscut @xinglin Would you mind to also take a review this pr when you have free time? thank you very much~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Leave one nit comment inline. Let's wait @zhangshuyan0 to confirm. Thanks.
* less than or equal to 0, the default value is used (converted to milliseconds). | ||
* @param timeOut The time (in seconds) to set as the excess redundancy block timeout. | ||
*/ | ||
public void setExcessRedundancyTimeout(long timeOut) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timeOut
-> timeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Hexiaoqiao for your comment.
Get it ,i will fix it later.
💔 -1 overall
This message was automatically generated. |
The failed unit test seems unrelated to the change, local run test is normal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one small suggestion. Others LGTM.
@@ -315,6 +315,13 @@ public class DFSConfigKeys extends CommonConfigurationKeys { | |||
public static final int | |||
DFS_NAMENODE_RECONSTRUCTION_PENDING_TIMEOUT_SEC_DEFAULT = 300; | |||
|
|||
public static final String DFS_NAMENODE_EXCESS_REDUNDANCY_TIMEOUT_SEC_KEY = | |||
"dfs.namenode.excess.redundancy.timeout-sec"; | |||
public static final long DFS_NAMENODE_EXCESS_REDUNDANCY_TIMEOUT_SEC = 3600; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DFS_NAMENODE_EXCESS_REDUNDANCY_TIMEOUT_SEC -> DFS_NAMENODE_EXCESS_REDUNDANCY_TIMEOUT_SEC_DEAFULT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zhangshuyan0 for your comment.
already fixed!
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. +1. Thanks @haiyang1987
Let's try to checkin while wait for two workdays if no more comments. Thanks. |
Thanks @Hexiaoqiao for your review! |
Committed to trunk. Thanks @haiyang1987 for your works. And @zhangshuyan0 @ZanderXu for your reviews. |
Thanks @Hexiaoqiao @zhangshuyan0 @ZanderXu for your review and merge. |
…apache#6176). Contributed by Haiyang Hu. Signed-off-by: He Xiaoqiao <hexiaoqiao@apache.org>
Description of PR
https://issues.apache.org/jira/browse/HDFS-17218
Currently found that DN will lose all pending DNA_INVALIDATE blocks if it restarts.
DN enables asynchronously deletion, it have many pending deletion blocks in memory.
when DN restarts, these cached blocks may be lost. it causes some blocks in the excess map in the namenode to be leaked and this will result in many blocks having more replicas then expected.
Root case
1.block1 of dn1 is chosen as excess, added to excessRedundancyMap and add To Invalidates.
2.dn1 heartbeat gets Invalidates command.
3.dn1 will execute async deletion when receive commands, but before it is actually deleted, the service stop, so the block1 still exsit.
4.at this time, nn's excessRedundancyMap will still have the block of dn1
5. restart the dn, at this time nn has not determined that the dn is in a dead state.
6. dn restarts will FBR is executed (processFirstBlockReport will not be executed here, processReport will be executed). since block1 is not a new block, the processExtraRedundancy logic will not be executed.
In HeartbeatManager#register(final DatanodeDescriptor d)
https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java#L230-L238
In BlockManager#processReport, the dn restart run FBR, here current dn still is alive,storageInfo.hasReceivedBlockReport() is true, so will call method processReport
https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L2916-L2946
In BlockManager#processReport run FBR, since the current DatanodeStorageInfo exists in the triplets in the BlockInfo corresponding to the reported block, so will not add toAdd list, addStoredBlock and processExtraRedundancy logic will not be executed.
https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L3044-L3085
In BlockManager#processChosenExcessRedundancy will add the redundancy of the given block stored in the given datanode to the excess map.
https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L4267-L4285
but because the dn side has not deleted the block, it will not call processIncrementalBlockReport, so the block of dn can not remove from excessRedundancyMap.
Solution
NameNode add logic to handle excess redundant block timeouts to resolve current issue.
If NN determines that the excess redundancy block in DN has timed out and re-adds it to Invalidates.