-
Notifications
You must be signed in to change notification settings - Fork 9.2k
HDFS-15737. Don't remove datanodes from outOfServiceNodeBlocks while checking in DatanodeAdminManager #2562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
💔 -1 overall
This message was automatically generated. |
|
It looks like this same logic also exists in trunk - could you submit a trunk PR / patch and then we can backport the change across all active branches? I am also a little confused about this problem. The map Eg, from trunk DatanodeAdminDefaultMonitor.java, here Note that outOfServiceNodeBlocks is modified on the first pass, and so Have you seen the ConcurrentModificationException logged due to this problem? |
Thanks for looking. If there are only 2 datanodes in outOfServiceNodeBlocks and the first one is removed, then it will be a dead loop on the second datanode. If there are more than 2 datanodes and the first one is removed, there will be a ConcurrentModificationException. I see both two cases in our prod very often. This issue only happens when remove (dnAdmin.stopMaintenance(dn);). By outOfServiceNodeBlocks.put(dn, blocks), it only updates the value, so Cyclic Iteration won't be affected |
|
Thanks for the information - this may explain why HDFS-12703 was needed, as some exceptions which were not logged at that time, caused the decommission thread to stop running until the NN was restarted. The change there was to catch the exception. The change here looks correct to me, but as the issue exists on the trunk branch, we should fix it there first, and then backport to 3.3, 3.2, 3.1 and 2.10 so the fix is in place across all branches. |
Due to HDFS-14854, the fix on trunk could be a very different one, since it doesn't make sense to change the new interface with a boolean parameter to stopTrackingNode while DatanodeAdminBackoffMonitor does't need. Looks a better fix would be introduce a cancelledNodes to DatanodeAdminDefaultMonitor, just like DatanodeAdminBackoffMonitor . Then in stopTrackingNode, don't remove dn from outOfServiceNodeBlocks, but add it to cancelledNodes for further process. However, the change would be a little bit bigger. |
|
We're closing this stale PR because it has been open for 100 days with no activity. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
https://issues.apache.org/jira/browse/HDFS-15737
NOTICE
Please create an issue in ASF JIRA before opening a pull request,
and you need to set the title of the pull request which starts with
the corresponding JIRA issue number. (e.g. HADOOP-XXXXX. Fix a typo in YYY.)
For more details, please see https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute