-
Notifications
You must be signed in to change notification settings - Fork 9.2k
HDFS-16438. Avoid holding read locks for a long time when scanDatanodeStorage #3928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
💔 -1 overall
This message was automatically generated. |
Hi @jojochuang , I want to confirm whether this comment is related to this issue? Thanks. |
|
Sorry. My bad. Posted to the wrong PR. |
|
I am not sure if it is safe to drop the read lock while iterating over the blocks in a StorageInfo as the blocks could change when you drop the lock. Looking at the code in DatanodeStorageInfo, we can add a block to the list, which always adds to the head of the list. The worst case here, is we don't consider some block, but it would be caught by the rescan at the end, so that is not a big problem. However we can remove a block. Lets say we stop iterating at blockX and drop the lock. Then a IBR comes in and gets processed, removing that block from the storage. The iterator will be holding a reference to that blockInfo object as The remove block code in BlockInfo.java looks like: It is basically unlinking itself from the linked list, and then sets its own prev and next to null. Then the iterator calls If the case of the current blockInfo having a null next, the iterator will get null, and think it has reached the end of the list and exit wrongly. This would be a rare bug to hit, as you would need to drop the lock and then the next block to consume from the list would need to be removed from the list, but it is one of those things that will happen from time to time and be very hard to explain. 5M blocks on a single storage is a lot - dropping and re-locking at the storage level was supposed to handle DNs with millions of blocks, but each storage only having a small number. Does this pause occur once per storage at the start and end of decommission for each node? |
|
Thanks @sodonnel for your comments and detailed explanations. I'm sorry I said the wrong number 5M, we have more than 10 million blocks per node, and 20 disks, so more than 500,000 blocks per storage. Your explanation makes sense to me. During Can we copy the blocks of each storage to a |
|
This is the scan logic: The rescan would be more expensive as it needs to check if each block is correctly replicated, and also if you are putting a node to maintenance. The rescan happens at the end. The initial scan basically puts all the entries into a list, so it should already be lightweight. Do you know if the pause is happening at the end of decommission during the re-scan? If so, we should probably refactor it to not do the |
|
Hi @sodonnel , thanks for your comments. This is the stack: I think this is at the end of decommission, because we can see |
|
Yea looks likes it is all the isReplicatedOK that is making it slow. What about a change like this, to move the replicated check outside of the lock: |
| if (Time.monotonicNow() - beginTime > scanDatanodeStorageLockTimeMs) { | ||
| namesystem.readUnlock(); | ||
| try { | ||
| Thread.sleep(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 1ms sleep time enough to remove lock starvation or we should add bit more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 1ms sleep time enough to remove lock starvation or we should add bit more?
Thanks @virajjasani for your review, I am trying a new implementation now, this code may be replaced.
Hi @sodonnel , I just want to make sure that we don't have to take the inverse here? Please point me out if I understand wrong. |
|
Thanks @sodonnel for your comments and suggestions, which make sense to me. Moving the replicated check outside of the lock is a good idea, I'll update the code and do some testing. |
Yes you are correct, my suggested code has it wrong. We need to remove from the list if it is replicated OK. The original code only added if it was not replicated OK. |
Thanks @sodonnel for your quick reply. I'll update the code and do some related tests later. |
|
💔 -1 overall
This message was automatically generated. |
|
We're closing this stale PR because it has been open for 100 days with no activity. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
JIRA: HDFS-16438.
At the time of decommission, if use
DatanodeAdminBackoffMonitor, there is a heavy operation:scanDatanodeStorage. If the number of blocks on a storage is large(more than 5 hundred thousand), and GC performance is also poor, it may hold read lock for a long time, we should optimize it.