New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDFS-17372. CommandProcessingThread#queue should use LinkedBlockingDeque to prevent high priority command blocked by low priority command #6530
base: trunk
Are you sure you want to change the base?
Conversation
@Hexiaoqiao @zhangshuyan0 @tasanuma @tomscut Hi, sir. Could you please take a look at this problem? If needed, I will post an UT soonly. |
Thanks @hfutatzhanghb for your report. Is |
@ZanderXu Hi, sir. Yes, DNA_ACCESSKEYUPDATE blocked by BlockCommand. Good idea, but still have some problems to solve. For example, the heartbeat response may contain commands with kind of command type, we can't know whether it contains DNA_ACCESSKEYUPDATE except we iterate |
💔 -1 overall
This message was automatically generated. |
I am a bit confused about this. How often do you update the block key? @hfutatzhanghb |
we use default, 10 hours.
…---- Replied Message ----
| From | ***@***.***> |
| Date | 02/06/2024 19:20 |
| To | ***@***.***> |
| Cc | ***@***.***>***@***.***> |
| Subject | Re: [apache/hadoop] HDFS-17372. CommandProcessingThread#queue should use LinkedBlockingDeque to prevent high priority command blocked by low priority command (PR #6530) |
I am a bit confused about this. How often do you update the block key? @hfutatzhanghb
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Has command DNA_ACCESSKEYUPDATE not been executed for ten hours? |
yes, it was blocked by block command。
…---- Replied Message ----
| From | ***@***.***> |
| Date | 02/06/2024 19:42 |
| To | ***@***.***> |
| Cc | ***@***.***>***@***.***> |
| Subject | Re: [apache/hadoop] HDFS-17372. CommandProcessingThread#queue should use LinkedBlockingDeque to prevent high priority command blocked by low priority command (PR #6530) |
we use default, 10 hours.
Has command DNA_ACCESSKEYUPDATE not been executed for ten hours?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @ZanderXu . I think we should try not to modify namenode or protocol. Actually, there is a trick that key update command can only appear in the last two position of the cmds
according to DatanodeManager#handleHeartbeat
, so we do not need to iterate all commands on datanode.
@zhangshuyan0 Sir, thanks a lot for your valuable suggestion, Agree with you and @ZanderXu . I will try to use the trick you mentioned above to modify this PR. |
Hi, sir. I have some doubts that what should we do if we add new CMD type here? The access key update cmd will not be the last two in array. |
9d1d14b
to
1f6fe12
Compare
@zhangshuyan0 @ZanderXu @tasanuma @tomscut Hi, sir. Have updated this PR.
|
2b46a60
to
47d3f2e
Compare
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
2e2de6d
to
2d39b50
Compare
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
…que to prevent high priority command blocked by low priority command.
c2a43cc
to
53cb056
Compare
@Hexiaoqiao @zhangshuyan0 @ZanderXu Sir, this PR has run rightly for about two months. Could you please help me push this PR forward ? thanks a lot. |
💔 -1 overall
This message was automatically generated. |
Description of PR
Refer to HDFS-17372.
Recently, we met a critical problem in our production cluster which have lots of small files. In that cluster, per datanode has almost six million blocks.
After deleting large dir or recommision, some datanodes's SumOfActorCommandQueueLength metrics became very high as below picture shows.
After a while, we found some datanodes's write block ops became zero. That means client can not write to those datanodes. We found some logs on those datanodes:
That is to say, DNA_ACCESSKEYUPDATE command was blocked in CommandProcessingThread#queue. This can be deadly.
So, we should guarantee that command with high-priority should be processed in time.