HDDS-8417. Cap on queue of commands at DN#4618
Conversation
There was a problem hiding this comment.
This description isn't really accurate - its really max commands of each command type that can be queued on a datanode, right? As the commands go from the command queue into sub queues in the various handlers in many cases.
...c/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java
Show resolved
Hide resolved
There was a problem hiding this comment.
For the replication supervisor, I think the limit check should be in supervisor.addTask(). Then we don't need to add this check in two places (Reconstruction and replication handlers).
There was a problem hiding this comment.
I'm not sure about this check, when we have limit checks in other places too. Eg we could have 5000 replications queued on its queue and then another 5000 here, so its a little confusing.
The command queue implementation is single threaded, and processes each command in turn. These are the list of handlers it currently has:
.addHandler(new CloseContainerCommandHandler()) // Blocks
.addHandler(new DeleteBlocksCommandHandler(getContainer(), // async
conf, dnConf.getBlockDeleteThreads(),
dnConf.getBlockDeleteQueueLimit()))
.addHandler(new ReplicateContainerCommandHandler(conf, supervisor, // async
pullReplicatorWithMetrics, pushReplicatorWithMetrics))
.addHandler(reconstructECContainersCommandHandler) // async
.addHandler(new DeleteContainerCommandHandler( // async
dnConf.getContainerDeleteThreads(), clock))
.addHandler(new ClosePipelineCommandHandler()) // blocks
.addHandler(new CreatePipelineCommandHandler(conf)) // blocks
.addHandler(new SetNodeOperationalStateCommandHandler(conf)) // blocks
.addHandler(new FinalizeNewLayoutVersionCommandHandler()) // blocks
.addHandler(new RefreshVolumeUsageCommandHandler()) // blocks
So quite a few of the commands block processing other commands on the queue while they run. Most of these are probably very fast. The RefreshVolumeUsageCommandHandler worries me though. I am not sure what it does or how often it get called, but if it calls du it could be slow. Perhaps it should be async. Could closed and create pipeline potentially be slow too if there are some issues with Ratis election or the pipeline when closing the pipeline?
When you saw problems with the DNs having a lot of queued commands, where were the commands queued? On the replication supervisor / delete block queues, or was the command queue also backed up with a lot of messages, indicating its thread was not able to process the messages onto the sub-queues fast enough, or some blocking command was taking a long time to run?
There was a problem hiding this comment.
@sodonnel
In earlier version, delete block was executed synchronously causing many commands accumulated. But this is resolved in later version.
The probability of command queue getting accumulated is rare, as commands takings lot of time is handover to sub-queue, and few commands only executed synchronously.
This is just a upper cap for all command queue to avoid any such issue if happens due to some bug or abnormal scenario, as good practice.
|
Its worth noting here that there have been changes in the Replication Manager which means it will not keep adding and adding to the replication queue on the datanode, so problems where 1000's of replication commands end up on the datanode should not happen any more. Also, the datanode now drops the commands if they are too old for both replication and delete container commands. I believe you saw some scenarios where the DN queue had a lot of message resulting in a crash. Do you have a breakdown of what commands were queued and where? Eg replication commands on the replication queue, what about the commandQueue where commands get placed first - was it also suffering from many messages stuck on it? I don't see any new tests in this PR - if we are going to have logic to enforce limits we probably need a couple of simple tests to check they are working ok. |
There was a problem hiding this comment.
This test class is not enabled as it is not prefixed with Test...:
- Helper to check scheduling efficiency.
- This unit test is not enabled (doesn't start with Test) but can be used
- to validate changes manually.
Need to move it into TestReplicationSupervisor
There was a problem hiding this comment.
We should try to avoid sleeps in tests. It leads to slow tests and potentially flaky tests. Instead of a sleep please use some sort of barrier to block execution - eg a countdown latch. This test takes just over 4 seconds to run.
In TestReplicationSupervisor there is a BlockingTask class defined that makes use of latches to block execution for a period of time, so you could probably make use of that.
// schedule a few tasks up to the limit
// try to schedule one more and the queue size should not increase
// unblock the execution.
There was a problem hiding this comment.
I think here we should use getTotalInFlightReplications() instead of getQueueSize() as it will include running tasks as well as those queued.
There was a problem hiding this comment.
Suggest rewording this log and including the task class in it, eg:
LOG.warn(Ignored a {} command for container {} in the Replication Supervisor as the queue has reached the max size of {}.,
task.class, task.getContainerId(), maxQueueSize());
|
@sumitagrawl please fix checkstyle |
Fixed... |
|
new one: |
sodonnel
left a comment
There was a problem hiding this comment.
LGTM. Thanks for implementing the changes.
What changes were proposed in this pull request?
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-8417
How was this patch tested?
Impact is tested using existing UT and integration test