HDDS-8494. Adjust replication queue limits for out-of-service nodes #4645

adoroszlai · 2023-05-03T10:25:53Z

What changes were proposed in this pull request?

When a datanode switches to a decommissioning state, it will adjust the size of the replication supervisor thread pool higher, and if the node returns to the In Service state, it will return to the lower thread pool limit.

Similarly when scheduling commands, SCM can allocate more commands to the decommissioning host, as it should process them more quickly due to the lower load and increased threadpool.

Scale the size of executor thread pool and command queue for replication in datanode if state changes between in-service and out-of-service
Similarly scale the limit of pending replication commands at SCM
Simplify TestReplicationSupervisor#testMaxQueueSize to avoid the use of thread pool (possible source of intermittent failures recently 1, 2, 3)

https://issues.apache.org/jira/browse/HDDS-8494

How was this patch tested?

Added unit test.

Tested in ozone compose environment with 6 nodes: created RATIS and EC keys, decommissioned and recommissioned one of the datanodes.

2023-05-02 16:37:20,303 [Command processor thread] INFO replication.ReplicationSupervisor: Node state updated to DECOMMISSIONING, scaling executor pool size to 20
...
2023-05-02 16:39:16,353 [Command processor thread] INFO replication.ReplicationSupervisor: Node state updated to IN_SERVICE, scaling executor pool size to 10
```

sodonnel · 2023-05-03T14:45:49Z

Change looks good. The only two suggests I have:

I wonder if we should make the "scaling factor" which we increase the limit and thread pool by configurable with a default of 2? If we make it configurable, should it be a decimal rather than an integer so we can scale by 1.5, 2.5 etc? I guess the config would need to apply on both the DN and RM, so I am not 100% sure where we should define it. It would be a shame to need two configs as they could get out of sync.
Might be good to also add a test based on testSendThrottledReplicateContainerCommand in the TestReplicationManager class to validate the decommissioning host is picked as a target when all nodes are over the original limit. This is kind of covered in the excluded nodes test, but excludes nodes are only updated when the new command pushes it over the limit. This test would ensure decommissioning nodes are still picked if they are over the original limit but under the extended limit.

adoroszlai · 2023-05-04T08:58:59Z

Thanks @sodonnel for the review. Added a test for testSendThrottledReplicateContainerCommand. I'll add the config in a follow-up task, need more time to think about it.

sodonnel

LGTM, thanks for adding the test and for cleaning up the repeated code in the existing tests!

adoroszlai added 6 commits May 2, 2023 18:40

HDDS-8494. Adjust replication queue limits for decommissioning nodes

3ddce18

Let SCM schedule more replication tasks

be2cea4

Merge remote-tracking branch 'origin/master' into HDDS-8494

50a8b66

Add test for SCM-side change

4b73cab

Log actual limit and state

256e4c3

Fix assertEquals argument order (expected, actual)

e7fe353

adoroszlai self-assigned this May 3, 2023

adoroszlai requested a review from sodonnel May 3, 2023 10:36

adoroszlai marked this pull request as ready for review May 3, 2023 10:54

Add test in TestReplicationManager

4470050

sodonnel approved these changes May 4, 2023

View reviewed changes

adoroszlai merged commit 74adc9a into apache:master May 4, 2023
27 checks passed

adoroszlai deleted the HDDS-8494 branch May 4, 2023 14:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-8494. Adjust replication queue limits for out-of-service nodes #4645

HDDS-8494. Adjust replication queue limits for out-of-service nodes #4645

adoroszlai commented May 3, 2023

sodonnel commented May 3, 2023

adoroszlai commented May 4, 2023

sodonnel left a comment

HDDS-8494. Adjust replication queue limits for out-of-service nodes #4645

HDDS-8494. Adjust replication queue limits for out-of-service nodes #4645

Conversation

adoroszlai commented May 3, 2023

What changes were proposed in this pull request?

How was this patch tested?

sodonnel commented May 3, 2023

adoroszlai commented May 4, 2023

sodonnel left a comment

Choose a reason for hiding this comment