HDDS-4062. Non rack aware pipelines should not be created if multiple racks are alive #1291

sodonnel · 2020-08-05T10:40:56Z

What changes were proposed in this pull request?

If we have a scenario where one rack has more nodes that others, it is possible for all hosts in the cluster to have reached their pipeline limit, while 3 nodes on the larger rack have not.

The current fallback logic will then allow a pipeline to be created which uses only the 3 nodes on the same rack, violating the rack placement policy.

There may be other ways this could happen with cluster load too, were the pipeline capacity has reached its limit on some nodes but not others.

The proposal here, is that if the cluster has multiple racks AND there are healthy nodes covering at least 2 racks, where healthy is defined as a node which is registered and not stale or dead, then we should not allow "fallback" (pipelines which span only 1 rack) pipelines to be created.

This means if you have a badly configured cluster - eg Rack 1 = 10 nodes; Rack 2 = 1 node, the pipeline limit will be constrained by the capacity of that 1 node on rack 2. Even a setup like Rack 1 = 10 nodes, Rack 2 = 5 would be constrained by this.

This constraint is better than creating non rack aware pipelines, and the rule above will handle the case when the cluster degrades to 1 rack, as the healthy node definition will notice only 1 rack is alive.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-4062

How was this patch tested?

New unit test added to reproduce the issue without the patch and ensure the patch fixes it.

sodonnel · 2020-08-10T19:03:57Z

I wanted to add a few more thoughts on this issue as I have discussed it with others.

I believe there is a bug (which this PR is intended to fix), as on a cluster with many racks and nodes all healthy, it is currently possible for the only remaining nodes with capacity for more pipelines to be on a single rack. This means that the vast majority of pipelines will be rack aware, but to fill the cluster pipeline capacity, a few non-rack aware pipelines can get created. That should not be allowed to happen, as it will give additional work to replication manager for all the 'bad' pipelines.

Then consider the badly configured clusters I mentioned in the opening comment. 10 nodes on 1 rack and 1 node on the other rack. My proposal is that the pipeline capacity of the cluster is limited by the single node on the second rack. If we don't do this, and limit the pipeline capacity, then all the non-rack-aware piplines will create containers all on rack 1. Then replication manager will need to replicate them all to the only possible destination on rack 2. Almost every closed container will then need to be replicated putting a lot of pressure on the cluster and the single node on rack2 will fill up. This really is a corner case for a small 2 rack cluster which is badly configured.

Note that HDFS has similar problems to this - racks need to be fairly well balanced in terms of node count, otherwise the nodes on the smaller racks tend to fill up faster.

sodonnel · 2020-08-10T19:05:00Z

@ChenSammi @timmylicheng Would either of you have an opinion on this issue?

nandakumar131 · 2020-08-12T18:47:52Z

...ds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/PipelinePlacementPolicy.java

@@ -163,9 +163,42 @@ int currentPipelineCount(DatanodeDetails datanodeDetails, int nodesRequired) {
      throw new SCMException(msg,
          SCMException.ResultCodes.FAILED_TO_FIND_SUITABLE_NODE);
    }
+
+    if (!checkAllNodesAreEqual(nodeManager.getClusterNetworkTopologyMap())) {
+      boolean multipleRacks = multipleRacksAvailable(healthyNodes);


Should we also consider excludedNodes here?

In line 124 if there are exclusions, then they are removed from the healthyNodes list, then based on the remaining elements healthyList is created by filtering the nodes that can not accept more pipeline.
At this point with the healthyNodes list we are checking whether there were multiple racks configured for the healthy nodes, as we need to know that we are in a multi rack environment. (If the whole cluster has just one rack, then even though we have rack awareness logic, we can not allocate pipelines in two racks, hence this check is needed.)
The second check is after filtering nodes that can not accept more pipeline, and the intention is to throw an exception if the remaining nodes are in one single rack, as in this case we have more racks with healthy nodes, still we can't allocate a rack aware pipeline.

In the first call to multipleRacksAvailable(...) we just use all healthy nodes and don't worry about excluded nodes. As Pifta said, excluded nodes are handled later and this first call is just to check if the cluster has multiple alive racks.

Let's take the same example given in the description,
Rack 1 = 10 nodes; Rack 2 = 1 node

For some reason, the one node which is in Rack 2 is added to the exclude list. According to SCM the node in Rack 2 is still healthy.

We will end up creating a pipeline in the same rack (Rack 1), even though we have two racks. We will run into same scenario which this PR is trying to address.

You are correct. Checking this again, I see at line 121/126 the excluded nodes are removed from the healthyList:

List<DatanodeDetails> healthyNodes = nodeManager.getNodes(HddsProtos.NodeState.HEALTHY); if (excludedNodes != null) { healthyNodes.removeAll(excludedNodes); }

So I need to move the line:

boolean multipleRacks = multipleRacksAvailable(healthyNodes);

To before removing the excluded nodes.

fapifta · 2020-08-12T23:15:25Z

Hi @sodonnel,

the changeset looks good to me, I am +1 on committing it, though it is non-binding :)

There is one thing I would change, but as this implementation fits within the rest of the code and kind of consistent with it, we might not need to start the change now or ever, even if I would be happier, but I would like to express my view on testing a bit, and see if I can start a change :) Please if you have anything to add on why you chose this implementation, I would be quite happy to discuss it as I might learn.
So... as I view, the test method tests 2 things, therefore the name of the method does not tell much about what is being tested, and one needs to carefully read the code and comments to understand what is the requirement being tested.
Instead of this, if I would have written the tests, I would have written two tests, one named something like testExceptionIsThrownWhenRackAwarePipelineCanNotBeCreated the other named something like test3NodesInSameRackReturnedWhenOnlyOneHealthyRackIsPresent.
With that, you can easily use a built in ExpectedException JUnit rule to check the thrown exception and assert on the exception message as well in the first test. Yes I am not a fan of LambdaTestUtils, I know it is written as part of the project, but as I see it helps the bad practice of testing more things at the same time, so instead I prefer to @Expect an exception or to use the ExpectedException rule.
Also if we want to test for the exception message, this way it is pretty much fragile, and I would prefer to have a package private static final string with the exception message being thrown in this case, and checked in the test, though I can live with this approach, as the message likely will not be changed too often.

sodonnel · 2020-08-13T14:43:11Z

Thanks for the review @fapifta - I will have a look at the test and spilt them into two as you said. I guess I just did it this way as other tests are like this, and there is a reasonable amount of setup code. I do see your point on each test covering only one thing however.

fapifta · 2020-08-14T11:09:43Z

Hi @sodonnel,

thank you for addressing the test related ideas, and for the contribution. The changes look good to me +1 (non-binding).

sodonnel · 2020-08-25T12:55:11Z

@nandakumar131 Could you check the latest change? Are you happy for us to commit this PR as it is now?

… racks are alive. (apache#1291)

* master: (26 commits) HDDS-4167. Acceptance test logs missing if fails during cluster startup (apache#1366) HDDS-4121. Implement OmMetadataMangerImpl#getExpiredOpenKeys. (apache#1351) HDDS-3867. Extend the chunkinfo tool to display information from all nodes in the pipeline. (apache#1154) HDDS-4077. Incomplete OzoneFileSystem statistics (apache#1329) HDDS-3903. OzoneRpcClient support batch rename keys. (apache#1150) HDDS-4151. Skip the inputstream while offset larger than zero in s3g (apache#1354) HDDS-4147. Add OFS to FileSystem META-INF (apache#1352) HDDS-4137. Turn on the verbose mode of safe mode check on testlib (apache#1343) HDDS-4146. Show the ScmId and ClusterId in the scm web ui. (apache#1350) HDDS-4145. Bump version to 1.1.0-SNAPSHOT on master (apache#1349) HDDS-4109. Tests in TestOzoneFileSystem should use the existing MiniOzoneCluster (apache#1316) HDDS-4149. Implement OzoneFileStatus#toString (apache#1356) HDDS-4153. Increase default timeout in kubernetes tests (apache#1357) HDDS-2411. add a datanode chunk validator fo datanode chunk generator (apache#1312) HDDS-4140. Auto-close /pending pull requests after 21 days of inactivity (apache#1344) HDDS-4152. Archive container logs for kubernetes check (apache#1355) HDDS-4056. Convert OzoneAdmin to pluggable model (apache#1285) HDDS-3972. Add option to limit number of items displaying through ldb tool. (apache#1206) HDDS-4068. Client should not retry same OM on network connection failure (apache#1324) HDDS-4062. Non rack aware pipelines should not be created if multiple racks are alive. (apache#1291) ...

Initial changes to implement feature

5a306c1

swagle assigned sodonnel Aug 5, 2020

swagle requested a review from nandakumar131 August 5, 2020 22:46

nandakumar131 reviewed Aug 12, 2020

View reviewed changes

Refactor tests based on review comments

ca74ee6

Handle excluded nodes correctly

a80d91f

nandakumar131 approved these changes Aug 26, 2020

View reviewed changes

nandakumar131 merged commit 8102ac7 into apache:master Aug 26, 2020

rakeshadr pushed a commit to rakeshadr/hadoop-ozone that referenced this pull request Sep 3, 2020

HDDS-4062. Non rack aware pipelines should not be created if multiple…

ad4bf01

… racks are alive. (apache#1291)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-4062. Non rack aware pipelines should not be created if multiple racks are alive #1291

HDDS-4062. Non rack aware pipelines should not be created if multiple racks are alive #1291

sodonnel commented Aug 5, 2020

sodonnel commented Aug 10, 2020

sodonnel commented Aug 10, 2020

nandakumar131 Aug 12, 2020

fapifta Aug 12, 2020 •

edited

Loading

sodonnel Aug 13, 2020

nandakumar131 Aug 15, 2020

sodonnel Aug 15, 2020

fapifta commented Aug 12, 2020 •

edited

Loading

sodonnel commented Aug 13, 2020

fapifta commented Aug 14, 2020

sodonnel commented Aug 25, 2020

HDDS-4062. Non rack aware pipelines should not be created if multiple racks are alive #1291

HDDS-4062. Non rack aware pipelines should not be created if multiple racks are alive #1291

Conversation

sodonnel commented Aug 5, 2020

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

sodonnel commented Aug 10, 2020

sodonnel commented Aug 10, 2020

nandakumar131 Aug 12, 2020

Choose a reason for hiding this comment

fapifta Aug 12, 2020 • edited Loading

Choose a reason for hiding this comment

sodonnel Aug 13, 2020

Choose a reason for hiding this comment

nandakumar131 Aug 15, 2020

Choose a reason for hiding this comment

sodonnel Aug 15, 2020

Choose a reason for hiding this comment

fapifta commented Aug 12, 2020 • edited Loading

sodonnel commented Aug 13, 2020

fapifta commented Aug 14, 2020

sodonnel commented Aug 25, 2020

fapifta Aug 12, 2020 •

edited

Loading

fapifta commented Aug 12, 2020 •

edited

Loading