Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-4062. Non rack aware pipelines should not be created if multiple racks are alive #1291

Merged
merged 3 commits into from
Aug 26, 2020

Conversation

sodonnel
Copy link
Contributor

@sodonnel sodonnel commented Aug 5, 2020

What changes were proposed in this pull request?

If we have a scenario where one rack has more nodes that others, it is possible for all hosts in the cluster to have reached their pipeline limit, while 3 nodes on the larger rack have not.

The current fallback logic will then allow a pipeline to be created which uses only the 3 nodes on the same rack, violating the rack placement policy.

There may be other ways this could happen with cluster load too, were the pipeline capacity has reached its limit on some nodes but not others.

The proposal here, is that if the cluster has multiple racks AND there are healthy nodes covering at least 2 racks, where healthy is defined as a node which is registered and not stale or dead, then we should not allow "fallback" (pipelines which span only 1 rack) pipelines to be created.

This means if you have a badly configured cluster - eg Rack 1 = 10 nodes; Rack 2 = 1 node, the pipeline limit will be constrained by the capacity of that 1 node on rack 2. Even a setup like Rack 1 = 10 nodes, Rack 2 = 5 would be constrained by this.

This constraint is better than creating non rack aware pipelines, and the rule above will handle the case when the cluster degrades to 1 rack, as the healthy node definition will notice only 1 rack is alive.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-4062

How was this patch tested?

New unit test added to reproduce the issue without the patch and ensure the patch fixes it.

@sodonnel
Copy link
Contributor Author

I wanted to add a few more thoughts on this issue as I have discussed it with others.

I believe there is a bug (which this PR is intended to fix), as on a cluster with many racks and nodes all healthy, it is currently possible for the only remaining nodes with capacity for more pipelines to be on a single rack. This means that the vast majority of pipelines will be rack aware, but to fill the cluster pipeline capacity, a few non-rack aware pipelines can get created. That should not be allowed to happen, as it will give additional work to replication manager for all the 'bad' pipelines.

Then consider the badly configured clusters I mentioned in the opening comment. 10 nodes on 1 rack and 1 node on the other rack. My proposal is that the pipeline capacity of the cluster is limited by the single node on the second rack. If we don't do this, and limit the pipeline capacity, then all the non-rack-aware piplines will create containers all on rack 1. Then replication manager will need to replicate them all to the only possible destination on rack 2. Almost every closed container will then need to be replicated putting a lot of pressure on the cluster and the single node on rack2 will fill up. This really is a corner case for a small 2 rack cluster which is badly configured.

Note that HDFS has similar problems to this - racks need to be fairly well balanced in terms of node count, otherwise the nodes on the smaller racks tend to fill up faster.

@sodonnel
Copy link
Contributor Author

@ChenSammi @timmylicheng Would either of you have an opinion on this issue?

@@ -163,9 +163,42 @@ int currentPipelineCount(DatanodeDetails datanodeDetails, int nodesRequired) {
throw new SCMException(msg,
SCMException.ResultCodes.FAILED_TO_FIND_SUITABLE_NODE);
}

if (!checkAllNodesAreEqual(nodeManager.getClusterNetworkTopologyMap())) {
boolean multipleRacks = multipleRacksAvailable(healthyNodes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also consider excludedNodes here?

Copy link
Contributor

@fapifta fapifta Aug 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In line 124 if there are exclusions, then they are removed from the healthyNodes list, then based on the remaining elements healthyList is created by filtering the nodes that can not accept more pipeline.
At this point with the healthyNodes list we are checking whether there were multiple racks configured for the healthy nodes, as we need to know that we are in a multi rack environment. (If the whole cluster has just one rack, then even though we have rack awareness logic, we can not allocate pipelines in two racks, hence this check is needed.)
The second check is after filtering nodes that can not accept more pipeline, and the intention is to throw an exception if the remaining nodes are in one single rack, as in this case we have more racks with healthy nodes, still we can't allocate a rack aware pipeline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the first call to multipleRacksAvailable(...) we just use all healthy nodes and don't worry about excluded nodes. As Pifta said, excluded nodes are handled later and this first call is just to check if the cluster has multiple alive racks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's take the same example given in the description,
Rack 1 = 10 nodes; Rack 2 = 1 node

For some reason, the one node which is in Rack 2 is added to the exclude list. According to SCM the node in Rack 2 is still healthy.

We will end up creating a pipeline in the same rack (Rack 1), even though we have two racks. We will run into same scenario which this PR is trying to address.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct. Checking this again, I see at line 121/126 the excluded nodes are removed from the healthyList:

    List<DatanodeDetails> healthyNodes =
        nodeManager.getNodes(HddsProtos.NodeState.HEALTHY);
    if (excludedNodes != null) {
      healthyNodes.removeAll(excludedNodes);
    }

So I need to move the line:

boolean multipleRacks = multipleRacksAvailable(healthyNodes);

To before removing the excluded nodes.

@fapifta
Copy link
Contributor

fapifta commented Aug 12, 2020

Hi @sodonnel,

the changeset looks good to me, I am +1 on committing it, though it is non-binding :)

There is one thing I would change, but as this implementation fits within the rest of the code and kind of consistent with it, we might not need to start the change now or ever, even if I would be happier, but I would like to express my view on testing a bit, and see if I can start a change :) Please if you have anything to add on why you chose this implementation, I would be quite happy to discuss it as I might learn.
So... as I view, the test method tests 2 things, therefore the name of the method does not tell much about what is being tested, and one needs to carefully read the code and comments to understand what is the requirement being tested.
Instead of this, if I would have written the tests, I would have written two tests, one named something like testExceptionIsThrownWhenRackAwarePipelineCanNotBeCreated the other named something like test3NodesInSameRackReturnedWhenOnlyOneHealthyRackIsPresent.
With that, you can easily use a built in ExpectedException JUnit rule to check the thrown exception and assert on the exception message as well in the first test. Yes I am not a fan of LambdaTestUtils, I know it is written as part of the project, but as I see it helps the bad practice of testing more things at the same time, so instead I prefer to @Expect an exception or to use the ExpectedException rule.
Also if we want to test for the exception message, this way it is pretty much fragile, and I would prefer to have a package private static final string with the exception message being thrown in this case, and checked in the test, though I can live with this approach, as the message likely will not be changed too often.

@sodonnel
Copy link
Contributor Author

Thanks for the review @fapifta - I will have a look at the test and spilt them into two as you said. I guess I just did it this way as other tests are like this, and there is a reasonable amount of setup code. I do see your point on each test covering only one thing however.

@fapifta
Copy link
Contributor

fapifta commented Aug 14, 2020

Hi @sodonnel,

thank you for addressing the test related ideas, and for the contribution. The changes look good to me +1 (non-binding).

@sodonnel
Copy link
Contributor Author

@nandakumar131 Could you check the latest change? Are you happy for us to commit this PR as it is now?

@nandakumar131 nandakumar131 merged commit 8102ac7 into apache:master Aug 26, 2020
rakeshadr pushed a commit to rakeshadr/hadoop-ozone that referenced this pull request Sep 3, 2020
errose28 added a commit to errose28/ozone that referenced this pull request Sep 11, 2020
* master: (26 commits)
  HDDS-4167. Acceptance test logs missing if fails during cluster startup (apache#1366)
  HDDS-4121. Implement OmMetadataMangerImpl#getExpiredOpenKeys. (apache#1351)
  HDDS-3867. Extend the chunkinfo tool to display information from all nodes in the pipeline. (apache#1154)
  HDDS-4077. Incomplete OzoneFileSystem statistics (apache#1329)
  HDDS-3903. OzoneRpcClient support batch rename keys. (apache#1150)
  HDDS-4151. Skip the inputstream while offset larger than zero in s3g (apache#1354)
  HDDS-4147. Add OFS to FileSystem META-INF (apache#1352)
  HDDS-4137. Turn on the verbose mode of safe mode check on testlib (apache#1343)
  HDDS-4146. Show the ScmId and ClusterId in the scm web ui. (apache#1350)
  HDDS-4145. Bump version to 1.1.0-SNAPSHOT on master (apache#1349)
  HDDS-4109. Tests in TestOzoneFileSystem should use the existing MiniOzoneCluster (apache#1316)
  HDDS-4149. Implement OzoneFileStatus#toString (apache#1356)
  HDDS-4153. Increase default timeout in kubernetes tests (apache#1357)
  HDDS-2411. add a datanode chunk validator fo datanode chunk generator (apache#1312)
  HDDS-4140. Auto-close /pending pull requests after 21 days of inactivity (apache#1344)
  HDDS-4152. Archive container logs for kubernetes check (apache#1355)
  HDDS-4056. Convert OzoneAdmin to pluggable model (apache#1285)
  HDDS-3972. Add option to limit number of items displaying through ldb tool. (apache#1206)
  HDDS-4068. Client should not retry same OM on network connection failure (apache#1324)
  HDDS-4062. Non rack aware pipelines should not be created if multiple racks are alive. (apache#1291)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants