-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-4062. Non rack aware pipelines should not be created if multiple racks are alive #1291
Conversation
I wanted to add a few more thoughts on this issue as I have discussed it with others. I believe there is a bug (which this PR is intended to fix), as on a cluster with many racks and nodes all healthy, it is currently possible for the only remaining nodes with capacity for more pipelines to be on a single rack. This means that the vast majority of pipelines will be rack aware, but to fill the cluster pipeline capacity, a few non-rack aware pipelines can get created. That should not be allowed to happen, as it will give additional work to replication manager for all the 'bad' pipelines. Then consider the badly configured clusters I mentioned in the opening comment. 10 nodes on 1 rack and 1 node on the other rack. My proposal is that the pipeline capacity of the cluster is limited by the single node on the second rack. If we don't do this, and limit the pipeline capacity, then all the non-rack-aware piplines will create containers all on rack 1. Then replication manager will need to replicate them all to the only possible destination on rack 2. Almost every closed container will then need to be replicated putting a lot of pressure on the cluster and the single node on rack2 will fill up. This really is a corner case for a small 2 rack cluster which is badly configured. Note that HDFS has similar problems to this - racks need to be fairly well balanced in terms of node count, otherwise the nodes on the smaller racks tend to fill up faster. |
@ChenSammi @timmylicheng Would either of you have an opinion on this issue? |
@@ -163,9 +163,42 @@ int currentPipelineCount(DatanodeDetails datanodeDetails, int nodesRequired) { | |||
throw new SCMException(msg, | |||
SCMException.ResultCodes.FAILED_TO_FIND_SUITABLE_NODE); | |||
} | |||
|
|||
if (!checkAllNodesAreEqual(nodeManager.getClusterNetworkTopologyMap())) { | |||
boolean multipleRacks = multipleRacksAvailable(healthyNodes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also consider excludedNodes
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In line 124 if there are exclusions, then they are removed from the healthyNodes list, then based on the remaining elements healthyList is created by filtering the nodes that can not accept more pipeline.
At this point with the healthyNodes list we are checking whether there were multiple racks configured for the healthy nodes, as we need to know that we are in a multi rack environment. (If the whole cluster has just one rack, then even though we have rack awareness logic, we can not allocate pipelines in two racks, hence this check is needed.)
The second check is after filtering nodes that can not accept more pipeline, and the intention is to throw an exception if the remaining nodes are in one single rack, as in this case we have more racks with healthy nodes, still we can't allocate a rack aware pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the first call to multipleRacksAvailable(...)
we just use all healthy nodes and don't worry about excluded nodes. As Pifta said, excluded nodes are handled later and this first call is just to check if the cluster has multiple alive racks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's take the same example given in the description,
Rack 1 = 10 nodes; Rack 2 = 1 node
For some reason, the one node which is in Rack 2 is added to the exclude list. According to SCM the node in Rack 2 is still healthy.
We will end up creating a pipeline in the same rack (Rack 1), even though we have two racks. We will run into same scenario which this PR is trying to address.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct. Checking this again, I see at line 121/126 the excluded nodes are removed from the healthyList:
List<DatanodeDetails> healthyNodes =
nodeManager.getNodes(HddsProtos.NodeState.HEALTHY);
if (excludedNodes != null) {
healthyNodes.removeAll(excludedNodes);
}
So I need to move the line:
boolean multipleRacks = multipleRacksAvailable(healthyNodes);
To before removing the excluded nodes.
Hi @sodonnel, the changeset looks good to me, I am +1 on committing it, though it is non-binding :) There is one thing I would change, but as this implementation fits within the rest of the code and kind of consistent with it, we might not need to start the change now or ever, even if I would be happier, but I would like to express my view on testing a bit, and see if I can start a change :) Please if you have anything to add on why you chose this implementation, I would be quite happy to discuss it as I might learn. |
Thanks for the review @fapifta - I will have a look at the test and spilt them into two as you said. I guess I just did it this way as other tests are like this, and there is a reasonable amount of setup code. I do see your point on each test covering only one thing however. |
Hi @sodonnel, thank you for addressing the test related ideas, and for the contribution. The changes look good to me +1 (non-binding). |
@nandakumar131 Could you check the latest change? Are you happy for us to commit this PR as it is now? |
… racks are alive. (apache#1291)
* master: (26 commits) HDDS-4167. Acceptance test logs missing if fails during cluster startup (apache#1366) HDDS-4121. Implement OmMetadataMangerImpl#getExpiredOpenKeys. (apache#1351) HDDS-3867. Extend the chunkinfo tool to display information from all nodes in the pipeline. (apache#1154) HDDS-4077. Incomplete OzoneFileSystem statistics (apache#1329) HDDS-3903. OzoneRpcClient support batch rename keys. (apache#1150) HDDS-4151. Skip the inputstream while offset larger than zero in s3g (apache#1354) HDDS-4147. Add OFS to FileSystem META-INF (apache#1352) HDDS-4137. Turn on the verbose mode of safe mode check on testlib (apache#1343) HDDS-4146. Show the ScmId and ClusterId in the scm web ui. (apache#1350) HDDS-4145. Bump version to 1.1.0-SNAPSHOT on master (apache#1349) HDDS-4109. Tests in TestOzoneFileSystem should use the existing MiniOzoneCluster (apache#1316) HDDS-4149. Implement OzoneFileStatus#toString (apache#1356) HDDS-4153. Increase default timeout in kubernetes tests (apache#1357) HDDS-2411. add a datanode chunk validator fo datanode chunk generator (apache#1312) HDDS-4140. Auto-close /pending pull requests after 21 days of inactivity (apache#1344) HDDS-4152. Archive container logs for kubernetes check (apache#1355) HDDS-4056. Convert OzoneAdmin to pluggable model (apache#1285) HDDS-3972. Add option to limit number of items displaying through ldb tool. (apache#1206) HDDS-4068. Client should not retry same OM on network connection failure (apache#1324) HDDS-4062. Non rack aware pipelines should not be created if multiple racks are alive. (apache#1291) ...
What changes were proposed in this pull request?
If we have a scenario where one rack has more nodes that others, it is possible for all hosts in the cluster to have reached their pipeline limit, while 3 nodes on the larger rack have not.
The current fallback logic will then allow a pipeline to be created which uses only the 3 nodes on the same rack, violating the rack placement policy.
There may be other ways this could happen with cluster load too, were the pipeline capacity has reached its limit on some nodes but not others.
The proposal here, is that if the cluster has multiple racks AND there are healthy nodes covering at least 2 racks, where healthy is defined as a node which is registered and not stale or dead, then we should not allow "fallback" (pipelines which span only 1 rack) pipelines to be created.
This means if you have a badly configured cluster - eg Rack 1 = 10 nodes; Rack 2 = 1 node, the pipeline limit will be constrained by the capacity of that 1 node on rack 2. Even a setup like Rack 1 = 10 nodes, Rack 2 = 5 would be constrained by this.
This constraint is better than creating non rack aware pipelines, and the rule above will handle the case when the cluster degrades to 1 rack, as the healthy node definition will notice only 1 rack is alive.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-4062
How was this patch tested?
New unit test added to reproduce the issue without the patch and ensure the patch fixes it.