Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-5916. Datanodes stuck in leader election in Kubernetes #3186

Merged

Conversation

sokui
Copy link
Contributor

@sokui sokui commented Mar 12, 2022

What changes were proposed in this pull request?

make ozone support datanode change IPs and hostnames (as long as the uuid not change)

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-5916

How was this patch tested?

Tested in k8s production with kerberos enabled. Each datanode is attached to a pvc. Ozone still works well after killing any number of the datanodes (Datanodes will be rescheduled with different IPs).

@kerneltime
Copy link
Contributor

cc @sodonnel @szetszwo

@szetszwo
Copy link
Contributor

FYI, the Ratis pre-vote feature could avoid infinite leader election; see https://issues.apache.org/jira/browse/RATIS-993 .

Of course, it is good to fix the underlying problem, i.e. updating the IP address.

@adoroszlai
Copy link
Contributor

ozone-topology has some failure related to hostnames, e.g.:

Run printTopology -o                                                  | FAIL |
'State = HEALTHY
Location: /rack1
 10.5.0.6(22d8fdd5a2e3) IN_SERVICE
 10.5.0.5(8229bd2d139a) IN_SERVICE
 10.5.0.4(a5cb3f4bf7b9) IN_SERVICE
Location: /rack2
 10.5.0.9(c6c0ab11d238) IN_SERVICE
 10.5.0.7(7fef480ed13c) IN_SERVICE
 10.5.0.8(d6c42b5c0caa) IN_SERVICE' does not contain '10.5.0.7(ozone-topology_datanode_4_1.ozone-topology_net) IN_SERVICE'

Can you please check?

https://github.com/apache/ozone/runs/5572880239#step:5:557

@sokui
Copy link
Contributor Author

sokui commented Mar 30, 2022

@adoroszlai , do you know how can I run these tests locally? I am not familiar with it. Thanks

@adoroszlai
Copy link
Contributor

@sokui You can run this acceptance test (Robot tests in Docker Compose-based environment) locally by:

mvn -DskipTests clean package
cd hadoop-ozone/dist/target/ozone-1.3.0-SNAPSHOT/compose/ozone-topology
./test.sh

@sokui
Copy link
Contributor Author

sokui commented Mar 31, 2022

Thanks @adoroszlai . I believe the test is fixed now.

@adoroszlai
Copy link
Contributor

Thanks @sokui for fixing the test. There are some checkstyle problems, can you please fix those, too?

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/PipelineManagerImpl.java
 378: First sentence should end with a period.
 384: Line is longer than 80 characters (found 87).
 389: Line is longer than 80 characters (found 89).
 407: Line is longer than 80 characters (found 88).
 408: Line is longer than 80 characters (found 88).

https://github.com/apache/ozone/runs/5769164574#step:5:408

@sokui
Copy link
Contributor Author

sokui commented Mar 31, 2022

@adoroszlai done.

@@ -96,7 +100,13 @@ public static UUID toDatanodeId(RaftProtos.RaftPeerProto peerId) {
}

private static String toRaftPeerAddress(DatanodeDetails id, Port.Name port) {
return id.getIpAddress() + ":" + id.getPort(port).getValue();
if (datanodeUseHostName()) {
LOG.debug("Datanode is using hostname for raft peer address");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well print the actual value calculated in the debug log.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

@@ -125,7 +125,7 @@ private void persistContainerDatanodeDetails() {
File idPath = new File(dataNodeIDPath);
DatanodeDetails datanodeDetails = this.context.getParent()
.getDatanodeDetails();
if (datanodeDetails != null && !idPath.exists()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the motivation for dropping this check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because when the datanode got restarted in k8s, the IP will be changed. So the original info in this file is not accurate any more. This will make sure we update with the latest info.

And when we are not using k8s, I think it is not harmful to always update this file whenever the node restarts.

// The field parent in DatanodeDetails class has the circular reference
// which will result in Gson infinite recursive parsing. We need to exclude
// this field when generating json string for DatanodeDetails object
static class DatanodeDetailsGsonExclusionStrategy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change can be merged as a quick PR and not wait on this PR.


if (datanodeDetails.getPersistedOpState()
!= HddsProtos.NodeOperationalState.IN_SERVICE) {
decommissionManager.continueAdminForNode(datanodeDetails);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to depend on the guarantees of continueAdminForNode (need to update javadoc for continueAdminForNode) and always call that method here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

continueAdminForNode implements the logic for when the dn should be monitored. Let's not replicate it.

  public synchronized void continueAdminForNode(DatanodeDetails dn)
      throws NodeNotFoundException {
    if (!scmContext.isLeader()) {
      LOG.info("follower SCM ignored continue admin for datanode {}", dn);
      return;
    }
    NodeOperationalState opState = getNodeStatus(dn).getOperationalState();
    if (opState == NodeOperationalState.DECOMMISSIONING
        || opState == NodeOperationalState.ENTERING_MAINTENANCE
        || opState == NodeOperationalState.IN_MAINTENANCE) {
      LOG.info("Continue admin for datanode {}", dn);
      monitor.startMonitoring(dn);
    }
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

} catch (NodeNotFoundException e) {
// Should not happen, as the node has just registered to call this event
// handler.
LOG.warn(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log as an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Copy link
Contributor

@kerneltime kerneltime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not completed my review, some minor nits to improve the code. I have to go over the test code.

datanodeDetails.getUuidString(),
datanodeInfo,
datanodeDetails);
if (clusterMap.contains(datanodeInfo)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to implement clusterMap.update(datanodeDetails). This would keep the locking and concurrency issues in check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Comment on lines 426 to 427
removeEntryFromDnsToUuidMap(oldDnsName);
addEntryToDnsToUuidMap(dnsName, datanodeDetails.getUuidString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here better to implement a new method updateEntryInDnsToUuisMap(oldDnsName, dnsName, datanodeDetails.getUuidString)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@sokui
Copy link
Contributor Author

sokui commented Apr 1, 2022

@kerneltime Your comments are addressed

@kerneltime
Copy link
Contributor

LGTM

@adoroszlai
Copy link
Contributor

hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/ratis/RatisHelper.java
 105: Line is longer than 80 characters (found 82).
 106: Line is longer than 80 characters (found 81).
 109: Line is longer than 80 characters (found 83).
hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/net/NetworkTopologyImpl.java
 120: Line is longer than 80 characters (found 81).
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeIpOrHostnameUpdateHandler.java
 42: Line is longer than 80 characters (found 83).

@sokui
Copy link
Contributor Author

sokui commented May 9, 2022

For this test failure: “build-branch / integration (flaky) (pull_request)“, does it require to pass? I check the flaky annotation, which has the following definitions:

/**
 * Annotation to mark test classes or methods with some intermittent failures.
 * These are handled separately from the normal tests.  (Not required to pass,
 * may be repeated automatically, etc.)
 */

@sokui
Copy link
Contributor Author

sokui commented Jun 1, 2022

@adoroszlai addressed all the PR comments. Pls have another look. Thanks

@adoroszlai
Copy link
Contributor

Thanks @sokui for updating the patch. LGTM, but let's wait for another review.

@GeorgeJahad
Copy link
Contributor

FWIW @adoroszlai I'm very interested in the PR and will try to review it in the next few days.

@@ -64,7 +65,8 @@ public class BackgroundPipelineCreator implements SCMService {
* SCMService related variables.
* 1) after leaving safe mode, BackgroundPipelineCreator needs to
* wait for a while before really take effect.
* 2) NewNodeHandler, NonHealthyToHealthyNodeHandler, PreCheckComplete
* 2) NewNodeHandler, NodeIpOrHostnameUpdateHandler,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be "NodeAddressUpdateHandler" instead of "NodeIpOrHostnameUpdateHandler"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Let me fix this.

@@ -28,7 +28,19 @@ regenerate_resources

start_k8s_env

execute_robot_test scm-0 smoketest/basic/basic.robot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we still want to run this test, in addition to those below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adoroszlai The change for this file is by cherry picking of your commits. Could you pls let me know why you delete this line before? If we need it back, I can simply put it back. Just want to know your consideration. Thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK either way.

basic.robot has two tests:

  • HTTP request for static web resource
  • Freon key generation/validation

The latter (Freon) is performed by the new code, too, so only the web test is missing.

* @param datanodeDetails new datanodeDetails
*/
@Override
public void closeStalePipelines(DatanodeDetails datanodeDetails) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a unit test for this. Is it not worth it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a unit test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

"Datanodes may be used up. Try to see if any pipeline is in " +
"ALLOCATED state, and then will wait for it to be OPEN",
repConfig, se);
List<Pipeline> allocatedPipelines = findPipelinesByState(repConfig,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't appear as if the test code covers waiting for pipelines to open. Does it need to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean testing waitOnePipelineReady() method? This method involves the timer and multi thread, so I think it is not a good candidate for unit test. For integration test, I do not know how to set it up and test it through. Any suggestion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sokui i was thinking of something like this:
sokui/ozone@HDDS-5916-support-datanode-change-ip-hostname...GeorgeJahad:gbjAllocateTest2

Feel free to ignore it if you don't like it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually do not use sleep() in unit tests, because it may cause the unit test unreliable. But for this test, we have no other better way to do it, and the sleep() here should not make the test case unstable. Let me include your commit. Thank you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for all the hard work you've put into this PR. We are very interested in it.

@GeorgeJahad
Copy link
Contributor

LGTM

InetAddress dnAddress = Server.getRemoteIp();
if (dnAddress != null) {
// Mostly called inside an RPC, update ip and peer hostname
datanodeDetails.setHostName(dnAddress.getHostName());
Copy link
Contributor Author

@sokui sokui Jun 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I delete this line, because these days, when I tested it, I found sometimes dnAddress.getHostName() returns IP instead of hostName, which makes the datanode restarting not work. Please let me know if it is OK to delete this line. @GeorgeJahad @adoroszlai

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sokui @adoroszlai I'm nervous about removing the call to setHostName().

I just took around and it seems to get used in many places. I've included some below:

Why does restart not work when it returns the IP string instead of the host string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @GeorgeJahad ,

To not change the old code path, I added the if condition: when useHostname is true, we will not setHostname, but when it is false (old code), we keep the old logic which set the hostname. There are two things I want to explain:

  1. When datanode fist register with scm, the datanodeDetails already contains hostName. Here the code we are talking about is just to reset the datanodeDetails.hostName. So the code you listed above won't return null if we remove this line of code datanodeDetails.setHostName(dnAddress.getHostName());.

  2. Why it doesn't work when we reset datanodeDetails.hostName when useHostname is true? this is because in k8s, when datanode first registered with scm, dnAddress.getHostName() may return IP instead of hostName (maybe because of k8s DNS lookup service delay, I am not exactly sure). this will result in the IP instead of hostName is used in datanode Ratis communication for Pipelines. When datanode gets restarted with different IP, then the Ratis communication with old IP throws the HostNotFoundException. But if we remove this line, then we are sure that the datanodeDetails.hostName always contains the hostname instead of the IP. So it won't have the Ratis communication problem.

This is the whole story. That's why now I keep the old code path same, but if useHostname is true, we won't do datanodeDetails.setHostName(dnAddress.getHostName()); in the register process. Please let me know if it makes sense to you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When datanode fist register with scm, the datanodeDetails already contains hostName. Here the code we are talking about is just to reset the datanodeDetails.hostName.

If you are sure this is true, then I'm fine with the change.

To not change the old code path, I added the if condition: when useHostname is true, we will not setHostname, but when it is false (old code),

I'm confused about this statement. The old code path is when "(!isNodeRegistered(datanodeDetails))" is true, isn't it? not when "(!useHostname)" is true? what am I missing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old codepath means current master, before this PR (or when DFS_DATANODE_USE_DN_HOSTNAME is not enabled).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the confusion. The old path is the current master. So I made the following change:

from

if (dnAddress != null) {
        // Mostly called inside an RPC, update ip and peer hostname
        datanodeDetails.setHostName(dnAddress.getHostName());
        ...
}

To

if (dnAddress != null) {
        // Mostly called inside an RPC, update ip and peer hostname
        if (!useHostname) {
            datanodeDetails.setHostName(dnAddress.getHostName());
        }
        ...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I was saying is that even we delete this line datanodeDetails.setHostName(dnAddress.getHostName());, it should be still fine because when datanode register with scm, the datanodeDetails already have the hostName info. But to be conservative, I just use the above logic to make sure when DFS_DATANODE_USE_DN_HOSTNAME is not enabled, the code is the exactly same as before.

@sokui
Copy link
Contributor Author

sokui commented Jun 16, 2022

Added the unit test, and replied the comment. Please have another look. Thank you!

@sokui
Copy link
Contributor Author

sokui commented Jun 17, 2022

I am looking into the verification failures. How can I run a .robot test case locally? For example, I just want to run /opt/hadoop/smoketest/recon/recon-api.robot this test case. @adoroszlai

@adoroszlai
Copy link
Contributor

How can I run a .robot test case locally?

To run acceptance tests in a specific environment (replace ozonesecure with the one you would like to exercise):

mvn -DskipTests clean package
cd hadoop-ozone/dist/target/ozone-*-SNAPSHOT/compose/ozonesecure
./test.sh

You can edit test.sh to disable/remove tests you want to skip.

@sokui
Copy link
Contributor Author

sokui commented Jun 17, 2022

@adoroszlai I feel confused. After I pushed the last change (one line change), some test failed with unrelated problem. The test error line seems not match my code. Do you know what's going on?

George Jahad added 2 commits June 17, 2022 15:09
@sokui
Copy link
Contributor Author

sokui commented Jun 19, 2022

@adoroszlai I feel confused. After I pushed the last change (one line change), some test failed with unrelated problem. The test error line seems not match my code. Do you know what's going on?

Hi @adoroszlai ,

When you have time, could you pls take a look at my above question? I think the failed tests in this PR are not relevant to my current code (the reported error lines do not match my code). If the testing relied on a wrong version of the code, could you pls re-trigger the testing? Thank you!

@adoroszlai
Copy link
Contributor

test error line seems not match my code. Do you know what's going on?

@sokui Pull requests are built and tested as if the source branch (your code) was merged into the base branch (master): https://github.com/apache/ozone/runs/6931349232#step:2:488

The compile error in TestSCMNodeManager was caused by recent change on master to use JUnit5 in some tests including this one.

@sokui
Copy link
Contributor Author

sokui commented Jun 21, 2022

test error line seems not match my code. Do you know what's going on?

@sokui Pull requests are built and tested as if the source branch (your code) was merged into the base branch (master): https://github.com/apache/ozone/runs/6931349232#step:2:488

The compile error in TestSCMNodeManager was caused by recent change on master to use JUnit5 in some tests including this one.

Nice. Seems all the tests get passed. Please let me know if we can merge it. Thank you!

@adoroszlai
Copy link
Contributor

@sodonnel @nandakumar131 would you like to take a look?

@adoroszlai adoroszlai merged commit 2ffbfff into apache:master Jun 23, 2022
@adoroszlai
Copy link
Contributor

Thanks @sokui for the patch, @GeorgeJahad, @kerneltime, @Xushaohong for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants