HDDS-5916. Datanodes stuck in leader election in Kubernetes #3186

sokui · 2022-03-12T00:36:49Z

What changes were proposed in this pull request?

make ozone support datanode change IPs and hostnames (as long as the uuid not change)

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-5916

How was this patch tested?

Tested in k8s production with kerberos enabled. Each datanode is attached to a pvc. Ozone still works well after killing any number of the datanodes (Datanodes will be rescheduled with different IPs).

…ection in Kubernets env

kerneltime · 2022-03-12T00:54:57Z

cc @sodonnel @szetszwo

szetszwo · 2022-03-14T18:48:21Z

FYI, the Ratis pre-vote feature could avoid infinite leader election; see https://issues.apache.org/jira/browse/RATIS-993 .

Of course, it is good to fix the underlying problem, i.e. updating the IP address.

adoroszlai · 2022-03-16T18:00:45Z

ozone-topology has some failure related to hostnames, e.g.:

Run printTopology -o                                                  | FAIL |
'State = HEALTHY
Location: /rack1
 10.5.0.6(22d8fdd5a2e3) IN_SERVICE
 10.5.0.5(8229bd2d139a) IN_SERVICE
 10.5.0.4(a5cb3f4bf7b9) IN_SERVICE
Location: /rack2
 10.5.0.9(c6c0ab11d238) IN_SERVICE
 10.5.0.7(7fef480ed13c) IN_SERVICE
 10.5.0.8(d6c42b5c0caa) IN_SERVICE' does not contain '10.5.0.7(ozone-topology_datanode_4_1.ozone-topology_net) IN_SERVICE'

Can you please check?

https://github.com/apache/ozone/runs/5572880239#step:5:557

sokui · 2022-03-30T17:43:19Z

@adoroszlai , do you know how can I run these tests locally? I am not familiar with it. Thanks

adoroszlai · 2022-03-30T18:00:41Z

@sokui You can run this acceptance test (Robot tests in Docker Compose-based environment) locally by:

mvn -DskipTests clean package
cd hadoop-ozone/dist/target/ozone-1.3.0-SNAPSHOT/compose/ozone-topology
./test.sh

sokui · 2022-03-31T04:20:28Z

Thanks @adoroszlai . I believe the test is fixed now.

adoroszlai · 2022-03-31T13:46:34Z

Thanks @sokui for fixing the test. There are some checkstyle problems, can you please fix those, too?

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/PipelineManagerImpl.java
 378: First sentence should end with a period.
 384: Line is longer than 80 characters (found 87).
 389: Line is longer than 80 characters (found 89).
 407: Line is longer than 80 characters (found 88).
 408: Line is longer than 80 characters (found 88).

https://github.com/apache/ozone/runs/5769164574#step:5:408

sokui · 2022-03-31T16:24:20Z

@adoroszlai done.

kerneltime · 2022-04-01T19:03:38Z

hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/ratis/RatisHelper.java

@@ -96,7 +100,13 @@ public static UUID toDatanodeId(RaftProtos.RaftPeerProto peerId) {
  }

  private static String toRaftPeerAddress(DatanodeDetails id, Port.Name port) {
-    return id.getIpAddress() + ":" + id.getPort(port).getValue();
+    if (datanodeUseHostName()) {
+      LOG.debug("Datanode is using hostname for raft peer address");


Might as well print the actual value calculated in the debug log.

kerneltime · 2022-04-01T19:11:17Z

...rc/main/java/org/apache/hadoop/ozone/container/common/states/datanode/InitDatanodeState.java

@@ -125,7 +125,7 @@ private void persistContainerDatanodeDetails() {
    File idPath = new File(dataNodeIDPath);
    DatanodeDetails datanodeDetails = this.context.getParent()
        .getDatanodeDetails();
-    if (datanodeDetails != null && !idPath.exists()) {


What's the motivation for dropping this check?

This is because when the datanode got restarted in k8s, the IP will be changed. So the original info in this file is not accurate any more. This will make sure we update with the latest info.

And when we are not using k8s, I think it is not harmful to always update this file whenever the node restarts.

kerneltime · 2022-04-01T19:33:37Z

hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/server/events/EventQueue.java

+  // The field parent in DatanodeDetails class has the circular reference
+  // which will result in Gson infinite recursive parsing. We need to exclude
+  // this field when generating json string for DatanodeDetails object
+  static class DatanodeDetailsGsonExclusionStrategy


This change can be merged as a quick PR and not wait on this PR.

kerneltime · 2022-04-01T20:02:53Z

.../server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeIpOrHostnameUpdateHandler.java

+
+      if (datanodeDetails.getPersistedOpState()
+              != HddsProtos.NodeOperationalState.IN_SERVICE) {
+        decommissionManager.continueAdminForNode(datanodeDetails);


It might be better to depend on the guarantees of continueAdminForNode (need to update javadoc for continueAdminForNode) and always call that method here.

continueAdminForNode implements the logic for when the dn should be monitored. Let's not replicate it.

public synchronized void continueAdminForNode(DatanodeDetails dn) throws NodeNotFoundException { if (!scmContext.isLeader()) { LOG.info("follower SCM ignored continue admin for datanode {}", dn); return; } NodeOperationalState opState = getNodeStatus(dn).getOperationalState(); if (opState == NodeOperationalState.DECOMMISSIONING || opState == NodeOperationalState.ENTERING_MAINTENANCE || opState == NodeOperationalState.IN_MAINTENANCE) { LOG.info("Continue admin for datanode {}", dn); monitor.startMonitoring(dn); } }

kerneltime · 2022-04-01T20:03:35Z

.../server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeIpOrHostnameUpdateHandler.java

+    } catch (NodeNotFoundException e) {
+      // Should not happen, as the node has just registered to call this event
+      // handler.
+      LOG.warn(


Log as an error.

kerneltime

I have not completed my review, some minor nits to improve the code. I have to go over the test code.

kerneltime · 2022-04-01T20:52:50Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/SCMNodeManager.java

+                  datanodeDetails.getUuidString(),
+                  datanodeInfo,
+                  datanodeDetails);
+          if (clusterMap.contains(datanodeInfo)) {


It might be better to implement clusterMap.update(datanodeDetails). This would keep the locking and concurrency issues in check.

kerneltime · 2022-04-01T20:54:34Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/SCMNodeManager.java

+          removeEntryFromDnsToUuidMap(oldDnsName);
+          addEntryToDnsToUuidMap(dnsName, datanodeDetails.getUuidString());


Same here better to implement a new method updateEntryInDnsToUuisMap(oldDnsName, dnsName, datanodeDetails.getUuidString)

sokui · 2022-04-01T22:35:59Z

@kerneltime Your comments are addressed

kerneltime · 2022-04-19T20:51:55Z

LGTM

adoroszlai · 2022-05-07T21:15:14Z

hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/ratis/RatisHelper.java
 105: Line is longer than 80 characters (found 82).
 106: Line is longer than 80 characters (found 81).
 109: Line is longer than 80 characters (found 83).
hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/net/NetworkTopologyImpl.java
 120: Line is longer than 80 characters (found 81).
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeIpOrHostnameUpdateHandler.java
 42: Line is longer than 80 characters (found 83).

sokui · 2022-05-09T19:35:59Z

For this test failure: “build-branch / integration (flaky) (pull_request)“, does it require to pass? I check the flaky annotation, which has the following definitions:

/**
 * Annotation to mark test classes or methods with some intermittent failures.
 * These are handled separately from the normal tests.  (Not required to pass,
 * may be repeated automatically, etc.)
 */

sokui · 2022-06-01T22:07:21Z

@adoroszlai addressed all the PR comments. Pls have another look. Thanks

adoroszlai · 2022-06-02T11:13:19Z

Thanks @sokui for updating the patch. LGTM, but let's wait for another review.

GeorgeJahad · 2022-06-07T16:22:40Z

FWIW @adoroszlai I'm very interested in the PR and will try to review it in the next few days.

GeorgeJahad · 2022-06-10T23:11:33Z

.../server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/BackgroundPipelineCreator.java

@@ -64,7 +65,8 @@ public class BackgroundPipelineCreator implements SCMService {
   * SCMService related variables.
   * 1) after leaving safe mode, BackgroundPipelineCreator needs to
   *    wait for a while before really take effect.
-   * 2) NewNodeHandler, NonHealthyToHealthyNodeHandler, PreCheckComplete
+   * 2) NewNodeHandler, NodeIpOrHostnameUpdateHandler,


Shouldn't this be "NodeAddressUpdateHandler" instead of "NodeIpOrHostnameUpdateHandler"?

Good catch. Let me fix this.

GeorgeJahad · 2022-06-14T23:05:30Z

hadoop-ozone/dist/src/main/k8s/examples/ozone/test.sh

@@ -28,7 +28,19 @@ regenerate_resources

 start_k8s_env

-execute_robot_test scm-0 smoketest/basic/basic.robot


Don't we still want to run this test, in addition to those below?

@adoroszlai The change for this file is by cherry picking of your commits. Could you pls let me know why you delete this line before? If we need it back, I can simply put it back. Just want to know your consideration. Thanks

I'm OK either way.

basic.robot has two tests:

HTTP request for static web resource

Freon key generation/validation

The latter (Freon) is performed by the new code, too, so only the web test is missing.

GeorgeJahad · 2022-06-15T00:58:36Z

...p-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/PipelineManagerImpl.java

+   * @param datanodeDetails new datanodeDetails
+   */
+  @Override
+  public void closeStalePipelines(DatanodeDetails datanodeDetails) {


I don't see a unit test for this. Is it not worth it?

Added a unit test

GeorgeJahad · 2022-06-15T01:03:29Z

...er-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/WritableRatisContainerProvider.java

+              "Datanodes may be used up. Try to see if any pipeline is in " +
+                  "ALLOCATED state, and then will wait for it to be OPEN",
+                  repConfig, se);
+          List<Pipeline> allocatedPipelines = findPipelinesByState(repConfig,


It doesn't appear as if the test code covers waiting for pipelines to open. Does it need to?

Do you mean testing waitOnePipelineReady() method? This method involves the timer and multi thread, so I think it is not a good candidate for unit test. For integration test, I do not know how to set it up and test it through. Any suggestion?

@sokui i was thinking of something like this:
sokui/ozone@HDDS-5916-support-datanode-change-ip-hostname...GeorgeJahad:gbjAllocateTest2

Feel free to ignore it if you don't like it.

I usually do not use sleep() in unit tests, because it may cause the unit test unreliable. But for this test, we have no other better way to do it, and the sleep() here should not make the test case unstable. Let me include your commit. Thank you!

Thank you for all the hard work you've put into this PR. We are very interested in it.

GeorgeJahad · 2022-06-15T01:21:52Z

LGTM

sokui · 2022-06-16T05:35:03Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/SCMNodeManager.java

-      InetAddress dnAddress = Server.getRemoteIp();
-      if (dnAddress != null) {
-        // Mostly called inside an RPC, update ip and peer hostname
-        datanodeDetails.setHostName(dnAddress.getHostName());


I delete this line, because these days, when I tested it, I found sometimes dnAddress.getHostName() returns IP instead of hostName, which makes the datanode restarting not work. Please let me know if it is OK to delete this line. @GeorgeJahad @adoroszlai

@sokui @adoroszlai I'm nervous about removing the call to setHostName().

I just took around and it seems to get used in many places. I've included some below:

ozone/hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/XceiverClientManager.java

Line 259 in f57a019

key += pipeline.getClosestNode().getHostName();

ozone/hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/freon/DatanodeChunkGenerator.java

Line 171 in f57a019

if (datanodeHosts.contains(dn.getHostName())) {

ozone/hadoop-ozone/ozonefs-common/src/main/java/org/apache/hadoop/fs/ozone/BasicOzoneClientAdapterImpl.java

Line 576 in f57a019

hostList.add(dn.getHostName());

Why does restart not work when it returns the IP string instead of the host string?

Hi @GeorgeJahad ,

To not change the old code path, I added the if condition: when useHostname is true, we will not setHostname, but when it is false (old code), we keep the old logic which set the hostname. There are two things I want to explain:

When datanode fist register with scm, the datanodeDetails already contains hostName. Here the code we are talking about is just to reset the datanodeDetails.hostName. So the code you listed above won't return null if we remove this line of code datanodeDetails.setHostName(dnAddress.getHostName());.

Why it doesn't work when we reset datanodeDetails.hostName when useHostname is true? this is because in k8s, when datanode first registered with scm, dnAddress.getHostName() may return IP instead of hostName (maybe because of k8s DNS lookup service delay, I am not exactly sure). this will result in the IP instead of hostName is used in datanode Ratis communication for Pipelines. When datanode gets restarted with different IP, then the Ratis communication with old IP throws the HostNotFoundException. But if we remove this line, then we are sure that the datanodeDetails.hostName always contains the hostname instead of the IP. So it won't have the Ratis communication problem.

This is the whole story. That's why now I keep the old code path same, but if useHostname is true, we won't do datanodeDetails.setHostName(dnAddress.getHostName()); in the register process. Please let me know if it makes sense to you.

When datanode fist register with scm, the datanodeDetails already contains hostName. Here the code we are talking about is just to reset the datanodeDetails.hostName.

If you are sure this is true, then I'm fine with the change.

To not change the old code path, I added the if condition: when useHostname is true, we will not setHostname, but when it is false (old code),

I'm confused about this statement. The old code path is when "(!isNodeRegistered(datanodeDetails))" is true, isn't it? not when "(!useHostname)" is true? what am I missing?

Old codepath means current master, before this PR (or when DFS_DATANODE_USE_DN_HOSTNAME is not enabled).

Sorry for the confusion. The old path is the current master. So I made the following change:

from

if (dnAddress != null) { // Mostly called inside an RPC, update ip and peer hostname datanodeDetails.setHostName(dnAddress.getHostName()); ... }

To

if (dnAddress != null) { // Mostly called inside an RPC, update ip and peer hostname if (!useHostname) { datanodeDetails.setHostName(dnAddress.getHostName()); } ... }

What I was saying is that even we delete this line datanodeDetails.setHostName(dnAddress.getHostName());, it should be still fine because when datanode register with scm, the datanodeDetails already have the hostName info. But to be conservative, I just use the above logic to make sure when DFS_DATANODE_USE_DN_HOSTNAME is not enabled, the code is the exactly same as before.

sokui · 2022-06-16T05:35:56Z

Added the unit test, and replied the comment. Please have another look. Thank you!

sokui · 2022-06-17T05:20:02Z

I am looking into the verification failures. How can I run a .robot test case locally? For example, I just want to run /opt/hadoop/smoketest/recon/recon-api.robot this test case. @adoroszlai

adoroszlai · 2022-06-17T05:53:49Z

How can I run a .robot test case locally?

To run acceptance tests in a specific environment (replace ozonesecure with the one you would like to exercise):

mvn -DskipTests clean package
cd hadoop-ozone/dist/target/ozone-*-SNAPSHOT/compose/ozonesecure
./test.sh

You can edit test.sh to disable/remove tests you want to skip.

sokui · 2022-06-17T21:54:59Z

@adoroszlai I feel confused. After I pushed the last change (one line change), some test failed with unrelated problem. The test error line seems not match my code. Do you know what's going on?

sokui · 2022-06-19T23:45:31Z

@adoroszlai I feel confused. After I pushed the last change (one line change), some test failed with unrelated problem. The test error line seems not match my code. Do you know what's going on?

Hi @adoroszlai ,

When you have time, could you pls take a look at my above question? I think the failed tests in this PR are not relevant to my current code (the reported error lines do not match my code). If the testing relied on a wrong version of the code, could you pls re-trigger the testing? Thank you!

…atanode-change-ip-hostname

adoroszlai · 2022-06-20T08:25:57Z

test error line seems not match my code. Do you know what's going on?

@sokui Pull requests are built and tested as if the source branch (your code) was merged into the base branch (master): https://github.com/apache/ozone/runs/6931349232#step:2:488

The compile error in TestSCMNodeManager was caused by recent change on master to use JUnit5 in some tests including this one.

sokui · 2022-06-21T04:53:36Z

test error line seems not match my code. Do you know what's going on?

@sokui Pull requests are built and tested as if the source branch (your code) was merged into the base branch (master): https://github.com/apache/ozone/runs/6931349232#step:2:488

The compile error in TestSCMNodeManager was caused by recent change on master to use JUnit5 in some tests including this one.

Nice. Seems all the tests get passed. Please let me know if we can merge it. Thank you!

adoroszlai · 2022-06-22T07:10:06Z

@sodonnel @nandakumar131 would you like to take a look?

adoroszlai · 2022-06-23T19:03:22Z

Thanks @sokui for the patch, @GeorgeJahad, @kerneltime, @Xushaohong for the review.

HDDS-5916: DNs in pipeline raft group get stuck in infinite leader el…

84b56f8

…ection in Kubernets env

avijayanhwx requested review from adoroszlai and nandakumar131 March 14, 2022 16:18

fix rat and checkstyle errors

cb4adb9

fix findbugs errors

e51aa4b

solve a edge case bug

50ded1c

fix test

5c9addf

fix check style error

a7e47b7

kerneltime reviewed Apr 1, 2022

View reviewed changes

zhiheng xie added 2 commits April 1, 2022 13:59

address PR comments

71f1b28

address PR comments

d7dbf36

resolve conflicts

9452d81

fix checkstyle error

b89eed1

fix checkstyle error

1a81703

GeorgeJahad reviewed Jun 10, 2022

View reviewed changes

fix a comment

2c5f812

GeorgeJahad reviewed Jun 14, 2022

View reviewed changes

GeorgeJahad reviewed Jun 15, 2022

View reviewed changes

zhiheng xie added 2 commits June 15, 2022 22:30

add more unit test

27f2d76

resovle conflict

7941838

sokui commented Jun 16, 2022

View reviewed changes

fix test failure

7ce0077

George Jahad added 2 commits June 17, 2022 15:09

working

e742158

cleanup

d8f4822

Merge remote-tracking branch 'origin/master' into HDDS-5916-support-d…

801e856

…atanode-change-ip-hostname

adoroszlai approved these changes Jun 22, 2022

View reviewed changes

adoroszlai merged commit 2ffbfff into apache:master Jun 23, 2022

sokui mentioned this pull request Oct 15, 2022

HDDS-7329. Extend ozone admin datanode usageinfo and list info to accept hostname parameter #3835

Merged

		removeEntryFromDnsToUuidMap(oldDnsName);
		addEntryToDnsToUuidMap(dnsName, datanodeDetails.getUuidString());

		@@ -28,7 +28,19 @@ regenerate_resources

		start_k8s_env

		execute_robot_test scm-0 smoketest/basic/basic.robot

HDDS-5916. Datanodes stuck in leader election in Kubernetes #3186

HDDS-5916. Datanodes stuck in leader election in Kubernetes #3186

Conversation

sokui commented Mar 12, 2022 • edited by adoroszlai

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

kerneltime commented Mar 12, 2022

szetszwo commented Mar 14, 2022

adoroszlai commented Mar 16, 2022

sokui commented Mar 30, 2022

adoroszlai commented Mar 30, 2022

sokui commented Mar 31, 2022

adoroszlai commented Mar 31, 2022

sokui commented Mar 31, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kerneltime left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sokui commented Apr 1, 2022

kerneltime commented Apr 19, 2022

adoroszlai commented May 7, 2022

sokui commented May 9, 2022

sokui commented Jun 1, 2022

adoroszlai commented Jun 2, 2022

GeorgeJahad commented Jun 7, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GeorgeJahad commented Jun 15, 2022

sokui Jun 16, 2022 • edited

Choose a reason for hiding this comment

GeorgeJahad Jun 20, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GeorgeJahad Jun 21, 2022 • edited

Choose a reason for hiding this comment

adoroszlai Jun 21, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sokui commented Jun 16, 2022

sokui commented Jun 17, 2022

adoroszlai commented Jun 17, 2022

sokui commented Jun 17, 2022

sokui commented Jun 19, 2022

adoroszlai commented Jun 20, 2022

sokui commented Jun 21, 2022

adoroszlai commented Jun 22, 2022

adoroszlai commented Jun 23, 2022

sokui commented Mar 12, 2022 •

edited by adoroszlai

sokui Jun 16, 2022 •

edited

GeorgeJahad Jun 20, 2022 •

edited

GeorgeJahad Jun 21, 2022 •

edited

adoroszlai Jun 21, 2022 •

edited