Skip to content

Comments

HDDS-5437. Move dead DN out of topology#2435

Merged
sodonnel merged 17 commits intoapache:masterfrom
JacksonYao287:HDDS-5437
Aug 18, 2021
Merged

HDDS-5437. Move dead DN out of topology#2435
sodonnel merged 17 commits intoapache:masterfrom
JacksonYao287:HDDS-5437

Conversation

@JacksonYao287
Copy link
Contributor

What changes were proposed in this pull request?

Move dead DN out of topology

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-5437

How was this patch tested?

unit test

@JacksonYao287 JacksonYao287 changed the title Move dead DN out of topology HDDS-5437. Move dead DN out of topology Jul 19, 2021
@JacksonYao287 JacksonYao287 force-pushed the HDDS-5437 branch 2 times, most recently from 1eb45d2 to 4ff0f0e Compare July 19, 2021 13:48
@JacksonYao287
Copy link
Contributor Author

@ChenSammi @sodonnel PTAL, thanks!

Copy link
Contributor

@dineshchitlangia dineshchitlangia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChenSammi
Copy link
Contributor

Thanks @JacksonYao287 for working on this.

When a DN registered, the DatanodeDetails instance which presents DN is added into both Topology tree and nodeStateManager. nodeStateManager is used more frequently than Topology to get a DN's info. Would you add a Precondition check that

  1. after DN is removed from topology, DatanodeDetails instance returned from nodeStateManager has no parent.
  2. after DN is added back into topology, DatanodeDetails instance returned from nodeStateManager has parent correctly set.

And would you add a new UT for it.

@JacksonYao287
Copy link
Contributor Author

thanks @ChenSammi for the review!

Precondition check that
after DN is removed from topology, DatanodeDetails instance returned from nodeStateManager has no parent.
after DN is added back into topology, DatanodeDetails instance returned from nodeStateManager has parent correctly set.
And would you add a new UT for it.

i will make this change

@JacksonYao287
Copy link
Contributor Author

@JacksonYao287 LGTM

thank @dineshchitlangia for the review!

@JacksonYao287
Copy link
Contributor Author

@ChenSammi PTAL, thanks!


// First set the node to IN_MAINTENANCE and ensure the container replicas
// are not removed on the dead event
datanode1 = nodeManager.getNodeByUuid(datanode1.getUuidString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this change test this feature? I think we need a test (or extend this test) to ensure the node is removed from the topology when the dead handler is called and another test to ensure it is put back when the healthy event is fired.

Copy link
Contributor Author

@JacksonYao287 JacksonYao287 Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @sodonnel for the review!

How does this change test this feature?

      for(DatanodeInfo node : nodeStateMap.getAllDatanodeInfos()) {
              .....
        case STALE:
          // Move the node to DEAD if the last heartbeat time is less than
          // configured dead-node interval.
          updateNodeState(node, deadNodeCondition, status,
              NodeLifeCycleEvent.TIMEOUT);
             ......
      }

this is the current code in NodeStateManager#checkNodesHealth, and it is the only place where a DEAD event is fired and DeadNodeHandler is called. So we can see here when judging the state of all the nodes , we get all the DatanodeInfos from nodeStateMap directly. when a datonode registers itself to scm, it will be added to nodeStateMap. but let us see nodeStateMap#add.

  public void addNode(......)
       ...........
      nodeMap.put(id, new DatanodeInfo(datanodeDetails, nodeStatus,
          layoutInfo));
       ............
  }

we can see that a new object is created here and added to the map, so the point here is that, the registered DatanodeDetails is not the one which is stored in nodeMap and got by DeadNodeHandler when DEAD event is fired . when running test , it will lead to the failure of Preconditions.checkState in DeadNodeHandler. so i think the current test code does not notice this , and this is why the change here is made.

I think we need a test (or extend this test) to ensure the node is removed from the topology when the dead handler is called and another test to ensure it is put back when the healthy event is fired.

yea , although i have add a Preconditions.checkState in the handlers , i think it`s better off adding this . i will do it

@JacksonYao287 JacksonYao287 requested a review from sodonnel July 29, 2021 03:09
@JacksonYao287
Copy link
Contributor Author

@sodonnel @ChenSammi can you please take a look?

@JacksonYao287
Copy link
Contributor Author

seems a flaky case, not caused by this patch

Copy link
Contributor

@sodonnel sodonnel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM, but we need to get a green CI run before committing. Lets wait for #2546 to complete and get merged, then we can rebase this one to ensure it completes all the checks.

@JacksonYao287
Copy link
Contributor Author

thanks @sodonnel for the review

@sodonnel sodonnel merged commit b175d0b into apache:master Aug 18, 2021
@JacksonYao287 JacksonYao287 deleted the HDDS-5437 branch August 18, 2021 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants