Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-2593. DatanodeAdminMonitor should track under replicated containers and complete the admin workflow accordingly #309

Closed

Conversation

sodonnel
Copy link
Contributor

@sodonnel sodonnel commented Dec 5, 2019

What changes were proposed in this pull request?

This PR builds on HDDS-2459 and chances the DatanodeAdminMonitor to use the newly provided ContainerReplicaCount object to track the under replicated containers on a node and move the node to the next stages in the decommission / maintenance workflow when they are all sufficiently replicated.

Once the replication is completed, the node will be moved to either complete (ending the decommission process) or go into a pending state waiting on maintenance to complete.

This PR is a significant refactor of the DatanodeAdminMonitor, but previously it was incomplete so the bulk of the code is new.

There are also a few minor refactors to other classes:

  • Removed the isHealthy() method from ReplicationManager and put it into ContainerReplicaCounts.
  • Pulled SimpleMockNodeManager into its own class (previously an inner class in TestReplicationManager) and enhanced it somewhat. HDDS-2673 is raised to consider merging it with MockNodeManager
  • ContainerReplicaCount now also takes a container object inorder to implement the isHealthy method.
  • TestDatanodeAdminMonitor now mocks all dependencies to allow for simpler and faster tests, avoiding MiniCluster.

Note that these changes do not consider the new commands which must be sent to the Datanodes to change their admin state as part of HDDS-2592. Those enhancement will be made as part of that Jira.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-2593

How was this patch tested?

Various new unit tests have been added.

@sodonnel
Copy link
Contributor Author

sodonnel commented Dec 5, 2019

Single failing unit test is:

[ERROR] testMultipleBlockAllocationWithClosedContainer(org.apache.hadoop.hdds.scm.block.TestBlockManager)  Time elapsed: 2.7 s  <<< ERROR!
org.apache.hadoop.hdds.scm.pipeline.InsufficientDatanodesException: Cannot create pipeline of factor 3 using 2 nodes.
	at org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.create(RatisPipelineProvider.java:153)
	at org.apache.hadoop.hdds.scm.pipeline.PipelineFactory.create(PipelineFactory.java:58)
	at org.apache.hadoop.hdds.scm.pipeline.SCMPipelineManager.createPipeline(SCMPipelineManager.java:155)
	at org.apache.hadoop.hdds.scm.block.TestBlockManager.testMultipleBlockAllocationWithClosedContainer(TestBlockManager.java:306)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

Which is fixed by HDDS-2646 and 6c1a9ff (Addendum HDDS-2646) on master.

@elek elek changed the title HDDS-2593 - DatanodeAdminMonitor should track under replicated containers and complete the admin workflow accordingly HDDS-2593. DatanodeAdminMonitor should track under replicated containers and complete the admin workflow accordingly Dec 6, 2019
Copy link
Member

@elek elek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks the updated @sodonnel

I like the removal of the state machine, I think it's more clean now.

Let me repeated what I mentioned during our offline discussion:

This approach manages the datanode state independent from the node manager and calculates the under/overreplication independent from the replication manager. With (bigger) refactor both can be handled in one place (node manager, replication manager).

I am happy to commit it in this form, but would be investigate further how can it be optimized to do the calculations and the state management in one place.

Can you please check the unit test failure?. It's passed earlier and without green unit test it's not tested with acceptance tests.

@sodonnel
Copy link
Contributor Author

sodonnel commented Dec 12, 2019

The unit test failure is still this one, unrelated to the changes I believe:

[ERROR] testMultipleBlockAllocationWithClosedContainer(org.apache.hadoop.hdds.scm.block.TestBlockManager)  Time elapsed: 2.248 s  <<< ERROR!
org.apache.hadoop.hdds.scm.pipeline.InsufficientDatanodesException: Cannot create pipeline of factor 3 using 2 nodes.
	at org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.create(RatisPipelineProvider.java:153)
	at org.apache.hadoop.hdds.scm.pipeline.PipelineFactory.create(PipelineFactory.java:58)
	at org.apache.hadoop.hdds.scm.pipeline.SCMPipelineManager.createPipeline(SCMPipelineManager.java:155)
	at org.apache.hadoop.hdds.scm.block.TestBlockManager.testMultipleBlockAllocationWithClosedContainer(TestBlockManager.java:306)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)

It should be fixed when we next merge master into this branch, as I believe that test has been fixed on master now.

I agree there is a chance we could move the calculation into replication manager and track the metrics of under-replicated blocks in the NodeManager, but for now I think we should move forward and commit this and once we have a working decommission solution we can look to optimise and refactor it further.

If you are happy can you please go ahead and commit this change?

S O'Donnell added 4 commits December 12, 2019 21:36
… workly - waiting on containers to replicate, maintenance to end and moving the node to their end date.

Additionally refactor the tests to avoid using MiniCluster and instead use mocks to ensure faster and simpler tests.
…ler invocations in testClosePipelinesEventFiredWhenAdminStarted
…only moved to IN_SERVICE when processed by the monitor. This simplifies the logic in the monitor as it does not need to worry about unprocessed cancelled nodes
@sodonnel
Copy link
Contributor Author

That test is still failing event after merging master to the branch. I though these commits would have fixed it, but I was confusing them with something else:

Addendum HDDS-2646. Start acceptance tests only if at least one THREE pipeline is available (#282)
HDDS-2646. Start acceptance tests only if at least one THREE pipeline is available (#282)

I think this is a problem with the MockNodeManager handing the new operational states on this branch. I will look at it further.

…ainer() which has been failing on decommission branch for some time
elek pushed a commit that referenced this pull request Dec 13, 2019
…ers and complete the admin workflow accordingly

Closes #309
@adoroszlai
Copy link
Contributor

This is committed in af2efa5 to the HDDS-1880-Decom feature branch, but PR would be closed automatically (due to Closes #... commit message) by GitHub only when the commit reaches master. Can we close it manually?

@sodonnel
Copy link
Contributor Author

This is committed in af2efa5 to the HDDS-1880-Decom feature branch, but PR would be closed automatically (due to Closes #... commit message) by GitHub only when the commit reaches master. Can we close it manually?

It is fine with me to close this one. I think @elek merges the PRs manually from the CLI rather than using the button here in github, which is possibly why this PR did not get closed automatically.

@sodonnel sodonnel closed this Dec 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants