-
Notifications
You must be signed in to change notification settings - Fork 475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-2593. DatanodeAdminMonitor should track under replicated containers and complete the admin workflow accordingly #309
Conversation
Single failing unit test is:
Which is fixed by HDDS-2646 and 6c1a9ff (Addendum HDDS-2646) on master. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks the updated @sodonnel
I like the removal of the state machine, I think it's more clean now.
Let me repeated what I mentioned during our offline discussion:
This approach manages the datanode state independent from the node manager and calculates the under/overreplication independent from the replication manager. With (bigger) refactor both can be handled in one place (node manager, replication manager).
I am happy to commit it in this form, but would be investigate further how can it be optimized to do the calculations and the state management in one place.
Can you please check the unit test failure?. It's passed earlier and without green unit test it's not tested with acceptance tests.
The unit test failure is still this one, unrelated to the changes I believe:
It should be fixed when we next merge master into this branch, as I believe that test has been fixed on master now. I agree there is a chance we could move the calculation into replication manager and track the metrics of under-replicated blocks in the NodeManager, but for now I think we should move forward and commit this and once we have a working decommission solution we can look to optimise and refactor it further. If you are happy can you please go ahead and commit this change? |
… workly - waiting on containers to replicate, maintenance to end and moving the node to their end date. Additionally refactor the tests to avoid using MiniCluster and instead use mocks to ensure faster and simpler tests.
…ler invocations in testClosePipelinesEventFiredWhenAdminStarted
…only moved to IN_SERVICE when processed by the monitor. This simplifies the logic in the monitor as it does not need to worry about unprocessed cancelled nodes
110c457
to
0ca851a
Compare
That test is still failing event after merging master to the branch. I though these commits would have fixed it, but I was confusing them with something else:
I think this is a problem with the MockNodeManager handing the new operational states on this branch. I will look at it further. |
…ainer() which has been failing on decommission branch for some time
…ers and complete the admin workflow accordingly Closes #309
This is committed in af2efa5 to the |
It is fine with me to close this one. I think @elek merges the PRs manually from the CLI rather than using the button here in github, which is possibly why this PR did not get closed automatically. |
What changes were proposed in this pull request?
This PR builds on HDDS-2459 and chances the DatanodeAdminMonitor to use the newly provided ContainerReplicaCount object to track the under replicated containers on a node and move the node to the next stages in the decommission / maintenance workflow when they are all sufficiently replicated.
Once the replication is completed, the node will be moved to either complete (ending the decommission process) or go into a pending state waiting on maintenance to complete.
This PR is a significant refactor of the DatanodeAdminMonitor, but previously it was incomplete so the bulk of the code is new.
There are also a few minor refactors to other classes:
Note that these changes do not consider the new commands which must be sent to the Datanodes to change their admin state as part of HDDS-2592. Those enhancement will be made as part of that Jira.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-2593
How was this patch tested?
Various new unit tests have been added.