-
Notifications
You must be signed in to change notification settings - Fork 477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-2329 Destroy pipelines on any decommission or maintenance nodes #86
Conversation
… moved through the decommission workflow. With this commit, a node can be added to the monitor for decommission or maintenace and it will have its pipelines closed. It will not make any further progress after the pipelines have been closed and further commits will address the next states.
29575af
to
d83074f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am +1 on this patch. @nandakumar131 , @xiaoyuyao, @elek Just flagging since this is an important patch. I will commit this on Monday evening PST if there are no further comments. Thanks
* the following happens: | ||
* | ||
* 1. First an event is fired to close any pipelines on the node, which will | ||
* also close any contaners. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: contaners -> containers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
* into maintenance. | ||
*/ | ||
public enum States { | ||
CLOSE_PIPELINES, GET_CONTAINERS, REPLICATE_CONTAINERS, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to add a integer in this enum, so you can actually annotate which enum is first, second etc.
At this point it is implicit I suppose ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also we might want to add a detailed Ascii based state machine diagram for the future readers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added sequenceNumber to indicate the ordering. ASCII diagram is a good idea, but I will leave it out until we are sure there are no more states or this flow may change before decommission is finished.
while(!cancelledNodes.isEmpty()) { | ||
DatanodeAdminNodeDetails dn = cancelledNodes.poll(); | ||
trackedNodes.remove(dn); | ||
// TODO - fire event to bring node back into service? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think is needed since if the DN starts HB-ing we should be ok is what I think. @nandakumar131 , any comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will leave this TODO in place for now, and we can review the "bring back into service" in more detail later as we progress with future patches.
@@ -2501,4 +2501,15 @@ | |||
The number of Recon Tasks that are waiting on updates from OM. | |||
</description> | |||
</property> | |||
<property> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am going to commit this, but we have different way of writing configs. I missed it earlier. I will file a follow up JIRA to clean this up.
What changes were proposed in this pull request?
When a node is marked for decommission or maintenance, the first step in taking the node out of service is to destroy any pipelines the node is involved in and confirm they have been destroyed before getting the container list for the node.
This commit added a new class called the DatanodeAdminMonitor, which is responsible for tracking nodes as they go through the decommission workflow.
When a node is marked for decommission, it gets added a to a queue in this monitor. The monitor runs periodically (30 seconds by default) and process any queued nodes. After processing they are tracked inside the monitor as they decommission workflow progresses (closing pipelines, getting the container list, replicating the container, etc).
With this commit, a node can be added to the monitor for decommission or maintenace and it will have its pipelines closed.
It will not make any further progress after the pipelines have been closed and further commits will address the next states.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-2329
How was this patch tested?
Some manual tests and new unit tests have been added.