Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-2329 Destroy pipelines on any decommission or maintenance nodes #86

Closed
wants to merge 5 commits into from

Conversation

sodonnel
Copy link
Contributor

What changes were proposed in this pull request?

When a node is marked for decommission or maintenance, the first step in taking the node out of service is to destroy any pipelines the node is involved in and confirm they have been destroyed before getting the container list for the node.

This commit added a new class called the DatanodeAdminMonitor, which is responsible for tracking nodes as they go through the decommission workflow.

When a node is marked for decommission, it gets added a to a queue in this monitor. The monitor runs periodically (30 seconds by default) and process any queued nodes. After processing they are tracked inside the monitor as they decommission workflow progresses (closing pipelines, getting the container list, replicating the container, etc).

With this commit, a node can be added to the monitor for decommission or maintenace and it will have its pipelines closed.

It will not make any further progress after the pipelines have been closed and further commits will address the next states.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-2329

How was this patch tested?

Some manual tests and new unit tests have been added.

S O'Donnell added 2 commits October 25, 2019 19:06
… moved through the decommission workflow.

With this commit, a node can be added to the monitor for decommission or maintenace and it will have its pipelines closed.

It will not make any further progress after the pipelines have been closed and further commits will address the next states.
Copy link
Contributor

@anuengineer anuengineer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am +1 on this patch. @nandakumar131 , @xiaoyuyao, @elek Just flagging since this is an important patch. I will commit this on Monday evening PST if there are no further comments. Thanks

* the following happens:
*
* 1. First an event is fired to close any pipelines on the node, which will
* also close any contaners.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: contaners -> containers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

* into maintenance.
*/
public enum States {
CLOSE_PIPELINES, GET_CONTAINERS, REPLICATE_CONTAINERS,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to add a integer in this enum, so you can actually annotate which enum is first, second etc.
At this point it is implicit I suppose ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we might want to add a detailed Ascii based state machine diagram for the future readers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added sequenceNumber to indicate the ordering. ASCII diagram is a good idea, but I will leave it out until we are sure there are no more states or this flow may change before decommission is finished.

while(!cancelledNodes.isEmpty()) {
DatanodeAdminNodeDetails dn = cancelledNodes.poll();
trackedNodes.remove(dn);
// TODO - fire event to bring node back into service?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think is needed since if the DN starts HB-ing we should be ok is what I think. @nandakumar131 , any comments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will leave this TODO in place for now, and we can review the "bring back into service" in more detail later as we progress with future patches.

@@ -2501,4 +2501,15 @@
The number of Recon Tasks that are waiting on updates from OM.
</description>
</property>
<property>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to commit this, but we have different way of writing configs. I missed it earlier. I will file a follow up JIRA to clean this up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants