Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-2592 Add Datanode command to allow the datanode to persist its admin state #521

Merged
merged 5 commits into from Mar 31, 2020

Conversation

sodonnel
Copy link
Contributor

@sodonnel sodonnel commented Feb 3, 2020

What changes were proposed in this pull request?

When the datanode state is change in SCMNodeManager a Datanode Command is created to transmit the new datanode state to the datanode. The datanode will then persist this state in the "datanodeID.yaml" file.

The current datanode.yaml file simply persists a DatanodeDetails object in YAML format. Therefore two new fields have been added to DatanodeDetails to support this:

persistedOpState,
persistedOpStateExpiryEpochSec

The first is the operational state and the second is the seconds from the epoch the setting should expire (for maintenance mode).

When a datanode starts up, it will read these values and supply them in the heartbeat to SCM.

If SCM notices the DN was registered before, SCM is seen as the source of truth and will update the DN state to that set in SCM if they differ.

If the DN was never registered with SCM (eg if SCM was restarted), then the DN state is seen as the source of truth and SCM will be updated accordingly.

This change also moves the state expiry time into the NodeStatus object, so it now carries operationalState, healthState and State expiry.

Note that the DN does not do anything with the persisted operational state. They are simply persisted and reported in the heartbeat. In this respect the DNs are being used as a form of distributed storage for the operational state so the decommission process can survive a restart.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-2592

How was this patch tested?

Tested manually using a docker build. This change still needs unit tests added if the current approach is agreed upon.

S O'Donnell added 5 commits January 13, 2020 13:40
…state to the datanode via the existing command queue
…t the values in the datanode yaml file (datanode.id) and ensure the parameters are sent with registration and the heartbeat
Copy link
Contributor

@anuengineer anuengineer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. I have two minor comments. Feel free to commit once it is taken care off.

persistDatanodeDetails(dni);
} catch (IOException ioe) {
LOG.error("Failed to persist the datanode state", ioe);
// TODO - this should probably be raised, but it will break the command
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add this to the Dn reports if you want to propagate this back into SCM.

@sodonnel
Copy link
Contributor Author

sodonnel commented Feb 4, 2020

@anuengineer Thanks for the review on this. I probably need to add some tests around these changes. I did not do that initially, as I wanted to be sure we were happy with the approach (storing the state in the datanode details, allowing it to be included in the heartbeat and the DN not making further use of the value). Now you have reviewed and are happy with this direction, I will get some tests added and check your comments in more detail.

@elek
Copy link
Member

elek commented Feb 18, 2020

/pending "I will get some tests added and check your comments in more detail."

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking this issue as un-mergeable as requested.

Please use /ready comment when it's resolved.

"I will get some tests added and check your comments in more detail."

@sodonnel
Copy link
Contributor Author

sodonnel commented Mar 30, 2020

We have parked the decommission work at the moment. Anu approved this change, but I feel we need some unit tests around the changes.

I have raised HDDS-3303 to log the fact we need some tests. I also raised HDDS-3304 for the "use INFINITY as a constant" suggestion.

As this is only going onto a branch which we will need to review in more detail later, I propose we commit this to the branch as is to get it out of the PR queue and leave the decommission work parked in a good state.

@sodonnel
Copy link
Contributor Author

/ready

@github-actions github-actions bot dismissed their stale review March 30, 2020 20:13

Blocking review request is removed.

@sodonnel sodonnel merged commit d902586 into apache:HDDS-1880-Decom Mar 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants