Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

what should happen when a node disappears and appears again? #123

Open
mapuri opened this issue May 6, 2016 · 0 comments
Open

what should happen when a node disappears and appears again? #123

mapuri opened this issue May 6, 2016 · 0 comments
Milestone

Comments

@mapuri
Copy link
Contributor

mapuri commented May 6, 2016

Right now when a node disappears and appears again, clusterm updates it's state in inventory however, no configuration related action is taken on the node. This might not always be desirable as when a node reappears, it might need the services setup again.

This issue explores a few options to handle these scenarios. Feel free to pitch in with comments and feedback.

Following are some scenarios where a node may disappear and then what may happen or be desired subsequently:

  • Scenario 1: node looses network connectivity (i.e. control interface traffic is affected)
    • since node is up all the services will stay running but most likely services will readjust to assume the host to be down.
    • when node appears again, if nothing has changed then node will get added back to cluster
    • But it is not necessarily guaranteed that the service configuration has not changed.
    • There is a need for some sort of config checksum equivalent, that can help here perhaps.
    • Re-setup (after config check) of services will ensure that node is on-boarded back with correct config
  • Scenario 2: node is rebooted (due power-loss or admin action)
    • the services get stopped on the node
    • when the node is booted up again, services need to be re-setup.
    • Re-setup of services will ensure that node is on-boarded back with correct config
  • Scenario 3: just serf is somehow affected (due to crashes or bugs or admin action):
    • first this is a service failure and we may be better off debugging serf itself.
    • since node is up all the services will stay running and also continue to be part of cluster
    • when serf recovers, most likely no configuration action is required.
    • There is a need for some sort of config checksum equivalent, that can help here perhaps.
    • Re-setup of services is most likely not needed but will ensure that node is on-boarded back with correct config.

Note: There is also a time period (especially Scenario 3 above) when node is down in monitoring system but still reachable from service perspective. We will ignore this scenario for now as we may need to debug and fix the condition that caused it instead. Also the side-effect doesn't seem too bad, unless service configuration change is desired during this time.

What follows next is a high level proposal for possible ansible and clusterm enhancements that would address node disappearance and reappearance scenario:

Configuration check (ansible tasks and plays) :

  • We could add config check tasks per service that would verify that service is configured in desired fashion. Some example of this are:
    • etcd:
      • master node could check that etcd service is running in master mode
      • worker node could check that etcd service is running in proxy mode
      • etcd ports are up to date ... and so on
    • contiv_network:
      • master node shall be running netmaster service in addition to netplugin
      • worker node shall be just running netplugin
      • netplugin/master ports are up to date
    • if the checks fail, then a service re-setup shall be triggered (see next section)

Service Re-setup (clusterm discovered event handler):

  • a service re-setup is triggered when a configuration check fails
  • resetup would involve running cleanup.yml followed by regualr provisioning based on node's host-group.
@mapuri mapuri added this to the 0.2 milestone May 7, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant