Skip to content

Agentless Autohealing

DeWayne Filppi edited this page Aug 28, 2018 · 7 revisions

Agentless Autohealing

One of the Cloudify manager's distinctive features is the ability to perform automated day 2+ operational tasks. These typically boil down to analyzing an event stream produced by agents, and reacting by running workflows. Auto-healing is one such task. A metric is identified as a trigger, and when it fails to be delivered to the manager for a period of time, the heal is triggered. This post is about adding a capability to Cloudify that emulates the simpler liveness probes style functionality found elsewhere (e.g. Kubernetes). Note that this plugin only functions on Linux systems currently.

Users Guide

The agentless autohealing capability is packaged as a plugin. The plugin defines a single type, and a single relationship. The type, cloudify.nodes.Healer, mainly serves as a configuration node. The relationship does the heavy lifting. The idea is that the Healer node is connected to a compute node via the cloudify.relationships.healer_connected_to relationship.

The Healer has a few built in kinds of "probes":

  • ping - Simple ICMP liveness check to the IP of the connected compute node.
  • http - HTTP GET to a URL on the IP of the connected compute node.
  • port - TCP socket open on socket on the IP of the connected compute node.

Configuration

All probes are based on the same operational model. Given an IP address (determined by the target of the cloudify.relationships.healer_connected_to relationship), run each probe with a specified frequency. Then, when the probe fails count times, it launches the heal workflow in Cloudify for the target compute node. The heal, is handled by Cloudify, and the probe resumes.

The plugin.yaml file has more detailed description of the configuration of each type of probe. The plugin has an "examples" directory with some simple Openstack blueprint that illustrate the use of the different probe types.

Logging

During installation of the containing blueprint, the plugin starts a daemon on the Cloudify manager that runs the configured probe. The logging from that daemon cannot be pushed to a meaningful location in the Cloudify manager, because it isn't associated with any execution. So currently, the log is placed in the /tmp directory with the filename pattern of healer_<deployment_id>_<pid>.log.

A Little Deeper Dive

The plugin operates by using the Cloudify programmable relationships feature to trigger the creation of what amounts to an agent (or daemon) of sorts on the manager. The only purpose of this daemon is to run the configured probe and

Future And Limitations

One glaring limitation of the current version is the simplicity: a single probe heals a single compute node. There is no support for healing groups, or multiple relationships triggering more complex probes. This is a possible future enhancement.

Another limitation is that you can only use three types of probes. Ideally, to fully mimic a typical script based approach, would be to allow plugin users to supply their own probes. This is on the short list.

Another limitation is scale. You don't want thousands of these probes running on the manager. It is conceivable to imagine more full featured multi-threaded "probe servers", perhaps sprinkled throughout large deployments. I think that's probably a bridge too far, and if you need that kind of scale, you should probably use the modern agent event driven approach, as opposed to a polling approach.

Clone this wiki locally