Skip to content

cdapio/chaos-monkey

Repository files navigation

Chaos Monkey

Chaos Monkey provides a convenient way to disrupt CDAP and hadoop services on a cluster. Disruptions can be scheduled, randomized, or issued on command.

Standalone Chaos Monkey

To start Chaos Monkey daemon and HTTP server, set configurations in chaos-monkey-site.xml and run ChaosMonkeyMain

Configurations

Disruptions setup

By default, the following disruptions will be available to each service:

  • start
  • restart
  • stop
  • terminate
  • kill
  • rolling-restart

Custom disruptions can be added by extending the Disruption class and then associating them with a service. A custom disruption is started by calling ClusterDisruptor.disrupt(serviceName, disruptionName, actionArguments), where disruptionName is set by the Disruption.getName() method. Disruptions receive a collection of RemoteProcess based on the actionArguments, and can be used to execute commands via ssh. To add a custom disruption to a service:

  • {service}.disruptions - Class paths of custom disruptions, separated by commas

Initialize a service for Chaos Monkey

Any configured service can be interacted with through ClusterDisruptor or REST endpoints. To configure a service for chaos Monkey, either provide custom disruptions or a pid file for the default disruptions:

  • {service}.pidFile - Path to the .pid file of the service

Configurations for scheduled disruptions

These additional properties can be set for a certain service to start a scheduled disruption:

  • {service}.interval - Number of seconds between each disruption
  • {service}.killProbability - Number between 0 to 1 representing chance of kill occurring each iteration.
  • {service}.stopProbability - Number between 0 to 1 representing chance of stop occurring each iteration.
  • {service}.restartProbability - Number between 0 to 1 representing chance of restart occurring each iteration.
  • {service}.minNodesPerIteration - Minimum number of nodes affected each iteration.
  • {service}.maxNodesPerIteration - Maximum number of nodes affected each iteration.

Cluster information collector

By default, Chaos Monkey will retrieve cluster information from Coopr
To get cluster information from Coopr, the following configurations need to be set:

  • cluster.info.collector.coopr.clusterId
  • cluster.info.collector.coopr.tenantId
  • cluster.info.collector.coopr.server.uri

To get cluster information from other sources, include a plugin to implement ClusterInfoCollector and set the following configs:

  • cluster.info.collector.class - classpath of the implementation of ClusterInfoCollector

Additional properties can be passed in to the ClusterInfoCollector implementation. Setting the property cluster.info.collector.{propertyName} in configurations will make {propertyName} available in the properties map, passed in via the initialize method

SSH configurations

username - username of SSH profile (if different from system user)
keyPassphrase - passphrase for private key, if applicable
privateKey - path to private key (will check default locations unless specified)

HTTP endpoints

HTTP server is hosted on port 11020, with the following endpoints:

POST /v1/services/{service}/{action}
{action} includes stop, kill, terminate, start, restart, and rolling-restart
The action, by default, will be performed on all nodes configured with the service. To specify affected nodes, include ne of the following request bodies:

{
 nodes:[<nodeAddress1>,<nodeAddress2>...]
}
{
 percentage:<numberFrom0To100>
}
{
 count:<numberOfNodes>
}

In addition to the above request bodies, rolling restart can be also configured with:

{
 restartTime:<restartTimeSeconds>
 delay:<delaySeconds>
}

GET /v1/nodes/{ip}/status
Get the status of all configured service on a given address

GET /v1/status
Get the status of all configured service on every node of the cluster

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •