Skip to content

EC2 Discovery

Nick Satterly edited this page Oct 17, 2013 · 3 revisions

Ganglia and EC2

An important feature of Ganglia is the use of multicast to allow the same simple configuration file to be used across all servers in a cluster. Multicast discovery also means nodes are dynamically added and removed without any further configuration updates. This is a fantastic feature of Ganglia, however none of it works in EC2 or other cloud environments because UDP multicast is not supported.

The work-arounds

To overcome the lack of multicast support and dynamic discovery there are a few work-arounds currently in use:

  1. Use a head node (or nodes) which all nodes for a cluster forward their metrics to for collection and aggregation. The drawbacks are that there are now a few servers which need specific, very complicated config that are also potential failure points. If you only have a few clusters with 100's or 1000's of instances that might work. However, if you have dozens and dozens of clusters with only a few nodes in each this can become an administrative nightmare to manage.

  2. A variation of the above is to use the gmetad servers as the "head node" for all clusters. This is particularly common in dynamic environments where no single server of a cluster can always be guaranteed to be up. However this requires a gmond for each cluster to be run on the gmetad servers which can be even more complicated to setup and maintain than option 1.

  3. Use puppet to continually modify the configurations and restart the agents based on the results of an EC2 API query. The drawbacks are that new instances only appear once a configuration management tool (like chef or puppet) has run (which is typically every 15-30 minutes) and can cause false "dead nodes" when configurations are not yet consistent across the cluster. In very dynamic environments where instances are launched and terminated frequently the lag between when a node comes up and the puppet run which fixes the metric collection is unacceptable.

The solution

This idea was borrowed from elasticsearch which does EC2 discovery by using the EC2 API instead of multicast to find other potential cluster members. Basically, appropriate credentials are added to a cloud section of the Ganglia agent to allow it to query the EC2 API and then filters are applied to a query to ensure only servers that belong to a particular cluster are returned by the API.

cloud {
  access_key = AWS_ACCESS_KEY_ID
  secret_key = SECRET_ACCESS_KEY
}

The Ganglia gmond agent uses the filter defined by combining the tags, groups and availability zones defined in the discovery section to find the list of matching EC2 instances.

discovery {
 type = "ec2" /* only ec2 API supported in this version */
 # endpoint = "https://ec2.amazonaws.com" /* only required if in us-east-1 */
 tags = {  } /* list of tags eg. stage=prod */
 groups = { } /* list of security groups */
 availability_zones = { } /* eg. eu-west-1a */
 host_type = private_ip /* private_ip, public_ip, private_dns, public_dns */
 discover_every = 90
 port = 8649
}

The result

There is only one configuration file per cluster and the configuration file doesn't change during the lifetime of the server. New instances are discovered immediately that they are launched and terminated instances are forgotten within 90 seconds (configurable).

Auto Scaling

Whenever a new instance comes up (as part of a auto scaling group, or whatever) it will discover all instances in its cluster and send metrics to them. When metrics are received from an instance that a gmond does not know about it triggers the gmond to do a forced discovery. This means that a newly launched instance is discovered and added to a cluster almost immediately.

It will also do a rediscovery every so often (by default every 90 seconds) so that instances that have been terminated are removed from its list of UDP send destinations.

Example

If all the production web frontend servers in EC2 were tagged with environment = PROD, cluster = frontend then the discovery section would look like the following:

discovery {
 type = "ec2"
 endpoint = "https://ec2.eu-west-1.amazonaws.com"
 tags = { "environment=PROD", "cluster=frontend" }
 groups = { }
 availability_zones = { }
 host_type = "private_ip"
 discover_every = 90
 port = 8649
}

If you wanted to separate the web servers in each availability zone into separate clusters it would simply be a matter of adding availability_zones = { eu-west-1a }, availability_zones = { eu-west-1b }, availability_zones = { eu-west-1c } etc to the configurations for server in each AZ.

Troubleshooting

Debug mode

Run the Ganglia agent in debug mode and look for debug output starting with [discovery.cloud] or [discovery.ec2].

[discovery.ec2] Using dynamic discovery to build list of nodes
[discovery.ec2] List of nodes will be refreshed every 90 seconds
[discovery.cloud] access key=AKIAIM3EQ4SCWGT6UZBA, secret key=***********wxyz
[discovery.ec2] using host_type [private_ip], tags [role=test,stage=dev], groups [], availability_zones []
[discovery.ec2] using endpoint https://ec2.amazonaws.com -> ec2.amazonaws.com
[discovery.ec2] base64 encoded hash kkZjBeftvfnlbUdmTyETKSdsQRRRe7ArqK4wqWoRIWI=
[discovery.ec2] API request https://ec2.amazonaws.com?AWSAccessKeyId=AKIAIM3EQ4SCWGT6UZBA&Action=DescribeInstances&Filter.1.Name=tag%3Astage&Filter.1.Value=dev&Filter.2.Name=tag%3Arole&Filter.2.Value=test&SignatureMethod=HmacSHA256&SignatureVersion=2&Timestamp=2012-10-24T21%3A57%3A41Z&Version=2012-08-15&Signature=kkZjBeftvfnlbUdmTyETKSdsQRRRe7ArqK4wqWoRIWI%3D
[discovery.ec2] HTTP response code 200, 9482 bytes retrieved
[discovery.ec2] i-b5569bc9 security group quicklaunch-1
[discovery.ec2] i-b7569bcb security group quicklaunch-1
[discovery.ec2] i-b1569bcd security group quicklaunch-1
[discovery.ec2] Found 3 matching instances
[discovery.ec2] adding i-b1569bcd, udp send channel privateIpAddress 10.243.123.143:8649
[discovery.ec2] UDP send channels = 10.195.178.185:8649
[discovery.ec2] UDP send channels = 10.194.63.158:8649
[discovery.ec2] UDP send channels = 10.243.123.143:8649

Get the XML response

Cut and paste HTTP request into a browser or use curl and review XML API response.

curl "https://ec2.amazonaws.com?AWSAccessKeyId=AKIAIM3EQ4SCWGT6UZBA&Action=DescribeInstances..."

<?xml version="1.0" encoding="UTF-8"?>
<DescribeInstancesResponse xmlns="http://ec2.amazonaws.com/doc/2012-08-15/">
    <requestId>963385d0-9c20-4b19-bca5-56dfe9d67657</requestId>
    <reservationSet>
        <item>
            <reservationId>r-07f88b61</reservationId>
            <ownerId>496780030265</ownerId>
            <groupSet>
                <item>
                    <groupId>sg-c91ec3a1</groupId>
                    <groupName>quicklaunch-1</groupName>
                </item>
            </groupSet>
            <instancesSet>
                <item>
                    <instanceId>i-b5569bc9</instanceId>
                    <imageId>ami-1624987f</imageId>
                    <instanceState>
                        <code>16</code>
                        <name>running</name>
                    </instanceState>
                    <privateDnsName>ip-10-194-63-158.ec2.internal</privateDnsName>
                    <dnsName>ec2-54-242-76-61.compute-1.amazonaws.com</dnsName>
                    <reason/>
[snip]

No receive channels

It is not necessary to comment out or delete from the configuration file references to udp_send_channels as they will just be silently ignored if present. However, if the agent fails to start with the following error apr_pollset_create failed: Invalid argument it means that no udp_recv_channels have been defined in the configuration file. The following stanza (or one like it) must be present:

udp_recv_channel {
  port = 8649
}

Support for other cloud environments

Any cloud provider that supports the defacto standard EC2 API is supported, naturally.

Support for most other cloud providers could be added by conforming to the DeltaCloud API or CIMI standard.