Skip to content

EC2 Testing Infrastructure

kirktrue edited this page Aug 17, 2010 · 2 revisions

Goals and Deliverables

The primary goals of this project are as follows:

  1. Ability to initialize EC2 instances for use with Voldemort
  2. Cluster deployment, configuration
  3. Individual node start/stop
  4. Ability to leverage above for performance and correctness tests

The goal for steps two through four is to provide utilities for cluster/client management and performance and correctness testing. Whether the cluster/clients and tests will reside on internally-managed systems or externally-managed systems should be transparent. That is, we want users to be able to perform cluster and client management and testing on external machines (e.g. EC2) as well as internal machines (e.g. departmental testing lab). The prerequisites for the cluster and clients are described later.

“Ability to initialize EC2 instances for use with Voldemort”

For the first goal, we’re going to use an appropriate Java library for EC2 initialization (Typica, etc.). This will introduce one or more dependencies to the project.

“Cluster deployment, configuration” and “Individual node start/stop”

For the second and third, we will essentially be leveraging SSH to connect and execute the needed commands on the remote instances. Research will be done to determine if a Java SSH wrapper library exists and suits our needs. Otherwise we can fall back on the Runtime/Process/ProcessBuilder/etc. Java APIs.

“Ability to leverage above for performance and correctness tests”

For the fourth goal, we will create a class in the testing directory that contains a set of static methods that provide simple APIs for cluster deployment, client deployment, node starting and stopping, etc. This will use the static import idiom of JUnit, absolve the unit/integration tests from needing to extend from a particular super class, yet still provide simple a set of APIs for the test writers.

The deliverable has several pieces:

  • Java library for achieving the above (as described later)
  • Shell scripts for invoking the individual tasks using the Java library
  • Convenience wrapper/utility class for JUnit tests that sits atop the library

The following shell script-based utilities will be created specific to EC2 functionality:

  • voldemort-ec2instancecreator.sh
  • voldemort-ec2instanceterminator.sh

The remaining shell scripts can be leveraged in a set of internally-managed or externally-managed systems:

  • voldemort-clustergenerator.sh
  • voldemort-deployer.sh
  • voldemort-clusterstarter.sh
  • voldemort-clusterstopper.sh
  • voldemort-clustercleaner.sh
  • voldemort-clusterremotetest.sh

Java APIs will also be created to do the above. However, these will not be listed here for the sake of brevity.

Environment Prerequisites

In order to perform the aforementioned operations, the cluster and client machines must have certain environmental configuration present. As mentioned before, while LinkedIn will leverage the EC2 infrastructure for testing, we want the utilities/library to be as agnostic as possible. So we define a minimal set of requirements.

Prerequisites for server nodes:

  • 2.6-based distribution of Linux
  • ulimit set appropriate to number of sockets involved in test
  • Key-based SSH access (i.e. does not require a password)
  • Java 5 installed with JAVA_HOME environment variable set
  • Required ports open for cluster connectivity

Prerequisites for client nodes:

  • 2.6-based distribution of Linux
  • ulimit set appropriate to number of sockets involved in test
  • Key-based SSH access (i.e. does not require a password)
  • Java 5 installed with JAVA_HOME environment variable set

EC2 Instance Creator

This utility will initialize EC2 instances for use as either Voldemort servers or clients. This step is independent of Voldemort and is simply a means to provision instances meeting our prerequisites.

Parameters:

  • Amazon account’s access ID
  • Amazon account’s secret key
  • AMI ID
  • Instance size (i.e. small, medium, large, etc.)
  • Key pair ID
  • Number of instances to create
  • Type of instances to create

The AMI ID will be an AMI that meets the environmental requirements above. Usually that will mean that a user has created a custom AMI that meets those requirements. The instance size will be dependent on the AMI. For example, if the AMI is 32-bit, it cannot use a “large” EC2 instance. Conversely, if the AMI is 64-bit, a “small” instance cannot be allocated. (This is an Amazon constraint.)

Logic:

  1. Process parameters
  2. Issue calls to library/Amazon to create server instances
  3. Perform the following in parallel:
    1. Perform polling, waiting for all instances to complete startup
    2. Determine external and internal host names for each instance
  4. Format results for output

The output will be a line-delineated, equal sign-separated list of external host name and internal host name. Essentially it is a Java properties file. For example:

<external host name 1>=<internal host name 1>
<external host name 2>=<internal host name 2>

EC2 Instance Terminator

This utility will terminate EC2 instances. This step is also independent of Voldemort and is simply a means to de-provision instances previously created.

This too will leverage an existing EC2 Java library for the actual heavy lifting.

Parameters:

  • Amazon account’s access ID
  • Amazon account’s secret key
  • External host names of instances to destroy (optional)

Logic:

  1. Process parameters
  2. Issue calls to library/Amazon to terminate server instances
  3. Perform the following in parallel:
    1. Perform polling, waiting for all instances to terminate

It may be beneficial to have a ‘terminate with extreme prejudice’ option wherein the utility destroys all instances, just to avoid having any stragglers. If the user provides an appropriate flag (--force) and no host names, we can terminate all of his instances.

Cluster Descriptor Generator

This is a simple Java API that creates a cluster.xml dynamically, based upon host names. It implements the generate_cluster_xml.py script in Java for use both as a command line utility and from within JUnit.

Deployer

The deployer utility/library API is used to deploy the Voldemort directory structure to a set of remote machines. These machines can be either servers or clients, or a mixture of both.

Parameters:

  • Server/client nodes’ (external) host names
  • Server/client nodes’ user ID
  • SSH private key path
  • Source directory (on local machine)
  • Destination directory (on remote machines)

Logic:

  1. Process parameters
  2. Using the server/client instance host name, user, and SSH key, format a call to rsync to copy data from the source directory to the destination directory
  3. Process the progress of the rsync call

Based on previous research, there are some Java-based rsync libraries. For now, however, we’ve resorted to invoking the rsync command line utility from Java to perform the actual data transfer.

Cluster Starter

The Voldemort server runner utility/library API is used to start the Voldemort servers on a set of remote machines. The heavy lifting will be performed by SSH to gain access to the machine and voldemort-start.sh for starting the server.

Parameters:

  • Map of server nodes’ external host names to internal host names
  • Server nodes’ user ID
  • SSH private key path
  • Voldemort root directory (on remote machines)
  • Voldemort config directory (on remote machines)
  • cluster.xml file (on local machine)

Logic:

  1. Process parameters
  2. For each server, determine the mapping of external server host name to node ID using cluster.xml and the server name mapping
  3. Using the server instance host name, user, and SSH key, SSH into each node
    1. Export VOLDEMORT_HOME to be the remote machine’s configuration directory
    2. Export VOLDEMORT_NODE_ID for the system’s unique node ID
    3. Invoke voldemort-start.sh
  4. Monitor output from each server node

VoldemortConfig already includes logic to look for the node ID in the VOLDEMORT_NODE_ID environment variable if not present in server.properties.

Cluster Stopper

The Voldemort server stopper utility/library API is used to stop the Voldemort servers on a set of remote machines. It does not halt the operating system, just the Voldemort process. The heavy lifting will be performed by SSH to gain access to the machine and voldemort-stop.sh for stopping the server.

Parameters:

  • List of server nodes’ external host names
  • Server nodes’ user ID
  • SSH private key path
  • Voldemort root directory (on remote machines)

Logic:

  1. Process parameters
  2. Using the server instance host name, user, and SSH key, SSH into each node and invoke voldemort-stop.sh

Remote Test Client Runner

The Voldemort test client runner utility/library API is used to start the Voldemort test clients on a set of remote machines. The heavy lifting will be performed by SSH to gain access to the machine and voldemort-remote-test.sh for starting the client.

Unfortunately I have not thought of a good way to handle the remote test arguments. One thought would be to simply provide a properties file mapping the remote test client with a string representing the arguments to pass to the remote test script (voldemort-remote.test.sh). However, that would make scripting a test from the command line burdensome, though it offers the best flexibility. Another approach would be to have the remote test client runner mimic the arguments used for the voldemort-remote-test script and build up the call to the

Parameters:

  • List of client nodes’ (public) host names
  • Client nodes’ user ID
  • SSH private key path
  • Voldemort directory (on remote machines)

Logic:

  1. Process parameters
  2. Using the client instance host name, user, and SSH key, SSH into each node and…
    1. …invoke voldemort-remote-test.sh, redirecting the output
  3. Prepare a means to monitor the process, output, and report that to the client or via the API
  4. At an appropriate interval, check that each machine is running the client process
  5. Provide a means to determine the progress of the test run – is it running, stalled, complete?
  6. Upon completion of the test run, return the results in the form: <client host name>,<iteration>,<read ms>,<write ms>,<delete ms>

The data that is returned is intended to be generic enough that arbitrary shell scripts and tools can take use these results for aggregation, trending, graphing, etc.

Open Questions/Issues

  1. What facilities need to be provided by the test framework to ensure correctness? For example, let’s say we’re implementing a JUnit test that checks that on server node failure the other nodes pick up the appropriate keys. How can the testing framework provide feedback to the JUnit test that this occurred?

Clone this wiki locally