Big Data Compute Cluster using Chef, Hadoop and Cassandra in the Amazon AWS/EC2 cloud
Ruby Perl Shell Other
Pull request Compare This branch is 1512 commits behind infochimps-labs:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.



Chef is a powerful tool for maintaining and describing the software and configurations that let a machine provide its services.

cluster_chef is

  • a clean, expressive way to describe how machines and roles are assembled into a working cluster
  • Our collection of Industrial-strength, cloud-ready recipes for Hadoop, Cassandra, HBase, Elasticsearch and more
  • a set of conventions and helpers that make provisioning cloud machines easier


Here’s a basic 3-node hadoop cluster:

    ClusterChef.cluster 'demohadoop' do
      facet 'master' do
        instances           1
        role                "nfs_server"
        role                "hadoop_master"
        role                "hadoop_worker"
        role                "hadoop_initial_bootstrap"

      facet 'worker' do
        instances           2
        role                "nfs_client"
        role                "hadoop_worker"
        server 0 do
	  chef_node_name 'demohadoop_worker_zero'

      cloud :ec2 do
        image_name          "lucid"
        flavor              "c1.medium"
        availability_zones  ['us-east-1d']
        security_group :logmuncher do
          authorize_group "webnode"

This defines a cluster (group of machines that serve some common purpose) with two facets, or unique configurations of machines within the cluster. (For another example, a webserver farm might have a loadbalancer facet, a database facet, and a webnode facet).

In the example above, the master serves out a home directory over NFS, and runs the processes that distribute jobs to hadoop workers. In this small cluster, the master also has workers itself, and a utility role that helps initialize it out of the box.

There are 2 workers; they use the home directory served out by the master, and run the hadoop worker processes.

Lastly, we define what machines to use for this cluster. Instead of having to look up and type in an image ID, we just say we want the Ubuntu ‘Lucid’ distribution on a c1.medium machine. Cluster_chef understands that this means we need the 32-bit image in the us-east-1 region, and makes the cloud instance request accordingly. It also creates a ‘logmunchers’ security group, opening it so all the ‘webnode’ machines can push their server logs onto the HDFS.

The following commands launch each machine, and once ready, ssh in to install chef and converge all its software.

    knife cluster launch demohadoop master --bootstrap
    knife cluster launch demohadoop worker --bootstrap

You can also now launch the entire cluster at once with the following

   knife cluster launch demohadoop --bootstrap

The cluster launch operation is (mostly) idempotent. (There is currently a short time after the completion of a cluster launch where a second cluster launch will create extra nodes that do not belong.)


h2. Getting Started

h3. Prelaunch

Follow the normal knife setup. If you can use the normal knife bootstrap
commands to launch a machine, you're ready to start.

h3. Setup

Install these gems,

    sudo gem install chef fog broham highline configliere right_aws net-ssh-multi formatador terminal-table

and (if you haven’t already), git clone or download the repo:

  git clone

Since we’ll need to refer back to it a few times in the setup that follows, please set an environment variable called CLUSTER_CHEF_PATH (modifying it to match the actual location):


AWS credentials

You need to make a cloud keypair, a secure key for communication with Amazon AWS cloud.

  1. Log in to the AWS console and create a new keypair named demohadoop. Your browser will download the private key file.
  2. Create a directory ~/.chef/keypairs/, and move the private key file you just downloaded to be ~/.chef/keypairs/demohadoop.pem.
  3. Make the private key unsnoopable, or ssh will complain:
      chmod 600 ~/.chef/keypairs/*.pem   

While you’re on the AWS console, also go to Account/Security Credentials and take note of your aws_access_key_id and aws_secret_access_key — you’ll need to add them to your knife.rb as shown below.

Knife setup

Clusterchef uses the ‘knife’ tool to control both chef and the cloud APIs.

Make the following additions to its configuration file (typically found at ~/.chef/knife.rb).

    # Type in the full path to your cluster_chef installation
    cluster_chef_path File.expand_path('~/path/to/cluster_chef')
    # Type in the full path to the directory holding your cloud keypairs.
    keypair_path      File.expand_path("~/.chef/keypairs")

    # Make sure knife can find all your junk
    cookbook_path ["#{cluster_chef_path}/cookbooks", "#{cluster_chef_path}/site-cookbooks",] # and anything else you want

    # Set your AWS access credentials
    knife[:aws_access_key_id]      = "XXXXXXXXXXX"
    knife[:aws_secret_access_key]  = "XXXXXXXXXXXXX"

Please hand-edit the cluster_chef_path line to give the correct location. If you already have a cookbook_path definition, you should merge it with the cookbook_path line above —just make sure that "#{cluster_chef_path}/cookbooks" and "#{cluster_chef_path}/site-cookbooks" appear in there too.

Push to chef server

We need to send all the cookbooks and role to the chef server. Visit your cluster_chef directory and run:

    knife cookbook upload --all
    for foo in roles/*.rb ; do knife role from file $foo & sleep 1 ; done

You should see all the cookbooks defined in cluster_chef/cookbooks (ant, apt, …) and cluster_chef/site-cookbooks (azkaban, cassandra, …) listed among

Stupid Surgical bits

On older versions of chef that don’t have a plugin mechanism for new commands, we have to do surgery on the knife itself… we’ll just symlink the new commands into chef’s lib/chef/knife directory, and symlink the bootstrap templates into chef’s lib/chef/knife/bootstrap directory. Set the path to your cluster_chef directory and run the following:

    sudo ln -s $CLUSTER_CHEF_PATH/lib/cluster_chef/knife/*.rb            $(dirname `gem which chef`)/chef/knife/
    sudo ln -s $CLUSTER_CHEF_PATH/lib/cluster_chef/knife/bootstrap/*     $(dirname `gem which chef`)/chef/knife/bootstrap/

Cluster chef is not currently set up to work as a pluggable gem for chef 0.10.0, but will in the near future. Until that time, you can use the knife commands in your local plugins directory with the following command:

    ln -s $CLUSTER_CHEF_PATH/lib/cluster_chef/knife  ~/.chef/plugins/knife

Cluster chef knife commands

knife cluster launch

Now if you type knife cluster launch you should see it found the new scripts:

    knife cluster launch CLUSTER_NAME [FACET_NAME] (options)
    knife cluster show CLUSTER_NAME [FACET_NAME] (options)

Go ahead and launch a cluster:

    knife cluster launch demohadoop master --bootstrap

It will kick off a node and then bootstrap it. By the time it’s done, you should be able to see the hadoop dashboard (follow the instructions for proxy setup). Once you’re convinced the cluster works, kick off the workers:

    knife cluster launch demohadoop worker --bootstrap


  • The initial startup is still finicky, but is at least down to only two passes for hadoop:
    for foo in hadoop-0.20-{datanode,namenode,tasktracker,jobtracker,secondarynamenode} ; do sudo service $foo stop ; done
    sudo chef-client
  • For hbase, still dialing it in but there’s also this:
    sudo -u hdfs hadoop fs -chown -R hbase:hbase /hadoop/hbase
    sudo chef-client
  • Once the master runs to completion with all daemons started, remove the hadoop_initial_bootstrap recipe from its run_list. (Note that you may have to edit the runlist on the machine itself depending on how you bootstrapped the node).
  • For problems starting NFS server on ubuntu maverick systems, read, understand and then run /tmp/ — See this thread for more

Zero-bootstrap, fire and forget cluster launch!

Note: Although you can (probably) still use broham to launch a cluster, cluster chef no longer needs it. ClusterChef is now able to assign node with node names without external assistnace.

  • Register for Amazon SimpleDB. (Although you do need a credit card, there’s no conceivable way broham will approach the free limit.)
  • You’ll have to run the following one-time command:
    sudo gem install broham configliere right_aws
    ruby -rubygems -e 'require "broham"; Broham.establish_connection :access_key=>"YOUR_ACCESS_KEY", :secret_access_key=>"YOUR_KEY"; Broham.create_domain'
  • Now you should be able to use broham:
     broham-register `hostname` 
  • To have it assign node names dynamically, se the client.rb script in cluster_chef/config as your /etc/chef/client.rb

Chef Concepts

ClusterChef will help you create a scalable, efficient compute cluster in the cloud. It has recipes for Hadoop, Cassandra, NFS and more — use as many or as few as you like. For example, you can create and:

  • A small 1-5 node cluster for development or just to play around with Hadoop or Cassandra
  • A spot-priced, ebs-backed cluster for unattended computing at rock-bottom prices
  • A large 30+ machine cluster with multiple EBS volumes per node running Hadoop and Cassandra, with optional NFS for

In chef,

  • A Recipe gives concrete steps that make a node achieve its desired final configuration. For example, the hadoop_cluster cookbook has a recipe to install the hadoop packages, and another to configure and run the namenode. If the cookbook isn’t installed,
  • A Cookbook holds a collections of related recipes and attributes, and the templates, libraries &c that support them.
  • A Role is a collection of related recipes and default attributes that work together. For example, there is a ‘hadoop_worker’