Skip to content
Big Data Compute Cluster using Chef, Hadoop and Cassandra in the Amazon AWS/EC2 cloud
Pull request Compare This branch is 2115 commits behind infochimps-labs:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


Big Data Cluster using Chef, Hadoop and Cassandra


ClusterChef will help you create a scalable, efficient compute cluster in the cloud. It has recipes for Hadoop, Cassandra, NFS and more — use as many or as few as you like. For example, you can create and:

  • A small 1-5 node cluster for development or just to play around with Hadoop or Cassandra
  • A spot-priced, ebs-backed cluster for unattended computing at rock-bottom prices
  • A large 30+ machine cluster with multiple EBS volumes per node running Hadoop and Cassandra, with optional NFS for


  • With Chef, you declare a final state for each node, not a procedure to follow. Adminstration is more efficient, robust and maintainable.
  • You get a nice central dashboard to manage clients
  • You can easily roll out configuration changes across all your machines
  • Chef is actively developed and has well-written recipes for webservers, databases, development tools, and a ton of different software packages.
  • Poolparty makes creating amazon cloud machines concise and easy: you can specify spot instances, ebs-backed volumes, disable-api-termination, and more.



Chef Concepts

In chef,

  • A Recipe gives concrete steps that make a node achieve its desired final configuration. For example, the hadoop_cluster cookbook has a recipe to install the hadoop packages, and another to configure and run the namenode. If the cookbook isn’t installed,
  • A Cookbook holds a collections of related recipes and attributes, and the templates, libraries &c that support them.
  • A Role is a collection of related recipes and default attributes that work together. For example, there is a ‘hadoop_worker’

Cluster Roles

Feedback on documentation:

Alright, so I gave it a QUICK once over. I shouldn’t spend much more time on this this weekend though…

– install dependencies: what about right_aws? Or is this no longer necessary?
– aws scavenger hunt: wtf is broham? Maybe just a link to its repo is fine?
– kick off chef+master node for option A: maybe a short note on what the -c option is doing (ie. I’d find it very handy to know that stored in that file is configuration for my cluster …)
– set up web ui: I believe the command “sudo cat /etc/chef/server.rb | grep -i pass” is meant to run on master node. Indicate this?

– wtf? I have no idea what’s going on here.

– where is the explanation for how to launch slaves?

Setting up a chef server is explained pretty well. I wouldn’t be able to spin up a hadoop master or hadoop nodes with these instructions though. Maybe just a handwavy explanation for what command to run and where the config lives for both would be sufficient.

- Adding a role?
- Nuking a role?
- wtf is a role? (Are we assuming prior knowledge of chef?)
- Modifying a running cluster?

Something went wrong with that request. Please try again.