Big Data Cluster using Chef, Hadoop and Cassandra
ClusterChef will help you create a scalable, efficient compute cluster in the cloud. It has recipes for Hadoop, Cassandra, NFS and more — use as many or as few as you like. For example, you can create and:
- A small 1-5 node cluster for development or just to play around with Hadoop or Cassandra
- A spot-priced, ebs-backed cluster for unattended computing at rock-bottom prices
- A large 30+ machine cluster with multiple EBS volumes per node running Hadoop and Cassandra, with optional NFS for
- With Chef, you declare a final state for each node, not a procedure to follow. Adminstration is more efficient, robust and maintainable.
- You get a nice central dashboard to manage clients
- You can easily roll out configuration changes across all your machines
- Chef is actively developed and has well-written recipes for webservers, databases, development tools, and a ton of different software packages.
- Poolparty makes creating amazon cloud machines concise and easy: you can specify spot instances, ebs-backed volumes, disable-api-termination, and more.
- Part 1: Initial settings and credentials
- Part 2: Set up Chef Server and your local chef environment
- Part 3: Launch your Hadoop master node:
- Part 4: Launch the rest of the hadoop cluster
- Designing your cluster
- A persistent HDFS using EBS volumes
- Burnk your own AMI
- Tips and Troubleshooting
- A Recipe gives concrete steps that make a node achieve its desired final configuration. For example, the hadoop_cluster cookbook has a recipe to install the hadoop packages, and another to configure and run the namenode. If the cookbook isn’t installed,
- A Cookbook holds a collections of related recipes and attributes, and the templates, libraries &c that support them.
- A Role is a collection of related recipes and default attributes that work together. For example, there is a ‘hadoop_worker’
Feedback on documentation:
Alright, so I gave it a QUICK once over. I shouldn’t spend much more time on this this weekend though…
– install dependencies: what about right_aws? Or is this no longer necessary?
– aws scavenger hunt: wtf is broham? Maybe just a link to its repo is fine?
– kick off chef+master node for option A: maybe a short note on what the -c option is doing (ie. I’d find it very handy to know that stored in that file is configuration for my cluster …)
– set up web ui: I believe the command “sudo cat /etc/chef/server.rb | grep -i pass” is meant to run on master node. Indicate this?
– wtf? I have no idea what’s going on here.
– where is the explanation for how to launch slaves?
Setting up a chef server is explained pretty well. I wouldn’t be able to spin up a hadoop master or hadoop nodes with these instructions though. Maybe just a handwavy explanation for what command to run and where the config lives for both would be sufficient.
- Adding a role?
- Nuking a role?
- wtf is a role? (Are we assuming prior knowledge of chef?)
- Modifying a running cluster?