Skip to content

Wrapping this cookbook

Chris Gianelloni edited this page May 2, 2014 · 2 revisions

Using this cookbook

This cookbook is intended to be wrapped with another cookbook, which sets up this cookbook's attributes and orchestrates initialization and service states. This is due to the complex nature of Hadoop, where certain actions require other actions to take place, many times not on the same machine. There are many execute and service resources with action :nothing to facilitate these functions.

Example: Creating directories in HDFS

For example, you must format the NameNode, start the NameNode, and start (at least) one DataNode before you can create a directory in HDFS. Since it is very likely that the NameNode and DataNode reside on different machines, orchestration is required and automatically performing these actions could be dangerous. Recipes/resources need to be executed as follows:

  1. recipe[hadoop::hadoop_hdfs_namenode] on NameNode machine
  2. recipe[hadoop::hadoop_hdfs_datanode] on all DataNode machines
  3. execute[hdfs-namenode-format] with action :run from recipe[hadoop::hadoop_hdfs_namenode] on NameNode machine
  4. service[hadoop-hdfs-namenode] with action :start from recipe[hadoop::hadoop_hdfs_namenode] on NameNode machine
  5. service[hadoop-hdfs-datanode] with action :start from recipe[hadoop::hadoop_hdfs_datanode] on all DataNode machines

At this point, you will have a functional HDFS cluster and can perform hdfs commands.

Wrapping resources

Chef allows one to call a resource using the resources collection via the #run_action method. While this can be done in Ruby in a recipe, we recommend putting these calls within a named ruby_block in a recipe. This causes the resource to be called during the execution phase of a Chef run, versus during the compile phase. This is necessary as all resources may not be available during the compile phase.

Example: wrapping execute resources

Here is an example, taken from continuuity/hadoop_wrapper_cookbook's hive_init recipe.

dfs = node['hadoop']['core_site']['fs.defaultFS']

ruby_block 'initaction-create-hive-hdfs-homedir' do
  block do
    resources('execute[hive-hdfs-homedir]').run_action(:run)
  end
  not_if "hdfs dfs -test -d #{dfs}/user/hive", :user => 'hdfs'
end

First, we're setting the dfs variable from the attribute which points to the location of our HDFS NameNode. Next, is a ruby_block which calls the execute[hive-hdfs-homedir] resource from the resources collection with the :run action. We guard execution of this ruby_block via a call to HDFS, using the hdfs shell command. This block can only be executed during the execution phase of a Chef run, and only after HDFS is fully operational. Otherwise, the hdfs shell command may not be available, may fail, or may not return the actual value.

Example: wrapping service resources

Service resources are similar to execute resources. The main difference is that service resources support multiple actions. This is an example that simply starts a service.

ruby_block 'service-hadoop-hdfs-namenode-start' do
  block do
    resources('service[hadoop-hdfs-namenode]').run_action(:start)
  end
end

Since service resources are always idempotent, we do not need to add a guard to the ruby_block. However, what if we wanted to both start the service, and enable it to start at boot? The Chef #run_action method only supports a single action, so we must perform both actions, ourselves.

ruby_block 'service-hadoop-hdfs-namenode-start-and-enable' do
  block do
    %w(enable start).each do |action|
      resources('service[hadoop-hdfs-namenode]').run_action(action.to_sym)
    end
  end
end

This causes our ruby_block to signal the service[hadoop-hdfs-namenode] resource from the resource collection to enable, then start.