Humboldt provides a tool-set on top of Rubydoop to run Hadoop jobs effortlessly both locally and on Amazon EMR. There is also some sugar added on top of the Rubydoop DSL.
Humboldt adds a number of type converters:
- binary -
String
-Hadoop::Io::BytesWritable
- encoded -
String
- MessagePack encodedHadoop::Io::BytesWritable
- text -
String
-Hadoop::Io::Text
- json -
Hash
-Hadoop::Io::Text
- long -
Integer
-Hadoop::Io::LongWritable
- none -
nil
-Hadoop::Io::NullWritable
Use them like so:
class Mapper < Humboldt::Mapper
input :long, :json
# ...
end
Hadoop does not perform well with many small input files, since each file is handled by its own map task, by default. Humboldt bundles an input format which combines files. Due to a bug in Hadoop 1.0.3 (and other versions), Hadoop 2.2.0 is required to run this for input files on S3, see this bug.
Example usage:
Rubydoop.configure do |input_paths, output_path|
job 'my job' do
input input_paths, format: :combined_text
set 'mapreduce.input.fileinputformat.split.maxsize', 32 * 1024 * 1024
# ...
end
end
mapreduce.input.fileinputformat.split.maxsize
controls the maximum
size of an input split.
Example usage:
A common mapreduce pattern when you need to count uniques is secondary sort, which can be quite a pain to implement. Humboldt makes it really easy, all you need to do say which indexes to partition and group by:
Rubydoop.configure do |input_paths, output_path|
job 'my job' do
# ...
secondary_sort 0, 10
# ...
end
end
See the API documentation for Rubydoop::JobDefinition#secondary_sort
for more information on how use it.
Download Hadoop and set up the classpath using
$ rake setup
The default is Hadoop 1.0.3. Specify a hadoop release by setting
$HADOOP_RELEASE
, e.g.
$ HADOOP_RELEASE=hadoop-2.2.0/hadoop-2.2.0 rake setup
Run the tests with
$ rake spec
Bump the version number in lib/humboldt/version.rb
, run rake gem:release
.
© 2014 Burt AB, see LICENSE.txt (BSD 3-Clause).