Humboldt

Humboldt provides a tool-set on top of Rubydoop to run Hadoop jobs effortlessly both locally and on Amazon EMR. There is also some sugar added on top of the Rubydoop DSL.

Sugar

Type converters

Humboldt adds a number of type converters:

binary - String - Hadoop::Io::BytesWritable
encoded - String - MessagePack encoded Hadoop::Io::BytesWritable
text - String - Hadoop::Io::Text
json - Hash - Hadoop::Io::Text
long - Integer - Hadoop::Io::LongWritable
none - nil - Hadoop::Io::NullWritable

Use them like so:

class Mapper < Humboldt::Mapper
  input :long, :json
  # ...
end

Combine input files

Hadoop does not perform well with many small input files, since each file is handled by its own map task, by default. Humboldt bundles an input format which combines files. Due to a bug in Hadoop 1.0.3 (and other versions), Hadoop 2.2.0 is required to run this for input files on S3, see this bug.

Example usage:

Rubydoop.configure do |input_paths, output_path|
  job 'my job' do
    input input_paths, format: :combined_text
    set 'mapreduce.input.fileinputformat.split.maxsize', 32 * 1024 * 1024

    # ...
  end
end

mapreduce.input.fileinputformat.split.maxsize controls the maximum size of an input split.

Secondary sort

Example usage:

A common mapreduce pattern when you need to count uniques is secondary sort, which can be quite a pain to implement. Humboldt makes it really easy, all you need to do say which indexes to partition and group by:

Rubydoop.configure do |input_paths, output_path|
  job 'my job' do
    # ...

    secondary_sort 0, 10

    # ...
  end
end

See the API documentation for Rubydoop::JobDefinition#secondary_sort for more information on how use it.

Development setup

Download Hadoop and set up the classpath using

$ rake setup

The default is Hadoop 1.0.3. Specify a hadoop release by setting $HADOOP_RELEASE, e.g.

$ HADOOP_RELEASE=hadoop-2.2.0/hadoop-2.2.0 rake setup

Run the tests with

$ rake spec

Release a new gem

Bump the version number in lib/humboldt/version.rb, run rake gem:release.

Name		Name	Last commit message	Last commit date
Latest commit History 164 Commits
bin		bin
config		config
ext/src/humboldt		ext/src/humboldt
lib		lib
spec		spec
.gitignore		.gitignore
.rspec		.rspec
.travis.yml		.travis.yml
.yardopts		.yardopts
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENSE.txt		LICENSE.txt
README.md		README.md
Rakefile		Rakefile
humboldt.gemspec		humboldt.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Humboldt

Sugar

Type converters

Combine input files

Secondary sort

Development setup

Release a new gem

Copyright

About

Releases 3

Packages

Contributors 8

Languages

License

burtcorp/humboldt

Folders and files

Latest commit

History

Repository files navigation

Humboldt

Sugar

Type converters

Combine input files

Secondary sort

Development setup

Release a new gem

Copyright

About

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 8

Languages

Packages