Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

This branch is 36 commits behind mrwalker:master

Reapply changeset to remove Cascading/Hadoop home dependencies, but t…

…his time correctly identify the JRuby 1.7 issue as a classpath problem.

Rather unimaginative solution: just set the classpath at the point of calling each app controlled by Rake (test, spec, samples).  This demonstrates a variety of ways of setting up the classpath for use by a JVM that will run c.j code.
latest commit dfb341bca5
Matt Walker authored
Failed to load latest commit information.
lib Reapply changeset to remove Cascading/Hadoop home dependencies, but t…
samples Add group_by and unique samples
spec Make changes necessary to work on JRuby 1.7.0 (used by Travis CI)
tasks Reapply changeset to remove Cascading/Hadoop home dependencies, but t…
test Make changes necessary to work on JRuby 1.7.0 (used by Travis CI)
.gitignore Remove the majority of unnecessary code in the gem, mostly in tasks
.travis.yml Travis CI integration
Gemfile Add Bundler integration, particularly for tests
Gemfile.lock Add Bundler integration, particularly for tests
LICENSE.txt Add license file (LGPL)
README.md Simplify and expand README. Plan is to add documentation on the githu…
Rakefile Clarify clean target for samples
build.xml Offline property is no longer necessary since you can just touch buil…
cascading.jruby.gemspec
ivy.xml Upgrade Cascading 2.0.0 wip-286 -> 2.0.0 (final)
ivysettings.xml Complete replace Ant build script with something sane that uses Ivy t…

README.md

Cascading.JRuby Build Status

cascading.jruby is a DSL for Cascading, which is a dataflow API written in Java. With cascading.jruby, Ruby programmers can rapidly script efficient MapReduce jobs for Hadoop.

To give you a quick idea of what a cascading.jruby job looks like, here's word count:

require 'rubygems'
require 'cascading'

input_path = ARGV.shift || (raise 'input_path required')

cascade 'wordcount', :mode => :local do
  flow 'wordcount' do
    source 'input', tap(input_path)

    assembly 'input' do
      split_rows 'line', 'word', :pattern => /[.,]*\s+/, :output => 'word'
      group_by 'word' do
        count
      end
    end

    sink 'input', tap('output/wordcount', :sink_mode => :replace)
  end
end.complete

cascading.jruby provides a clean Ruby interface to Cascading, but doesn't attempt to add abstractions on top of it. Therefore, you should be acquainted with the Cascading API before you begin.

For operations you can apply to your dataflow within a pipe assembly, see the Assembly class. For operations available within a block passed to a group_by, union, or join, see the Aggregations class.

Note that the Ruby code you write merely constructs a Cascading job, so no JRuby runtime is required on your cluster. This stands in contrast with writing Hadoop streaming jobs in Ruby. To run cascading.jruby applications on a Hadoop cluster, you must use Jading to package them into a job jar.

cascading.jruby has been tested on JRuby versions 1.2.0, 1.4.0, 1.5.3, 1.6.5, and 1.6.7.2.

Something went wrong with that request. Please try again.