Scrunch jobs can be launched from a REPL. #42

Closed
wants to merge 1 commit into
from

Conversation

Projects
None yet
3 participants

lockoff commented Jun 28, 2012

This commit modifies the Scrunch project so that Scrunch jobs can be run from
a Scala REPL. Users can run a Scala REPL capable of launching Scrunch jobs by
building Scrunch using mvn package and running bin/scrunch from the
distribution directory that results. Several changes have been made to the
project to accomplish this:

  1. The project has been modified to produce a release distribution. The
    distribution is created by maven when mvn package is run. A distribution
    folder and tarball are created. The distribution folder contains a bin dir that
    contains scripts, a lib dir that contains all library jars, and a log dir that
    contains a log4j configuration file.
  2. A modified Scala REPL was added to the project. An object InterpreterRunner
    was created that launches a Scala REPL. It's a modification of Scala's
    MainGenericRunner. The new Scrunch version allows client code to determine if a
    REPL is actually running, and includes methods for creating a jar from the code
    compiled from REPL input. A script named "scrunch" was added to the project
    that, when run, launches this modified Scala REPL. The script is a modification
    of the script distributed with Scala that launches the Scala REPL.
  3. Scrunch's Pipeline class was modified so that any MapReduce pipeline
    constructed automatically adds the Scrunch lib jars to the Distributed Cache of
    the job and to the classpaths of run tasks.
  4. Methods on PCollection/PTable/etc. that result in a job being launched were
    modified to check if the REPL is running and, if so, create a jar of code
    compiled from REPL input and ship that jar with the job so that it's on the
    classpath of run tasks.
  5. To facilitate extensions, From/To/At objects were changed to traits, with
    likewise named singleton objects that extend the traits created.
  6. The examples in the examples directory, and the script scrunch.py for running
    those examples, are included in the project distribution. The scrunch.py script
    was renamed to scrunch-job.py and modified to cope with the new project
    distribution structure and take advantage of the fact that Scrunch lib jars are
    now automatically added to the classpath of run jobs.

I started an integration test for actually launching jobs but the MiniMRCluster
testing framework does not behave properly when jars are added to the
distributed cache. The problem is related to MAPREDUCE-2884. I have verified
that jobs can be launched from the REPL using an actual cluster.

@kiyan kiyan Scrunch jobs can be launched from a REPL.
This commit modifies the Scrunch project so that Scrunch jobs can be run from
a Scala REPL.  Users can run a Scala REPL capable of launching Scrunch jobs by
building Scrunch using `mvn package` and running bin/scrunch from the
distribution directory that results. Several changes have been made to the
project to accomplish this:

1. The project has been modified to produce a release distribution. The
distribution is created by maven when `mvn package` is run. A distribution
folder and tarball are created. The distribution folder contains a bin dir that
contains scripts, a lib dir that contains all library jars, and a log dir that
contains a log4j configuration file.

2. A modified Scala REPL was added to the project. An object InterpreterRunner
was created that launches a Scala REPL.  It's a modification of Scala's
MainGenericRunner.  The new Scrunch version allows client code to determine if a
REPL is actually running, and includes methods for creating a jar from the code
compiled from REPL input.  A script named "scrunch" was added to the project
that, when run, launches this modified Scala REPL.  The script is a modification
of the script distributed with Scala that launches the Scala REPL.

3. Scrunch's Pipeline class was modified so that any MapReduce pipeline
constructed automatically adds the Scrunch lib jars to the Distributed Cache of
the job and to the classpaths of run tasks.

4. Methods on PCollection/PTable/etc. that result in a job being launched were
modified to check if the REPL is running and, if so, create a jar of code
compiled from REPL input and ship that jar with the job so that it's on the
classpath of run tasks.

5. To facilitate extensions, From/To/At objects were changed to traits, with
likewise named singleton objects that extend the traits created.

6. The examples in the examples directory, and the script scrunch.py for running
those examples, are included in the project distribution.  The scrunch.py script
was renamed to scrunch-job.py and modified to cope with the new project
distribution structure and take advantage of the fact that Scrunch lib jars are
now automatically added to the classpath of run jobs.

I started an integration test for actually launching jobs but the MiniMRCluster
testing framework does not behave properly when jars are added to the
distributed cache.  The problem is related to MAPREDUCE-2884. I have verified
that jobs can be launched from the REPL using an actual cluster.
6bcf329
Contributor

jwills commented Jul 11, 2012

Okay to revert this one, I think.

Contributor

jwills commented Aug 14, 2012

Integrated into apache crunch (incubating)

jwills closed this Aug 14, 2012

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment