Scalding Workshop/Tutorial README
About this Workshop/Tutorial
This session is a half-day tutorial on Scalding and its place in the Hadoop ecosystem. Scalding is a Scala API developed at Twitter for distributed data programming that uses the Cascading Java API, which in turn sits on top of Hadoop's Java API. However, Scalding, through Cascading, also offers a local mode that makes it easy to run jobs without using the Hadoop libraries, for simpler testing and learning. We'll use this feature for most of this session.
We use sbt, the de facto Scala build tool, to resolve dependencies (such as the Scalding and Cascading jars), and to compile the one Hadoop example (but not the rest of the exercises...). You will need to install Git, Java, Scala, and sbt for this workshop, as we discuss next.
Please do the following installation steps before the workshop!
It helps to pick a work directory where you will install some of the packages. In what follows, we'll assume you're using
$HOME/fun on Linux, Mac OSX, or Cygwin for Windows with the
bash shell (or a similar shell) or you are using
C:\fun on Windows.
Once git is installed, clone this workshop from GitHub. Use your favorite Git GUI or the command line. Using
cd $HOME/fun git clone git://github.com/deanwampler/scalding-workshop.git
cd C:\fun git clone git://github.com/deanwampler/scalding-workshop.git
Java v1.7 or Better
If it's not already installed, install Java from java.com.
Scala v2.11.7 (or v2.10.6)
We'll use a build of Scalding for Scala v2.11.7 (although you can also use Scala v2.10.6). Install Scala following the instructions here.
See the website for sbt for installation instructions. Actually, what you install is a driver Java program. The actual version of
sbt used will be bootstrapped for the project...
Setting Up The Project and a Sanity Check
Once you've completed these steps, we need to "bootstrap" the project with
sbt and then run a "sanity check" script, our exercise 0.
The first of the following three commands changes to the root directory of the workshop. (We'll spend the whole session working in this directory.) The second command runs
sbt to create an "assembly" (an all-inclusive jar file with all the dependent jars we need included - well, most of them...). Finally, the third and last command runs the sanity check script. We'll run it using a Scala script called
run in the root directory of the project, which we'll use for all the exercises.
bash (assuming you installed the workshop in
cd $HOME/fun/scalding-workshop sbt assembly ./run scripts/SanityCheck0.scala
On Windows (assuming you installed the workshop in
cd C:\fun\scalding-workshop sbt assembly scala run scripts/SanityCheck0.scala
The commands should run without error. If you get an error like
sbt not found or
scala not found, make sure these tools are on your command "path".
sbt assembly command first runs an
update task, which downloads all the dependencies, using the specification in
project/Build.scala. You'll see lots of messages as it tries different repositories. Note that these dependencies will be downloaded to your
$HOME/.ivy2 directory (on *nix systems). This may take a while to run!!
assembly task builds an all-inclusive "jar" (Java ARchive) file that includes all the dependencies, including Scalding and Hadoop. This jar file makes it easier to run Scalding scripts on Hadoop, because it simplifies working with dependency jars and the
CLASSPATH. The output of
X.Y.Z will be the current version number for the workshop.
For completeness, note also that the version of
sbt itself is specified in
project/build.properties. There is also a
project/plugins.sbt file that specifies some
sbt plugins we use.
run Scala script takes a moment to compile the Scalding script and then run it. The output is written to
output/SanityCheck0.txt. (What's in that file?)
If you have Ruby installed on your system, there is a port of
run in Ruby called
run.rb. To use it, just replace the
run command above with
run.rb, for the *nix
bash shell, or for Windows, use
ruby run.rb instead of
See the Appendix below for "optional installs", if you decide to use Scalding after the tutorial you'll want to install some of these packages.
NOTE: There is now an interpreter "shell" mode available for Scalding. See the Scalding README for details.
You can now start with the workshop itself. Go to the companion Workshop page.
Notes on Releases
Upgraded to Scala v2.11.7, with optional support for v2.10.6, SBT 0.13.9, and upgraded dependencies like Algebird. However, adopting the newer features of Scalding, like the Typed API and the REPL/shell, haven't been adopted. Pull requests welcome!
Moved to Scala v2.10.3 and Scalding v0.9.0rc4. Refined some of the exercises and added one that uses Scalding's newer "type-safe" API.
Moved to Scala v2.10.2 and Scalding v0.8.6. Completely reworked the build process and the script running process. Refined many of the exercises.
Added a file missing from distribution. Refined the run scripts to work better with different Java versions.
Refined several exercises and fixed bugs. Added
Makefile for building releases. (Since removed...)
First release for the StrangeLoop 2012 workshop.
For Further Information
I'm Dean Wampler from Lightbend. I prepared this workshop. Send me email with questions about the workshop or for information about consulting and training on Scala, Scalding, the Lightbend Reactive Platform, and other Hadoop and Big Data technologies.
Some of the data used in these exercises was obtained from InfoChimps.
NOTE: The first version of this workshop was written while I worked at Think Big Analytics. The original and now obsolete fork of the workshop is here.
Appendix - Optional Installs
If you're serious about using Scalding, you should clone and build the Scalding repo itself. We'll talk briefly about it in the workshop, but it isn't required.
Scalding from GitHub
Clone Scalding from GitHub. Using
bash and assuming you'll clone it into
cd $HOME/fun git clone https://github.com/twitter/scalding.git
Windows is similar.
Ruby v1.8.7 or v1.9.X
Ruby is used as a platform-independent language for driver scripts by Scalding (e.g., their
scripts/scald.rb). See ruby-lang.org for details on installing Ruby. Either version 1.8.7 or 1.9.X will work.
Build Scalding according to its Getting Started page. By default, Twitter builds with Scala v2.9.3, but Scalding builds with 2.10.2 and the
project/Build.scala file can be edited for this version.
project/Build.scala. Near the top, you'll see a line
scalaVersion := 2.9.2 and next to it, a commented line for version 2.10.0. Comment out the line with 2.9.2 and uncomment the 2.10.0 line, then change the last zero to "2" or "3". Save your changes.
Now, here is a synopsis of the build steps. Using
cd $HOME/fun/scalding sbt update sbt assembly
cd C:\fun\scalding sbt update sbt assembly
(The Getting Started page says to build the
test target between
assembly, but the later builds
Once you've built Scalding, run the following command as a sanity check to ensure everything is setup properly. Using
cd $HOME/fun/scalding scripts/scald.rb --local tutorial/Tutorial0.scala
cd C:\fun\scalding ruby scripts\scald.rb --local tutorial/Tutorial0.scala