In [None]:
%matplotlib inline
import matplotlib
import seaborn as sns
sns.set()
matplotlib.rcParams['figure.dpi'] = 144

# Scalding (Scala on Cascading)

Refer to https://github.com/twitter/scalding for the latest and greatest in Scalding documentation.


### Contents

1. [Background](#background)
1. [The Scalding API](#api)
1. [The canonical example](#countingwords)
1. [Tools set-up](#setup)
1. [A big data exercise](#exercise)

<span id='background'></span>

## Frameworks on top of MapReduce

By now you've been exposed to two frameworks of distributed computing: Hadoop MapReduce and Apache Spark. Spark showed us how "nice" programming around big data can be in terms of expressing your algorithms in terms of primitives and functions that allow you to think more abstractly than "key, value, shuffle, map-and-reduce-only." However there are still many situations in which we would still like our cluster manager to run MapReduce under the hood without sacrificing convenient utilities like joins.

Thankfully, several APIs exist on top of MapReduce that allow exactly this: [Pig](https://pig.apache.org/docs/latest/basic.html), [Crunch](https://crunch.apache.org), and [Cascading](http://www.cascading.org) are the most popular; they each offer benefits and tradeoffs over one another. 

We'll introduce Scalding, the Scala DSL of Cascading, primarily because:  
1. we've already been introduced to Scala via Spark,  
1. Scalding has wide usage in sophisticated production settings among a number of recognizable companies,  
1. it has an advantage over the UDFs and SQL-like operations of Pig because Pig requires one to maintain two codebases for it to be used in production (one for Pig operations, the other in Python/Java to call the Pig operations)
1. for a comparison to Crunch, refer [here](#crunch_difference).

<span id='api'></span>


## Scalding API primer


Scalding's **functions**, i.e. the "verbs" of the DSL, can be divided into four types:
- Map-like functions
- Grouping functions
- Group/Reduce functions
- Join operations
And other misc. functions

Scalding's key **objects**, i.e. the "nouns":
- `TypedPipe[T]` : distributed list of objects of type T
- `KeyedList[K,V]` : represents some sharding of objects of key K and value V
    - `Grouped[K,V]` : usual groupings
    - `CoGrouped[K,V]` : co-groupings / joins

**APIs**  
- Fields-based API (legacy)
- Type-safe API (use this!)
- Matrix API 

<span id='countingwords'></span>


## MapReduce paradigm in Scalding  


#### The "Scalding" way to approach the word count algorithm:

- Have a distributed list of objects of type `String`, i.e. a `TypedPipe[String]` resulting from reading in a text file  
- From the `TypedPipe`, generate a flattened list of words. Split tokens on one or more whitespace characters, i.e. with regex: `\\s+`  
- Group this list of words by word  
- Count the number of occurrences of each word
- Write tab-separated word (String), count (Long) to file  


#### Script that does the above

```scala
import com.twitter.scalding._

val lines : TypedPipe[String] = TypedPipe.from(TextLine("hello.txt"))

// Write a word count to file
lines.flatMap(_.split("\\s+"))
    .group
    .sum
    .write(TypedTsv[(String, Long)]("output"))
```    
    

#### Scala class (that is also stylistically nicer) that does the above:

```scala
import com.twitter.scalding._

class WordCountJob(args: Args) extends Job(args) {
  TypedPipe.from(TextLine(args("input")))
    .flatMap { line => tokenize(line) }
    .groupBy { word => word } // use each word for a key
    .size // in each group, get the size
    .write(TypedTsv[(String, Long)](args("output")))

  // Split a piece of text into individual words.
  def tokenize(text : String) : Array[String] = {
    // Lowercase each word and remove punctuation.
    text.toLowerCase.replaceAll("[^a-zA-Z0-9\\s]", "").split("\\s+")
  }
}
```

<span id='setup'></span>

## Installing and using Scalding locally


```bash
$ git clone https://github.com/twitter/scalding.git
$ ./sbt update
$ ./sbt test     # runs the tests; if you do 'sbt assembly' below, these tests, which are long, are repeated
$ ./sbt assembly # creates a fat jar with all dependencies, which is useful when using the scald.rb script
```

Also add a SCALDING_HOME variable to your ~/.bash_profile pointing to the Scalding directory. 


#### Scalding REPL

To open the REPL:
`$ $SCALDING_HOME/scripts/scald.rb --repl --local`

Walk through this REPL example at https://gist.github.com/johnynek/a47699caa62f4f38a3e2 to get a feel for the Scalding programming paradigms.

### Running your Scalding code as an app


`$ scald.rb --local {YourApp.scala}`

Your `build.sbt` should look something like this:  
```scala
name := {Your App Name}
version := "1.0"
scalaVersion := "2.10.4"

resolvers += "Concurrent Maven Repo" at "http://conjars.org/repo"

libraryDependencies += Seq(
    "cascading" % "cascading-core" % "2.0.2",
    "cascading" % "cascading-local" % "2.0.2",
    "cascading" % "cascading-hadoop" % "2.0.2",
    "cascading.kryo" % "cascading.kryo" % "0.4.4",
    "com.twitter" % "meat-locker" % "0.3.0",
    "com.twitter" % "maple" % "0.2.2",
    "commons-lang" % "commons-lang" % "2.4",
    "com.twitter" % "scalding_2.9.2" % "0.7.3",
    "org.specs2" % "specs2_2.9.2" % "1.12.1"
    )
```


## Using Scalding on Amazon EMR 


Because Scalding is Hadoop MapReduce under the hood, unlike in the case of Spark where we had to follow special submission instructions, submitting a `jar` of Scalding code works in the same way as submitting a `jar` of straight-up Java MapReduce code. To build your project, run `sbt assembly` in the root directory of your project.

You'll fire up an EMR cluster as per usual (use `./scripts/launch_emr.sh` as a guide) and submit a step either during the creation or after the cluster is running through the EMR Console or the AWS CLI.

Official EMR documentation is here: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/CreateCascading.html

TL;DR the command to create a cluster and add a step should be something like:
```bash
aws emr create-cluster --name "Scalding job cluster" --ami-version 3.6 \
--use-default-roles --ec2-attributes KeyName=myKey \
--instance-type m3.xlarge --instance-count 3 \
--bootstrap-actions Path=pathtobootstrapscript,Name="CascadingSDK" \
--steps Type="CUSTOM_JAR",Name="Scalding Step",ActionOnFailure=CONTINUE,Jar=s3://pathtojarfile,\
Args=["-input","s3://pathtoinputdata","-output","s3://pathtooutputbucket","arg1","arg2"]
```

To add to a running cluster:
```bash
aws emr add-steps --cluster-id j-XXXXX --steps Type="CUSTOM_JAR",Name="Scalding Step",ActionOnFailure=CONTINUE,Jar=s3://pathtojarfile,\
Args=["-input","s3://pathtoinputdata","-output","s3://pathtooutputbucket","arg1","arg2"]
```

<span id='exercise'></span>

### Exercise: Analyzing Wikipedia Traffic in 2008


To tackle this exercise, you can use your choice of big-/small-data tools since the data is on `s3` and you can therefore fire up an EC2/EMR cluster installed with the tools of your choice. We suggest using either Spark or Scalding. In both cases, you should be able to complete all the questions using just that one tool (i.e. both the questions that require big-data-munging and those that can be answered in-memory); however you may also choose to use a mix of Python Pandas for the in-memory-size data (e.g. for small time series calculations) after ETLing with one of the big data frameworks. 

If you answer all the questions in the exercise: congratulations! You'll have completed a full data scientist workflow using "production-like" data. You'll get a feel for how the workflow for big data differs from small data. You'll also gain an interesting talking point to potential employers.


### About the data set


The English language data set is located at `s3://dataincubator-course/wikidata/wikistats/en_pagecounts` and derived from the encyclopedic [publicly available data](https://aws.amazon.com/datasets/) on AWS. It's about 20 GB compressed and covers the period 10/1/08 through 12/8/08.

(If you want to look at the entire data set i.e. not just English-language, the full `wikistats` `pagecount` data set is at `s3://dataincubator-course/wikidata/wikistats/pp_pagecounts/` is ~75 GB and covers the period 10/1/08 through 12/8/08.)
 
Each log has 4 fields: `yearmonthdayhour, projectcode, pagename, pageviews`
```
2008120701 en Still_Breathing 2
2008120701 en Still_Climbing_%28album%29 1
2008120701 en Still_Creepin_On_Ah_Come_Up 1
2008120701 en Still_Life_with_Spherical_Mirror 2
2008120701 en Stillwater_Mining_Company 2
```


### Questions (for the English language data set only)

1. Give 100 most popular websites by `pageviews` from 10/1/08-10/15/08 (inclusive), ordered by `pageviews` descending.

1. For (start date) to (end date): Visualize the distribution of pages' cumulative `pageviews` during this time period. Give the mean, median, standard deviation. What kind of distribution is it?

1. For the topic "Barack Obama": Look at the time series of daily page views from (start) to (end) date. Calculate the correlation with topic "Sarah Palin."

1. Bonus: How does the PageRank of a page predict its mean views over time?

1. For a sample of pairs of linked nodes, calculate the `pageview` correlations. How does the the distribution of correlations compare to that of a randomly chosen sample of unlinked node pairs?



**What is the main difference between Crunch and Cascading?**  
(Source: https://github.com/cloudera/crunch/wiki/Frequently-Asked-Questions)  

The main difference between Crunch and Cascading/Pig/Hive is in their data models. In Pig and Cascading, most operations in a pipeline are performed on collections of Tuples (here are the Javadocs for the Pig Tuple and the Cascading Tuple). In my answer to this question on Quora, I refer to this as the "single serializable type" (SST) model. Using the SST data model makes it much easier to implement common operations, and Pig, and Cascading provide big libraries of built-in functions that are designed to operate on their respective Tuple types. They also provide APIs for developers to create their own user-defined functions that interact with their SST data models.

Crunch, like the FlumeJava library that is based on, uses a data model that has multiple serializable types (MST). At each stage in a Crunch pipeline, you specify how the data from that stage should be serialized. The benefit of doing this is that it lets you verify at compile-time that each stage of your pipeline will actually receive the type of data it knows how to process. In this sense, the MST model for building pipelines is similar to using a statically typed language, and the SST model is similar to using a dynamically typed language. We feel that the MST model has definite benefits for MapReduce developers:

Compile-time type verification lowers the probability that a type error will cause your MapReduce pipeline to fail when it actually runs on a Hadoop cluster.
It makes user-defined functions in Crunch extremely easy to write, saving you from writing the boilerplate for type checking your data that you need to do in Pig or Cascading.
It makes MapReduce tasks that run over complex data types, like binary data or the results of an HBase scan, much easier to write. There's no work required to map from the complex type onto the Cascading/Pig SST and back again.
Crunch's MST serialization model currently has two different implementations, one based on Writables and the other based on Avro records.

### Exit Tickets

1. Describe the differences between HDFS, Cascading, and Spark.

*Copyright &copy; 2015 The Data Incubator.  All rights reserved.*