
# Apache Spark

Apache Spark is a framework for distributed computing; this framework aims to
make it simpler to write programs that run in parallel across many nodes in a cluster
of computers.

It is designed from the ground up for high performance in applications of 
iterative nature, where the same data is accessed multiple times. This performance is
achieved primarily through caching datasets in memory, combined with low latency
and overhead to launch parallel computation tasks. Together with other features
such as fault tolerance, flexible distributed-memory data structures, and a powerful
functional API, Spark has proved to be broadly useful for a wide range of large-scale
data processing tasks, over and above machine learning and iterative analytics.

A Spark cluster is made up of two types of processes: a driver program and multiple
executors. In the local mode, all these processes are run within the same JVM. In a
cluster, these processes are usually run on separate nodes.


Components
Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext
object in your main program (called the driver program).
Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either
Spark’s own standalone cluster manager, Mesos or YARN), which allocate resources across applications.
Once connected, Spark acquires executors on nodes in the cluster, which are processes that run
computations and store data for your application. Next, it sends your application code (defined by JAR or
Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to
run.


<img src="../../images/spark_components.PNG">


## SparkContext and SparkConf

The starting point of writing any Spark program is SparkContext (or
JavaSparkContext in Java). SparkContext is initialized with an instance of a
SparkConf object, which contains various Spark cluster-configuration settings (for
example, the URL of the master node). 

Once initialized, we will use the various methods found in the SparkContext object
to create and manipulate distributed datasets and shared variables. The Spark shell
(in both Scala and Python, which is unfortunately not supported in Java) takes care
of this context initialization for us. 



## Resilient Distributed Datasets

The core of Spark is a concept called the Resilient Distributed Dataset (RDD).
An RDD is a collection of "records" (strictly speaking, objects of some type) that is
distributed or partitioned across many nodes in a cluster (for the purposes of the
Spark local mode, the single multithreaded process can be thought of in the same
way). An RDD in Spark is fault-tolerant; this means that if a given node or task fails
(for some reason other than erroneous user code, such as hardware failure, loss of
communication, and so on), the RDD can be reconstructed automatically on the
remaining nodes and the job will still complete.


### Creating RDDs

RDDs can be created from existing collections, for example, in a Scala Spark shell type as below:
    
    val collection = List("a", "b", "c", "d", "e")
    val rddFromCollection = sc.parallelize(collection)
    
RDDs can also be created from Hadoop-based input sources, including the local
filesystem, HDFS, and Amazon S3. The following
code is an example of creating an RDD from a text file located on the local filesystem:
    
    val rddFromTextFile = sc.textFile("LICENSE")

[Read this documentation](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-overview.html) on AWS website for an overview on AWS EMR

### Spark operations

Once we have created an RDD, we have a distributed collection of records that
we can manipulate. In Spark's programming model, operations are split into
transformations and actions. Generally speaking, a transformation operation applies
some function to all the records in the dataset, changing the records in some way.
An action typically runs some computation or aggregation operation and returns the
result to the driver program where SparkContext is running.

In [1]:
sc

<pyspark.context.SparkContext object at 0x7f6b05536e10>

It provide access to many of the underlying structures used by pySpark.

The entry point into all SQL functionality in Spark is the SQLContext class. To create a basic instance, all we need is a SparkContext reference. Since we are running Spark in shell mode (using pySpark) we can use the global context object sc for this purpose.

In [2]:
sqlContext = SQLContext(sc)

We’ll then create an RDD using sc.parallelize with 20 partitions which will be distributed amongst the Spark Worker nodes and also verify the number of partitions in the RDD.

In [12]:
rdd = sc.parallelize(range(1000), 20) 
rdd.getNumPartitions()

20

We can take a look at the first five records using the take action.

In [13]:
rdd.take(5)

[0, 1, 2, 3, 4]

Let’s now perform a few transformations on the RDD which will bin the numbers into the lowest 100s and count the total frequency for each bin.

In [14]:
rdd.map(lambda r: (round(r/100)*100, 1))\
  .reduceByKey(lambda x,y: x+y)\
  .collect()

[(0.0, 100), (800.0, 100), (900.0, 100), (100.0, 100), (200.0, 100), (300.0, 100), (400.0, 100), (500.0, 100), (600.0, 100), (700.0, 100)]

### Working with Amazon S3, DataFrames and Spark SQL

Let’s now try to read some data from Amazon S3 using the Spark SQL Context. This makes parsing JSON files significantly easier than before. After the reading the parsed data in, the resulting output is a Spark DataFrame. We can then register this as a table and run SQL queries off of it for simple analytics.
The data we will be working with is a subset of the reddit comments published and compiled by reddit user /u/Stuck_In_the_Matrix, in r/datasets. The current example is processing reddit comments collected in May 2015 which is roughly 30GB.


In this example we will calculate the number of distinct gilded authors and the average score of all the comments in each subreddit for the month of May, 2015. The results will then be ranked by the number of distinct gilded authors per subreddit and the average score of all the comments per subreddit.

In [None]:
%matplotlib inline

In [18]:
from pyspark.sql.types import *

Now read in all the comments from May, 2015 using the Spark SQL Context.

In [19]:
fields = [StructField("archived", BooleanType(), True), 
          StructField("author", StringType(), True), 
          StructField("author_flair_css_class", StringType(), True), 
          StructField("body", StringType(), True), 
          StructField("controversiality", LongType(), True), 
          StructField("created_utc", StringType(), True), 
          StructField("distinguished", StringType(), True), 
          StructField("downs", LongType(), True), 
          StructField("edited", StringType(), True), 
          StructField("gilded", LongType(), True), 
          StructField("id", StringType(), True), 
          StructField("link_id", StringType(), True), 
          StructField("name", StringType(), True), 
          StructField("parent_id", StringType(), True), 
          StructField("retrieved_on", LongType(), True), 
          StructField("score", LongType(), True), 
          StructField("score_hidden", BooleanType(), True), 
          StructField("subreddit", StringType(), True), 
          StructField("subreddit_id", StringType(), True), 
          StructField("ups", LongType(), True)]

In [21]:
rawDF = sqlContext.read.json("s3n://reddit-comments/2015/RC_2015-05", StructType(fields)).persist(StorageLevel.MEMORY_AND_DISK_SER).registerTempTable("comments")

We are first defining the schema of the JSON file. Not defining this is also an option; however, Spark will then need to pass through the data twice to:
    
    infer the schema

    parse the data into a Spark DataFrame
    
----

This can be very time consuming when datasets grow much larger. Since we know what the schema will be for this static dataset, it is in our best interest to define it beforehand. Allowing Spark to infer the schema is particularly useful, however, for scenarios when schemas change over time and fields are added or removed.


Next the data is read from the public S3 reddit-comments bucket as a Spark DataFrame using <span style="color:#a5541a"><b>sqlContext.read.json("...")</b></span>. Manipulations on the Spark DataFrame in most cases are significantly more efficient that working with the core RDDs.


After reading in the data, we would also like to persist it into memory and disk for multiple uses later on with <span style="color:#a5541a"><b>.persist(StorageLevel.MEMORY_AND_DISK_SER)</b></span>. Choosing the memory and disk option permits Spark to gracefully spill the data to disk if it is too large for memory across all the Spark Worker nodes. 

Here we will be executing two queries on the dataset. The second query will be able to read directly from the persisted data instead of having to read in the entire dataset again.


Lastly the Spark DataFrame is registered as a table with <span style="color:#a5541a"><b>.registerTempTable("comments")</b></span>, so we can run SQL queries off of it. The table can then be referenced by the name "comments".


Let’s now run some SQL queries on the dataset to find the total number of distinct gilded authors and the average comment score per subreddit for this month.

In [22]:
distinct_gilded_authors_by_subreddit = sqlContext.sql(""" 
    SELECT subreddit, COUNT(DISTINCT author) as authors 
    FROM comments 
    WHERE gilded > 0 
    GROUP BY subreddit 
    ORDER BY authors DESC 
    """)

In [25]:
average_score_by_subreddit = sqlContext.sql(""" 
    SELECT subreddit, AVG(score) as avg_score 
    FROM comments 
    GROUP BY subreddit 
    ORDER BY avg_score DESC 
    """)

Let’s take a look at the top 5 subreddits with the most gilded authors commenting and highest average comment score. Note that every command until now has been a transformation and no data has actually flowed through this point. We have essentially been building a Directed Acyclic Graph (DAG) for the operations to perform on the data. Data only begins flowing through when an action is called such as .collect(), .take(), .first(), etc.

In [23]:
distinct_gilded_authors_by_subreddit.take(5)

[Row(subreddit=u'AskReddit', authors=2677), Row(subreddit=u'funny', authors=506), Row(subreddit=u'pics', authors=459), Row(subreddit=u'videos', authors=379), Row(subreddit=u'news', authors=355)]

Since this is the first action taken, all the 30GB will be read in and parsed from S3. This should take about 15 minutes depending on the region of Spark cluster.


You will notice that the next action finishes in about 30 seconds. This is because Spark knows that the original data is persisted into memory and disk and does not need to go to S3 to get the data. Had we not persisted the data at the very beginning, this action would take another 15 minutes (30X slower).

In [26]:
average_score_by_subreddit.take(5)

[Row(subreddit=u'karlsruhe', avg_score=73.3157894736842), Row(subreddit=u'picturesofiansleeping', avg_score=22.92391304347826), Row(subreddit=u'photoshopbattles', avg_score=21.04499959532738), Row(subreddit=u'behindthegifs', avg_score=20.62438118811881), Row(subreddit=u'IAmA', avg_score=18.381243801552937)]

# You are done