<DIV ALIGN=CENTER>

# Introduction to Spark
## Professor Robert J. Brunner
  
</DIV>  
-----
-----

## Introduction

Previously in this course, we have discussed doing data science at the Unix command line, and with Python, primarily by using Pandas. We also have discussed other Python libraries that bring new functionalities to the Python data science stack. Other _big data_ technologies, however, also exist and can be relevant to particular data science investigations, depending on the scale of data. Of these other technologies, one of the most promising is [**Spark**][sp].

Spark is a cluster computing system that leverages [Hadoop][sh] technologies like [HDFS][shdfs] for high performance storage and [Yarn][sy] for cluster management. While some may see Spark as a replacement for Hadoop, an alternative argument can be made that [Spark is simply another compute engine][sce] for Hadoop, in addition to Map-Reduce.

In this IPython Notebook, we explore using Spark to perform data processing in a similar maner to our previous efforts with Pandas. For this we will use the airline data, which has been stored in an HDFS system that is accesible from within our Spark cluster. [Other][dw] tutorials exist, although they often focus on Scala examples since Spark is written for that language.

-----
[sp]: http://spark.apache.org
[sh]: http://hadoop.apache.org
[sy]: https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
[shdfs]: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html
[sce]: http://techcrunch.com/2015/07/12/spark-and-hadoop-are-friends-not-foes/
[dw]: https://github.com/deanwampler/spark-workshop/tree/master/tutorial

### Initialization

In this class, we do not use a dedicated Spark cluster, and instead run our spark applications in our local Docker container from within our IPython Notebook environment. However, we still emphasize resource management, in particular we demonstrate how to ensure that any SparkContext previously used by this Jupyter Server is properly released before starting a new one. After this, we will initialize a new SparkContext to properly interact from this dockerized IPython Notebook to the Spark cluster.

----- 

In [1]:
# We release the SparkContext if it exists.
try:
    sc
except:
    pass ;
else:
    sc.stop()

# Now handle initial import statements
from pyspark import SparkConf, SparkContext

# Create new Spark Configuration (port numbers might need to be adjusted from defaults.)
myconf = SparkConf()
myconf.setMaster('local[*]')
myconf.setAppName("INFO490 SP17 W14-NB1: Professor Brunner")
myconf.set('spark.executor.memory', '1g')

# Create and initialize a new Spark Context
sc = SparkContext(conf=myconf)

# Display Spark version information, which also verifies SparkContext is active
print("\nSpark version: {0}".format(sc.version))


Spark version: 2.0.1


-----

### Using Spark

Spark is a framework for processing large-data tasks, in general this means Petabytes (or more of data). Spark can run on the HDFS file system, which can be set up to chunk files into blocks and to replicate these blocks across a cluster's storage to promote increased performance. Spark abstracts these details, however, allowing us to develop an application on a small system and scale up to large data on a cluster. 

In Spark, communications move between a driver process and the execution processes. This communication is handled for us by using a [`SparkContext`][sc], which requests resources from the Spark master process, such as number of cores, which are reserved to complete our Spark tasks. In the previous code cell, we initialized our `SparkContext`. Once a Spark Context is active, we can use the Spark Console to monitor jobs and the overall Spark infrastructure. The Jupyter Server currently sets an HTTP header (`X-Frame-Options`) that prevents us from easily displaying this console within this Notebook. However, if you open a new web browser to the IP address of this Notebook and use `4040` as the port number, you should be able to view and interact with the console, as shown in the following screenshot:

![Spark Console](images/spark-console.png)

-----

The basic data structure in Spark is a [Resilient Distributed Dataset][rdd] (RDD). An RDD is immutable, thus if you want to add a column to an RDD, you must create a new copy that includes the new column. In Spark, data processing tasks can be transformation or actions, and these tasks can be pipelined for efficiency. Each transformation creates a new RDD, but since Spark uses lazy evaluation, the transformations are not executed until an action is invoked.

These concepts are demonstrated in the following code cells, where we first create a list of integers, which we use to initialize a new RDD.

-----
[sc]: http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext
[rdd]: http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD

In [2]:
data = range(50)
print(data)

range(0, 50)


In [3]:
myRDD = sc.parallelize(data, 8)

-----

In the previous code cell, we create a [parallelized collection][pc] by using
the `parallelize` method, which partitions the data across cores in a
cluster. The general rule indicated in the Spark documentation is that
you want 2-4 partitions per core.

Next, we use several method functions on the RDD to obtain the RDD
unique ID, which indicates when new RDDs are created, as well as naming
RDDs for easier viewing in the Spark cluster management software.

-----
[pc]: https://spark.apache.org/docs/latest/programming-guide.html#parallelized-collections

In [4]:
print("Initial RDD id: {0}".format(myRDD.id()))

Initial RDD id: 1


In [5]:
myRDD.setName("Professor Brunner's RDD")

Professor Brunner's RDD PythonRDD[1] at RDD at PythonRDD.scala:48

In [6]:
print(myRDD.toDebugString())

b"(8) Professor Brunner's RDD PythonRDD[1] at RDD at PythonRDD.scala:48 []\n |  ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:475 []"


-----

Now, given this simple RDD, we can apply a transformation, in this case
we simply add one to each element in the RDD. This tranformation doesn't
actually happen until we call an action method, which first occurs in
the third code cell below when we call the `collect` method. The new RDD
has been created, however as indicated by its new id.

-----

In [7]:
myaddRDD = myRDD.map(lambda a: a + 1)

In [8]:
print(myaddRDD.toDebugString())

b'(8) PythonRDD[2] at RDD at PythonRDD.scala:48 []\n |  ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:475 []'


In [9]:
print(myaddRDD.collect())

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50]


-----

We can now apply a second transformation, in this case we apply a
filter, which selects values from the RDD based on a condition (in this
example we select valus that are evenly divisible by 5). The
transformation doesn't occur, however, until we once again call the
`collect` method, which _collects_ the results of the different
transformations.

-----

In [10]:
myfilterRDD = myaddRDD.filter(lambda x: (x % 5) == 0)

In [11]:
myfilterRDD.collect()

[5, 10, 15, 20, 25, 30, 35, 40, 45, 50]

In [12]:
print(myfilterRDD.toDebugString())

b'(8) PythonRDD[3] at collect at <ipython-input-11-4166111fb352>:1 []\n |  ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:475 []'


-----

Tranformations, however, can be chained together in a process called
pipelining. Doing so can produce long code strings, which can be
difficult to follow (or debug). Thus, it is considered [good style][gs]
to break pipelined operations such that each transformation occurs on a
separate line. The following code combines the previous Spark tasks
together into a single line, but shown using recommended style.

-----
[gs]: http://nbviewer.ipython.org/github/jdwittenauer/ipython-notebooks/blob/master/Spark-Lab0-Tutorial.ipynb#-(8d)-Readability-and-code-style-

In [13]:
(sc
 .parallelize(data)
 .map(lambda x: x + 1)
 .filter(lambda x: (x % 5) == 0)
 .collect())

[5, 10, 15, 20, 25, 30, 35, 40, 45, 50]

-----

### Data Processing

Previously in this Notebook, we have used Spark to create simple RDDs
that demonstrated Spark transformations and actions on small data. Now
we will change approaches and analyze the airline data, first starting
with the single 2001 flight data file. We can create a new RDD by
reading in the data as a textfile, after which we execute the RDD
creation by counting the number of lines in the RDD. We subsequently
apply several other RDD methods to display the first few rows of data by
using the `take` method. Finally, we use the built-in `help` to se the
list of supported RDD methods.

-----



In [14]:
filename = '/home/data_scientist/data/2001/2001-1.csv'

text_file = sc.textFile(filename)

In [15]:
text_file.count()

500000

In [16]:
text_file.take(5)

['Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay',
 '2001,1,17,3,1806,1810,1931,1934,US,375,N700��,85,84,60,-3,-4,BWI,CLT,361,5,20,0,NA,0,NA,NA,NA,NA,NA',
 '2001,1,18,4,1805,1810,1938,1934,US,375,N713��,93,84,64,4,-5,BWI,CLT,361,9,20,0,NA,0,NA,NA,NA,NA,NA',
 '2001,1,19,5,1821,1810,1957,1934,US,375,N702��,96,84,80,23,11,BWI,CLT,361,6,10,0,NA,0,NA,NA,NA,NA,NA',
 '2001,1,20,6,1807,1810,1944,1934,US,375,N701��,97,84,66,10,-3,BWI,CLT,361,4,27,0,NA,0,NA,NA,NA,NA,NA']

In [17]:
# Display help info on spark rdd
help(text_file)

Help on RDD in module pyspark.rdd object:

class RDD(builtins.object)
 |  A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.
 |  Represents an immutable, partitioned collection of elements that can be
 |  operated on in parallel.
 |  
 |  Methods defined here:
 |  
 |  __add__(self, other)
 |      Return the union of this RDD and another one.
 |      
 |      >>> rdd = sc.parallelize([1, 1, 2, 3])
 |      >>> (rdd + rdd).collect()
 |      [1, 1, 2, 3, 1, 1, 2, 3]
 |  
 |  __getnewargs__(self)
 |  
 |  __init__(self, jrdd, ctx, jrdd_deserializer=AutoBatchedSerializer(PickleSerializer()))
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __repr__(self)
 |      Return repr(self).
 |  
 |  aggregate(self, zeroValue, seqOp, combOp)
 |      Aggregate the elements of each partition, and then the results for all
 |      the partitions, using a given combine functions and a neutral "zero
 |      value."
 |      
 |      The functions C{op(t1, t2

-----

With this text RDD, we can begin to process the data. Since our data is,
at this point, simply a list of strings, we first need to transform the
data into columns, remove the header row, and extract out the columns of
interest. These steps are pipelined to create a single RDD, that isn't
processed until we execute an action method, in this case, the `first`
method that displays the first row in the new RDD.

-----

In [18]:
col_data = text_file.map(lambda l: l.split(",")) \
            .map(lambda p: (p[0], p[1], p[2], p[4], p[14], p[15], p[16], p[17], p[18])) \
            .filter(lambda line: 'Year' not in line)

In [19]:
col_data.first()

('2001', '1', '17', '1806', '-3', '-4', 'BWI', 'CLT', '361')

-----

Spark, unlike Pandas, will not handle NA values. Thus we need an
additional tranform to remove lines from our RDD that contain missing
data. We can accomplish this by using an appropriate filter, which will
be executed when we call the `count` method. This value should be
**480106**, which corresponds to the number inthe first flight data file.

-----

In [20]:
cols = col_data.filter(lambda line: 'NA' not in line)

In [21]:
cols.count()

480106

-----

To analyze these data, however, we need to convert the columns to the
appropriate data types. In this case, we can simply apply one final
transformation.

-----

In [22]:
fields = cols.map(lambda p: (int(p[0]), int(p[1]), int(p[2]), int(p[3]),
                          int(p[4]), int(p[5]), p[6], p[7], int(p[8])))

In [23]:
fields.count()

480106

-----

### Bulk File Processing

As our final Spark exercise, we will perform bulk file processing. Since
our Spark cluster has been setup to support multiple students and not
optimized for single user performance, we need to proceed carefully.

First, we close our existing SparkContext, and open a new one that
supports using more cores. Second, we create an RDD from a wildcard
pattern that will match all flight records. Finally, we execute the
creation of the RDD by executing the `count` method toi display the
number of flights across all files. Next, we create a pipelined
tranformation similar to our previous one that converts the rows to
columns, extracts the columns of interest and removes the header row. We
execute the first transform by issuing the `first` action method to
display the new first row. Next, we create a second tranformation that
extracts flights that depart from O'Hare and remove all rows with
missing data. We display the result, showing nearly 13 million flights
departed from O'Hare. 

-----

In [24]:
# Display Spark version information, which also verifies SparkContext is active
print("\nSpark version: {0}".format(sc.version))


Spark version: 2.0.1


In [25]:
filename = '/home/data_scientist/data/2001/2001-*.csv'

flight_files = sc.textFile(filename)

flight_files.count()

500000

In [26]:
col_data = flight_files.map(lambda l: l.split(",")) \
            .map(lambda p: (p[0], p[1], p[2], p[4], p[14], p[15], p[16], p[17], p[18])) \
            .filter(lambda line: 'Year' not in line)

In [27]:
%time col_data.first()

CPU times: user 10 ms, sys: 0 ns, total: 10 ms
Wall time: 39.3 ms


('2001', '1', '17', '1806', '-3', '-4', 'BWI', 'CLT', '361')

In [28]:
cols = col_data.filter(lambda line: 'NA' not in line).filter(lambda line: 'ORD' in line)

fields = cols.map(lambda p: (int(p[0]), int(p[1]), int(p[2]), int(p[3]),
                          int(p[4]), int(p[5]), p[6], p[7], int(p[8])))

%time fields.count()

CPU times: user 10 ms, sys: 0 ns, total: 10 ms
Wall time: 1.4 s


54756

-----
### Student Activity

In the preceding cells, we introduced Spark. Now that you have run the Notebook, go back and make the
following changes to see how the results change.

1. Change the `myRDD` example to start with all integers from 0 to 199.
Use an appropriate lambda function to convert this RDD to a new RDD that
has all odd integers from 1 to 399.

2. Filter the previous RDD to contain only entries that are divisible by
9.

3. Convert this RDD to a Spark DataFrame, specify the column name as
`Numbers`.

4. Add an index column to this Spark DataFrame, which sequentially
increases.

Additional, more advanced problems:

1. Create an RDD containing the 'Year', 'Month', 'DayofMonth', 'dDelay',
and 'Origin' columns for the airline data..

2. Filter this RDD to contain only flight data for flights leaving O'Hare
airport.

-----

### Ending the Spark Session

We must stop the `SparkContext` in order to release the spark resources  before existing this Notebook.

-----

In [29]:
sc.stop()