# Introduction to PySpark with Jupyter

PySpark is an interface into the Apache Spark framework:

> Apache Spark is an open-source cluster computing framework originally developed in the AMPLab at UC Berkeley. In contrast to Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory primitives provide performance up to 100 times faster for certain applications.

Spark is used for big data applications since, by definition, they are not able to be processed within a single compute resource.  A common use for the framework is to process large amounts of data and use Machine Learning techniques to analyze, understand, and predict outcomes for external processes.

This notebook was created by aggregating information from various sources, including notebooks and code that I have developed on projects, but also using some of the following books:, [Learning Spark](http://shop.oreilly.com/product/0636920028512.do), [Advanced Analytics with Spark](http://shop.oreilly.com/product/0636920035091.do), and [High Performance Spark](http://shop.oreilly.com/product/0636920046967.do)

Some of these resources do not include Python or PySpark usage directly, but I have been able to translate the information into Pythonic, or at least Python, for use here.

In addition, many resources exist on the web for exploring [Python](https://www.python.org/) and [PySpark](http://spark.apache.org/docs/latest/api/python/index.html) as well as Machine Learning and other big data uses in general.  Due to the dynamic nature of these resources, you should always search and use the most current information available at the time you need it.

## Import the module

This is already installed in the docker container, so simply import it here.

In [1]:
import pyspark

## Create a Spark Context

Creating a SparkContext requires the configuration for Spark operation to be defined.  This is most easily done by creating a SparkConf object with the desired parameter values for the way you want Spark to operate.  Here we define a 'local' style operation since we want to explore Spark and PySpark without needing to have a cluster available for job execution.

In [2]:
# Create a simple local Spark configuration.
conf = (
    pyspark
      .SparkConf()
      .setMaster('local[*]')
      .setAppName('Introduction Notebook')
)

# Show the configuration:
import pprint as pp
print('Configuration:')
pp.pprint(conf.getAll())

Configuration:
[('spark.master', 'local[*]'),
 ('spark.submit.deployMode', 'client'),
 ('spark.app.name', 'Introduction Notebook')]



Creating a context should only be done once per session.  Guarding the creation with the "try" block ensures that we will only create the context the first time the following cell is executed.


In [3]:
# Create a Spark context for local work.
try:
    sc
except:
    sc = pyspark.SparkContext(conf = conf)

# Check that we are using the expected version of PySpark.
print('Version: ',sc.version)

Version:  1.6.1


## Prove the module is available

Create a simple example and execute it in order to demonstrate that the module working correctly and the context is configured correctly.

The following creates an RDD initialized with a range of numbers, then samples 5 of them.  Spark will have distributed the RDD data and the work execution among the available executors in order to perform this processing.

In [4]:
# Prove that Spark is installed and working correctly
rdd = sc.parallelize(range(1000))
result = rdd.takeSample(False, 5)
print('5 randomly selected values from the range: %s' % result)

5 randomly selected values from the range: [379, 848, 521, 851, 596]
