# Introduction to PySpark with Jupyter

PySpark is an interface into the Apache Spark framework:

> Apache Spark is an open-source cluster computing framework originally developed in the AMPLab at UC Berkeley. In contrast to Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory primitives provide performance up to 100 times faster for certain applications.

Spark is used for big data applications since, by definition, they are not able to be processed within a single compute resource.  A common use for the framework is to process large amounts of data and use Machine Learning techniques to analyze, understand, and predict outcomes for external processes.

This notebook was created by aggregating information from various sources, including notebooks and code that I have developed on projects, but also using some of the following books:, [Learning Spark](http://shop.oreilly.com/product/0636920028512.do), [Advanced Analytics with Spark](http://shop.oreilly.com/product/0636920035091.do), and [High Performance Spark](http://shop.oreilly.com/product/0636920046967.do)

Some of these resources do not include Python or PySpark usage directly, but I have been able to translate the information into Pythonic, or at least Python, for use here.

In addition, many resources exist on the web for exploring [Python](https://www.python.org/) and [PySpark](http://spark.apache.org/docs/latest/api/python/index.html) as well as Machine Learning and other big data uses in general.  Due to the dynamic nature of these resources, you should always search and use the most current information available at the time you need it.

## Import the module

This is already installed in the docker container, so simply import it here.

In [1]:
import pyspark

## Create a Spark Context

Creating a SparkContext requires the configuration for Spark operation to be defined.  This is most easily done by creating a SparkConf object with the desired parameter values for the way you want Spark to operate.  Here we define a 'local' style operation since we want to explore Spark and PySpark without needing to have a cluster available for job execution.

In [2]:
# Create a simple local Spark configuration.
conf = (
    pyspark
      .SparkConf()
      .setMaster('local[*]')
      .setAppName('Introduction Notebook')
)

# Show the configuration:
import pprint as pp
print('Configuration:')
pp.pprint(conf.getAll())

Configuration:
[('spark.master', 'local[*]'),
 ('spark.submit.deployMode', 'client'),
 ('spark.app.name', 'Introduction Notebook')]



Creating a context should only be done once per session.  Guarding the creation with the "try" block ensures that we will only create the context the first time the following cell is executed.


In [3]:
# Create a Spark context for local work.
try:
    sc
except:
    sc = pyspark.SparkContext(conf = conf)

# Check that we are using the expected version of PySpark.
print('Version: ',sc.version)

Version:  1.6.1


## Prove the module is available

Create a simple example and execute it in order to demonstrate that the module working correctly and the context is configured correctly.

The following creates an RDD initialized with a range of numbers, then samples 5 of them.  Spark will have distributed the RDD data and the work execution among the available executors in order to perform this processing.

In [4]:
# Prove that Spark is installed and working correctly
rdd = sc.parallelize(range(1000))
rdd.takeSample(False, 5)

[390, 464, 204, 403, 203]

## Process Some Text

Once we have a working Spark instance, we can perform some actual work.  A common example is to perform word counting on a corpus of text.

For the following, the full text for Shakespear's _The Taming of the Shrew_ was obtained and will be processed.  The text was obtained from [lexically.net](http://lexically.net/wordsmith/support/shakespeare.html) which obtained the actual corpus from the [Online Library of Liberty](http://www.lexically.net).

If you download and process these files yourself, note that they are stored in 16 bit Unicode.  For simplicity, there is a local copy of this one play located in the _data_ directory that is stored in UTF8 format.

In [5]:

# UTF8 encoded textfile.
shrewText = sc.textFile('data/tamingoftheshrew.txt')


Like most text processing and NLP work flows, removal of stop words reduces the size of the required tasks.  We can use the standard stop words available from the _stop-words_  Python package for our list of words to remove from the corpus.  Since we are processing the text in Spark, we go ahead and _broadcast_ this data to all workers.  This is an efficient mechanism to ensure that all workers have the data available with a minimum of network traffic involved.

In [6]:

# Grab stop words to remove from the corpus.
from stop_words import get_stop_words
stopwords = sc.broadcast( set(get_stop_words('en')) )


Now that we have the corpus to process and some immutable data to work with, we can start working on the data.  First, we split the input into individual words.  We can easily do this by splitting on whitespace of any kind and size, then creating an output record for each word resulting from the split.

In the code below, the splitting is done internal to the flatMap call.  In that call, each line is processed to replace whitespace of any kind with a single space, all text is converted to lower case for counting, and then the single spaces are used to split the line into a record for each word.

Note that the multiple whitespace portions of a line are found and replaced using a regular expression pattern.  This pattern was broadcast to all worker processes since the pattern itself is immutable and common to all workers.  Note that use of the broadcast variables requires that the _.value_ attribute be accessed to obtain the original variable.

Once the words have been converted to records, the stop words are removed and any remaining empty records are removed.

In [7]:
# Generate rows by splitting at (any number of) spaces.
import re
pattern = sc.broadcast( re.compile(r'\s+') )
shrewWords  = (
    shrewText.flatMap(lambda line: pattern.value.sub(' ',line.strip().lower()).split(" "))
      .filter(lambda w: w not in stopwords.value) # Remove stop words.
      .filter(lambda w: len(w) > 0)               # Remove empty words.
)

Now that we have a record for each individual word in the corpus, we can group and count them.  To do this we create Key/Value records by mapping the words.  The Key is set to the word, and the Value is given an integer value of 1.

Calling _.recduceByKey()_ on these Key/Value records groups the records for each work together and processes then using the provided function.  Here we add up all of the individual Values for the records of that Key.  Since each word in the corpus started with a Value of 1, adding these together results in the count of the number of times that (Key) word appears in the corpus.

We go on to sort the result in descending order by count, then save the result.  The saved result is stored in parts by the RDD and will need to be combined in order to see the entire output together.  Other storage types can provide a single output file for review.

In [8]:
# Count the words by mapping a value for each row
# and adding the values up for each unique key.
shrewCounts = (
    shrewWords.map(lambda word: (word, 1)) # Generate the Key/Value records.
      .reduceByKey(lambda x, y: x + y)     # Generate the word counts.
      .map(lambda t: (t[1],t[0]))          # Swap Key and Value to sort by Value.
      .sortByKey(False)                    # Sort in descending order.
      .map(lambda t: (t[1],t[0]))          # Swap back to original sense of Key/Value.
)

resultsLocation = 'shrewcounts'

# Ensure that there is no previous output in the location.
# Choose to store multiple results by using multiple locations.
import shutil
shutil.rmtree(resultsLocation,ignore_errors=True)

# Store the results
shrewCounts.saveAsTextFile(resultsLocation)

Now we can collect up and display interesting information about the processing results.  Note that the ETL processing is not yet complete, since we can see what appear to be XML tags and partial XML tags that should be removed (or at least transformed) to produce the result.

In [9]:
# Count the number of unique words and the total number of words.
# Sans stop words, of course.
countOfUniqueWords = shrewCounts.count()
totalCountOfWords  = shrewCounts.map(lambda t: t[1]).reduce(lambda x,y: x+y)

# Look at some results.
print('Unique words: ',countOfUniqueWords,', Total words: ',totalCountOfWords,'\n')
for k,v in shrewCounts.take(10):
    print(k,': ',v)


Unique words:  5203 , Total words:  15403 

dir> :  364
</stage :  182
<stage :  181
</petruchio> :  158
<petruchio> :  158
will :  146
thou :  112
shall :  99
</tranio> :  91
<tranio> :  91
