# <center> Introduction to Spark In-memmory Computing via Python PySpark </center>

- Spark is an implementation of the MapReduce programming paradigm that operates on in-memory data and allows data reuses across multiple computations.
- Performance of Spark is significantly better than its predecessor, Hadoop MapReduce. 
- Spark's primary data abstraction is Resilient Distributed Dataset (RDD):
    - Read-only, partitioned collection of records
    - Created (aka written) through deterministic operations on data:
        - Loading from stable storage
        - Transforming from other RDDs
        - Generating through coarse-grained operations such as map, join, filter ...
    - Do not need to be materialized at all time and are recoverable via **data lineage**

<img src="figures/spark2_arch.png" width="600"/>

In [1]:
!module list

Currently Loaded Modulefiles:
  1) anaconda3/4.2.0   3) zeromq/4.1.5
  2) matlab/2015a      4) hdp/0.1


## 1. Getting Started

Spark stores data in memory. This memory space is represented by variable **sc** (SparkContext). 

In [2]:
import sys
import os

sys.path.insert(0, '/usr/hdp/current/spark2-client/python')
sys.path.insert(0, '/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip')

os.environ['SPARK_HOME'] = '/usr/hdp/current/spark2-client/'
os.environ['SPARK_CONF_DIR'] = '/etc/hadoop/synced_conf/spark2/'
os.environ['PYSPARK_PYTHON'] = '/software/anaconda3/4.2.0/bin/python'

import pyspark
conf = pyspark.SparkConf()
conf.setMaster("yarn")
conf.set("spark.driver.memory","4g")
conf.set("spark.executor.memory","60g")
conf.set("spark.num.executors","3")
conf.set("spark.executor.cores","12")

sc = pyspark.SparkContext(conf=conf)

In [3]:
sc

<pyspark.context.SparkContext at 0x2b74abc71fd0>

In [4]:
textFile = sc.textFile("/repository/complete-shakespeare.txt")

In [5]:
print (textFile)

/repository/complete-shakespeare.txt MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0


## 2. What does Spark do with my data?

**Storage Level:**
- Does RDD use disk?
- Does RDD use memory?
- Does RDD use off-heap memory?
- Should an RDD be serialized (while persisting)?
- How many replicas (default: 1) to use (can only be less than 40)?

In [6]:
textFile.getStorageLevel()

StorageLevel(False, False, False, False, 1)

In [7]:
textFile.getNumPartitions()

2

In [8]:
textFile.cache()

/repository/complete-shakespeare.txt MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0

In [9]:
textFile.getStorageLevel()

StorageLevel(False, True, False, False, 1)

- By default, each transformed RDD may be recomputed each time you run an action on it.
- It is also possible to *persist* RDD in memory using *persist()* or *cache()*
    - *persist()* allows you to specify level of storage for RDD
    - *cache()* only persists RDD in memory
    - To retire RDD from memory, *unpersist()* is called

## 3. WordCount

Data operations in Spark are categorized into two groups, *transformation* and *action*. 
- A *transformation* creates new dataset from existing data. Examples of *transformation* include map, filter, reduceByKey, and sort. 
- An *action* returns a value to the driver program (aka memory space of this notebook) after running a computation on the data set. Examples of *action* include count, collect, reduce, and save. 

"All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program." -- Spark Documentation

#### RDD Operations in Spark

**Transformations: **

- *map*(f: T -> U) : RDD[T] -> RDD[U]
- *filter*(f: T -> Bool) : RDD[T] -> RDD[T]
- *flatMap*(f: T -> Seq[U]) : RDD[T] -> RDD[U]
- *sample*(*fraction*: Float) : RDD[T] -> RDD[T] (deterministic sampling)
- *groupByKey*() : RDD[(K,V)] -> RDD[(K, Seq[V])]
- *reduceByKey*(f: (V,V) -> V) : RDD[(K,V)] -> RDD[(K,V)]
- *union*() : (RDD[T], RDD[T]) -> RDD[T]
- *join*() : (RDD[(K,V)], RDD[(K,W)]) -> RDD[(K,(V,W))]
- *cogroup*() : (RDD[(K,V)], RDD[(K,W)] -> RDD[(K, (Seq[V],Seq[W]))]
- *crossProduct*() : (RDD[T], RDD[U]) -> RDD[(T,U)]
- *mapValues*(f: V -> W) : RDD[(K,V)] -> RDD[(K,W)] (preserves partitioning)
- *sort*(c: Comparator[K]) :  RDD[(K,V)] -> RDD[(K,V)]
- *partitionBy*(p: Partitioner[K]) : RDD[(K,V)] -> RDD[(K,V)]

**Actions:**

- *count*() : RDD[T] -> Long
- *collect*() : RDD[T] -> Seq[T]
- *reduce*(f: (T,T) -> T) : RDD[T] -> T
- *lookup*(k : K) : RDD[(K,V)] -> Seq[V] (on hash/range partitionied RDDs)
- *save*(path: String) : Outputs RDD to a storage system 

In [10]:
textFile = sc.textFile("/repository/complete-shakespeare.txt")

In [11]:
textFile

/repository/complete-shakespeare.txt MapPartitionsRDD[3] at textFile at NativeMethodAccessorImpl.java:0

In [12]:
%%time
textFile.count()

CPU times: user 10 ms, sys: 4.79 ms, total: 14.8 ms
Wall time: 4.19 s


124796

In [13]:
wordcount = textFile.flatMap(lambda line: line.split(" ")) \
            .map(lambda word: (word, 1)) \
            .reduceByKey(lambda a, b: a + b)

In [14]:
wordcount

PythonRDD[9] at RDD at PythonRDD.scala:48

In [17]:
!hdfs dfs -mkdir intro-to-spark

In [18]:
wordcount.saveAsTextFile("intro-to-spark/output-wordcount-01")

In [19]:
!hdfs dfs -cat intro-to-spark/output-wordcount-01/part-00000 \
    2>/dev/null | head -n 20

('', 506672)
('Quince', 1)
('Corin,', 2)
('Just', 10)
('enrooted', 1)
('divers', 20)
('Doubtless', 2)
('undistinguishable,', 1)
('widowhood,', 1)
('incorporate.', 1)
('rare,', 10)
('Sir-I', 1)
("Stain'd", 2)
('sith', 12)
("O'erpays", 1)
('a-going?', 1)
('perfection.', 5)
('twice,', 2)
('LIBRARY,', 221)
('Gloucestershire;', 3)


**Step-by-step actions:**

In [32]:
!hdfs dfs -cat /repository/complete-shakespeare.txt \
    2>/dev/null | head -n 500

ï»¿The Project Gutenberg EBook of The Complete Works of William Shakespeare, by 
William Shakespeare

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org

** This is a COPYRIGHTED Project Gutenberg eBook, Details Below **
**     Please follow the copyright guidelines in this file.     **

Title: The Complete Works of William Shakespeare

Author: William Shakespeare

Posting Date: September 1, 2011 [EBook #100]
Release Date: January, 1994

Language: English


*** START OF THIS PROJECT GUTENBERG EBOOK COMPLETE WORKS--WILLIAM SHAKESPEARE ***




Produced by World Library, Inc., from their Library of the Future




This is the 100th Etext file presented by Project Gutenberg, and
is presented in cooperation with World Library, Inc., f

In [21]:
wordcount_step_01 = textFile.flatMap(lambda line: line.split(" "))

In [22]:
wordcount_step_01

PythonRDD[16] at RDD at PythonRDD.scala:48

In [23]:
wordcount_step_01.take(20)

['The',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'The',
 'Complete',
 'Works',
 'of',
 'William',
 'Shakespeare,',
 'by',
 '',
 'William',
 'Shakespeare',
 '',
 'This',
 'eBook',
 'is',
 'for']

In [24]:
wordcount_step_02 = wordcount_step_01.map(lambda word: (word, 1))

In [25]:
wordcount_step_02.take(20)

[('The', 1),
 ('Project', 1),
 ('Gutenberg', 1),
 ('EBook', 1),
 ('of', 1),
 ('The', 1),
 ('Complete', 1),
 ('Works', 1),
 ('of', 1),
 ('William', 1),
 ('Shakespeare,', 1),
 ('by', 1),
 ('', 1),
 ('William', 1),
 ('Shakespeare', 1),
 ('', 1),
 ('This', 1),
 ('eBook', 1),
 ('is', 1),
 ('for', 1)]

In [26]:
wordcount_step_03 = wordcount_step_02.reduceByKey(lambda a, b: a + b)

In [27]:
wordcount_step_03.take(20)

[('', 506672),
 ('Quince', 1),
 ('Corin,', 2),
 ('Just', 10),
 ('enrooted', 1),
 ('divers', 20),
 ('Doubtless', 2),
 ('undistinguishable,', 1),
 ('widowhood,', 1),
 ('incorporate.', 1),
 ('rare,', 10),
 ('Sir-I', 1),
 ("Stain'd", 2),
 ('sith', 12),
 ("O'erpays", 1),
 ('a-going?', 1),
 ('perfection.', 5),
 ('twice,', 2),
 ('LIBRARY,', 221),
 ('Gloucestershire;', 3)]

### Challenge

- Augment the mapping process of WordCount with a function to filter out punctuations and capitalization from the unique words

To stop the Spark job, call `sc.stop()`

In [None]:
sc.stop()