# A tour for Spark's toolset
![](images/toolset.png)

**TOC**
- Running production applications with spark-submit
- Datasets: type-safe APIs for structured data
- Structured Streaming
- Machine learning and advanced analytics
- Resilient Distributed Datasets (RDD): Spark’s low level APIs
- SparkR
- The third-party package ecosystem

## Running Production Applications

- `spark-submit`: send application code to a cluster and excute there. (via cluster manager)
- application run until it exits (complete the task) or encounters an error

**Spark's cluster manager**
- Standlone
- Mesos
- YARN

### Examples

In [None]:
## not run here
## scala version
## only via Command Line under spark root directory
./bin/spark-submit \
    --class org.apache.spark.examples.SparkPi \
    --master local \
    ./examples/jars/spark-examples_2.11-2.2.0.jar 10

This sample application calculates the digits of pi to a certain level of estimation. Here, we’ve told spark-submit that we want to run on our local machine, which class and which JAR we would like to run, and some command-line arguments for that class

In [None]:
## not run here
## python version
./bin/spark-submit \
    --master local \
    ./examples/src/main/python/pi.py 10

By changing the master argument of `spark-submit`, we can also submit the same application to a cluster running Spark’s standalone cluster manager, Mesos or YARN

## Datasets: Type-Safe Structured APIs
- Only for **statically typed** code: Java and Scale  
- Not available for **dynamically typed** language: Python and R

Similar to Java `ArrayList` or Scala `Seq`
- APIs are *type-safe*

*Come back in the future*

## Structured Streamming (Part V)
- high-level API for stream processing.
- reduce latency and allow incremental processing.

#### Example - retail dataset

In [None]:
! head ../Spark-The-Definitive-Guide-master/data/retail-data/by-day/2010-12-01.csv -n 3

In [1]:
# create spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder\
    .master('local[3]')\
    .appName('Cha3')\
    .getOrCreate()
    

In [2]:
# load the data as static DataFrame
staticDataFrame = spark.read.format('csv')\
    .option('header', 'true')\
    .option('inferSchema', 'true')\
    .load('../Spark-The-Definitive-Guide-master/data/retail-data/by-day/*.csv')

staticDataFrame.createOrReplaceTempView("retail_data")
staticSchema = staticDataFrame.schema

In [3]:
from pyspark.sql.functions import window, column, desc, col
# set partitions = 5 after shuffle
spark.conf.set('spark.sql.shuffle.partitions', '5')

# select query
staticDataFrame\
    .selectExpr(
    'CustomerId',
    '(UnitPrice * Quantity) as total_cost',
    'InvoiceDate')\
    .groupBy(
    col('CustomerId'), window(col('InvoiceDate'), '1 day'))\
    .sum('total_cost')\
    .show(5)

+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   14075.0|[2011-12-04 16:00...|316.78000000000003|
|   18180.0|[2011-12-04 16:00...|            310.73|
|   15358.0|[2011-12-04 16:00...| 830.0600000000003|
|   15392.0|[2011-12-04 16:00...|304.40999999999997|
|   15290.0|[2011-12-04 16:00...|263.02000000000004|
+----------+--------------------+------------------+
only showing top 5 rows



In [4]:
# create streaming dataframe
streamingDataFrame = spark.readStream\
    .schema(staticSchema)\
    .option('maxFilesPerTrigger', 1)\
    .format('csv')\
    .option('header', 'true')\
    .load('../Spark-The-Definitive-Guide-master/data/retail-data/by-day/*.csv')

In [5]:
# check streaming
streamingDataFrame.isStreaming

True

In [6]:
# select query as above 
# lazy operation
purchaseByCustomerPerHour = streamingDataFrame\
    .selectExpr(
    'CustomerId',
    '(UnitPrice * Quantity) as total_cost',
    'InvoiceDate')\
    .groupBy(
    col('CustomerId'), window(col('InvoiceDate'), '1 day'))\
    .sum('total_cost')

#### Streaming actions
The action we will use will output to an in-memory table that
we will update after each `trigger`. In this case, each trigger is based on an individual file (the read
option that we set). 

Spark will mutate the data in the in-memory table such that we will always have the highest value as specified in our previous aggregation:

In [7]:
purchaseByCustomerPerHour.writeStream\
    .format('memory')\
    .queryName('customer_purchases')\
    .outputMode('complete')\
    .start()
    # memory = store in-memory table
    # the name of the in-memory table
    # complete = all the counts should be in the table

<pyspark.sql.streaming.StreamingQuery at 0x7f40e508cc18>

In [8]:
# start the stream
spark.sql('''
SELECT *
FROM customer_purchases
ORDER BY `sum(total_cost)` DESC
''')\
    .show(5)

+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|      null|[2011-11-10 16:00...|13636.969999999936|
|      null|[2011-01-23 16:00...|  8101.42000000001|
|      null|[2011-01-30 16:00...|  4822.28000000001|
|      null|[2011-11-11 16:00...| 4538.830000000009|
|   15311.0|[2011-11-11 16:00...| 4041.180000000001|
+----------+--------------------+------------------+
only showing top 5 rows



In [10]:
# write the results to the console 
purchaseByCustomerPerHour.writeStream\
    .format("console")\
    .queryName("customer_purchases_2")\
    .outputMode("complete")\
    .start()

<pyspark.sql.streaming.StreamingQuery at 0x7f40e508c320>

## Machine Learning and Advanced Analytics

In [3]:
staticDataFrame.printSchema()

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: timestamp (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: double (nullable = true)
 |-- Country: string (nullable = true)



Machine learning algorithms in MLlib require that data is represented as __numerical values__.

In [5]:
# Data transformation
from pyspark.sql.functions import date_format, col
preppedDataFrame = staticDataFrame\
    .na.fill(0)\
    .withColumn('day_of_week', date_format(col('InvoiceDate'), 'EEEE'))\
    .coalesce(5)

In [8]:
preppedDataFrame.printSchema()

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: timestamp (nullable = true)
 |-- UnitPrice: double (nullable = false)
 |-- CustomerID: double (nullable = false)
 |-- Country: string (nullable = true)
 |-- day_of_week: string (nullable = true)



In [12]:
# train/test split 
trainDataFrame = preppedDataFrame\
    .where('InvoiceDate < "2011-07-01"')
testDataFrame = preppedDataFrame\
    .where('InvoiceDate >= "2011-07-01"')

In [15]:
print('TrainDF size : {}'.format(trainDataFrame.count()))
print('TestDF size : {}'.format(testDataFrame.count()))

TrainDF size : 245903
TestDF size : 296006


Spark’s MLlib also provides a number of transformations with which we can automate some of our general transformations. One such transformer is a `StringIndexer`:

In [16]:
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer()\
    .setInputCol('day_of_week')\
    .setOutputCol('day_of_week_index')

This will turn our days of weeks into corresponding numerical values. For example, Spark might
represent Saturday as 6, and Monday as 1. However, with this numbering scheme, we are
implicitly stating that Saturday is greater than Monday (by pure numerical values). This is
obviously *incorrect*. To fix this, we therefore need to use a `OneHotEncoder` to encode each of
these values as their own column. These Boolean flags state whether that day of week is the
relevant day of the week:

In [20]:
from pyspark.ml.feature import OneHotEncoder
encoder = OneHotEncoder()\
    .setInputCol('day_of_week_index')\
    .setOutputCol('day_of_week_encoded')

Each of these will result in a set of columns that we will “assemble” into a vector. All machine
learning algorithms in Spark take as **input** a `Vector` type, which must be a set of numerical
values:

In [46]:
from pyspark.ml.feature import VectorAssembler
vectorAssembler = VectorAssembler()\
    .setInputCols(['UnitPrice', 'Quantity', 'day_of_week_encoded'])\
    .setOutputCol('features')

Here, we have three key features: the price, the quantity, and the day of week. Next, we’ll set this
up into a **pipeline** so that any future data we need to transform can go through the exact same
process:

In [47]:
## set up the pipeline
from pyspark.ml import Pipeline
transformationPipeline = Pipeline()\
    .setStages([indexer, encoder, vectorAssembler])

We first need to fit our **transformers** to this dataset.
(Cover in depth in *Part VI*) Basically our StringIndexer needs to know how many
unique values there are to be indexed. After those exist, encoding is easy but Spark must look at all the distinct values in the column to be indexed in order to store those values later on:

In [48]:
fittedPipeline = transformationPipeline.fit(trainDataFrame)

In [49]:
# transform the data
transformedTraining = fittedPipeline.transform(trainDataFrame)

**ML** :There are always two types for every algorithm in MLlib’sDataFrame API. 
- `Algorithm` : untrained version
- `AlgorithmModel` : trained version. 
    
In our example, this is `KMeans` and then `KMeansModel`.

In [59]:
# Comparison on w/ or w/o caching
import timeit as t
from pyspark.ml.clustering import KMeans

kmeans = KMeans()\
    .setK(20)\
    .setSeed(1)

**Caching** : an optimization (more detail in *Part IV*). This will put a copy of the intermediately transformed dataset into memory, allowing us to repeatedly access it at much lower cost than running the entire pipeline again.

In [62]:
# w/o caching
t1 = t.timeit()

kmModel = kmeans.fit(transformedTraining)

t2 = t.timeit()
print('Training time : {}'.format(t2-t1))

Training time : -0.00886492900281155


In [63]:
# w/ caching
t1 = t.timeit()

transformedTraining.cache()
kmModel = kmeans.fit(transformedTraining)

t2 = t.timeit()
print('Training time : {}'.format(t2-t1))

Training time : -0.012267105999853811


Compute the cost according to some success merits on our training set.

In [68]:
transformedTest = fittedPipeline.transform(testDataFrame)

517507094.72221166

In [80]:
print('Cost of train set : {}'.format(kmModel.computeCost(transformedTraining)),
      '\nCost of test set : {}'.format(kmModel.computeCost(transformedTest)))

Cost of train set : 84553739.96537484 
Cost of test set : 517507094.72221166


## Lower Level APIs (Part IV)

Spark includes a number of lower-level primitives to allow for arbitrary Java and Python object
manipulation via **Resilient Distributed Datasets (RDDs)** (Chapter 4). Virtually everything in Spark is built on
top of RDDs.

**Usage 1**: parallelize raw data that you have stored in memory on the driver machine.

In [81]:
from pyspark.sql import Row
spark.sparkContext.parallelize([Row(1), Row(2), Row(3)]).toDF()

DataFrame[_1: bigint]

**Note**: There are basically no instances in modern Spark, for which you
should be using RDDs instead of the structured APIs beyond manipulating some very raw
unprocessed and unstructured data.

## SparkR (Part VII)

In [83]:
%%R ## need to set up R kernel for jupyter notebook
library(SparkR)

sparkDF <- read.df("/data/flight-data/csv/2015-summary.csv",
                   source = "csv", header="true", inferSchema = "true")
take(sparkDF, 5)

UsageError: Cell magic `%%R` not found.


In [None]:
%%R
collect(orderBy(sparkDF, "count"), 20)


Used with other libraries

In [None]:
library(magrittr)
sparkDF %>%
    orderBy(desc(sparkDF$count)) %>%
    groupBy("ORIGIN_COUNTRY_NAME") %>%
    count() %>%limit(10) %>%
    collect()

## Spark's ecosystem and Packages

There are many mature packages and projects have been developed as part of the ecosystem of packages. 
- largest index of Spark Packages at [spark-packages.org](https://spark-packages.org/), where any user can publish to this package repository.
