### In notebook...

* MLlib
* Graph analysis
* Neural networks

## Advanced Analytics and Machine Learning

* Preprocessing your data (cleaning data and feature engineering)
* Supervised Learning
* Unsupervised Learning
* Recommendation Engines
* Graph Analysis
* Deep Learning

## What is MLlib?

MLlib is a package, built on and included in Spark, that provides interfaces for
* gathering and cleaning data,
* feature engineering and feature selection,
* training and tuning large scale supervised and unsupervised machine learning models,
* and using those models in production.
MLlib helps with all three steps of the process although it really shines in steps one and two for reasons that we will
touch on shortly.

The most common that you will come across is the Vector. Whenever we pass a set of features into a
machine learning model, we must do it as a vector that consists of Doubles. This vector can be either sparse (where
most of the elements are zero) or dense (where there are many unique values). These are specified in different ways,
one where we specify the exact values(dense) and the other where we specify the total size and which values are
nonzero(sparse). Sparse is appropriate, as you might have guessed, when the majority of the values are zero as this is
a more compressed representation that other formats.

In [6]:
sc = pyspark.SparkContext("local[*]")
spark = SparkSession.builder.appName('notebook').getOrCreate()

In [7]:
from pyspark.ml.linalg import Vectors
denseVec = Vectors.dense(1.0, 2.0, 3.0)
size = 3
idx = [1, 2]
# locations in vector
values = [2.0, 3.0]
sparseVec = Vectors.sparse(size, idx, values)
# sparseVec.toDense() # these two don’t work, not sure why
# denseVec.toSparse() # will debug later

In [8]:
sparseVec

SparseVector(3, {1: 2.0, 2: 3.0})

## MLlib in Action

In [None]:
df = spark.read.json("Spark-The-Definitive-Guide-master/data/simple-ml")
df.orderBy("value2").show()

This dataset consists of a categorical label with two values, a categorical variable (color), and two numerical variables.
While the data is synthetic, an example of when this data might be used would be to predict customer health at a
company. The label represents their true current health, the color represents a rating before a phone call to determine
their true health and the two values represent some sort of usage metric. You should immediately recognize that this
will be a classification task where we hope to predict our binary output variable based on the inputs.

In [55]:
#libsvmData = spark.read.format("libsvm")\
#.load("Spark-The-Definitive-Guide-master/data/sample_libsvm_data.txt")

## Transformers
As we mentioned, transformer will help us manipulate our current columns in one way or another. These columns,
in machine learning terminology, represent features (that we will input into our model) and in our particular case, a
label that represents the correct output. Transformers exist to either cut down on the number of features, add more
features, manipulate current ones or simply help us format our data correctly. In general, transformers add new
columns to DataFrames.

To achieve this, we are going to do this by specifying an RFormula. This is a declarative language for specifying
machine learning models and is incredibly simple to use once you understand the syntax. Currently RFormula
supports a limited subset of the R operators that in practice work quite well for simple models. The basic operators
are:
* ~ separate target and terms;
* + concat terms, "+ 0" means removing the intercept (this means that the y-intercept of the line that we will fit
will be 0.);
* - remove a term, "- 1" means removing intercept (this means that the y-intercept of the line that we will fit will
be 0. Yes, this does the same thing as the bullet above.);
* : interaction (multiplication for numeric values, or binarized categorical values);
* . all columns except the target/dependant variable.


In [56]:
from pyspark.ml.feature import RFormula

Then we go through the process of defining our formula. In this case we want to use all available variables (the .) and
then specify a interactions between value1 and color and value2 and color.

In [57]:
supervised = RFormula(formula= "lab~ . + color:value1 + color:value2" )

In [58]:
fittedRF = supervised.fit(df)
preparedDF = fittedRF.transform(df)
preparedDF.show()

+-----+----+------+------------------+--------------------+-----+
|color| lab|value1|            value2|            features|label|
+-----+----+------+------------------+--------------------+-----+
|green|good|     1|14.386294994851129|(10,[1,2,3,5,8],[...|  1.0|
| blue| bad|     8|14.386294994851129|(10,[2,3,6,9],[8....|  0.0|
| blue| bad|    12|14.386294994851129|(10,[2,3,6,9],[12...|  0.0|
|green|good|    15| 38.97187133755819|(10,[1,2,3,5,8],[...|  1.0|
|green|good|    12|14.386294994851129|(10,[1,2,3,5,8],[...|  1.0|
|green| bad|    16|14.386294994851129|(10,[1,2,3,5,8],[...|  0.0|
|  red|good|    35|14.386294994851129|(10,[0,2,3,4,7],[...|  1.0|
|  red| bad|     1| 38.97187133755819|(10,[0,2,3,4,7],[...|  0.0|
|  red| bad|     2|14.386294994851129|(10,[0,2,3,4,7],[...|  0.0|
|  red| bad|    16|14.386294994851129|(10,[0,2,3,4,7],[...|  0.0|
|  red|good|    45| 38.97187133755819|(10,[0,2,3,4,7],[...|  1.0|
|green|good|     1|14.386294994851129|(10,[1,2,3,5,8],[...|  1.0|
| blue| ba

In [59]:
train, test = preparedDF.randomSplit([0.7, 0.3])

## Estimators
Now that we transformed our data into the correct format and created some valuable features. It’s time to actually
fit our model. In this case we will use logistic regression.

In [60]:

from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(
labelCol= "label" ,
featuresCol= "features" )

In [61]:
print(lr.explainParams())

aggregationDepth: suggested depth for treeAggregate (>= 2). (default: 2)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0)
family: The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial (default: auto)
featuresCol: features column name. (default: features, current: features)
fitIntercept: whether to fit an intercept term. (default: True)
labelCol: label column name. (default: label, current: label)
lowerBoundsOnCoefficients: The lower bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of features) for binomial regression, or (number of classes, number of features) for multinomial regression. (undefined)
lowerBoundsOnIntercepts: The lower bounds on intercepts if fitting under bound constrained optimization. The

In [62]:
fittedLR = lr.fit(train)

In [63]:
fittedLR.transform(train).select( "label" ,"prediction" ).show()

+-----+----------+
|label|prediction|
+-----+----------+
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
+-----+----------+
only showing top 20 rows



## Pipelining our Workflow

In [64]:
train, test = df.randomSplit([0.7, 0.3])
rForm = RFormula()

In [65]:
lr = LogisticRegression()\
.setLabelCol( "label" )\
.setFeaturesCol( "features" )

In [66]:
from pyspark.ml import Pipeline
stages = [rForm, lr]
pipeline = Pipeline().setStages(stages)

The next step will be evaluating the performance of this pipeline.
Spark does this by setting up a parameter grid of all the combinations of the parameters that you specify. You
should immediately notice in the following code snippet that even our RFormula is tuning specific parameters.
In a pipeline, we can modify more than just the model’s hyperparameters, we can even modify the transformer’s
properties.

In our current grid there are three hyperparameters that will diverge from the defaults.
* two different options for the R formula
* three different options for the elastic net parameter
* two different options for the regularization parameter

In [67]:
from pyspark.ml.tuning import ParamGridBuilder
params = ParamGridBuilder()\
.addGrid(rForm.formula, [
"lab ~ . + color:value1",
"lab ~ . + color:value1 + color:value2"])\
.addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\
.addGrid(lr.regParam, [0.1, 2.0])\
.build()

This gives us a total of twelve different combinations of these parameters, which means we will be training twelve
different versions of logistic regression.

With the grid built it is now time to specify our evaluation. There are evaluators for classification and regression, which
we cover in subsequent chapters. In this case, we will be using the BinaryClassificationEvaluator. This
evaluator allows us to automatically optimize our model training according to some specific criteria that we specify.
In this case we will specify areaUnderROC which is the total area under the receiver operating characteristic a very
common measure of classification performance that we cover in the classification chapte

In [68]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator()\
.setMetricName( "areaUnderROC" )\
.setRawPredictionCol( "prediction" )\
.setLabelCol( "label" )

As you may know, it is a best practice in machine learning to fit your hyperparameters on a validation set (instead of
your test set). The reasons for this are to prevent overfitting. Therefore we cannot use our holdout test set (that we
created before) to tune these parameters. Luckily Spark provides two options for performing this hyperparameter
tuning in an automated way. We can use a TrainValidationSplit, which will simply perform an arbitrary
random split of our data into two different groups, or a CrossValidator, which performs K-fold cross validation by
splitting the dataset into k non-overlapping randomly partitioned folds.

In [69]:
from pyspark.ml.tuning import TrainValidationSplit
tvs = TrainValidationSplit()\
.setTrainRatio(0.75)\
.setEstimatorParamMaps(params)\
.setEstimator(pipeline)\
.setEvaluator(evaluator)

Now we can fit our entire pipeline. This will test out every version of the model against the validation set. You will
notice that the the type of tvsFitted is TrainValidationSplitModel. Any time that we fit a given model, it
outputs a "model" type.

In [70]:
tvsFitted = tvs.fit(train)

In [71]:
evaluator.evaluate(tvsFitted.transform(test))

0.9117647058823529

## Deployment Patterns

1. Train your ML algorithm offline and then put the results into a database (usually a key-value store). This works
well for something like recommendation but poorly for something like classification or regression where you
cannot just lookup a value for a given user but must calculate one.
2. Train your ML algorithm offline, persist the model to disk, then use that for serving. This is not a low latency
solution as the overhead of starting up a Spark job can be quite high - even if you’re not running on a cluster.
Additionally this does not parallelize well, so you’ll likely have to put a load balancer in front of multiple model
replicas and build out some REST API integration yourself. There are some interesting potential solutions to
this problem, but nothing quite production ready yet.
3. Manually (or via some other software) convert your distributed model to one that can run much more quickly
on a single machine. This works well when there is not too much manipulation of the raw data in Spark and
can be hard to maintain over time. Again there are solutions that are working on this specification as well
but nothing production ready. This cannot be found in the previous illustration because it’s something that
requires manual work.
4. Train your ML algorithm online and use it online, this is possible when used in conjunction like streaming but
is quite sophisticated. This landscape will likely continue to mature as Structured Streaming development
continues.

## Graph Analysis

his chapter is going to dive
into a more specialized toolset: graph processing. In
the context of graphs, nodes or vertices are the units
while edges define the relationships between those
nodes. The process of graph analysis is the process of
analyzing these relationships. An example graph might
be your friend group, in the context of graph analysis
each vertex or node would represent a person and each
edge would represent a relationship.


Graph are a natural
way of describing relationships and many different
problem sets and Spark provides several ways of
working in this analytics paradigm. Some business use
cases could be detecting credit card fraud, importance
of papers in bibliographic networks [which papers are
most referenced], and ranking web pages as Google
famously used the PageRank algorithm to do.

In [10]:
import os
#os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages graphframes:graphframes:0.6.0-spark2.3-s_2.11 pyspark-shell'
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages graphframes:graphframes:0.7.0-spark2.3-s_2.11 pyspark-shell'
#os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages graphframes:graphframes:0.8.0-spark3.0-s_2.12 pyspark-shell'

import pyspark
from pyspark.sql import *
from pyspark.sql.functions import udf

In [13]:
bikeStations = spark.read\
.option("header","true")\
.csv("../data/bike-data/201508_station_data.csv")
tripData = spark.read\
.option("header","true")\
.csv("../data/bike-data/201508_trip_data.csv")

## Building A Graph
The first step is to build the graph, to do this we need to define the vertices and edges. In our case we’re creating
a directed graph. This graph will point from the source to the location. In the context of this bike trip data, this will
point from a trip’s starting location to a trip’s ending location.To define the graph, we use the naming conventions
presented in the GraphFrames library. In the vertices table we define our identifier as id and in the edges table we
label the source id as src and the destination id as dst.

This is not working in python. alternative: scala

In [14]:
stationVertices = bikeStations\
.withColumnRenamed("name", "id")\
.distinct()
tripEdges = tripData\
.withColumnRenamed("Start Station", "src")\
.withColumnRenamed("End Station", "dst")

This allows us to build out graph out of the DataFrames we have so far. We will also leverage caching because we’ll be
accessing this data frequently in the following queries.

In [15]:
#from graphframes import GraphFrame
from graphframes import GraphFrame 

In [16]:
GraphFrame(stationVertices, tripEdges)
#stationGraph.cache()

GraphFrame(v:[id: string, station_id: string ... 5 more fields], e:[src: string, dst: string ... 9 more fields])

## MLlib Neural Network Support

Spark’s MLlib currently has native support for one deep learning algorithm, the multilayer perceptron classifier in the
ml.classification.MultilayerPerceptronClassifier class. This class is limited to training relatively
shallow networks containing fully connected layers with the sigmoid activation function, and an output layer with
a softmax activation function. This class is most useful for training the last few layers of a classification model when
using transfer learning on top of an existing deep learning based featurizer. For example, it can be added on top of the
Deep Learning Pipelines library we describe later in this chapter to quickly perform transfer learning over Keras and TensorFlow models. However, the MultiLayerPerceptronClassifier alone is not enough to train a deep learning model
from scratch on raw input data.

## IMAGES

One of the historical challenges when it came to working with images in Spark is that getting them into a DataFrame
was difficult and tedious. Deep Learning Pipelines includes utility functions that making loading and decoding images
in a distributed fashion easy.

In [5]:
from sparkdl import readImages

Using TensorFlow backend.


In [6]:
img_dir = 'Spark-The-Definitive-Guide-master/data/deep-learning-images/'
image_df = readImages(img_dir)

In [7]:
image_df.show()
image_df.printSchema()

+--------------------+-----+
|            filePath|image|
+--------------------+-----+
|file:/home/erikap...| null|
+--------------------+-----+

root
 |-- filePath: string (nullable = false)
 |-- image: struct (nullable = true)
 |    |-- mode: string (nullable = false)
 |    |-- height: integer (nullable = false)
 |    |-- width: integer (nullable = false)
 |    |-- nChannels: integer (nullable = false)
 |    |-- data: binary (nullable = false)



## Transfer Learning

Now that we have some data, we can get started with some simple transfer learning. Remember, this means
leveraging a model that someone else created and modifying it to better suit our own purposes. First we will load the
data for each type of flower and create a training and a test set.

In [25]:
#!pip install sparkdl
#!pip install tensorframes
#!pip install kafka

Collecting kafka
  Downloading kafka-1.3.5-py2.py3-none-any.whl (207 kB)
[K     |████████████████████████████████| 207 kB 5.1 MB/s eta 0:00:01
[?25hInstalling collected packages: kafka
Successfully installed kafka-1.3.5


In [27]:
from sparkdl import readImages
from pyspark.sql.functions import lit
tulips_df = readImages(img_dir + "/tulips").withColumn("label", lit(1))
daisy_df = readImages(img_dir + "/daisy").withColumn("label", lit(0))
tulips_train, tulips_test = tulips_df.randomSplit([0.6, 0.4])
daisy_train, daisy_test = daisy_df.randomSplit([0.6, 0.4])
train_df = tulips_train.unionAll(daisy_train)
test_df = tulips_test.unionAll(daisy_test)

SyntaxError: invalid syntax (simple.py, line 54)

The next step will be to leverage a transformer called the DeepImageFeaturizer. This will allow us to leverage a
pre-trained model called Inception, a powerful neural network successfully used to identify patterns in images. The
version were are using is pre-trained to work well with images. This is a part of the standard pretrained models that
ship with the Keras library. However, this particular neural network is not trained to work with our particular set of
images (involving flowers). Therefore we’re going to use transfer learning in order to make it into something useful for
our own purposes.
One thing that’s quite powerful here is that we can use the same ML pipeline concepts that we saw throughout this
part of the book and leverage them with Deep Learning: DeepImageFeaturizer is just a Spark ML transformer.
Additionally, all that we’ve done to extend this model is add on a logistic regression model in order to facilitate the
training of our end model. We could use another classifier in its place.

In [9]:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
from sparkdl import DeepImageFeaturizer
featurizer = DeepImageFeaturizer(inputCol="image", outputCol="features", modelName="InceptionV3")
lr = LogisticRegression(maxIter=1, regParam=0.05, elasticNetParam=0.3, labelCol="label")
p = Pipeline(stages=[featurizer, lr])
p_model = p.fit(train_df)

INFO:tensorflow:Froze 376 variables.


2019-02-13 13:55:26,807 INFO (MainThread-6629) Froze 376 variables.


INFO:tensorflow:Converted 376 variables to const ops.


2019-02-13 13:55:27,042 INFO (MainThread-6629) Converted 376 variables to const ops.


INFO:tensorflow:Froze 0 variables.


2019-02-13 13:55:54,473 INFO (MainThread-6629) Froze 0 variables.


INFO:tensorflow:Converted 0 variables to const ops.


2019-02-13 13:55:54,591 INFO (MainThread-6629) Converted 0 variables to const ops.
2019-02-13 13:55:55,010 INFO (MainThread-6629) Fetch names: ['sdl_flattened_mixed10/concat:0']
2019-02-13 13:55:55,013 INFO (MainThread-6629) Spark context = <SparkContext master=local[*] appName=pyspark_python>


Py4JJavaError: An error occurred while calling o235.loadClass.
: java.lang.ClassNotFoundException: org.tensorframes.impl.DebugRowOps
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)


Once we trained the model, we can use the same classification evaluator that we saw several chapters ago. We can
specify the metric we’d like to test and then test against that.

In [22]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
tested_df = p_model.transform(test_df)
evaluator = MulticlassClassificationEvaluator(metricName="accuracy")
print("Test set accuracy = " + str(evaluator.evaluate(tested_df.select("prediction", "label"))))

NameError: name 'p_model' is not defined

With our DataFrame of examples, we can inspect the rows and images in which we made mistakes in the previous
training.

In [19]:
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import expr
# a simple UDF to convert the value to a double
def _p1 (v):
    return float(v.array[1])


In [20]:
p1 = udf(_p1, DoubleType())


AttributeError: 'DoubleType' object has no attribute 'array'

In [21]:
df = tested_df.withColumn("p_1", p1(tested_df.probability))
wrong_df = df.orderBy(expr("abs(p_1 - label)"), ascending=False)
wrong_df.select("filePath", "p_1", "label").limit(10).show()

NameError: name 'tested_df' is not defined