# Spark Learning Note - MLlib
Jia Geng | gjia0214@gmail.com


<a id='directory'></a>

## Directory

- [Data Source](https://github.com/databricks/Spark-The-Definitive-Guide/tree/master/data/)
- [1. Some Machine Learning Examples](#sec1)
- [2. Classic ML Developmental Stages](#sec2-1)
- [3.  Spark MLlib Overview](#sec3)
- [4. Simple Example Walk Through](#sec4)
    - [4.1 Load the data](#sec4-1)
    - [4.2 Transformer - RFomula](#sec4-2)
    - [4.3 Estimator](#sec4-3)
    - [4.4 Pipeline and GridSearch](#sec4-4)
    - [4.5 Tuning (Evaluator and GridSearch)](#sec4-5)

## 1. Some Machine Learning Examples <a id='sec1'></a>

Supervised Learning
- classification
    - predicting disease
    - clasifying image
- regression
    - predicting sales
    - predicting number of viewer of a show
    
Recommendation
- movie recommendation
- product recommendation

Unsupervised Learning
- anormaly detection
- user segmentation 
- topic modeling

Graph Analysis
- fraud prediction
    - interesting - account within two hops of fraudulent number might be considered as suspicious
- anormaly detection
    - e.g. if typically in the data each vertex has ten edges associated with it. given a vertex only has one edge -> possible anormaly
- classification
    - influencer's network has similar structure
- recommendation
    - PageRank is a graph algorithm!
   
[back to top](#directory)

## 2. Classic ML Developmental Stages <a id='sec2'></a>

- collect data
- clean data
- feature engineering
- modeling
- evaluating and tuning
- leveraging model/insights

[back to top](#directory)


## 3. Spark MLlib Overview <a id='sec3'></a>

Spark MLlib provide two core packages for machine learning;
- `pyspark.ml`: provide high level DataFrames APIs for building machine learning piplines
- `pyspark.mllib`: provide low level RDD APIs


**Spark MLlib vs Other ML packages**
- most of other ml packages are **single machine tools**
- when to use MLlib?
    - when data is large, use MLlib for feature engineering then use single machine tool for modeling
    - when data and model are both large and can not fit on one machine, MLlib makes distributed machine learning very simple
- potential disadvantage of MLlib
    - When deploying the model, MLlib does not have buildin to serve low-latency predictions from a model
    - Might want to export the model to another serving system or custom application to do it
    
**Spark Structual Types**
- Transformers: functions convert raw data in some way
- Estimators
    - can a a kind of transformer than is initialized data, e.g. normalize data need to get the mean and std from data
    - algorithms that allow users to train a model from data
- Evaluator: provide insight about how a model performs according to some criteria we specified such as AUC.
- Pipeline: a container hat pipelining the process, like the scikit-learn pipeline
- **The transformer, estimator and evaluater object classes usually can be initiated as a 'blank' object. Then set up the attribute and configuration later. This makes these classes  support the Pipeline construction and grid search.**

**Spark Low Level Data Types**
- `from pyspark.ml.linalg import Vectors`
- Dense Vector: `Vector.dense(1.0, 2.0, 3.0)`
- Spark Vector: `Vector.sparse(size, idx, values)` idx for positions that is not zero

[back to top](#directory)

## 4. Simple Example Walk Through <a id='sec4'></a>


### 4.1 Load the data <a id='sec4-1'></a>

Initialize the spark session, load the data, set up partitions, cahce if needed, and do some exploration such as count, check nulls, summary, etc.

[back to top](#directory)

In [1]:
from pyspark.sql.session import SparkSession

data_example_path = '/home/jgeng/Documents/Git/SparkLearning/data/simple-ml' 
spark = SparkSession.builder.appName('MLexample').getOrCreate()
spark

In [2]:
# load the data
df = spark.read.json(data_example_path)

In [3]:
from pyspark.sql.functions import col, max, min, avg, stddev_samp

# check on schema
df.show(3)
df.printSchema()

+-----+----+------+------------------+
|color| lab|value1|            value2|
+-----+----+------+------------------+
|green|good|     1|14.386294994851129|
| blue| bad|     8|14.386294994851129|
| blue| bad|    12|14.386294994851129|
+-----+----+------+------------------+
only showing top 3 rows

root
 |-- color: string (nullable = true)
 |-- lab: string (nullable = true)
 |-- value1: long (nullable = true)
 |-- value2: double (nullable = true)



In [4]:
# check nulls
for col_name in df.columns:
    print(df.where('{} is null'.format(col_name)).count())

0
0
0
0


In [5]:
df.select(col('color')).distinct().show(3)
df.select(col('lab')).distinct().show(3)
df.select('value1', 'value2').summary().show()

+-----+
|color|
+-----+
|green|
|  red|
| blue|
+-----+

+----+
| lab|
+----+
| bad|
|good|
+----+

+-------+------------------+------------------+
|summary|            value1|            value2|
+-------+------------------+------------------+
|  count|               110|               110|
|   mean|14.818181818181818|  21.0914521792258|
| stddev|13.305294399193416|10.999588110596887|
|    min|                 1|14.386294994851129|
|    25%|                 2|14.386294994851129|
|    50%|                12|14.386294994851129|
|    75%|                16| 38.97187133755819|
|    max|                45| 38.97187133755819|
+-------+------------------+------------------+



### 4.2 Transformer - RFomula <a id='sec4-2'></a>

Most of the machine learning algorithms in MLlib needs the input to be transformed into:
- Double for labels
- Vector[Double] for features

**Use R-liked operator to build a `RFomula` as transformer**
- under `pyspark.ml.feature`
- `~` sperate the target and terms
- `+` to concat/include a feature. 
    - `+0` to remove the intercept
- `-` to remove a term
    - `-0` to remove the intercept (same as `+0`)
- `:` as the interaction between two feature, i.e. multiplication for numeric values or binarized categorical values
- `.` all columns except for the target 

E.g.
`lab~.+color:value1+colr:value2` means
- label is the target 
- model takes all columns except lab column as input
- model also takes interaction terms between color:value1, color:value2 as input

To transform data into usable features:
- build a `RFormula object`
- use `RFormula.fit(data_df)` to set up the transform configuration. `fit` return a `RFormulaModel` object
- Transform the data via `RFormulaModel` by calling `.transform(data)` 

[back to top](#directory)

In [6]:
from pyspark.ml.feature import RFormula

# specify the transformer using RFormula
rfm = RFormula()
rfm.setFormula('lab~.+color:value1+color:value2')

RFormula_1be4c97d054f

In [7]:
# fit the rformula object with data to create the transformer
transformer = rfm.fit(df)
print(type(transformer))
print(transformer.explainParams())

<class 'pyspark.ml.feature.RFormulaModel'>
featuresCol: features column name (default: features)
forceIndexLabel: Force to index label whether it is numeric or string (default: False)
formula: R model formula (current: lab~.+color:value1+color:value2)
handleInvalid: How to handle invalid data (unseen or NULL values) in features and label column of string type. Options are 'skip' (filter out rows with invalid data), error (throw an error), or 'keep' (put invalid data in a special additional bucket, at index numLabels). (default: error)
labelCol: label column name (default: label)
stringIndexerOrderType: How to order categories of a string FEATURE column used by StringIndexer. The last category after ordering is dropped when encoding strings. Supported options: frequencyDesc, frequencyAsc, alphabetDesc, alphabetAsc. The default value is 'frequencyDesc'. When the ordering is set to 'alphabetDesc', RFormula drops the same category as R when encoding strings. (default: frequencyDesc)


In [8]:
# transform - it will concat a feature column to the original df
preparedDF = transformer.transform(df)
preparedDF.show(3)
preparedDF.printSchema()
preparedDF.select('features').show(3, False)

+-----+----+------+------------------+--------------------+-----+
|color| lab|value1|            value2|            features|label|
+-----+----+------+------------------+--------------------+-----+
|green|good|     1|14.386294994851129|(10,[1,2,3,5,8],[...|  1.0|
| blue| bad|     8|14.386294994851129|(10,[2,3,6,9],[8....|  0.0|
| blue| bad|    12|14.386294994851129|(10,[2,3,6,9],[12...|  0.0|
+-----+----+------+------------------+--------------------+-----+
only showing top 3 rows

root
 |-- color: string (nullable = true)
 |-- lab: string (nullable = true)
 |-- value1: long (nullable = true)
 |-- value2: double (nullable = true)
 |-- features: vector (nullable = true)
 |-- label: double (nullable = false)

+--------------------------------------------------------------------+
|features                                                            |
+--------------------------------------------------------------------+
|(10,[1,2,3,5,8],[1.0,1.0,14.386294994851129,1.0,14.386294994851129])|

In [9]:
# split the data into train an test
train, test = preparedDF.randomSplit([0.7, 0.3])
print(train.count())
print(test.count())

78
32


### 4.3 Estimator <a id='sec4-3'></a>

Most of the algorithms are under `pyspark.ml`. E.g logistic regression under `pyspark.ml.classification.LogisticRegression`

The classifier constructor usually takes in parameters that specify the feature column and label column along with some hyperparameters. **MOst classifier object have a funtion `explainParam()` that can provide info regarding the hyperparameters**

Estimator
- the class object only contains the params configuration for the model, e.g. `LogisticRegreesion`
- use `.fit()` to fit the training data
    - `fit` returns a trained classifier, e.g. `LogisticRegressionModel`. this is the classifier object that contains weights etc. for making predictions!
- use `.transform()` to make predictions since logically, prediction is just transform the input to labels!

[back to top](#directory)

In [10]:
from pyspark.ml.classification import LogisticRegression

In [11]:
logit = LogisticRegression(labelCol='label', featuresCol='features')
print(type(logit))
print(logit.explainParams())

<class 'pyspark.ml.classification.LogisticRegression'>
aggregationDepth: suggested depth for treeAggregate (>= 2). (default: 2)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0)
family: The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial (default: auto)
featuresCol: features column name. (default: features, current: features)
fitIntercept: whether to fit an intercept term. (default: True)
labelCol: label column name. (default: label, current: label)
lowerBoundsOnCoefficients: The lower bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of features) for binomial regression, or (number of classes, number of features) for multinomial regression. (undefined)
lowerBoundsOnIntercepts: The lower bounds on intercep

In [12]:
# fit the model with training data
clf = logit.fit(train)
print(type(clf))

<class 'pyspark.ml.classification.LogisticRegressionModel'>


In [13]:
print(clf.coefficientMatrix)

DenseMatrix([[-86.68052131,  98.78827878,   1.91225145,  -0.7738317 ,
                3.56939876, -10.37593858,  -6.27052032,  -1.56046852,
                2.68388541,  -4.62152395]])


In [14]:
# making predictions
clf.transform(test).show(3) # probably want to select the probability and prediction column only 

+-----+---+------+------------------+--------------------+-----+--------------------+--------------------+----------+
|color|lab|value1|            value2|            features|label|       rawPrediction|         probability|prediction|
+-----+---+------+------------------+--------------------+-----+--------------------+--------------------+----------+
| blue|bad|     8|14.386294994851129|(10,[2,3,6,9],[8....|  0.0|[119.925940598685...|[1.0,8.2570660939...|       0.0|
| blue|bad|     8|14.386294994851129|(10,[2,3,6,9],[8....|  0.0|[119.925940598685...|[1.0,8.2570660939...|       0.0|
| blue|bad|     8|14.386294994851129|(10,[2,3,6,9],[8....|  0.0|[119.925940598685...|[1.0,8.2570660939...|       0.0|
+-----+---+------+------------------+--------------------+-----+--------------------+--------------------+----------+
only showing top 3 rows



### 4.4 Pipeline and GridSearch <a id='sec4-4'></a>

Spark also have a pipeline class: `pyspark.ml.Pipeline`. `Pipeline` is essentially a compact estimator that can do feature transformation, model fitting and prediction. `Pipeline` have a `.stages` attribute that keeps the configurations of the transformer and estimator.

[back to top](#directory)

### 4.5 Tuning (Evaluator and GridSearch)  <a id='sec4-5'></a>

Spark provide a very compact way to do model selection. 

Steps:
- initialize the transformer, estimator and pipeline 
- set up the `ParamGridBuilder` under `pyspark.ml.tuning` for the grid search
    - `ParamGridBuilder` can be used to configure the searching space for transformer (feature subsets) and the estimator (model hyperparam)
    - use `ParamGridBuilder.addGrid(attr, candidates)` to configure the grid search
- create a evaluator. `pyspark.ml.evaluation` host different types of evaluators for different task, that can be used for evaluating the model performance. When constructing the evaluator, you usually need to:
    - `setMetricName()`
    - `setRawPredictionCol()`
    - `setLabelCol()`
- create a verifier, e.g. `TrainValidationSplit`. this is a compact class that takes in the pipeline and evaluator and do tuning,

After training.
- use `evaluator.evaluate(tvsFitted.bestModel.transform(test))` for the performance
- to check the training record on the best model `summary= tvsFitted.bestModel.stages[1].summary`
    - use stages to get the classifer if pipeline estimator was used
    - `summary.objectiveHistory` is the loss history during training
    - `summary.roc.show()` gives the roc curve data

To load/write model, just use `load` `write` mothod

[back to top](#directory)

In [15]:
from pyspark.ml.feature import RFormula
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline

# prepare the transformer and the estimator
# do not specify any hyperparameters here
rfm = RFormula()
logit = LogisticRegression().setLabelCol('label').setFeaturesCol('features')

# construct the pipeline
ppBuilder = Pipeline()

# set up the stage
stages = [rfm, logit]
pp = ppBuilder.setStages(stages)  # does not configure inplace!
print(type(ppBuilder), type(pp))

<class 'pyspark.ml.pipeline.Pipeline'> <class 'pyspark.ml.pipeline.Pipeline'>


In [122]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# build the evaluator
evaluator = BinaryClassificationEvaluator().setMetricName('areaUnderROC')\
                                            .setRawPredictionCol('prediction')\
                                            .setLabelCol('label')

In [123]:
from pyspark.ml.tuning import ParamGridBuilder

# building the grid search space
rfm_can = ['lab~.', 'lab~.+color:value1+color:value2']  # feature space
enet_can = [0, 0.5, 1] # 0 for l1 0.5 for l1 l2, 1 for l2
reg_can = [0, 1e-3, 1e-2, 1e-1, 1, 10] # 0 for no regularization
params = ParamGridBuilder().addGrid(rfm.formula, rfm_can)\
                            .addGrid(logit.elasticNetParam, enet_can)\
                            .addGrid(logit.regParam, reg_can)\
                            .build()  # dont forget to call build!

In [124]:
from pyspark.ml.tuning import TrainValidationSplit

# build the train validation machine
# tvs will takes 0.25 of the training data as the holdout set for validation
tvs = TrainValidationSplit().setTrainRatio(0.75)\
                            .setEstimator(pp)\
                            .setEvaluator(evaluator)\
                            .setEstimatorParamMaps(params)  

In [125]:
train, test = df.randomSplit([0.7, 0.3])
tvsFitted = tvs.fit(train)  # train on the train data

In [126]:
evaluator.evaluate(tvsFitted.bestModel.transform(test))

1.0

In [127]:
# get the valdiation results via the validation metrics
# get associated model params via getEstimatorParamMaps
print(len(tvsFitted.getEstimatorParamMaps()))
print(len(tvsFitted.validationMetrics))
print(tvsFitted.validationMetrics)

36
36
[0.8444444444444443, 0.8444444444444443, 0.8444444444444443, 0.9, 0.6, 0.5, 0.8444444444444443, 0.8444444444444443, 0.8444444444444443, 0.9, 0.5, 0.5, 0.8444444444444443, 0.8444444444444443, 0.8444444444444443, 0.9, 0.5, 0.5, 1.0, 1.0, 0.9, 0.9, 0.8, 0.5, 1.0, 1.0, 0.9, 0.9, 0.5, 0.5, 1.0, 1.0, 0.9, 0.9, 0.5, 0.5]


In [128]:
# get the est logistic model summary
# since the model is pipeline
# need to first get the classifier via stage
# then get the summary 
summary = tvsFitted.bestModel.stages[1].summary
loss_history = summary.objectiveHistory
print(loss_history, len(loss_history))
summary.roc.show()  # get the roc curve detail

[0.6892163745019179, 0.510502314570867, 0.3664536939110651, 0.31679257987553894, 0.24655236616423126, 0.18633891607378847, 0.12416752564202191, 0.08844790966404317, 0.0634782659653297, 0.03650144877617272, 0.01753512868955874, 0.008457676545887327, 0.004137592979524203, 0.002055533225166825, 0.0010322038166535655, 0.0007264889100313303, 0.00033311393984353466, 0.0002064664704583792, 7.880561922556143e-05, 3.8462902620327634e-05, 2.1691150840552562e-05, 1.4897791984951597e-05, 7.593582642831163e-06, 4.03270219201795e-06, 2.031491353222154e-06, 1.0324961219027968e-06, 5.207674153398636e-07, 2.6302611265871177e-07, 1.3270426678462717e-07, 6.697408502800347e-08, 3.37985945598921e-08] 31
+-------------------+------------------+
|                FPR|               TPR|
+-------------------+------------------+
|                0.0|               0.0|
|                0.0|0.7222222222222222|
|                0.0|0.8611111111111112|
|                0.0|               1.0|
|0.20930232558139536|

[back to top](#directory)