# Ch24 Advanced Analytics and Machine Learning Overview

###  Summary of modules and objects used in the notebook
- From **`pyspark.ml`**$\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ $ Import **`Pipeline`** for Pipeline
- From **`pyspark.ml.tuning`**$\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ $ Import **`TrainValidationSplit, ParamGridBuilder`** for Tuning
- From **`pyspark.ml.feature`**$\ \ \ \ \ \ \ \ \ \ \ \ \ \ $ Import **`RFormula`** for Feature Engineering with Transformers
- From **`pyspark.ml.classification`**$\ \ $ Import **`LogisticRegression`** for Modeling with Estimators
- From **`pyspark.ml.evaluation`**$\ \ \ \ \ \ \ \ \ $ Import **`BinaryClassificationEvaluator`** for Evaluation with Evaluators

In [2]:
df = spark.read.json("../pyspark-training/data/The-Definitive-Guide/simple-ml")
df.orderBy("value2").show()

+-----+----+------+------------------+
|color| lab|value1|            value2|
+-----+----+------+------------------+
|green|good|     1|14.386294994851129|
|green| bad|    16|14.386294994851129|
| blue| bad|     8|14.386294994851129|
| blue| bad|     8|14.386294994851129|
| blue| bad|    12|14.386294994851129|
|green| bad|    16|14.386294994851129|
|green|good|    12|14.386294994851129|
|  red|good|    35|14.386294994851129|
|  red|good|    35|14.386294994851129|
|  red| bad|     2|14.386294994851129|
|  red| bad|    16|14.386294994851129|
|  red| bad|    16|14.386294994851129|
| blue| bad|     8|14.386294994851129|
|green|good|     1|14.386294994851129|
|green|good|    12|14.386294994851129|
| blue| bad|     8|14.386294994851129|
|  red|good|    35|14.386294994851129|
| blue| bad|    12|14.386294994851129|
|  red| bad|    16|14.386294994851129|
|green|good|    12|14.386294994851129|
+-----+----+------+------------------+
only showing top 20 rows



## Feature Engineering with Transformers
### RFormula
RFormula is a transformer

In [3]:
from pyspark.ml.feature import RFormula

In [4]:
supervised = RFormula(formula = "lab ~ . + color:value1 + color:value2")

In [7]:
fittedRF = supervised.fit(df)
preparedDF = fittedRF.transform(df)
preparedDF.show()

+-----+----+------+------------------+--------------------+-----+
|color| lab|value1|            value2|            features|label|
+-----+----+------+------------------+--------------------+-----+
|green|good|     1|14.386294994851129|(10,[1,2,3,5,8],[...|  1.0|
| blue| bad|     8|14.386294994851129|(10,[2,3,6,9],[8....|  0.0|
| blue| bad|    12|14.386294994851129|(10,[2,3,6,9],[12...|  0.0|
|green|good|    15| 38.97187133755819|(10,[1,2,3,5,8],[...|  1.0|
|green|good|    12|14.386294994851129|(10,[1,2,3,5,8],[...|  1.0|
|green| bad|    16|14.386294994851129|(10,[1,2,3,5,8],[...|  0.0|
|  red|good|    35|14.386294994851129|(10,[0,2,3,4,7],[...|  1.0|
|  red| bad|     1| 38.97187133755819|(10,[0,2,3,4,7],[...|  0.0|
|  red| bad|     2|14.386294994851129|(10,[0,2,3,4,7],[...|  0.0|
|  red| bad|    16|14.386294994851129|(10,[0,2,3,4,7],[...|  0.0|
|  red|good|    45| 38.97187133755819|(10,[0,2,3,4,7],[...|  1.0|
|green|good|     1|14.386294994851129|(10,[1,2,3,5,8],[...|  1.0|
| blue| ba

### Create a test set

In [8]:
train, test = preparedDF.randomSplit([0.7, 0.3])

## Estimator

In [10]:
from pyspark.ml.classification import LogisticRegression

In [15]:
lr = LogisticRegression(labelCol = "label", featuresCol = "features")

In [13]:
# Inspect parameters
print(lr.explainParams())

aggregationDepth: suggested depth for treeAggregate (>= 2). (default: 2)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0)
family: The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial (default: auto)
featuresCol: features column name. (default: features, current: feature)
fitIntercept: whether to fit an intercept term. (default: True)
labelCol: label column name. (default: label, current: label)
lowerBoundsOnCoefficients: The lower bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of features) for binomial regression, or (number of classes, number of features) for multinomial regression. (undefined)
lowerBoundsOnIntercepts: The lower bounds on intercepts if fitting under bound constrained optimization. The 

In [16]:
fittedLR = lr.fit(train)

In [18]:
fittedLR.transform(train).select("label", "prediction").show()

+-----+----------+
|label|prediction|
+-----+----------+
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
+-----+----------+
only showing top 20 rows



## Pipeline

In [20]:
from pyspark.ml import Pipeline

In [22]:
# Create holdout set from original data
train, test = df.randomSplit([0.7, 0.3])

rForm = RFormula()
lr = LogisticRegression().setLabelCol("label").setFeaturesCol("features")

stages = [rForm, lr]
pipeline = Pipeline().setStages(stages)

## Training and Evaluation

In [23]:
from pyspark.ml.tuning import ParamGridBuilder

In [24]:
params = ParamGridBuilder()\
    .addGrid(rForm.formula, ["lab ~ . + color:value1", "lab ~ . + color:value1 + color:value2"])\
    .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\
    .addGrid(lr.regParam, [0.1, 2.0])\
    .build()

In [25]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [27]:
evaluator = BinaryClassificationEvaluator()\
    .setMetricName("areaUnderROC")\
    .setRawPredictionCol("prediction")\
    .setLabelCol("label")

In [28]:
from pyspark.ml.tuning import TrainValidationSplit

In [29]:
tvs = TrainValidationSplit()\
    .setTrainRatio(0.75)\
    .setEstimatorParamMaps(params)\
    .setEstimator(pipeline)\
    .setEvaluator(evaluator)

In [30]:
tvsFitted = tvs.fit(train)

In [31]:
evaluator.evaluate(tvsFitted.transform(test))

0.9411764705882353

## Presisting and Applying Models
`tvs.Fitted.write.overwrite().save("path")`