## Machine Learning with MLLib

In [1]:
import org.apache.spark.sql.Row

val frame = spark
      .read
      .option("header", "true")
      .option("inferSchema", "true")
      .csv("../data/heart.csv")

frame.show()

Intitializing Scala interpreter ...

Spark Web UI available at http://8f3d79309c47:4041
SparkContext available as 'sc' (version = 2.4.3, master = local[*], app id = local-1563986132061)
SparkSession available as 'spark'


+---+---+---+--------+----+---+-------+-------+-----+-------+-----+---+----+------+
|age|sex| cp|trestbps|chol|fbs|restecg|thalach|exang|oldpeak|slope| ca|thal|target|
+---+---+---+--------+----+---+-------+-------+-----+-------+-----+---+----+------+
| 63|  1|  3|     145| 233|  1|      0|    150|    0|    2.3|    0|  0|   1|     1|
| 37|  1|  2|     130| 250|  0|      1|    187|    0|    3.5|    0|  0|   2|     1|
| 41|  0|  1|     130| 204|  0|      0|    172|    0|    1.4|    2|  0|   2|     1|
| 56|  1|  1|     120| 236|  0|      1|    178|    0|    0.8|    2|  0|   2|     1|
| 57|  0|  0|     120| 354|  0|      1|    163|    1|    0.6|    2|  0|   2|     1|
| 57|  1|  0|     140| 192|  0|      1|    148|    0|    0.4|    1|  0|   1|     1|
| 56|  0|  1|     140| 294|  0|      0|    153|    0|    1.3|    1|  0|   2|     1|
| 44|  1|  1|     120| 263|  0|      1|    173|    0|    0.0|    2|  0|   3|     1|
| 52|  1|  2|     172| 199|  1|      1|    162|    0|    0.5|    2|  0|   3|

import org.apache.spark.sql.Row
frame: org.apache.spark.sql.DataFrame = [age: int, sex: int ... 12 more fields]


## Showing the schema

In [2]:
frame.printSchema()

root
 |-- age: integer (nullable = true)
 |-- sex: integer (nullable = true)
 |-- cp: integer (nullable = true)
 |-- trestbps: integer (nullable = true)
 |-- chol: integer (nullable = true)
 |-- fbs: integer (nullable = true)
 |-- restecg: integer (nullable = true)
 |-- thalach: integer (nullable = true)
 |-- exang: integer (nullable = true)
 |-- oldpeak: double (nullable = true)
 |-- slope: integer (nullable = true)
 |-- ca: integer (nullable = true)
 |-- thal: integer (nullable = true)
 |-- target: integer (nullable = true)



## Isolate the features into a feature column

* Data Scientists call features columns
* We need a column with all the features for each model

In [3]:
import org.apache.spark.ml.feature.VectorAssembler
val assembler = new VectorAssembler()
    .setInputCols(Array("age"))
    .setOutputCol("features")

import org.apache.spark.ml.feature.VectorAssembler
assembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_7337dfccc34c


## The `target` column of our supervised learning needs to be renamed into a `label` 

* The model requires that features are isolated into it's onw column
* Notice the `features` is a list of elements we require to plugin
* We are going to calculate the regression between `chol` and `age`, so age will be our label

In [4]:
val newFrame = assembler.transform(frame).withColumnRenamed("chol", "label")
newFrame.show()

+---+---+---+--------+-----+---+-------+-------+-----+-------+-----+---+----+------+--------+
|age|sex| cp|trestbps|label|fbs|restecg|thalach|exang|oldpeak|slope| ca|thal|target|features|
+---+---+---+--------+-----+---+-------+-------+-----+-------+-----+---+----+------+--------+
| 63|  1|  3|     145|  233|  1|      0|    150|    0|    2.3|    0|  0|   1|     1|  [63.0]|
| 37|  1|  2|     130|  250|  0|      1|    187|    0|    3.5|    0|  0|   2|     1|  [37.0]|
| 41|  0|  1|     130|  204|  0|      0|    172|    0|    1.4|    2|  0|   2|     1|  [41.0]|
| 56|  1|  1|     120|  236|  0|      1|    178|    0|    0.8|    2|  0|   2|     1|  [56.0]|
| 57|  0|  0|     120|  354|  0|      1|    163|    1|    0.6|    2|  0|   2|     1|  [57.0]|
| 57|  1|  0|     140|  192|  0|      1|    148|    0|    0.4|    1|  0|   1|     1|  [57.0]|
| 56|  0|  1|     140|  294|  0|      0|    153|    0|    1.3|    1|  0|   2|     1|  [56.0]|
| 44|  1|  1|     120|  263|  0|      1|    173|    0|    0.

newFrame: org.apache.spark.sql.DataFrame = [age: int, sex: int ... 13 more fields]


In [5]:
val focusedFrame = newFrame.select("label", "features")

focusedFrame: org.apache.spark.sql.DataFrame = [label: int, features: vector]


## Split the data

* We need to split the data training and testing
* We are going to split 70% training - 30% testing
* It will be essential that we put a random seed to randomly select the rows (observations)

In [6]:
import org.apache.spark.sql.Dataset
val splitData: Array[Dataset[Row]] = focusedFrame.randomSplit(Array(0.7, 0.3), seed = 1234L)
val trainingData = splitData(0)
trainingData.show()

+-----+--------+
|label|features|
+-----+--------+
|  149|  [49.0]|
|  149|  [71.0]|
|  157|  [41.0]|
|  160|  [45.0]|
|  164|  [62.0]|
|  166|  [61.0]|
|  167|  [40.0]|
|  168|  [57.0]|
|  169|  [44.0]|
|  172|  [41.0]|
|  175|  [38.0]|
|  175|  [38.0]|
|  175|  [51.0]|
|  176|  [59.0]|
|  177|  [43.0]|
|  177|  [46.0]|
|  177|  [59.0]|
|  177|  [65.0]|
|  178|  [60.0]|
|  180|  [42.0]|
+-----+--------+
only showing top 20 rows



import org.apache.spark.sql.Dataset
splitData: Array[org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]] = Array([label: int, features: vector], [label: int, features: vector])
trainingData: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: int, features: vector]


In [7]:
val testingData = splitData(1)
testingData.show()

+-----+--------+
|label|features|
+-----+--------+
|  126|  [57.0]|
|  131|  [57.0]|
|  141|  [44.0]|
|  174|  [70.0]|
|  184|  [56.0]|
|  185|  [60.0]|
|  188|  [54.0]|
|  193|  [56.0]|
|  196|  [52.0]|
|  197|  [53.0]|
|  197|  [76.0]|
|  198|  [35.0]|
|  199|  [39.0]|
|  201|  [54.0]|
|  204|  [59.0]|
|  206|  [54.0]|
|  211|  [43.0]|
|  212|  [52.0]|
|  212|  [59.0]|
|  212|  [66.0]|
+-----+--------+
only showing top 20 rows



testingData: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: int, features: vector]


## Linear Regression

* Linear Regression is a model that draws a line through the data points
* After training it provides a coefficient (line slope) and intercept e.g. $mx + b$
* Here we will use some standard parameters (called hyperparameters by data scientists)
* For a visual understanding of linear regression [enjoy this visualization](http://setosa.io/ev/ordinary-least-squares-regression/)

In [8]:
import org.apache.spark.ml.regression.LinearRegression
val lr = new LinearRegression()
      .setMaxIter(10)
      .setRegParam(0.3)
      .setElasticNetParam(0.8)

import org.apache.spark.ml.regression.LinearRegression
lr: org.apache.spark.ml.regression.LinearRegression = linReg_fd517d08ceec


## Training 

In [9]:
val lrModel = lr.fit(trainingData)

lrModel: org.apache.spark.ml.regression.LinearRegressionModel = linReg_fd517d08ceec


## Print the coefficient (slope) and intercept

In [10]:
println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}")

Coefficients: [1.5602976407592346] Intercept: 163.61093163371797


## Summarize the model over the training set and print out some metrics

In [11]:
val summary = lrModel.evaluate(testingData)
println(f"Mean Squared Error: ${summary.meanSquaredError}%1.2f")
println(f"Mean Absolute Error: ${summary.meanAbsoluteError}%1.2f")

Mean Squared Error: 2155.67
Mean Absolute Error: 35.70


summary: org.apache.spark.ml.regression.LinearRegressionSummary = org.apache.spark.ml.regression.LinearRegressionSummary@63b9060f


## Decision Tree

* Decision Trees will find the information required to split the data with a series of `if` statements internally
* How it does so is with a recursive split and determining a purity score
* Decision Trees take multiple feature (column) data


### Use `VectorAssembler` to arrange all the features 

* A Decision Tree can use all features so we will include that
* The column with all the features will be called `features`

In [12]:
import org.apache.spark.ml.feature.VectorAssembler

val assembler = new VectorAssembler()
      .setInputCols(Array("age", "sex", "cp", "trestbps", "chol",
        "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal"))
      .setOutputCol("features")

import org.apache.spark.ml.feature.VectorAssembler
assembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_87e274ae718d


### Perform the transformation

* Notice the `features` column and the elements that it contains
* We will plug in the data along with the `target` on whether or not they will have heart disease

In [13]:
val transformed = assembler.transform(frame)
transformed.show()

+---+---+---+--------+----+---+-------+-------+-----+-------+-----+---+----+------+--------------------+
|age|sex| cp|trestbps|chol|fbs|restecg|thalach|exang|oldpeak|slope| ca|thal|target|            features|
+---+---+---+--------+----+---+-------+-------+-----+-------+-----+---+----+------+--------------------+
| 63|  1|  3|     145| 233|  1|      0|    150|    0|    2.3|    0|  0|   1|     1|[63.0,1.0,3.0,145...|
| 37|  1|  2|     130| 250|  0|      1|    187|    0|    3.5|    0|  0|   2|     1|[37.0,1.0,2.0,130...|
| 41|  0|  1|     130| 204|  0|      0|    172|    0|    1.4|    2|  0|   2|     1|[41.0,0.0,1.0,130...|
| 56|  1|  1|     120| 236|  0|      1|    178|    0|    0.8|    2|  0|   2|     1|[56.0,1.0,1.0,120...|
| 57|  0|  0|     120| 354|  0|      1|    163|    1|    0.6|    2|  0|   2|     1|[57.0,0.0,0.0,120...|
| 57|  1|  0|     140| 192|  0|      1|    148|    0|    0.4|    1|  0|   1|     1|[57.0,1.0,0.0,140...|
| 56|  0|  1|     140| 294|  0|      0|    153|    0|  

transformed: org.apache.spark.sql.DataFrame = [age: int, sex: int ... 13 more fields]


### Applying the Decision Tree Model

* Plugging in the model, we will direct it to the `feature` column, and the `target`

In [14]:
import org.apache.spark.ml.classification.DecisionTreeClassifier
val decisionTreeClassifier = new DecisionTreeClassifier()
      .setFeaturesCol("features")
      .setLabelCol("target")

import org.apache.spark.ml.classification.DecisionTreeClassifier
decisionTreeClassifier: org.apache.spark.ml.classification.DecisionTreeClassifier = dtc_f0145e5a03be


### Splitting the data for training and testing

In [15]:
val splitData = newFrame.randomSplit(Array(0.7, 0.3), seed = 1234L)
val trainingData = splitData(0)
val testingData = splitData(1)

splitData: Array[org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]] = Array([age: int, sex: int ... 13 more fields], [age: int, sex: int ... 13 more fields])
trainingData: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [age: int, sex: int ... 13 more fields]
testingData: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [age: int, sex: int ... 13 more fields]


### Training the model

In [16]:
val model = decisionTreeClassifier.fit(trainingData)

model: org.apache.spark.ml.classification.DecisionTreeClassificationModel = DecisionTreeClassificationModel (uid=dtc_f0145e5a03be) of depth 4 with 15 nodes


### Calling `transform` to view the data

In [17]:
val result = model.transform(testingData)
result.show(10)

+---+---+---+--------+-----+---+-------+-------+-----+-------+-----+---+----+------+--------+-------------+--------------------+----------+
|age|sex| cp|trestbps|label|fbs|restecg|thalach|exang|oldpeak|slope| ca|thal|target|features|rawPrediction|         probability|prediction|
+---+---+---+--------+-----+---+-------+-------+-----+-------+-----+---+----+------+--------+-------------+--------------------+----------+
| 29|  1|  1|     130|  204|  0|      0|    202|    0|    0.0|    2|  0|   2|     1|  [29.0]|  [13.0,36.0]|[0.26530612244897...|       1.0|
| 34|  0|  1|     118|  210|  0|      1|    192|    0|    0.7|    2|  0|   2|     1|  [34.0]|  [13.0,36.0]|[0.26530612244897...|       1.0|
| 34|  1|  3|     118|  182|  0|      0|    174|    0|    0.0|    2|  0|   2|     1|  [34.0]|  [13.0,36.0]|[0.26530612244897...|       1.0|
| 39|  0|  2|     138|  220|  0|      1|    152|    0|    0.0|    1|  0|   2|     1|  [39.0]|  [13.0,36.0]|[0.26530612244897...|       1.0|
| 41|  1|  1|     13

result: org.apache.spark.sql.DataFrame = [age: int, sex: int ... 16 more fields]


### Determining the score and our performance

* We will procure the `org.apache.spark.ml.evaluation.BinaryClassificationEvaluator` for this decision tree
* This is a binary response: Has heart disease, Does not have heart disease
* The default score for the `BinaryClassificationEvaluator` is the AUC (Area Under the Curve) / ROC (Receiving Operating Characteristic) Score which determines the area of the false positive rate against the true positive rate.
* The best AUC, is 1.0

In [18]:
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator

val evaluator = new BinaryClassificationEvaluator()
                    .setLabelCol("target") 
                    .setRawPredictionCol("rawPrediction") 

import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
evaluator: org.apache.spark.ml.evaluation.BinaryClassificationEvaluator = binEval_2ed50842b3a8


### Displaying the final score

In [19]:
val aucScore = evaluator.evaluate(result)
println(s"AUC Score = $aucScore")

AUC Score = 0.5243055555555556


aucScore: Double = 0.5243055555555556


## What does a random forest do?

* Random Forest takes multiple trees and determines a score based on average or voting
* This is the wisdom of the crowd
* Each tree can either institute a (WR) with replacement, or (WOR) without replacement
* With Replacement is like measuring fish and throwing the fish back in the water. You may get the same one again

In [20]:
import org.apache.spark.ml.classification.RandomForestClassifier
val rf = new RandomForestClassifier()
      .setFeaturesCol("features")
      .setLabelCol("target")
      .setNumTrees(100)

import org.apache.spark.ml.classification.RandomForestClassifier
rf: org.apache.spark.ml.classification.RandomForestClassifier = rfc_38d32acf4f93


In [21]:
val model = rf.fit(trainingData)

model: org.apache.spark.ml.classification.RandomForestClassificationModel = RandomForestClassificationModel (uid=rfc_38d32acf4f93) with 100 trees


In [22]:
val result = model.transform(testingData)
result.show(10)

+---+---+---+--------+-----+---+-------+-------+-----+-------+-----+---+----+------+--------+--------------------+--------------------+----------+
|age|sex| cp|trestbps|label|fbs|restecg|thalach|exang|oldpeak|slope| ca|thal|target|features|       rawPrediction|         probability|prediction|
+---+---+---+--------+-----+---+-------+-------+-----+-------+-----+---+----+------+--------+--------------------+--------------------+----------+
| 29|  1|  1|     130|  204|  0|      0|    202|    0|    0.0|    2|  0|   2|     1|  [29.0]|[33.7344011536839...|[0.33734401153683...|       1.0|
| 34|  0|  1|     118|  210|  0|      1|    192|    0|    0.7|    2|  0|   2|     1|  [34.0]|[33.7344011536839...|[0.33734401153683...|       1.0|
| 34|  1|  3|     118|  182|  0|      0|    174|    0|    0.0|    2|  0|   2|     1|  [34.0]|[33.7344011536839...|[0.33734401153683...|       1.0|
| 39|  0|  2|     138|  220|  0|      1|    152|    0|    0.0|    1|  0|   2|     1|  [39.0]|[41.6632611947362...|[0.4

result: org.apache.spark.sql.DataFrame = [age: int, sex: int ... 16 more fields]


In [23]:
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator

val evaluator = new BinaryClassificationEvaluator()
                    .setLabelCol("target") 
                    .setRawPredictionCol("rawPrediction") 

import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
evaluator: org.apache.spark.ml.evaluation.BinaryClassificationEvaluator = binEval_ea849f1dcc97


In [24]:
val aucScore = evaluator.evaluate(result)
println(s"AUC Score = $aucScore")

AUC Score = 0.5779671717171717


aucScore: Double = 0.5779671717171717
