## Machine Learning with MLLib

In [None]:
import org.apache.spark.sql.Row

val frame = spark
      .read
      .option("header", "true")
      .option("inferSchema", "true")
      .csv("../data/heart.csv")

frame.show()

Intitializing Scala interpreter ...

Spark Web UI available at http://8f3d79309c47:4041
SparkContext available as 'sc' (version = 2.4.3, master = local[*], app id = local-1563986132061)
SparkSession available as 'spark'


## Showing the schema

In [None]:
frame.printSchema()

## Isolate the features into a feature column

* Data Scientists call features columns
* We need a column with all the features for each model

In [None]:
import org.apache.spark.ml.feature.VectorAssembler
val assembler = new VectorAssembler()
    .setInputCols(Array("age"))
    .setOutputCol("features")

## The `target` column of our supervised learning needs to be renamed into a `label` 

* The model requires that features are isolated into it's onw column
* Notice the `features` is a list of elements we require to plugin
* We are going to calculate the regression between `chol` and `age`, so age will be our label

In [None]:
val newFrame = assembler.transform(frame).withColumnRenamed("chol", "label")
newFrame.show()

In [None]:
val focusedFrame = newFrame.select("label", "features")

## Split the data

* We need to split the data training and testing
* We are going to split 70% training - 30% testing
* It will be essential that we put a random seed to randomly select the rows (observations)

In [None]:
import org.apache.spark.sql.Dataset
val splitData: Array[Dataset[Row]] = focusedFrame.randomSplit(Array(0.7, 0.3), seed = 1234L)
val trainingData = splitData(0)
trainingData.show()

In [None]:
val testingData = splitData(1)
testingData.show()

## Linear Regression

* Linear Regression is a model that draws a line through the data points
* After training it provides a coefficient (line slope) and intercept e.g. $mx + b$
* Here we will use some standard parameters (called hyperparameters by data scientists)
* For a visual understanding of linear regression [enjoy this visualization](http://setosa.io/ev/ordinary-least-squares-regression/)

In [None]:
import org.apache.spark.ml.regression.LinearRegression
val lr = new LinearRegression()
      .setMaxIter(10)
      .setRegParam(0.3)
      .setElasticNetParam(0.8)

## Training 

In [None]:
val lrModel = lr.fit(trainingData)

## Print the coefficient (slope) and intercept

In [None]:
println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}")

## Summarize the model over the training set and print out some metrics

In [None]:
val summary = lrModel.evaluate(testingData)
println(f"Mean Squared Error: ${summary.meanSquaredError}%1.2f")
println(f"Mean Absolute Error: ${summary.meanAbsoluteError}%1.2f")

## Decision Tree

* Decision Trees will find the information required to split the data with a series of `if` statements internally
* How it does so is with a recursive split and determining a purity score
* Decision Trees take multiple feature (column) data


### Use `VectorAssembler` to arrange all the features 

* A Decision Tree can use all features so we will include that
* The column with all the features will be called `features`

In [None]:
import org.apache.spark.ml.feature.VectorAssembler

val assembler = new VectorAssembler()
      .setInputCols(Array("age", "sex", "cp", "trestbps", "chol",
        "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal"))
      .setOutputCol("features")

### Perform the transformation

* Notice the `features` column and the elements that it contains
* We will plug in the data along with the `target` on whether or not they will have heart disease

In [None]:
val transformed = assembler.transform(frame)
transformed.show()

### Applying the Decision Tree Model

* Plugging in the model, we will direct it to the `feature` column, and the `target`

In [None]:
import org.apache.spark.ml.classification.DecisionTreeClassifier
val decisionTreeClassifier = new DecisionTreeClassifier()
      .setFeaturesCol("features")
      .setLabelCol("target")

### Splitting the data for training and testing

In [None]:
val splitData = newFrame.randomSplit(Array(0.7, 0.3), seed = 1234L)
val trainingData = splitData(0)
val testingData = splitData(1)

### Training the model

In [None]:
val model = decisionTreeClassifier.fit(trainingData)

### Calling `transform` to view the data

In [None]:
val result = model.transform(testingData)
result.show(10)

### Determining the score and our performance

* We will procure the `org.apache.spark.ml.evaluation.BinaryClassificationEvaluator` for this decision tree
* This is a binary response: Has heart disease, Does not have heart disease
* The default score for the `BinaryClassificationEvaluator` is the AUC (Area Under the Curve) / ROC (Receiving Operating Characteristic) Score which determines the area of the false positive rate against the true positive rate.
* The best AUC, is 1.0

In [None]:
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator

val evaluator = new BinaryClassificationEvaluator()
                    .setLabelCol("target") 
                    .setRawPredictionCol("rawPrediction") 

### Displaying the final score

In [None]:
val aucScore = evaluator.evaluate(result)
println(s"AUC Score = $aucScore")

## What does a random forest do?

* Random Forest takes multiple trees and determines a score based on average or voting
* This is the wisdom of the crowd
* Each tree can either institute a (WR) with replacement, or (WOR) without replacement
* With Replacement is like measuring fish and throwing the fish back in the water. You may get the same one again

In [None]:
import org.apache.spark.ml.classification.RandomForestClassifier
val rf = new RandomForestClassifier()
      .setFeaturesCol("features")
      .setLabelCol("target")
      .setNumTrees(100)

In [None]:
val model = rf.fit(trainingData)

In [None]:
val result = model.transform(testingData)
result.show(10)

In [None]:
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator

val evaluator = new BinaryClassificationEvaluator()
                    .setLabelCol("target") 
                    .setRawPredictionCol("rawPrediction") 

In [None]:
val aucScore = evaluator.evaluate(result)
println(s"AUC Score = $aucScore")