<br><br>
<span style="color:green;font-size:xx-large">Spark ML</span>


<li>provides a uniform API for building and tuning ML models</li>
<li>provides libraries for feature extraction and transformation</li>
<li>provides support for <b>ML pipelines</b></li>
<li>provides support for linear algebra, statistics, scaling, etc.</li>
<li>Mostly works with Spark Dataframes</li>

<span style="color:blue;font-size:large">Why Spark ML?</span>

<span style="color:red">Machine learning at scale</span> Spark's ML library and its data analytic models can support ML operations with billions of observations</span>

<br><br>
<span style="color:green;font-size:xx-large">Spark ML Data Structures</span>

<span style="color:blue;font-size:large">Vectors</span>

<li><a href="https://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/linalg/Vectors.html">https://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/linalg/Vectors.html</a></li>
<li><span style="color:blue">dense vectors</span>: Ordinary vectors. Each index has a data value associated with it</li>
<li><span style="color:blue">sparse vectors</span>: Only actual elements are stored. Specify indices and values. Unspecified locations are 0.0. Arguments: number of elements, (index,value) pairs for non-zero elements</li>
<li>Which one should you use? Depends on the data (sparse data, sparse vectors) and the algorithm (some algorithms - e.g., naive bayes, work better with dense vectors than with sparse vectors</li>  
<li>Vectors contain two useful functions <span style="color:blue">norm</span> and <span style="color:blue">squared_distance</span></li>

In [2]:
import org.apache.spark.ml.linalg.Vectors
val x = Array(3.2,0,0,0,4.7,1.6,0,0,0,0,10.2,0,0,11.1)
val data_dense = Vectors.dense(x) 
val data_sparse = data_dense.toSparse

import org.apache.spark.ml.linalg.Vectors
x: Array[Double] = Array(3.2, 0.0, 0.0, 0.0, 4.7, 1.6, 0.0, 0.0, 0.0, 0.0, 10.2, 0.0, 0.0, 11.1)
data_dense: org.apache.spark.ml.linalg.Vector = [3.2,0.0,0.0,0.0,4.7,1.6,0.0,0.0,0.0,0.0,10.2,0.0,0.0,11.1]
data_sparse: org.apache.spark.ml.linalg.SparseVector = (14,[0,4,5,10,13],[3.2,4.7,1.6,10.2,11.1])


In [3]:
data_dense

res0: org.apache.spark.ml.linalg.Vector = [3.2,0.0,0.0,0.0,4.7,1.6,0.0,0.0,0.0,0.0,10.2,0.0,0.0,11.1]


In [4]:
data_sparse //14 elements, nonzero location, value of nonzero

res1: org.apache.spark.ml.linalg.SparseVector = (14,[0,4,5,10,13],[3.2,4.7,1.6,10.2,11.1])


In [5]:
val data_dense = Vectors.dense(3.2,0,0,0,4.7,1.6,0,0,0,0,10.2,0,0,11.1) 
val data_sparse = Vectors.sparse(14,Array(0,4,5,10,13),Array(3.2,4.7,1.6,10.2,11.1))

data_dense: org.apache.spark.ml.linalg.Vector = [3.2,0.0,0.0,0.0,4.7,1.6,0.0,0.0,0.0,0.0,10.2,0.0,0.0,11.1]
data_sparse: org.apache.spark.ml.linalg.Vector = (14,[0,4,5,10,13],[3.2,4.7,1.6,10.2,11.1])


In [6]:
println(Vectors.norm(data_dense,1)) //returns the p=1 norm (taxicab)
println(Vectors.norm(data_dense,2)) //returns the euclidean norm
println(Vectors.sqdist(data_dense,data_sparse)) //distance between two vectors

30.799999999999997
16.190738093119784
0.0


<span style="color:blue;font-size:large">Matrix</span>

In [7]:
import org.apache.spark.ml.linalg.SparseMatrix
import org.apache.spark.ml.linalg.Matrices


val data = Array(1.0, 0.0, 4.0, 0.0, 3.0, 5.0, 2.0, 0.0, 6.0)
val dense_m = Matrices.dense(3,3,data)
val sparse_m = dense_m.toSparse

import org.apache.spark.ml.linalg.SparseMatrix
import org.apache.spark.ml.linalg.Matrices
data: Array[Double] = Array(1.0, 0.0, 4.0, 0.0, 3.0, 5.0, 2.0, 0.0, 6.0)
dense_m: org.apache.spark.ml.linalg.Matrix =
1.0  0.0  2.0
0.0  3.0  0.0
4.0  5.0  6.0
sparse_m: org.apache.spark.ml.linalg.SparseMatrix =
3 x 3 CSCMatrix
(0,0) 1.0
(2,0) 4.0
(1,1) 3.0
(2,1) 5.0
(0,2) 2.0
(2,2) 6.0


In [8]:
sparse_m

res3: org.apache.spark.ml.linalg.SparseMatrix =
3 x 3 CSCMatrix
(0,0) 1.0
(2,0) 4.0
(1,1) 3.0
(2,1) 5.0
(0,2) 2.0
(2,2) 6.0


In [9]:
dense_m.transpose

res4: org.apache.spark.ml.linalg.Matrix =
1.0  0.0  4.0
0.0  3.0  5.0
2.0  0.0  6.0


In [10]:
sparse_m.transpose

res5: org.apache.spark.ml.linalg.SparseMatrix =
3 x 3 CSCMatrix
(0,0) 1.0
(2,0) 2.0
(1,1) 3.0
(0,2) 4.0
(1,2) 5.0
(2,2) 6.0


<br><br><br>
<h2 style="color:red;font-size:50px">feature transformers</h2>
<br>

<li>Spark contains an extensive library of feature transformers</li>
<li>We'll take a quick look at a few here</li>
<li><a href="https://spark.apache.org/docs/latest/ml-features.html">https://spark.apache.org/docs/latest/ml-features.html</a></li>



<br><br>
<span style="color:green;font-size:xx-large">Vector Assembler</span>
<br>

<li>Combines a set of columns into a single <b>sparse</b> vector</li>
<li>In supervised learning, the independent features are combined into a single vector</li>
<li>As a result, each case is represented by a pair (dv,iv-vector)</li>
<li><a href="https://spark.apache.org/docs/latest/ml-features#vectorassembler">https://spark.apache.org/docs/latest/ml-features#vectorassembler</a></li>




In [11]:
import org.apache.spark.ml.feature.VectorAssembler

val df = spark.createDataFrame(Seq(
  (22.0, 23.1,3),
  (12.2, 13.0,2),
  (43.7, 16.2,4),
  (36.4, 34.8,3),
  (6.1, 71.0,3),
  (28.2, 22.1,7)
)).toDF("feature1", "feature2","dv")

df.show

+--------+--------+---+
|feature1|feature2| dv|
+--------+--------+---+
|    22.0|    23.1|  3|
|    12.2|    13.0|  2|
|    43.7|    16.2|  4|
|    36.4|    34.8|  3|
|     6.1|    71.0|  3|
|    28.2|    22.1|  7|
+--------+--------+---+



import org.apache.spark.ml.feature.VectorAssembler
df: org.apache.spark.sql.DataFrame = [feature1: double, feature2: double ... 1 more field]


<span style="color:blue;font-size:large">Create an assembler object identifying the columns that need to be vectorized</span>


In [12]:
val assembler = new VectorAssembler()
  .setInputCols(Array("feature1","feature2"))
  .setOutputCol("features")



assembler: org.apache.spark.ml.feature.VectorAssembler = VectorAssembler: uid=vecAssembler_037dcb832033, handleInvalid=error, numInputCols=2


<span style="color:blue;font-size:large">Call transform on the dataframe</span>
<br>
<li>This creates a new dataframe, vectorizing the input columns, and storing the vector in the output column</li>
<li>by default, spark ml models assume the dv is in a column called label and the iv in a column called features</li>
<li>In the example below, we select the two columns of interest and rename the dv as "label"</li>

In [13]:
assembler.transform(df)
.select("dv","features")
.show(false)

+---+-----------+
|dv |features   |
+---+-----------+
|3  |[22.0,23.1]|
|2  |[12.2,13.0]|
|4  |[43.7,16.2]|
|3  |[36.4,34.8]|
|3  |[6.1,71.0] |
|7  |[28.2,22.1]|
+---+-----------+



In [None]:
val df_lr = assembler.transform(df)
    .select("dv","features")
    .withColumnRenamed("dv","label")

In [None]:
df_lr.show

In [None]:
assembler.transform(df).show

<span style="color:blue;font-size:large">Putting it all together</span>



In [14]:
import org.apache.spark.ml.feature.VectorAssembler

val df = spark.createDataFrame(Seq(
  (22.0, 23.1,3),
  (12.2, 13.0,2),
  (43.7, 16.2,4),
  (36.4, 34.8,3),
  (6.1, 71.0,3),
  (28.2, 22.1,7)
)).toDF("feature1", "feature2","dv")

//Create an assembler object identifying the columns that need to be vectorized
val assembler = new VectorAssembler()
  .setInputCols(Array("feature1","feature2"))
  .setOutputCol("features")

//Call transform on the dataframe. This creates a new dataframe using the specifications
//(specs = which columns to keep)
//by default, spark ml models assume the dv is in a column called label and the iv in a column called features
val df_lr = assembler.transform(df)
    .select("dv","features")
    .withColumnRenamed("dv","label")

df_lr.show

+-----+-----------+
|label|   features|
+-----+-----------+
|    3|[22.0,23.1]|
|    2|[12.2,13.0]|
|    4|[43.7,16.2]|
|    3|[36.4,34.8]|
|    3| [6.1,71.0]|
|    7|[28.2,22.1]|
+-----+-----------+



import org.apache.spark.ml.feature.VectorAssembler
df: org.apache.spark.sql.DataFrame = [feature1: double, feature2: double ... 1 more field]
assembler: org.apache.spark.ml.feature.VectorAssembler = VectorAssembler: uid=vecAssembler_fc11181aad9a, handleInvalid=error, numInputCols=2
df_lr: org.apache.spark.sql.DataFrame = [label: int, features: vector]


<br><br><br>
<span style="color:green;font-size:xx-large">String indexer</span>
<br>


<li>ML algorithms need numbers!</li>
<li>Any string variables need to be converted into numbers before they can be used</li>
<li><a href="https://spark.apache.org/docs/latest/api/scala/org/apache/spark/ml/feature/StringIndexer.html">StringIndexer</a> is a spark feature transofrmer that does this</li>
<li>The most frequent category is given the value 1, second most 2, etc.</li>

In [16]:
import org.apache.spark.ml.feature.StringIndexer

val df = spark.createDataFrame(Seq(
  ("MIA", 17.2,2),  
  ("NYC", 23.1,3),
  ("SFO", 13.0,2),
  ("NYC", 16.2,4),
  ("CHI", 34.8,3),
  ("SFO", 71.0,3),
  ("LAX", 22.1,7)
)).toDF("feature1", "feature2","dv")

df.show

+--------+--------+---+
|feature1|feature2| dv|
+--------+--------+---+
|     MIA|    17.2|  2|
|     NYC|    23.1|  3|
|     SFO|    13.0|  2|
|     NYC|    16.2|  4|
|     CHI|    34.8|  3|
|     SFO|    71.0|  3|
|     LAX|    22.1|  7|
+--------+--------+---+



import org.apache.spark.ml.feature.StringIndexer
df: org.apache.spark.sql.DataFrame = [feature1: string, feature2: double ... 1 more field]


<span style="color:blue;font-size:large">Create a StringIndexer object</span>
<br>
<li>Note the similarity with VectorAssembler</li>


In [17]:
val indexer = new StringIndexer()
  .setInputCol("feature1")
  .setOutputCol("feature1Index")



indexer: org.apache.spark.ml.feature.StringIndexer = strIdx_823ea4a5c0b1


<span style="color:blue;font-size:large">Fit the StringIndexer to the data</span>
<br>
<li><span style="color:red">fit</span> parameterizes a model</li>
<li>For StringIndexer, that means building a mapping from each string to a unique number</li>
<li>This step was not necessary for VectorAssembler because there are no parameters</li>

In [18]:
val stringMapper = indexer.fit(df) //label encoder //for future set, testing set //unseen data

stringMapper: org.apache.spark.ml.feature.StringIndexerModel = StringIndexerModel: uid=strIdx_823ea4a5c0b1, handleInvalid=error


<span style="color:blue;font-size:large">Use transform to generate numerical values for strings</span>



In [19]:
val indexedStrings = stringMapper.transform(df)

indexedStrings: org.apache.spark.sql.DataFrame = [feature1: string, feature2: double ... 2 more fields]


In [20]:
indexedStrings.show //most frequent => 0.0

+--------+--------+---+-------------+
|feature1|feature2| dv|feature1Index|
+--------+--------+---+-------------+
|     MIA|    17.2|  2|          4.0|
|     NYC|    23.1|  3|          0.0|
|     SFO|    13.0|  2|          1.0|
|     NYC|    16.2|  4|          0.0|
|     CHI|    34.8|  3|          2.0|
|     SFO|    71.0|  3|          1.0|
|     LAX|    22.1|  7|          3.0|
+--------+--------+---+-------------+



<span style="color:blue;font-size:large">Putting it all together</span>



In [21]:
import org.apache.spark.ml.feature.StringIndexer

val df = spark.createDataFrame(Seq(
  ("MIA", 17.2,2),  
  ("NYC", 23.1,3),
  ("SFO", 13.0,2),
  ("NYC", 16.2,4),
  ("CHI", 34.8,3),
  ("SFO", 71.0,3),
  ("LAX", 22.1,7)
)).toDF("feature1", "feature2","dv")

//Create a StringIndexer object with the column specifications
val indexer = new StringIndexer()
  .setInputCol("feature1")
  .setOutputCol("feature1Index")

//The "fit" operation determines the category to number relationship
//The "transform" operation does the actual assigning of values
val indexed = indexer
                .fit(df)
                .transform(df)
indexed.show()

+--------+--------+---+-------------+
|feature1|feature2| dv|feature1Index|
+--------+--------+---+-------------+
|     MIA|    17.2|  2|          4.0|
|     NYC|    23.1|  3|          0.0|
|     SFO|    13.0|  2|          1.0|
|     NYC|    16.2|  4|          0.0|
|     CHI|    34.8|  3|          2.0|
|     SFO|    71.0|  3|          1.0|
|     LAX|    22.1|  7|          3.0|
+--------+--------+---+-------------+



import org.apache.spark.ml.feature.StringIndexer
df: org.apache.spark.sql.DataFrame = [feature1: string, feature2: double ... 1 more field]
indexer: org.apache.spark.ml.feature.StringIndexer = strIdx_4a09fb538a4f
indexed: org.apache.spark.sql.DataFrame = [feature1: string, feature2: double ... 2 more fields]


<br><br><br>
<span style="color:green;font-size:xx-large">One Hot Encoding</span>
<br>

<li>One hot encoding maps a categorical (numerical) column with k distinct categories into k-1 binary columns</li>
<li>Example: if possible values of a column are 1, 2, and 3; then one hot encoding will produce two columns<li>
    <ul>
        <li>column 1 will take the value 1 if the original value was 1 and 0 otherwise</li>
        <li>column 2 will take the value 1 if the original value was 2 and 0 otherwise</li>
        <li>column 1 and column 2 will both take the value 0 if the original value was 0</li>
    </ul>
<li>The spark feature transformer OneHotEncoder does this for us</li> 
<li>The one hot coded data is returned as a vector in a single column</li>
<li>The input values to the encoder must be numeric. Use StringIndexer to convert strings to numeric first</li>

In [22]:
import org.apache.spark.ml.feature.OneHotEncoder

val df = spark.createDataFrame(Seq(
    ("Jack","A","IEOR"),
    ("Jill","B","IEOR"),
    ("Jiahuo","A","CS"),
    ("Pierre","C","APAM"),
    ("Clemence","B","APAM"),
    ("Savitri","A","CS"),
    ("Bjorn","A","QMSS")
)).toDF("student","grade","department")

import org.apache.spark.ml.feature.OneHotEncoder
df: org.apache.spark.sql.DataFrame = [student: string, grade: string ... 1 more field]


In [23]:
df.show

+--------+-----+----------+
| student|grade|department|
+--------+-----+----------+
|    Jack|    A|      IEOR|
|    Jill|    B|      IEOR|
|  Jiahuo|    A|        CS|
|  Pierre|    C|      APAM|
|Clemence|    B|      APAM|
| Savitri|    A|        CS|
|   Bjorn|    A|      QMSS|
+--------+-----+----------+



<span style="color:blue;font-size:large">Convert strings to numeric labels</span>

<li>Note that we can index multiple columns simultaneously</li>

In [24]:
import org.apache.spark.ml.feature.StringIndexer


val indexer = new StringIndexer()
  .setInputCols(Array("grade","department"))
  .setOutputCols(Array("gradeIndex","departmentIndex"))

//The "fit" operation determines the category to number relationship
//The "transform" operation does the actual assigning of values
val indexedDf = indexer
                .fit(df)
                .transform(df)
indexedDf.show()

+--------+-----+----------+----------+---------------+
| student|grade|department|gradeIndex|departmentIndex|
+--------+-----+----------+----------+---------------+
|    Jack|    A|      IEOR|       0.0|            2.0|
|    Jill|    B|      IEOR|       1.0|            2.0|
|  Jiahuo|    A|        CS|       0.0|            1.0|
|  Pierre|    C|      APAM|       2.0|            0.0|
|Clemence|    B|      APAM|       1.0|            0.0|
| Savitri|    A|        CS|       0.0|            1.0|
|   Bjorn|    A|      QMSS|       0.0|            3.0|
+--------+-----+----------+----------+---------------+



import org.apache.spark.ml.feature.StringIndexer
indexer: org.apache.spark.ml.feature.StringIndexer = strIdx_d4321fa15102
indexedDf: org.apache.spark.sql.DataFrame = [student: string, grade: string ... 3 more fields]


<span style="color:blue;font-size:large">Create an Encoder object</span>



In [25]:
import org.apache.spark.ml.feature.OneHotEncoder

val encoder = new OneHotEncoder()
  .setInputCols(Array("gradeIndex", "departmentIndex")) //since scala tuple is not iteratble in scala, so it need Array
  .setOutputCols(Array("gradeVec", "departmentVec"))


import org.apache.spark.ml.feature.OneHotEncoder
encoder: org.apache.spark.ml.feature.OneHotEncoder = oneHotEncoder_016c480b470c


<span style="color:blue;font-size:large">Fit the model</span>

<li>Why do we need to fit?</li>
<li>Also, note that we're now using indexedDf</li>

In [26]:
val model = encoder.fit(indexedDf)

model: org.apache.spark.ml.feature.OneHotEncoderModel = OneHotEncoderModel: uid=oneHotEncoder_016c480b470c, dropLast=true, handleInvalid=error, numInputCols=2, numOutputCols=2


<span style="color:blue;font-size:large">Run transform to generate the one hot encoded columns</span>
<li>The one hot encoded values are returned in the form of a sparse vector</li>


In [27]:


val encoded = model.transform(indexedDf)
encoded.show()
//gradeVec: length is 2

+--------+-----+----------+----------+---------------+-------------+-------------+
| student|grade|department|gradeIndex|departmentIndex|     gradeVec|departmentVec|
+--------+-----+----------+----------+---------------+-------------+-------------+
|    Jack|    A|      IEOR|       0.0|            2.0|(2,[0],[1.0])|(3,[2],[1.0])|
|    Jill|    B|      IEOR|       1.0|            2.0|(2,[1],[1.0])|(3,[2],[1.0])|
|  Jiahuo|    A|        CS|       0.0|            1.0|(2,[0],[1.0])|(3,[1],[1.0])|
|  Pierre|    C|      APAM|       2.0|            0.0|    (2,[],[])|(3,[0],[1.0])|
|Clemence|    B|      APAM|       1.0|            0.0|(2,[1],[1.0])|(3,[0],[1.0])|
| Savitri|    A|        CS|       0.0|            1.0|(2,[0],[1.0])|(3,[1],[1.0])|
|   Bjorn|    A|      QMSS|       0.0|            3.0|(2,[0],[1.0])|    (3,[],[])|
+--------+-----+----------+----------+---------------+-------------+-------------+



encoded: org.apache.spark.sql.DataFrame = [student: string, grade: string ... 5 more fields]


In [28]:
encoded.rdd

res15: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[25] at rdd at <console>:43


In [29]:
encoded.select("departmentVec").where($"student"==="Jack").rdd.map{ case Row(v: Vector) => v.toDense}
.collect()(0)(0)

<console>: 43: error: not found: value Row

In [30]:
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row


encoded.select("departmentVec").where($"student"==="Jack").rdd.map { case Row(v: Vector) => v}.first.toDense

import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row
res17: org.apache.spark.ml.linalg.DenseVector = [0.0,0.0,1.0]


In [31]:
encoded.select("departmentVec").rdd.map { case Row(v: Vector) => v}.map(v=>v.toDense).collect

res18: Array[org.apache.spark.ml.linalg.DenseVector] = Array([0.0,0.0,1.0], [0.0,0.0,1.0], [0.0,1.0,0.0], [1.0,0.0,0.0], [1.0,0.0,0.0], [0.0,1.0,0.0], [0.0,0.0,0.0])


<br><br><br><br>
<h2 style="color:red;font-size:50px">Example: California home values</h2>
<br><br>

<li>California housing data from 1980</li>
<ol>
<li>longitude: A measure of how far west a house is; a higher value is farther west

<li>latitude: A measure of how far north a house is; a higher value is farther north

<li>housingMedianAge: Median age of a house within a block; a lower number is a newer building

<li>totalRooms: Total number of rooms within a block
<li>totalBedrooms: Total number of bedrooms within a block
<li>population: Total number of people residing within a block
<li>households: Total number of households, a group of people residing within a home unit, for a block
<li>medianIncome: Median income for households within a block of houses (measured in tens of thousands of US Dollars)
<li>medianHouseValue: Median house value for households within a block (measured in US Dollars)

<span style="color:green;font-size:xx-large">Predict median home values from california housing data</span>


<span style="color:blue;font-size:large">Read the data </span>
<li>set inferschema to true, header to true</li> 

In [None]:
!pwd

In [32]:
!head "cal_housing.data"

-122.230000,37.880000,41.000000,880.000000,129.000000,322.000000,126.000000,8.325200,452600.000000
-122.220000,37.860000,21.000000,7099.000000,1106.000000,2401.000000,1138.000000,8.301400,358500.000000
-122.240000,37.850000,52.000000,1467.000000,190.000000,496.000000,177.000000,7.257400,352100.000000
-122.250000,37.850000,52.000000,1274.000000,235.000000,558.000000,219.000000,5.643100,341300.000000
-122.250000,37.850000,52.000000,1627.000000,280.000000,565.000000,259.000000,3.846200,342200.000000
-122.250000,37.850000,52.000000,919.000000,213.000000,413.000000,193.000000,4.036800,269700.000000
-122.250000,37.840000,52.000000,2535.000000,489.000000,1094.000000,514.000000,3.659100,299200.000000
-122.250000,37.840000,52.000000,3104.000000,687.000000,1157.000000,647.000000,3.120000,241400.000000
-122.260000,37.840000,42.000000,2555.000000,665.000000,1206.000000,595.000000,2.080400,226700.000000
-122.250000,37.840000,52.000000,3549.000000,707.000000,1551.000000,714.000000,3.691200,

In [34]:
spark.read.format("csv")
        .option("header","false")
        .option("inferschema","true")
        .load("cal_housing.data")
        .show

+-------+-----+----+------+------+------+------+------+--------+
|    _c0|  _c1| _c2|   _c3|   _c4|   _c5|   _c6|   _c7|     _c8|
+-------+-----+----+------+------+------+------+------+--------+
|-122.23|37.88|41.0| 880.0| 129.0| 322.0| 126.0|8.3252|452600.0|
|-122.22|37.86|21.0|7099.0|1106.0|2401.0|1138.0|8.3014|358500.0|
|-122.24|37.85|52.0|1467.0| 190.0| 496.0| 177.0|7.2574|352100.0|
|-122.25|37.85|52.0|1274.0| 235.0| 558.0| 219.0|5.6431|341300.0|
|-122.25|37.85|52.0|1627.0| 280.0| 565.0| 259.0|3.8462|342200.0|
|-122.25|37.85|52.0| 919.0| 213.0| 413.0| 193.0|4.0368|269700.0|
|-122.25|37.84|52.0|2535.0| 489.0|1094.0| 514.0|3.6591|299200.0|
|-122.25|37.84|52.0|3104.0| 687.0|1157.0| 647.0|  3.12|241400.0|
|-122.26|37.84|42.0|2555.0| 665.0|1206.0| 595.0|2.0804|226700.0|
|-122.25|37.84|52.0|3549.0| 707.0|1551.0| 714.0|3.6912|261100.0|
|-122.26|37.85|52.0|2202.0| 434.0| 910.0| 402.0|3.2031|281500.0|
|-122.26|37.85|52.0|3503.0| 752.0|1504.0| 734.0|3.2705|241800.0|
|-122.26|37.85|52.0|2491.

In [36]:
import org.apache.spark.sql.types._
val df = spark.read.format("csv")
        .option("header","false")
        .option("inferschema","true")
        .load("cal_housing.data")
        .toDF("Longitude","Latitude","MedianAge",
                     "TotalRooms","TotalBedrooms","Population","Households",
                     "MedianIncome","MedianHomeValue")


import org.apache.spark.sql.types._
df: org.apache.spark.sql.DataFrame = [Longitude: double, Latitude: double ... 7 more fields]


In [37]:
df.show()

+---------+--------+---------+----------+-------------+----------+----------+------------+---------------+
|Longitude|Latitude|MedianAge|TotalRooms|TotalBedrooms|Population|Households|MedianIncome|MedianHomeValue|
+---------+--------+---------+----------+-------------+----------+----------+------------+---------------+
|  -122.23|   37.88|     41.0|     880.0|        129.0|     322.0|     126.0|      8.3252|       452600.0|
|  -122.22|   37.86|     21.0|    7099.0|       1106.0|    2401.0|    1138.0|      8.3014|       358500.0|
|  -122.24|   37.85|     52.0|    1467.0|        190.0|     496.0|     177.0|      7.2574|       352100.0|
|  -122.25|   37.85|     52.0|    1274.0|        235.0|     558.0|     219.0|      5.6431|       341300.0|
|  -122.25|   37.85|     52.0|    1627.0|        280.0|     565.0|     259.0|      3.8462|       342200.0|
|  -122.25|   37.85|     52.0|     919.0|        213.0|     413.0|     193.0|      4.0368|       269700.0|
|  -122.25|   37.84|     52.0|    253

In [38]:
df.printSchema

root
 |-- Longitude: double (nullable = true)
 |-- Latitude: double (nullable = true)
 |-- MedianAge: double (nullable = true)
 |-- TotalRooms: double (nullable = true)
 |-- TotalBedrooms: double (nullable = true)
 |-- Population: double (nullable = true)
 |-- Households: double (nullable = true)
 |-- MedianIncome: double (nullable = true)
 |-- MedianHomeValue: double (nullable = true)



<span style="color:blue;font-size:large">Eyeball the data</span>

In [39]:
df.show(false)

+---------+--------+---------+----------+-------------+----------+----------+------------+---------------+
|Longitude|Latitude|MedianAge|TotalRooms|TotalBedrooms|Population|Households|MedianIncome|MedianHomeValue|
+---------+--------+---------+----------+-------------+----------+----------+------------+---------------+
|-122.23  |37.88   |41.0     |880.0     |129.0        |322.0     |126.0     |8.3252      |452600.0       |
|-122.22  |37.86   |21.0     |7099.0    |1106.0       |2401.0    |1138.0    |8.3014      |358500.0       |
|-122.24  |37.85   |52.0     |1467.0    |190.0        |496.0     |177.0     |7.2574      |352100.0       |
|-122.25  |37.85   |52.0     |1274.0    |235.0        |558.0     |219.0     |5.6431      |341300.0       |
|-122.25  |37.85   |52.0     |1627.0    |280.0        |565.0     |259.0     |3.8462      |342200.0       |
|-122.25  |37.85   |52.0     |919.0     |213.0        |413.0     |193.0     |4.0368      |269700.0       |
|-122.25  |37.84   |52.0     |2535.0 

<span style="color:green;font-size:xx-large">Feature Engineering</span>

<span style="color:blue;font-size:large">Setting up the dependent variable</span>
<li>We'll simplify the median home value by dividing it by 100,000

In [40]:
df.withColumn("MedianHomeValue",$"MedianHomeValue"/100000).show

+---------+--------+---------+----------+-------------+----------+----------+------------+---------------+
|Longitude|Latitude|MedianAge|TotalRooms|TotalBedrooms|Population|Households|MedianIncome|MedianHomeValue|
+---------+--------+---------+----------+-------------+----------+----------+------------+---------------+
|  -122.23|   37.88|     41.0|     880.0|        129.0|     322.0|     126.0|      8.3252|          4.526|
|  -122.22|   37.86|     21.0|    7099.0|       1106.0|    2401.0|    1138.0|      8.3014|          3.585|
|  -122.24|   37.85|     52.0|    1467.0|        190.0|     496.0|     177.0|      7.2574|          3.521|
|  -122.25|   37.85|     52.0|    1274.0|        235.0|     558.0|     219.0|      5.6431|          3.413|
|  -122.25|   37.85|     52.0|    1627.0|        280.0|     565.0|     259.0|      3.8462|          3.422|
|  -122.25|   37.85|     52.0|     919.0|        213.0|     413.0|     193.0|      4.0368|          2.697|
|  -122.25|   37.84|     52.0|    253

In [43]:
import org.apache.spark.sql.types._
val df = spark.read.format("csv")
        .option("header","false")
        .option("inferschema","true")
        .load("cal_housing.data")
        .toDF("Longitude","Latitude","MedianAge",
                     "TotalRooms","TotalBedrooms","Population","Households",
                     "MedianIncome","MedianHomeValue")
        .withColumn("MedianHomeValue",$"MedianHomeValue"/100000)

import org.apache.spark.sql.types._
df: org.apache.spark.sql.DataFrame = [Longitude: double, Latitude: double ... 7 more fields]


<span style="color:blue;font-size:large">Setting up independent variables</span>
<li>We'll divide total rooms, total bedrooms, and population by the number of households to get per household data</li>


In [44]:
df
    .withColumn("RoomsPerHouse", col("TotalRooms")/col("Households"))
    .withColumn("PeoplePerHouse", col("Population")/col("Households"))
    .withColumn("BedroomsPerHouse", col("TotalBedrooms")/col("Households"))
    .show

+---------+--------+---------+----------+-------------+----------+----------+------------+---------------+------------------+------------------+------------------+
|Longitude|Latitude|MedianAge|TotalRooms|TotalBedrooms|Population|Households|MedianIncome|MedianHomeValue|     RoomsPerHouse|    PeoplePerHouse|  BedroomsPerHouse|
+---------+--------+---------+----------+-------------+----------+----------+------------+---------------+------------------+------------------+------------------+
|  -122.23|   37.88|     41.0|     880.0|        129.0|     322.0|     126.0|      8.3252|          4.526| 6.984126984126984|2.5555555555555554|1.0238095238095237|
|  -122.22|   37.86|     21.0|    7099.0|       1106.0|    2401.0|    1138.0|      8.3014|          3.585| 6.238137082601054| 2.109841827768014|0.9718804920913884|
|  -122.24|   37.85|     52.0|    1467.0|        190.0|     496.0|     177.0|      7.2574|          3.521| 8.288135593220339|2.8022598870056497| 1.073446327683616|
|  -122.25|   37

In [45]:
import org.apache.spark.sql.types._
val df = spark.read.format("csv")
        .option("header","false")
        .option("inferschema","true")
        .load("cal_housing.data")
        .toDF("Longitude","Latitude","MedianAge",
                     "TotalRooms","TotalBedrooms","Population","Households",
                     "MedianIncome","MedianHomeValue")
        .withColumn("MedianHomeValue",$"MedianHomeValue"/100000)
    .withColumn("RoomsPerHouse", col("TotalRooms")/col("Households"))
    .withColumn("PeoplePerHouse", col("Population")/col("Households"))
    .withColumn("BedroomsPerHouse", col("TotalBedrooms")/col("Households"))

import org.apache.spark.sql.types._
df: org.apache.spark.sql.DataFrame = [Longitude: double, Latitude: double ... 10 more fields]


<span style="color:blue;font-size:large">Select the features we need</span>


In [46]:
df.printSchema

root
 |-- Longitude: double (nullable = true)
 |-- Latitude: double (nullable = true)
 |-- MedianAge: double (nullable = true)
 |-- TotalRooms: double (nullable = true)
 |-- TotalBedrooms: double (nullable = true)
 |-- Population: double (nullable = true)
 |-- Households: double (nullable = true)
 |-- MedianIncome: double (nullable = true)
 |-- MedianHomeValue: double (nullable = true)
 |-- RoomsPerHouse: double (nullable = true)
 |-- PeoplePerHouse: double (nullable = true)
 |-- BedroomsPerHouse: double (nullable = true)



In [None]:
df.select("MedianHomeValue", 
              "MedianAge", 
              "Population", 
              "Households", 
              "MedianIncome", 
              "RoomsPerHouse", 
              "PeoplePerHouse", 
              "BedroomsPerHouse",
               "Latitude",
               "Longitude").printSchema

In [48]:
import org.apache.spark.sql.types._
val df = spark.read.format("csv")
        .option("header","false")
        .option("inferschema","true")
        .load("cal_housing.data")
        .toDF("Longitude","Latitude","MedianAge",
                     "TotalRooms","TotalBedrooms","Population","Households",
                     "MedianIncome","MedianHomeValue")
        .withColumn("MedianHomeValue",$"MedianHomeValue"/100000)
    .withColumn("RoomsPerHouse", col("TotalRooms")/col("Households"))
    .withColumn("PeoplePerHouse", col("Population")/col("Households"))
    .withColumn("BedroomsPerHouse", col("TotalBedrooms")/col("Households"))
    .select("MedianHomeValue", 
              "MedianAge", 
              "Population", 
              "Households", 
              "MedianIncome", 
              "RoomsPerHouse", 
              "PeoplePerHouse", 
              "BedroomsPerHouse",
               "Latitude",
               "Longitude")


import org.apache.spark.sql.types._
df: org.apache.spark.sql.DataFrame = [MedianHomeValue: double, MedianAge: double ... 8 more fields]


<br><br><br>
<span style="color:green;font-size:xx-large">Machine Learning Pipelines</span>

<li>A machine learning pipeline is an end-to-end framework that takes raw data as an input and produces the output of the model</li>
<li>A pipeline is designed to manage the flow of data as it goes through the feature engineering and feature transformation process, into the ML model, and gathers the results</li>
<li>Spark contains pipleline support</li>
<li>Spark pipelines contain transform, evaluate, and fit steps</li>

<span style="color:blue;font-size:large">Package the initial data preparation steps</span>
<li>Package the various steps that massage the data to get it ready for Spark ML into a function</span>

<span style="color:blue;font-size:large">Read the data from a file and split into train and test</span>


In [49]:
import org.apache.spark.sql.DataFrame


def readData(): (DataFrame,DataFrame) = {
    val df = spark.read.format("csv")
        .option("header","false")
        .option("inferschema","true")
        .load("cal_housing.data")
        .toDF("Longitude","Latitude","MedianAge",
                     "TotalRooms","TotalBedrooms","Population","Households",
                     "MedianIncome","MedianHomeValue")
    val Array(train,test) = df.randomSplit(Array(0.8,0.2),seed=1234L)
    (train,test)
}

import org.apache.spark.sql.DataFrame
readData: ()(org.apache.spark.sql.DataFrame, org.apache.spark.sql.DataFrame)


In [50]:
val (train, test) = readData

train: org.apache.spark.sql.DataFrame = [Longitude: double, Latitude: double ... 7 more fields]
test: org.apache.spark.sql.DataFrame = [Longitude: double, Latitude: double ... 7 more fields]


<span style="color:blue;font-size:large">Do the preprocessing steps</span>
<li>train and test can be separately passed through this function</li>

In [53]:
import org.apache.spark.sql.DataFrame


def prepareData(df: DataFrame): DataFrame = {
    df.withColumn("MedianHomeValue",$"MedianHomeValue"/100000)
        .withColumn("RoomsPerHouse", col("TotalRooms")/col("Households"))
        .withColumn("PeoplePerHouse", col("Population")/col("Households"))
        .withColumn("BedroomsPerHouse", col("TotalBedrooms")/col("Households"))
        .select("MedianHomeValue", 
                  "MedianAge", 
                  "Population", 
                  "Households", 
                  "MedianIncome", 
                  "RoomsPerHouse", 
                  "PeoplePerHouse", 
                  "BedroomsPerHouse",
                   "Latitude",
                   "Longitude")
        .withColumnRenamed("MedianHomeValue","label")//in spark, dependent variable default named label
}


import org.apache.spark.sql.DataFrame
prepareData: (df: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame


In [54]:
prepareData(test)

res29: org.apache.spark.sql.DataFrame = [label: double, MedianAge: double ... 8 more fields]


In [55]:
prepareData(test).show

+-----+---------+----------+----------+------------+------------------+------------------+------------------+--------+---------+
|label|MedianAge|Population|Households|MedianIncome|     RoomsPerHouse|    PeoplePerHouse|  BedroomsPerHouse|Latitude|Longitude|
+-----+---------+----------+----------+------------+------------------+------------------+------------------+--------+---------+
|0.858|     19.0|    1298.0|     478.0|      1.9797| 5.589958158995816| 2.715481171548117|1.1548117154811715|    41.8|   -124.3|
|1.114|     52.0|     907.0|     369.0|      2.3571| 6.008130081300813|2.4579945799457996| 1.067750677506775|   40.58|  -124.26|
|0.705|     39.0|     883.0|     337.0|       1.745| 5.448071216617211| 2.620178041543027|1.0445103857566767|   40.79|  -124.18|
|1.289|     17.0|     873.0|     313.0|      4.0357| 6.472843450479234|2.7891373801916934|1.0798722044728435|   40.74|  -124.17|
|1.161|     13.0|     951.0|     353.0|      4.8516|  6.15014164305949|2.6940509915014164|0.96033

<br><br>
<span style="color:green;font-size:xx-large">Applying Spark feature transformations</span>


<span style="color:blue;font-size:large">Prepare data for ML</span>


<li>Gather independent features into a column of vectors</li>
<li>Each vector corresponds to one data point</li>
<li>Specify the dependent variable (by default, ml models look for a column named <span style="color:blue">label</span>)</li>
<li>Keep only the column of vectors (independent variables) and the dependent variable</li>


<span style="color:blue;font-size:large">VectorAssembler</span>


<li>Spark ML provided feature transformer that constructs a vector column given a set of columns</li>
<li><a href="https://spark.apache.org/docs/latest/ml-features#vectorassembler">https://spark.apache.org/docs/latest/ml-features#vectorassembler</a></li>
<li>And constructs a dataframe with the columns from the input df and with the group of selected columns concatenated into a vector</li> 

In [56]:
//IV column names
val cols = prepareData(train).columns
    .map(l => if (l != "label") Some(l) else None)
    .flatten

//if there is a missing value?

cols: Array[String] = Array(MedianAge, Population, Households, MedianIncome, RoomsPerHouse, PeoplePerHouse, BedroomsPerHouse, Latitude, Longitude)


In [57]:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.{Matrix, Vectors}

//Get the names of all columns except MedianHomeValue
val cols = prepareData(train).columns
    .map(l => if (l != "label") Some(l) else None)
    .flatten


//Create a vectorassembler from the list of columns and specify the name of the column of vectors
val assembler = new VectorAssembler()
  .setInputCols(cols)
  .setOutputCol("features")

//Apply the transform function on the data frame, select the dv and features column
//And rename the dv column to label

val vector_df = assembler.transform(prepareData(train))
    .select("label","features")

vector_df.show

+-----+--------------------+
|label|            features|
+-----+--------------------+
|0.946|[52.0,806.0,270.0...|
|1.036|[17.0,1244.0,456....|
| 0.79|[36.0,1194.0,465....|
|0.761|[32.0,434.0,187.0...|
|1.067|[52.0,1152.0,435....|
|0.508|[52.0,544.0,172.0...|
|0.732|[11.0,1343.0,479....|
|0.783|[28.0,1530.0,653....|
|0.581|[32.0,620.0,268.0...|
|0.669|[20.0,1993.0,721....|
|0.684|[17.0,1947.0,647....|
|0.901|[21.0,2907.0,972....|
| 0.69|[30.0,1367.0,583....|
|  0.7|[37.0,640.0,260.0...|
|0.746|[15.0,1645.0,640....|
| 1.07|[35.0,480.0,179.0...|
|0.722|[33.0,656.0,236.0...|
| 0.67|[34.0,950.0,317.0...|
|0.702|[37.0,867.0,310.0...|
|0.646|[40.0,788.0,279.0...|
+-----+--------------------+
only showing top 20 rows



import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.{Matrix, Vectors}
cols: Array[String] = Array(MedianAge, Population, Households, MedianIncome, RoomsPerHouse, PeoplePerHouse, BedroomsPerHouse, Latitude, Longitude)
assembler: org.apache.spark.ml.feature.VectorAssembler = VectorAssembler: uid=vecAssembler_d48b0f798214, handleInvalid=error, numInputCols=9
vector_df: org.apache.spark.sql.DataFrame = [label: double, features: vector]


<span style="color:blue;font-size:large">Scaling</span>


<li>Scale all independent variables </li>
<li><a href="https://spark.apache.org/docs/latest/ml-features#standardscaler">https://spark.apache.org/docs/latest/ml-features#standardscaler</a></li>
<li><span style="color:blue">withStd</span> scales the data to std of 1</li>
<li><span style="color:blue">withMean set to false</span> scales the data to std of 1 with the mean unchanged</li>
<li><span style="color:blue">withMean set to true</span> scales the data to std of 1 with the mean 0</li>




<li>Create a Scaler object</li>
<li>Specify the column to be scaled (features)</li>
<li>Specify the options (withMean/withStd)</li>
<li>Then apply fit to generate the parameters for scaling (the mean and the std)</li>
<li>And apply transform to scale the data (to the fitted mean and std)</li>
<li>A new column will be added to the dataframe with the scaled values</li>

In [58]:
import org.apache.spark.ml.feature.StandardScaler
val scaler = new StandardScaler()
      .setInputCol("features")
      .setOutputCol("scaledFeatures")
      .setWithStd(true)
      .setWithMean(true)

//Generate the parameters (fit the scaling object to the data)
val fitted_scaler = scaler.fit(vector_df)

//scale the data
val scaled_df = fitted_scaler.transform(vector_df)
scaled_df.show(false)

+-----+-------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|label|features                                                                                         |scaledFeatures                                                                                                                                                                         |
+-----+-------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0.946|[52.0,806.0,270.0,3.0147,6.7407407407407405,2.9851851851851854,1.1111111111111112,40.54,-124.35] |[1.8529357551237848,-0.54

import org.apache.spark.ml.feature.StandardScaler
scaler: org.apache.spark.ml.feature.StandardScaler = stdScal_1c8c420141a8
fitted_scaler: org.apache.spark.ml.feature.StandardScalerModel = StandardScalerModel: uid=stdScal_1c8c420141a8, numFeatures=9, withMean=true, withStd=true
scaled_df: org.apache.spark.sql.DataFrame = [label: double, features: vector ... 1 more field]


In [59]:
scaled_df.printSchema

root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)
 |-- scaledFeatures: vector (nullable = true)



<span style="color:green;font-size:xx-large">Select and setup a model</span>


<li>We'll use a straightforward regression model</li>
<li>and fit the training data to the model</li>
<li><a href="https://spark.apache.org/docs/latest/ml-classification-regression.html#linear-regression">https://spark.apache.org/docs/latest/ml-classification-regression.html#linear-regression</a></li>

In [60]:
import org.apache.spark.ml.regression.LinearRegression

val lr = new LinearRegression()
    .setMaxIter(10)
    .setRegParam(0.3) //Regularization parameter
    .setElasticNetParam(0.8) //elastic net regularization parameter (L1 + L2 penalties)
    .setFeaturesCol("scaledFeatures") //independent variables
    .setLabelCol("label") //dependent variable (we don't need to specify this since we've called our col label)

val lrModel = lr.fit(scaled_df) //fit the regression to the training data



import org.apache.spark.ml.regression.LinearRegression
lr: org.apache.spark.ml.regression.LinearRegression = linReg_58e54e320c3e
lrModel: org.apache.spark.ml.regression.LinearRegressionModel = LinearRegressionModel: uid=linReg_58e54e320c3e, numFeatures=9


In [61]:
lr.fit(scaled_df)

res34: org.apache.spark.ml.regression.LinearRegressionModel = LinearRegressionModel: uid=linReg_58e54e320c3e, numFeatures=9


<br><br><br>
<span style="color:green;font-size:xx-large">Set up the pipeline</span>
<li>and send data through it</li>

In [62]:
import org.apache.spark.ml.{Pipeline, PipelineModel}

val pipeline = new Pipeline().setStages(Array(assembler,scaler,lr))
val model = pipeline.fit(prepareData(train))

import org.apache.spark.ml.{Pipeline, PipelineModel}
pipeline: org.apache.spark.ml.Pipeline = pipeline_20ad1d99528f
model: org.apache.spark.ml.PipelineModel = pipeline_20ad1d99528f


<br><br><br>
<span style="color:green;font-size:xx-large">Model evaluation</span>

<li>The model contains the estimated parameters</li>
<li>And can be used to get predictions on the training and testing data</li>

<span style="color:blue;font-size:large">Create an evaluator</span>



In [63]:
import org.apache.spark.ml.evaluation.RegressionEvaluator

val evaluator = new RegressionEvaluator()
  .setLabelCol("label")
  .setPredictionCol("prediction")
  

import org.apache.spark.ml.evaluation.RegressionEvaluator
evaluator: org.apache.spark.ml.evaluation.RegressionEvaluator = RegressionEvaluator: uid=regEval_99ff2961ba17, metricName=rmse, throughOrigin=false


<span style="color:blue;font-size:large">Get predictions</span>



In [64]:
val predictions = model.transform(prepareData(test))

predictions: org.apache.spark.sql.DataFrame = [label: double, MedianAge: double ... 11 more fields]


<span style="color:blue;font-size:large">Get evaluation metrics</span>



In [65]:
val rmse_test = evaluator.setMetricName("rmse").evaluate(predictions)
val r2_test = evaluator.setMetricName("r2").evaluate(predictions)
println("Test  RMSE: ",rmse_test," Test  r2: ",r2_test)

(Test  RMSE: ,0.8771558134520177, Test  r2: ,0.41378266806495867)


rmse_test: Double = 0.8771558134520177
r2_test: Double = 0.41378266806495867


<span style="color:blue;font-size:large">Get model estimated parameters</span>



In [71]:
model.stages(2)
//Transformer is an umbrella type for lr, mlp, onehoteconding etc.

res41: org.apache.spark.ml.Transformer = LinearRegressionModel: uid=linReg_58e54e320c3e, numFeatures=9


In [66]:
import org.apache.spark.ml.regression.LinearRegressionModel

val lrModel = model.stages(2).asInstanceOf[LinearRegressionModel]//cast Transformer to lr
println("Coefficients: ",lrModel.coefficients)
println("Intercept: ",lrModel.intercept)

(Coefficients: ,[0.0,0.0,0.0,0.5287039916961691,0.0,0.0,0.0,0.0,0.0])
(Intercept: ,2.074400715885009)


import org.apache.spark.ml.regression.LinearRegressionModel
lrModel: org.apache.spark.ml.regression.LinearRegressionModel = LinearRegressionModel: uid=linReg_58e54e320c3e, numFeatures=9


<br><br><br>
<span style="color:green;font-size:xx-large">Hyperparameter tuning</span>
<br><br>

<li>Pipelines are useful for hyperparameter tuning</li>
<li>Create a parameter grid with parameter options</li>
<li>Create a cross validation model</li>
<li>fit the model (finds the best model)</li>
<li>extract model information</li>



<span style="color:blue;font-size:large">Create a parameter grid</span>



In [72]:
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}
val paramGrid = new ParamGridBuilder()
    .addGrid(lr.regParam, Array(0.1, 0.01))
    .addGrid(lr.elasticNetParam,Array(0.7,0.8, 0.9))
  .build()

import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}
paramGrid: Array[org.apache.spark.ml.param.ParamMap] =
Array({
	linReg_58e54e320c3e-elasticNetParam: 0.7,
	linReg_58e54e320c3e-regParam: 0.1
}, {
	linReg_58e54e320c3e-elasticNetParam: 0.7,
	linReg_58e54e320c3e-regParam: 0.01
}, {
	linReg_58e54e320c3e-elasticNetParam: 0.8,
	linReg_58e54e320c3e-regParam: 0.1
}, {
	linReg_58e54e320c3e-elasticNetParam: 0.8,
	linReg_58e54e320c3e-regParam: 0.01
}, {
	linReg_58e54e320c3e-elasticNetParam: 0.9,
	linReg_58e54e320c3e-regParam: 0.1
}, {
	linReg_58e54e320c3e-elasticNetParam: 0.9,
	linReg_58e54e320c3e-regParam: 0.01
})


<span style="color:blue;font-size:large">Specify cross-validation parameters</span>
<li>Spark pipelines can do the cross validation in parallel</li>


In [73]:
val cv = new CrossValidator()
  .setEstimator(pipeline) 
  .setEvaluator(new RegressionEvaluator()) //Will try to minimize mean square error
  .setEstimatorParamMaps(paramGrid)
  .setNumFolds(3)  // Use at least 3 in practice!
  .setParallelism(3)  // Evaluate up to 2 parameter settings in parallel

cv: org.apache.spark.ml.tuning.CrossValidator = cv_c5e3ce0e94d9


<span style="color:blue;font-size:large">Fit the cross validation model</span>



In [74]:
val cvModel = cv.fit(prepareData(train))

22/12/07 10:23:30 WARN BlockManager: Block rdd_258_0 already exists on this machine; not re-adding it
22/12/07 10:23:30 WARN BlockManager: Block rdd_258_0 already exists on this machine; not re-adding it


cvModel: org.apache.spark.ml.tuning.CrossValidatorModel = CrossValidatorModel: uid=cv_c5e3ce0e94d9, bestModel=pipeline_20ad1d99528f, numFolds=3


<span style="color:blue;font-size:large">Get predictions</span>

<li>The cross validation model contains the best model found in the grid search</li>

In [75]:
val test_r = cvModel.transform(prepareData(test))

test_r: org.apache.spark.sql.DataFrame = [label: double, MedianAge: double ... 11 more fields]


<span style="color:blue;font-size:large">Get best model evaluation metrics</span>



In [76]:
import org.apache.spark.ml.evaluation.RegressionEvaluator

val evaluator = new RegressionEvaluator()
  .setLabelCol("label")
  .setPredictionCol("prediction")
  

val rmse = evaluator.setMetricName("rmse").evaluate(test_r)
val r2 = evaluator.setMetricName("r2").evaluate(test_r)

import org.apache.spark.ml.evaluation.RegressionEvaluator
evaluator: org.apache.spark.ml.evaluation.RegressionEvaluator = RegressionEvaluator: uid=regEval_2ffe5920ff1a, metricName=r2, throughOrigin=false
rmse: Double = 0.6866377577428902
r2: Double = 0.6407799863685091


<span style="color:blue;font-size:large">Extract the best model</span>

In [77]:
println(cvModel.bestModel.asInstanceOf[PipelineModel]
    .stages(2)
    .asInstanceOf[LinearRegressionModel]
    .coefficients)

println(cvModel.bestModel.asInstanceOf[PipelineModel]
    .stages(2)
    .asInstanceOf[LinearRegressionModel]
    .intercept)

[0.1701674396025173,-0.42697526239126027,0.4872216691862174,0.7698920441756001,-0.08262425865397742,0.0,0.1234461715492403,-0.6780525787325298,-0.6117890691324789]
2.0744007158850004


In [78]:
predictions.printSchema

root
 |-- label: double (nullable = true)
 |-- MedianAge: double (nullable = true)
 |-- Population: double (nullable = true)
 |-- Households: double (nullable = true)
 |-- MedianIncome: double (nullable = true)
 |-- RoomsPerHouse: double (nullable = true)
 |-- PeoplePerHouse: double (nullable = true)
 |-- BedroomsPerHouse: double (nullable = true)
 |-- Latitude: double (nullable = true)
 |-- Longitude: double (nullable = true)
 |-- features: vector (nullable = true)
 |-- scaledFeatures: vector (nullable = true)
 |-- prediction: double (nullable = false)



In [79]:
import org.apache.spark.ml.tuning.CrossValidator
cvModel.bestModel.getParam _

import org.apache.spark.ml.tuning.CrossValidator
res44: String => org.apache.spark.ml.param.Param[Any] = $Lambda$6020/0x0000000801df6040@5023bfda


<span style="color:blue;font-size:large">Get the best scores for each model</span>
<li>linear regression defaults to rmse as the metric</li>

In [80]:
cvModel.getEvaluator

res45: org.apache.spark.ml.evaluation.Evaluator = RegressionEvaluator: uid=regEval_8fd4b9399a8b, metricName=rmse, throughOrigin=false


In [81]:
cvModel.avgMetrics

res46: Array[Double] = Array(0.8110491985964897, 0.7042989389165119, 0.8147699859940992, 0.7047059724075506, 0.8189638446724051, 0.704868132797246)


<span style="color:blue;font-size:large">Try a different metric</span>
<li>Since we're using regression, the algorithm is the same</li>
<li>However, the cross validation model will choose the best model based on the value of the new metric</li>


In [82]:
val cv = new CrossValidator()
  .setEstimator(pipeline) 
  .setEvaluator(new RegressionEvaluator().setMetricName("r2")) //Will try to minimize rmse but will report r2
  .setEstimatorParamMaps(paramGrid)
  .setNumFolds(3)  // Use at least 3 in practice!
  .setParallelism(3)  // Evaluate up to 2 parameter settings in parallel

val cvModel = cv.fit(prepareData(train))

val evaluator = new RegressionEvaluator()
  .setLabelCol("label")
  .setPredictionCol("prediction")
  
val test_r = cvModel.transform(prepareData(test))
val rmse = evaluator.setMetricName("rmse").evaluate(test_r)
val r2 = evaluator.setMetricName("r2").evaluate(test_r)

cv: org.apache.spark.ml.tuning.CrossValidator = cv_1c4e941b9178
cvModel: org.apache.spark.ml.tuning.CrossValidatorModel = CrossValidatorModel: uid=cv_1c4e941b9178, bestModel=pipeline_20ad1d99528f, numFolds=3
evaluator: org.apache.spark.ml.evaluation.RegressionEvaluator = RegressionEvaluator: uid=regEval_9fb908b99570, metricName=r2, throughOrigin=false
test_r: org.apache.spark.sql.DataFrame = [label: double, MedianAge: double ... 11 more fields]
rmse: Double = 0.6866377577428902
r2: Double = 0.6407799863685091


In [83]:
cvModel.avgMetrics

res47: Array[Double] = Array(0.5075670843870324, 0.6286806429919505, 0.5030403790572823, 0.6282521923184208, 0.4979128859605077, 0.6280815855468848)
