# Model Fitting with MLlib
brian higginbotham

In this example we will fit three different types of machine learning models using the ```MLlib``` library from ```pyspark```. To do this, we will use the **Abalone** data set available [here at the UCI Machine Learning Library](https://archive.ics.uci.edu/dataset/1/abalone). This data set contains seven continuous (measured) features and one categorical feature of over 4,000 abalone and one target variable - rings. We will use the ring variable to predict the age of the abalone.

To begin with, let's import some libraries that we will be using.

In [2]:
import pandas as pd
import numpy as np
import pyspark.pandas as ps
from pyspark.sql import SparkSession

Inspecting the dataset outside of the pyspark environment (just open it in a text file) we'll notice the set does not have any associated column names. The column names are in a companion file which we'll use to create a list of the column names that we can add to the dataset when we read it in.

In [3]:
cols = ['sex', 'length', 'diameter', 'height', 'whole_weight', 'shucked_weight', 'viscera_weight', 'shell_weight', 'rings']

Now we'll create a spark session and read in the dataset and inspect the first 5 rows.

In [5]:
spark = SparkSession.builder.getOrCreate()
abalone = spark.createDataFrame(pd.read_csv('abalone.data', names=cols))
abalone.show(5)

+---+------+--------+------+------------+--------------+--------------+------------+-----+
|sex|length|diameter|height|whole_weight|shucked_weight|viscera_weight|shell_weight|rings|
+---+------+--------+------+------------+--------------+--------------+------------+-----+
|  M| 0.455|   0.365| 0.095|       0.514|        0.2245|         0.101|        0.15|   15|
|  M|  0.35|   0.265|  0.09|      0.2255|        0.0995|        0.0485|        0.07|    7|
|  F|  0.53|    0.42| 0.135|       0.677|        0.2565|        0.1415|        0.21|    9|
|  M|  0.44|   0.365| 0.125|       0.516|        0.2155|         0.114|       0.155|   10|
|  I|  0.33|   0.255|  0.08|       0.205|        0.0895|        0.0395|       0.055|    7|
+---+------+--------+------+------------+--------------+--------------+------------+-----+
only showing top 5 rows



For our ML models, we'll want to split the data into a train and test set. We can do this using ```.randomSplit()```. We'll print out the length of each split just to make sure we get the right proportions.

In [7]:
train, test = abalone.randomSplit([0.8, 0.2], seed=54)
print(train.count(), test.count())

3331 846


Additionally, we will want to make sure that our categorical variable, 'sex', is roughly equally proportioned in the training set. We wouldn't want there to be an overly disproportionate number of one of these categories as that could negatively affect the model fitting and the resultant predictions.

In [8]:
print(train.filter(abalone['sex']=='F').count())
print(train.filter(abalone['sex']=='M').count())
print(train.filter(abalone['sex']=='I').count())

1026
1227
1078


looks pretty good! If the proportions were significantly off, we would need to check the proportions of the category in the overall dataset and see if it is similar to the training set. If there was a significant difference between the two, we would want to look into ```stratify```ing the train and test set by the categorical variable. We probably should have done this before the train and test split, but since it looks pretty good we'll leave it as is.

# MLR/Regularized Regression

The first model will be a **Regularized Regression** model. These models are based on the Multiple Linear Regression models but with a penalty assessement that influences feature selection.

In a MLR model, multiple features are used to determine a target or outcome. These features can be considered individually (known as Main Effect), combined (Interactive Effect), exponentiated (Polynomial Effect), or any combination thereof.  Depending on the number of features in the dataset, it can quickly get difficult and time consuming to try and determine the best combination of features to include in a MLR model. This is where **Regularized Regression** can be helpful. These models add a penalty term to the MLR model that influences the ‘weight’ of each feature so that the model performs feature selection. 

There are three types of **Regularized Regression models**: LASSO, Ridge, and ElasticNet.

**LASSO**: the error assessment on a LASSO is determined by a ‘tuning parameter’ (alpha) applied to the sum of the absolute values of the parameters or coefficients. This parameter can range from 0 (which means the penalty is not applied and the model reverts to a MLR model) to very large. The higher the alpha, the greater the shrinkage will be in the parameters and thus fewer features will be determining the target or outcome.

**Ridge**: the error assessment on a Ridge is determined by the ‘tuning parameter” applied to the sum of the parameters or coefficients squared. Similar to the LASSO, the parameter can range from 0 (penalty not applied - MLR model) to very large.  By squaring the coeffiecients the deviations from 1 are emphasized.

**ElasticNet**: This is a combination of the LASSO and the Ridge Models. A ratio (L1) is used to determine to proportion of the LASSO and Ridge to be used in the model.

As with the MLR, it can be difficult and time consuming to determine which alpha to use for the LASSO, the Ridge, and which alpha and L1 to use for the ElasticNet. Adding to this calculation, we’ll also need to decide which model, with optimal parameters, performs best.

With the MLlib ```LinearRegression()``` and ```crossValidator()``` functions, we’ll be able to enter in a range of parameters for both alpha and L1 and produce a model with the optimal parameters! Depending the range of parameters, the crossValidator will automatically determine if the best model is a MLR (alpha = 0), LASSO (L1 = 1), Ridge (L1= 0), or ElasticNet (0<alpha<1).

To get started, we'll need to prep our data for the model fit.

In [9]:
from pyspark.ml.feature import SQLTransformer, StringIndexer, Binarizer, VectorAssembler, VectorIndexer

The one categorical feature, 'sex', has three values - female ('F'), male ('M'), and infant ('I'). We'll need to create dummy variables for these values. We'll start by using ```StringIndexer()``` which creates a column with numerical values for each category in a categorical column. We then fit the ```StringIndexer``` instance to the data and then ```.transform()``` the data to create the numerical column.

In [11]:
indexer_f = StringIndexer(inputCols=['sex'], outputCols=['sex_F'])
indexer_fTrans = indexer_f.fit(train)
indexer_fTrans.transform(train).show(20)

+---+------+--------+------+------------+--------------+--------------+------------+-----+-----+
|sex|length|diameter|height|whole_weight|shucked_weight|viscera_weight|shell_weight|rings|sex_F|
+---+------+--------+------+------------+--------------+--------------+------------+-----+-----+
|  F|  0.44|    0.34|   0.1|       0.451|         0.188|         0.087|        0.13|   10|  2.0|
|  F|  0.47|   0.355|   0.1|      0.4755|        0.1675|        0.0805|       0.185|   10|  2.0|
|  F| 0.525|    0.38|  0.14|      0.6065|         0.194|        0.1475|        0.21|   14|  2.0|
|  F|  0.53|   0.415|  0.15|      0.7775|         0.237|        0.1415|        0.33|   20|  2.0|
|  F|  0.53|    0.42| 0.135|       0.677|        0.2565|        0.1415|        0.21|    9|  2.0|
|  F| 0.535|   0.405| 0.145|      0.6845|        0.2725|         0.171|       0.205|   10|  2.0|
|  F| 0.545|   0.425| 0.125|       0.768|         0.294|        0.1495|        0.26|   16|  2.0|
|  F|  0.55|    0.44|  0.15|  

Now we 'binarize' the column using ```Binarizer()```. Since 'F' were assigned 2.0 in the indexer, which happens to be the highest of the three values, we'll use a threshold of 1.5 to 'binarize' on. This means everthing above a 1.5 gets assigned '1' and everything else '0' in the 'female' column.

In [12]:
binary_fTrans = Binarizer(threshold = 1.5, inputCol='sex_F', outputCol='female')
binary_fTrans.transform(indexer_fTrans.transform(train)).show(5)

+---+------+--------+------+------------+--------------+--------------+------------+-----+-----+------+
|sex|length|diameter|height|whole_weight|shucked_weight|viscera_weight|shell_weight|rings|sex_F|female|
+---+------+--------+------+------------+--------------+--------------+------------+-----+-----+------+
|  F|  0.44|    0.34|   0.1|       0.451|         0.188|         0.087|        0.13|   10|  2.0|   1.0|
|  F|  0.47|   0.355|   0.1|      0.4755|        0.1675|        0.0805|       0.185|   10|  2.0|   1.0|
|  F| 0.525|    0.38|  0.14|      0.6065|         0.194|        0.1475|        0.21|   14|  2.0|   1.0|
|  F|  0.53|   0.415|  0.15|      0.7775|         0.237|        0.1415|        0.33|   20|  2.0|   1.0|
|  F|  0.53|    0.42| 0.135|       0.677|        0.2565|        0.1415|        0.21|    9|  2.0|   1.0|
+---+------+--------+------+------------+--------------+--------------+------------+-----+-----+------+
only showing top 5 rows



Now we need to create a dummy variable for the men or infants, but not both. Once we create a dummy variable for one of the reamining two, the remaining variable will be automatically considered by the model through it's absence (for example, if we dummy variable 'F' and 'M', then 'I' will be automatically considered when 'F'=0 and 'M'=0).

However, ```StringIndexer()``` automatically assigns numbers based on frequency of the category. Thus 'F' was assigned 2.0 because it was the least frequent category (0.0 being the most frequent). This creates a problem with our threshold argument in ```Binarizer()```, which will only assign '1' above a certain threshold. We can remedy this by using the argument ```stringOrderType=``` in ```StringIndexer()```. By changing ```stringOrderType='frequencyAsc'``` we'll now get 'M' to be assigned 2.0 and from there we can use the same ```Binarizer()``` threshold for a 'male' dummy column.

In [13]:
indexer_m = StringIndexer(inputCols=['sex'], outputCols=['sex_M'], stringOrderType='frequencyAsc')
indexer_mTrans = indexer_m.fit(train)
indexer_mTrans.transform(train).show(20)

+---+------+--------+------+------------+--------------+--------------+------------+-----+-----+
|sex|length|diameter|height|whole_weight|shucked_weight|viscera_weight|shell_weight|rings|sex_M|
+---+------+--------+------+------------+--------------+--------------+------------+-----+-----+
|  F|  0.44|    0.34|   0.1|       0.451|         0.188|         0.087|        0.13|   10|  0.0|
|  F|  0.47|   0.355|   0.1|      0.4755|        0.1675|        0.0805|       0.185|   10|  0.0|
|  F| 0.525|    0.38|  0.14|      0.6065|         0.194|        0.1475|        0.21|   14|  0.0|
|  F|  0.53|   0.415|  0.15|      0.7775|         0.237|        0.1415|        0.33|   20|  0.0|
|  F|  0.53|    0.42| 0.135|       0.677|        0.2565|        0.1415|        0.21|    9|  0.0|
|  F| 0.535|   0.405| 0.145|      0.6845|        0.2725|         0.171|       0.205|   10|  0.0|
|  F| 0.545|   0.425| 0.125|       0.768|         0.294|        0.1495|        0.26|   16|  0.0|
|  F|  0.55|    0.44|  0.15|  

In [14]:
binary_mTrans = Binarizer(threshold = 1.5, inputCol='sex_M', outputCol='male')
binary_mTrans.transform(indexer_mTrans.transform(binary_fTrans.transform(indexer_fTrans.transform(train)))).show(20)

+---+------+--------+------+------------+--------------+--------------+------------+-----+-----+------+-----+----+
|sex|length|diameter|height|whole_weight|shucked_weight|viscera_weight|shell_weight|rings|sex_F|female|sex_M|male|
+---+------+--------+------+------------+--------------+--------------+------------+-----+-----+------+-----+----+
|  F|  0.44|    0.34|   0.1|       0.451|         0.188|         0.087|        0.13|   10|  2.0|   1.0|  0.0| 0.0|
|  F|  0.47|   0.355|   0.1|      0.4755|        0.1675|        0.0805|       0.185|   10|  2.0|   1.0|  0.0| 0.0|
|  F| 0.525|    0.38|  0.14|      0.6065|         0.194|        0.1475|        0.21|   14|  2.0|   1.0|  0.0| 0.0|
|  F|  0.53|   0.415|  0.15|      0.7775|         0.237|        0.1415|        0.33|   20|  2.0|   1.0|  0.0| 0.0|
|  F|  0.53|    0.42| 0.135|       0.677|        0.2565|        0.1415|        0.21|    9|  2.0|   1.0|  0.0| 0.0|
|  F| 0.535|   0.405| 0.145|      0.6845|        0.2725|         0.171|       0.

Now we have the dummy columns for 'female' and 'male' along with the implied column for 'infant'.

```SQLTransformer()``` can be used to create, select, and name columns that will be used in the model.

The initial dataset had the column 'rings' as the target to be used to determine age by adding 1.5 to the value. We can go ahead and make that transformation in the ```SQLTransform()``` function as well as adding a few interaction terms. Remember, an interaction term is when we combine two (or more) features to create a new feature that may be influential in fitting the model. Here, we'll combine two continuous and categorical features to create two new features - 'intact1' and 'intact2'

In [15]:
sqlTrans_lr = SQLTransformer(statement = '''SELECT length, diameter, height, whole_weight, shucked_weight, \
viscera_weight, shell_weight, rings+1.5 as label, male, female, whole_weight*male as intact1, length*female as intact2 \
FROM __THIS__''')

In [16]:
sqlTrans_lr.transform(binary_mTrans.transform(indexer_mTrans.transform\
        (binary_fTrans.transform(indexer_fTrans.transform(train))))).show(20)

+------+--------+------+------------+--------------+--------------+------------+-----+----+------+-------+-------+
|length|diameter|height|whole_weight|shucked_weight|viscera_weight|shell_weight|label|male|female|intact1|intact2|
+------+--------+------+------------+--------------+--------------+------------+-----+----+------+-------+-------+
|  0.44|    0.34|   0.1|       0.451|         0.188|         0.087|        0.13| 11.5| 0.0|   1.0|    0.0|   0.44|
|  0.47|   0.355|   0.1|      0.4755|        0.1675|        0.0805|       0.185| 11.5| 0.0|   1.0|    0.0|   0.47|
| 0.525|    0.38|  0.14|      0.6065|         0.194|        0.1475|        0.21| 15.5| 0.0|   1.0|    0.0|  0.525|
|  0.53|   0.415|  0.15|      0.7775|         0.237|        0.1415|        0.33| 21.5| 0.0|   1.0|    0.0|   0.53|
|  0.53|    0.42| 0.135|       0.677|        0.2565|        0.1415|        0.21| 10.5| 0.0|   1.0|    0.0|   0.53|
| 0.535|   0.405| 0.145|      0.6845|        0.2725|         0.171|       0.205|

The last step in the data preparation phase is to take all of our features and add them to one column as a vector. This is the only input column the model will use in the ```.fit()``` method. We can use ```VectorAssembler()``` to accomplish this.

In [17]:
assembler_lr = VectorAssembler(inputCols=['length', 'diameter', 'height', 'whole_weight', \
                'shucked_weight', 'viscera_weight', 'shell_weight', \
                'male', 'female', 'intact1', 'intact2'], outputCol='features', handleInvalid='keep')

**Note** how we iteritively use ```.transform()``` over and over again on each of our successive methods to produce the desired dataframe. This will be used to create our ```Pipeline()``` in the model fitting.

In [18]:
assembler_lr.transform(sqlTrans_lr.transform(binary_mTrans.transform\
            (indexer_mTrans.transform(binary_fTrans.transform\
            (indexer_fTrans.transform(train)))))).show(5)

+------+--------+------+------------+--------------+--------------+------------+-----+----+------+-------+-------+--------------------+
|length|diameter|height|whole_weight|shucked_weight|viscera_weight|shell_weight|label|male|female|intact1|intact2|            features|
+------+--------+------+------------+--------------+--------------+------------+-----+----+------+-------+-------+--------------------+
|  0.44|    0.34|   0.1|       0.451|         0.188|         0.087|        0.13| 11.5| 0.0|   1.0|    0.0|   0.44|[0.44,0.34,0.1,0....|
|  0.47|   0.355|   0.1|      0.4755|        0.1675|        0.0805|       0.185| 11.5| 0.0|   1.0|    0.0|   0.47|[0.47,0.355,0.1,0...|
| 0.525|    0.38|  0.14|      0.6065|         0.194|        0.1475|        0.21| 15.5| 0.0|   1.0|    0.0|  0.525|[0.525,0.38,0.14,...|
|  0.53|   0.415|  0.15|      0.7775|         0.237|        0.1415|        0.33| 21.5| 0.0|   1.0|    0.0|   0.53|[0.53,0.415,0.15,...|
|  0.53|    0.42| 0.135|       0.677|        0.2

Now that the data is ready to be fit, we can set up the cross validation function that will iterate through selected parameters and choose the optimal parameters for the fitted model.

Here are the libraries we'll use.

In [19]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml import Pipeline

Now set up an instance for ```LinearRegression()``` that will be used in ```pipeline_lr``` and ```paramGrid_lr```.

Use ```ParamGridBuilder()``` to set up the list of **alphas** (```regParam```) and **L1**'s (```elasticNetParam```) to be tested.

```Pipeline()``` is a list of the transformations we used to prep the data for fitting. Start with the first transformation and end with the Linear Regression instance 'lr'.

```CrossValidator()``` combines all these pieces into one object that can be used to fit the model. It also includes the argument ```evaluator=RegressionEvaluator```. This input specifies the metric to be used to measure the cross-validation folds. The default value is **RMSE**.

**RMSE** (Root Mean Squared Error): The difference between the prediction and the actual value sqaured; then summed for all squared differences; from this sum the mean is calculated; the square root of the mean. Since the difference was squared in the beginning, taking the square root at the end converts the value back to the same units as the measurements. When comparing model training and performance, the lower the 'RMSE', the better the model.

In [21]:
lr = LinearRegression()
paramGrid_lr = ParamGridBuilder().addGrid(lr.regParam,[0, 0.01, 0.05, 0.1])\
.addGrid(lr.elasticNetParam,[0, .5, .75, .9, .95, 1])\
.build()
pipeline_lr=Pipeline(stages=[indexer_fTrans, binary_fTrans, indexer_mTrans, \
                             binary_mTrans, sqlTrans_lr, assembler_lr, lr])
crossval_lr=CrossValidator(estimator=pipeline_lr,
                        estimatorParamMaps=paramGrid_lr,
                        evaluator=RegressionEvaluator(),
                        numFolds=5)

Use the ```.fit()``` method on the training data to fit the model.

Note that the train data is the same unchanged data from the initial test/train split. The **pipeline** will perform all the data transformations for us!

In [None]:
cv_lr_Model=crossval_lr.fit(train)

Now we'll view the **RMSE** for each combination of alpha and L1.

In [23]:
my_list = []
for i in range(len(paramGrid_lr)):
    my_list.append([cv_lr_Model.avgMetrics[i], paramGrid_lr[i].values()])

In [24]:
my_list

[[2.240841778433401, dict_values([0.0, 0.0])],
 [2.2408417784333965, dict_values([0.0, 0.5])],
 [2.2408417784334054, dict_values([0.0, 0.75])],
 [2.2408417784334005, dict_values([0.0, 0.9])],
 [2.240841778433398, dict_values([0.0, 0.95])],
 [2.2408417784334005, dict_values([0.0, 1.0])],
 [2.25267676836313, dict_values([0.01, 0.0])],
 [2.260743591656917, dict_values([0.01, 0.5])],
 [2.266270937241732, dict_values([0.01, 0.75])],
 [2.27678870254163, dict_values([0.01, 0.9])],
 [2.2737135047268895, dict_values([0.01, 0.95])],
 [2.2783504660262612, dict_values([0.01, 1.0])],
 [2.2917163478500067, dict_values([0.05, 0.0])],
 [2.3148706086291506, dict_values([0.05, 0.5])],
 [2.3255559117430638, dict_values([0.05, 0.75])],
 [2.3320902239181533, dict_values([0.05, 0.9])],
 [2.3328185941148996, dict_values([0.05, 0.95])],
 [2.3335951639851147, dict_values([0.05, 1.0])],
 [2.3166678013918527, dict_values([0.1, 0.0])],
 [2.3481170694593, dict_values([0.1, 0.5])],
 [2.3599841777881876, dict_values

It looks like the best parameters are **alpha**=0 and **L1**=0.75. However, if **alpha** is 0, then we have an **MLR** model. This makes sense looking at the output - all values for 'RMSE' for **alpha**=0 are equal up to 13 decimal points!

So really, the best parameter is **alpha**=0 - an **MLR** model.

Now we'll use this model to make predictions by using the ```transform()``` method on the model object. We can view a portion of the results to compare the predictions to the actual values ('labels' column)

In [25]:
cv_lr_Model.transform(test).show(20)

+------+--------+------+------------+--------------+--------------+------------+-----+----+------+-------+-------+--------------------+------------------+
|length|diameter|height|whole_weight|shucked_weight|viscera_weight|shell_weight|label|male|female|intact1|intact2|            features|        prediction|
+------+--------+------+------------+--------------+--------------+------------+-----+----+------+-------+-------+--------------------+------------------+
|  0.55|   0.415| 0.135|      0.7635|         0.318|          0.21|         0.2| 10.5| 0.0|   1.0|    0.0|   0.55|[0.55,0.415,0.135...| 11.22553523529227|
| 0.565|    0.44| 0.155|      0.9395|        0.4275|         0.214|        0.27| 13.5| 0.0|   1.0|    0.0|  0.565|[0.565,0.44,0.155...|11.755413465129166|
|  0.33|   0.255|  0.08|       0.205|        0.0895|        0.0395|       0.055|  8.5| 0.0|   0.0|    0.0|    0.0|[0.33,0.255,0.08,...|  7.98469848282023|
|  0.38|   0.275|   0.1|      0.2255|          0.08|         0.049|   

Finally, run ```RegressionEvaluator()``` with ```.evaluate()``` method to get an **RMSE** score on the test set.

We'll change the metric to **mae** - 'mean absolute error'. This metric takes the sums up the absolute difference between each prediction and the label/actual value and then takes the average of that value.

In [26]:
RegressionEvaluator(metricName='mae').evaluate(cv_lr_Model.transform(test))

1.4557640125829179

# Random Forest

The next model we'll look at is a **Random Forest**. Random Forest is an expansion of the Regression Tree model. In a Regression Tree model, the prediction space is split into regions and the predicted output is based on the average of observations within a region. The region split is based on values of the features - the model determines the most predominate features to split on and then may include subsequent splits based on other feature values. This is the same as interaction terms, so for tree models we do not need to include interaction features.

**Random Forest** fit multiple trees (ensemble) and takes the average of the averages to use in prediction. This method reduces variance over an individual tree fit. Additionally, Random Forest performs random feature selection for each tree fit. By using a random subset of features for each tree fit, the model *reduces* the influence of really strong features which may introduce bias into the predictions.

The parameters to consider in a Random Forest are the **depth** and **number of trees**. Depth is the number of splits or regions the tree has - too few and the model is too general; too many and the model is overfit. Number of trees is how many different tree fits the model makes - too few loses accuracy; too many is computationally intensive for both time and resources.

We've already done a lot of the heavy lifting with our Regularized Regression model, so we'll just need to make a few tweaks to get the Random Forest model up and running.

We'll keep the binary transformations for the 'sex' column but we'll need to update the ```SQLTransformer()``` object. Remember that trees automatically account for interactions, so we can remove those from our column selection.

In [27]:
sqlTrans_rf = SQLTransformer(statement = '''SELECT length, diameter, height, whole_weight, shucked_weight, \
viscera_weight, shell_weight, rings+1.5 as label, male, female FROM __THIS__''')

In [28]:
sqlTrans_rf.transform(binary_mTrans.transform(indexer_mTrans.transform\
            (binary_fTrans.transform(indexer_fTrans.transform(train))))).show(20)

+------+--------+------+------------+--------------+--------------+------------+-----+----+------+
|length|diameter|height|whole_weight|shucked_weight|viscera_weight|shell_weight|label|male|female|
+------+--------+------+------------+--------------+--------------+------------+-----+----+------+
|  0.44|    0.34|   0.1|       0.451|         0.188|         0.087|        0.13| 11.5| 0.0|   1.0|
|  0.47|   0.355|   0.1|      0.4755|        0.1675|        0.0805|       0.185| 11.5| 0.0|   1.0|
| 0.525|    0.38|  0.14|      0.6065|         0.194|        0.1475|        0.21| 15.5| 0.0|   1.0|
|  0.53|   0.415|  0.15|      0.7775|         0.237|        0.1415|        0.33| 21.5| 0.0|   1.0|
|  0.53|    0.42| 0.135|       0.677|        0.2565|        0.1415|        0.21| 10.5| 0.0|   1.0|
| 0.535|   0.405| 0.145|      0.6845|        0.2725|         0.171|       0.205| 11.5| 0.0|   1.0|
| 0.545|   0.425| 0.125|       0.768|         0.294|        0.1495|        0.26| 17.5| 0.0|   1.0|
|  0.55|  

We'll need to remove those interactions from the ```VectorAssembler()``` as well.

In [29]:
assembler_rf = VectorAssembler(inputCols=['length', 'diameter', 'height', 'whole_weight', \
                'shucked_weight', 'viscera_weight', 'shell_weight', \
                'male', 'female'], outputCol='features', handleInvalid='keep')

In [30]:
assembler_rf.transform(sqlTrans_rf.transform(binary_mTrans.transform\
                (indexer_mTrans.transform(binary_fTrans.transform\
                (indexer_fTrans.transform(train)))))).show(20)

+------+--------+------+------------+--------------+--------------+------------+-----+----+------+--------------------+
|length|diameter|height|whole_weight|shucked_weight|viscera_weight|shell_weight|label|male|female|            features|
+------+--------+------+------------+--------------+--------------+------------+-----+----+------+--------------------+
|  0.44|    0.34|   0.1|       0.451|         0.188|         0.087|        0.13| 11.5| 0.0|   1.0|[0.44,0.34,0.1,0....|
|  0.47|   0.355|   0.1|      0.4755|        0.1675|        0.0805|       0.185| 11.5| 0.0|   1.0|[0.47,0.355,0.1,0...|
| 0.525|    0.38|  0.14|      0.6065|         0.194|        0.1475|        0.21| 15.5| 0.0|   1.0|[0.525,0.38,0.14,...|
|  0.53|   0.415|  0.15|      0.7775|         0.237|        0.1415|        0.33| 21.5| 0.0|   1.0|[0.53,0.415,0.15,...|
|  0.53|    0.42| 0.135|       0.677|        0.2565|        0.1415|        0.21| 10.5| 0.0|   1.0|[0.53,0.42,0.135,...|
| 0.535|   0.405| 0.145|      0.6845|   

Add the ```RandomForestRegressor``` library

In [31]:
from pyspark.ml.regression import RandomForestRegressor

Now set up a ```CrossValidator()``` object with updated model instance, paramGrids, and pipelines.

In ```ParamGridBuilder()``` specify the number of trees to iterate through and the maximum depth used for each tree iteration. ```Pipeline()``` remains the same and ```RegressionEvaluator()``` defaults to **rmse**.

In [32]:
rf = RandomForestRegressor(labelCol='label', featuresCol='features')
paramGrid_rf = ParamGridBuilder().addGrid(rf.numTrees,[10, 15, 20])\
.addGrid(rf.maxDepth,[1, 5, 10])\
.build()
pipeline_rf=Pipeline(stages=[indexer_fTrans, binary_fTrans, indexer_mTrans, binary_mTrans, \
            sqlTrans_rf, assembler_rf, rf])
crossval_rf=CrossValidator(estimator=pipeline_rf,
                        estimatorParamMaps=paramGrid_rf,
                        evaluator=RegressionEvaluator(),
                        numFolds=3)

Fit the model.

In [None]:
cv_rf_Model = crossval_rf.fit(train)

Some results we can get from the fitted model include the optimal number of trees and features used.

In [34]:
cv_rf_Model.bestModel.stages[-1]

RandomForestRegressionModel: uid=RandomForestRegressor_bcb2c782ae5e, numTrees=20, numFeatures=9

As well as a breakdown of weighted importance of each feature.

In [35]:
cv_rf_Model.bestModel.stages[-1].featureImportances

SparseVector(9, {0: 0.0592, 1: 0.1005, 2: 0.1508, 3: 0.1028, 4: 0.1195, 5: 0.117, 6: 0.3162, 7: 0.0161, 8: 0.0178})

Run predictions on the test set.

In [36]:
cv_rf_Model.transform(test).show(20)

+------+--------+------+------------+--------------+--------------+------------+-----+----+------+--------------------+------------------+
|length|diameter|height|whole_weight|shucked_weight|viscera_weight|shell_weight|label|male|female|            features|        prediction|
+------+--------+------+------------+--------------+--------------+------------+-----+----+------+--------------------+------------------+
|  0.55|   0.415| 0.135|      0.7635|         0.318|          0.21|         0.2| 10.5| 0.0|   1.0|[0.55,0.415,0.135...|11.346372938355417|
| 0.565|    0.44| 0.155|      0.9395|        0.4275|         0.214|        0.27| 13.5| 0.0|   1.0|[0.565,0.44,0.155...|11.643525221962573|
|  0.33|   0.255|  0.08|       0.205|        0.0895|        0.0395|       0.055|  8.5| 0.0|   0.0|[0.33,0.255,0.08,...|  8.03042034870752|
|  0.38|   0.275|   0.1|      0.2255|          0.08|         0.049|       0.085| 11.5| 0.0|   0.0|[0.38,0.275,0.1,0...| 9.078998983157696|
| 0.355|    0.28| 0.095|   

Calculate the performance of the model to compare with our previous model. 

In [37]:
RegressionEvaluator(metricName='mae').evaluate(cv_rf_Model.transform(test))

1.4486161589442321

# Gradient Boosting

Gradient Boosting is another ensemble method based on Regression Trees. As the name ensemble implies, it fits multiple regression trees but instead of taking an average of the results, as Random Forest does, Gradient Boosting takes the residuals of the tree fit and then fits another tree to the residuals and then repeats this pattern for a given number of times. The model is based on the sum of all the trees. The idea is that the first trees are very loose - small number of leaves and that each successive tree gradually increases leaves. The rate of increase is the learning rate and it becomes an additional parameter to be tested when fitting a Gradient Boosted Tree.

In [38]:
from pyspark.ml.regression import GBTRegressor
from pyspark.ml.feature import Normalizer

Here we're going to try out the ```pyspark``` feature ```Normalizer()```. This will normalize all of the feature vectors in the column 'features.'

Except for adding this 'Normalizer' to the end of our data prep, our data pipeline will remain exactly the same as it was for the Random Forest model.

In [39]:
normalizer = Normalizer(inputCol="features", outputCol="features_norm", p=1.0)
normalizer.transform(assembler_rf.transform(sqlTrans_rf.transform(binary_mTrans.transform(indexer_mTrans.transform \
                    (binary_fTrans.transform(indexer_fTrans.transform(train))))))).show(20)

+------+--------+------+------------+--------------+--------------+------------+-----+----+------+--------------------+--------------------+
|length|diameter|height|whole_weight|shucked_weight|viscera_weight|shell_weight|label|male|female|            features|       features_norm|
+------+--------+------+------------+--------------+--------------+------------+-----+----+------+--------------------+--------------------+
|  0.44|    0.34|   0.1|       0.451|         0.188|         0.087|        0.13| 11.5| 0.0|   1.0|[0.44,0.34,0.1,0....|[0.16081871345029...|
|  0.47|   0.355|   0.1|      0.4755|        0.1675|        0.0805|       0.185| 11.5| 0.0|   1.0|[0.47,0.355,0.1,0...|[0.16587259572966...|
| 0.525|    0.38|  0.14|      0.6065|         0.194|        0.1475|        0.21| 15.5| 0.0|   1.0|[0.525,0.38,0.14,...|[0.16390883546674...|
|  0.53|   0.415|  0.15|      0.7775|         0.237|        0.1415|        0.33| 21.5| 0.0|   1.0|[0.53,0.415,0.15,...|[0.14800335101926...|
|  0.53|    0

Set up the CrossValidator() object with updated model instance, paramGrids, and pipelines. Note that the features column to be used in building the model has been updated to the normalized features column ('features_norm').

In ParamGridBuilder() specify the number of trees to iterate through (maxIter), the maximum depth used for each tree iteration (maxDepth), and the learning rate (stepSize).

Also note 'normalizer' added to the end of the pipeline.

In [40]:
gb = GBTRegressor(labelCol='label', featuresCol='features_norm')
paramGrid_gb = ParamGridBuilder() \
.addGrid(gb.maxIter,[10, 15, 20])\
.addGrid(gb.maxDepth,[1, 5, 8])\
.addGrid(gb.stepSize, [0.1, 0.01])\
.build()
pipeline_gb = Pipeline(stages=[indexer_fTrans, binary_fTrans, indexer_mTrans, \
                binary_mTrans, sqlTrans_rf, assembler_rf, normalizer, gb])
crossval_gb = CrossValidator(estimator=pipeline_gb,
                        estimatorParamMaps=paramGrid_gb,
                        evaluator=RegressionEvaluator(),
                        numFolds=3)

fit the model on the training set

In [41]:
cv_gb_Model = crossval_gb.fit(train)

                                                                                

We can use the same attributes as in Random Forest to view the parameter results

In [42]:
cv_gb_Model.bestModel.stages[-1]

GBTRegressionModel: uid=GBTRegressor_e939eda97a25, numTrees=10, numFeatures=9

and feature importances

In [43]:
cv_gb_Model.bestModel.stages[-1].featureImportances

SparseVector(9, {0: 0.2179, 1: 0.0696, 2: 0.048, 3: 0.1221, 4: 0.1815, 5: 0.092, 6: 0.234, 7: 0.0092, 8: 0.0257})

Run predictions on the test set...

In [44]:
cv_gb_Model.transform(test).show(20)

+------+--------+------+------------+--------------+--------------+------------+-----+----+------+--------------------+--------------------+------------------+
|length|diameter|height|whole_weight|shucked_weight|viscera_weight|shell_weight|label|male|female|            features|       features_norm|        prediction|
+------+--------+------+------------+--------------+--------------+------------+-----+----+------+--------------------+--------------------+------------------+
|  0.55|   0.415| 0.135|      0.7635|         0.318|          0.21|         0.2| 10.5| 0.0|   1.0|[0.55,0.415,0.135...|[0.15313935681470...|11.444813516423684|
| 0.565|    0.44| 0.155|      0.9395|        0.4275|         0.214|        0.27| 13.5| 0.0|   1.0|[0.565,0.44,0.155...|[0.14086262777362...|11.528288359204032|
|  0.33|   0.255|  0.08|       0.205|        0.0895|        0.0395|       0.055|  8.5| 0.0|   0.0|[0.33,0.255,0.08,...|[0.31309297912713...|  7.96364266726127|
|  0.38|   0.275|   0.1|      0.2255|   

and calculate the **mae** to compare with the other models.

In [45]:
RegressionEvaluator(metricName='mae').evaluate(cv_gb_Model.transform(test))

1.4427118816091002

All three results are pretty close, but using **mae** as the metric, it looks like the best performing model on the test set was **gradient boosting**!