
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>




# Random Forests and Hyperparameter Tuning

Now let's take a look at how to tune random forests using grid search and cross validation in order to find the optimal hyperparameters.  Using the Databricks Runtime for ML, MLflow automatically logs metrics of all your experiments with the SparkML cross-validator as well as the best fit model!

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Lesson Objectives:<br>

By the end of this lesson, you should be able to;

* Tune hyperparameters using Spark ML’s grid search feature
* Explain cross validation concepts and how to use cross validation in Spark ML pipelines
* Optimize a Spark ML pipeline


## 📌 Requirements

**Required Databricks Runtime Version:** 
* Please note that in order to run this notebook, you must use one of the following Databricks Runtime(s): **12.2.x-cpu-ml-scala2.12**

## Lesson Setup

The first thing we're going to do is to **run setup script**. This script will define the required configuration variables that are scoped to each user

In [0]:
%run "./Includes/Classroom-Setup"

Python interpreter will be restarted.
Python interpreter will be restarted.


Resetting the learning environment:
| No action taken

Skipping install of existing datasets to "dbfs:/mnt/dbacademy-datasets/scalable-machine-learning-with-apache-spark/v02"

Validating the locally installed datasets:
| listing local files...(3 seconds)
| validation completed...(3 seconds total)

Creating & using the schema "charlie_ohara_4mi2_da_sml" in the catalog "hive_metastore"...(1 seconds)

Predefined tables in "charlie_ohara_4mi2_da_sml":
| -none-

Predefined paths variables:
| DA.paths.working_dir: dbfs:/mnt/dbacademy-users/charlie.ohara@standard.ai/scalable-machine-learning-with-apache-spark
| DA.paths.user_db:     dbfs:/mnt/dbacademy-users/charlie.ohara@standard.ai/scalable-machine-learning-with-apache-spark/database.db
| DA.paths.datasets:    dbfs:/mnt/dbacademy-datasets/scalable-machine-learning-with-apache-spark/v02

Setup completed (12 seconds)



## Build a Model Pipeline

In [0]:
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml import Pipeline

# load data and split between train and test data 
file_path = f"dbfs:/mnt/dbacademy-datasets/scalable-machine-learning-with-apache-spark/v02/airbnb/sf-listings/sf-listings-2019-03-06-clean.delta/"
airbnb_df = spark.read.format("delta").load(file_path)
train_df, test_df = airbnb_df.randomSplit([.8, .2], seed=42)

# then do "feature eng" 
categorical_cols = [field for (field, dataType) in train_df.dtypes if dataType == "string"]
index_output_cols = [x + "Index" for x in categorical_cols]

string_indexer = StringIndexer(inputCols=categorical_cols, outputCols=index_output_cols, handleInvalid="skip")
# vector assemble = ML friendly format adding all numeric and caterogical variables into a single column as a list 
numeric_cols = [field for (field, dataType) in train_df.dtypes if ((dataType == "double") & (field != "price"))]
assembler_inputs = index_output_cols + numeric_cols
vec_assembler = VectorAssembler(inputCols=assembler_inputs, outputCol="features")

# choose the algo, predicting the price and setting max bins so data can be split into multiple workers evenly
rf = RandomForestRegressor(labelCol="price", maxBins=40)
stages = [string_indexer, vec_assembler, rf]
# run all the steps sequentially 
pipeline = Pipeline(stages=stages)




## ParamGrid

First let's take a look at the various hyperparameters we could tune for random forest.

**Pop quiz:** what's the difference between a parameter and a hyperparameter?

In [0]:
print(rf.explainParams())

bootstrap: Whether bootstrap samples are used when building trees. (default: True)
cacheNodeIds: If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. Users can set how often should the cache be checkpointed or disable it by setting checkpointInterval. (default: False)
checkpointInterval: set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext. (default: 10)
featureSubsetStrategy: The number of features to consider for splits at each tree node. Supported options: 'auto' (choose automatically for task: If numTrees == 1, set to 'all'. If numTrees > 1 (forest), set to 'sqrt' for classification and to 'onethird' for regression), 'all' (use all features), 'onethird' (use 1/3 of the featur




There are a lot of hyperparameters we could tune, and it would take a long time to manually configure.

Instead of a manual (ad-hoc) approach, let's use Spark's <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.tuning.ParamGridBuilder.html?highlight=paramgridbuilder#pyspark.ml.tuning.ParamGridBuilder" target="_blank">ParamGridBuilder</a> to find the optimal hyperparameters in a more systematic approach.

Let's define a grid of hyperparameters to test:
  - **`maxDepth`**: max depth of each decision tree (Use the values **`2, 5`**)
  - **`numTrees`**: number of decision trees to train (Use the values **`5, 10`**)

**`addGrid()`** accepts the name of the parameter (e.g. **`rf.maxDepth`**), and a list of the possible values (e.g. **`[2, 5]`**).

In [0]:
from pyspark.ml.tuning import ParamGridBuilder

param_grid = (ParamGridBuilder()
              .addGrid(rf.maxDepth, [2, 5]) # depth of each tree - list of possible values are depth or 2 or 5
              .addGrid(rf.numTrees, [5, 10]) # list of possible values are 5 trees and 10 trees
              .build())




## Cross Validation

We are also going to use 3-fold cross validation to identify the optimal hyperparameters.

![crossValidation](https://files.training.databricks.com/images/301/CrossValidation.png)

With 3-fold cross-validation, we train on 2/3 of the data, and evaluate with the remaining (held-out) 1/3. We repeat this process 3 times, so each fold gets the chance to act as the validation set. We then average the results of the three rounds.




We pass in the **`estimator`** (pipeline), **`evaluator`**, and **`estimatorParamMaps`** to <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.tuning.CrossValidator.html?highlight=crossvalidator#pyspark.ml.tuning.CrossValidator" target="_blank">CrossValidator</a> so that it knows:
- Which model to use
- How to evaluate the model
- What hyperparameters to set for the model

We can also set the number of folds we want to split our data into (3), as well as setting a seed so we all have the same split in the data.

In [0]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import CrossValidator

evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction")

cv = CrossValidator(estimator=pipeline, # includes preprocessing of the data and algo we want to use - need to rebuild over and over
                    evaluator=evaluator, # defines how we evaluate model success 
                    estimatorParamMaps=param_grid, # what hyperparameters to use 
                    numFolds=3, # split our data into 3 - use 2/3 for training and 1/3 for validation, switching up which third is used for validation 
                    seed=42 # random number to make result reproducable 
                    )




**Question**: How many models are we training right now?

In [0]:
cv_model = cv.fit(train_df)




**Question**: Should we put the pipeline in the cross validator, or the cross validator in the pipeline?

It depends if there are estimators or transformers in the pipeline. If you have things like StringIndexer (an estimator) in the pipeline, then you have to refit it every time if you put the entire pipeline in the cross validator.

However, if there is any concern about data leakage from the earlier steps, the safest thing is to put the pipeline inside the CV, not the other way. CV first splits the data and then .fit() the pipeline. If it is placed at the end of the pipeline, we potentially can leak the info from hold-out set to train set.

In [0]:
cv = CrossValidator(estimator=rf, evaluator=evaluator, estimatorParamMaps=param_grid, 
                    numFolds=3, seed=42)

stages_with_cv = [string_indexer, vec_assembler, cv] # much faster to avoid rebuilding the pipeline each time = duplicate work for each pass
pipeline = Pipeline(stages=stages_with_cv)

pipeline_model = pipeline.fit(train_df)

In the current method, only **one MLflow run is logged**, whereas in the previous method, **five runs** were logged. This is because, in the first method, the pipeline was placed within the CrossValidator, which automatically logs all runs. However, in the second method, since **the pipeline only returns the best model without evaluating metrics**, only that single model is seen. Additionally, no evaluation metrics are logged. Essentially, the Pipeline logs all stages and directly returns the best model without performing model evaluations.




Let's take a look at the model with the best hyperparameter configuration

In [0]:
results = list(zip(cv_model.getEstimatorParamMaps(), cv_model.avgMetrics))
print(results) 
print(len(results)) # 4
# so we do a combo of  the different params = 2 x 2 = 4 models created
# model 1 = train 5 trees with depth of 2 
# model 2 = train 10 trees with depth of 2
# model 3 = train 5 trees with depth of 5
# model 4 = train 10 trees with depth of 5


[({Param(parent='RandomForestRegressor_7336550d560e', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 2, Param(parent='RandomForestRegressor_7336550d560e', name='numTrees', doc='Number of trees to train (>= 1).'): 5}, 280.13676707305024), ({Param(parent='RandomForestRegressor_7336550d560e', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 2, Param(parent='RandomForestRegressor_7336550d560e', name='numTrees', doc='Number of trees to train (>= 1).'): 10}, 280.00286295955465), ({Param(parent='RandomForestRegressor_7336550d560e', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 5, Param(parent='RandomForestRegressor_7336550d560e', name='numTrees', do

In [0]:
# the pipeline_model will only store the model that had the best performance based on the performance metric we defined in the evaluator 
pred_df = pipeline_model.transform(test_df)

rmse = evaluator.evaluate(pred_df)
r2 = evaluator.setMetricName("r2").evaluate(pred_df)
print(f"RMSE is {rmse}")
print(f"R2 is {r2}")

RMSE is 0.35841179228176645
R2 is 0.35841179228176645





Progress!  Looks like we're out-performing decision trees.


## Classroom Cleanup

Run the following cell to remove lessons-specific assets created during this lesson:

In [0]:
DA.cleanup()

Resetting the learning environment:
| dropping the schema "charlie_ohara_4mi2_da_sml"...(1 seconds)
| removing the working directory "dbfs:/mnt/dbacademy-users/charlie.ohara@standard.ai/scalable-machine-learning-with-apache-spark"...(0 seconds)

Validating the locally installed datasets:
| listing local files...(3 seconds)
| validation completed...(3 seconds total)


&copy; 2023 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>