
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>



 
# Hyperopt Lab

The <a href="https://github.com/hyperopt/hyperopt" target="_blank">Hyperopt library</a> allows for parallel hyperparameter tuning using either random search or Tree of Parzen Estimators (TPE). With MLflow, we can record the hyperparameters and corresponding metrics for each hyperparameter combination. You can read more on <a href="https://github.com/hyperopt/hyperopt/blob/master/docs/templates/scaleout/spark.md" target="_blank">SparkTrials w/ Hyperopt</a>.

> SparkTrials fits and evaluates each model on one Spark executor, allowing massive scale-out for tuning. To use SparkTrials with Hyperopt, simply pass the SparkTrials object to Hyperopt's fmin() function.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Learning Objectives:<br>

By the end of this lab, you should be able to;

* Train a single-node machine learning model in distributed way
* Explain the difference between `SparkTrails` and default `Trails` class

## Lab Setup

The first thing we're going to do is to **run setup script**. This script will define the required configuration variables that are scoped to each user.

In [0]:
%run "../Includes/Classroom-Setup"

Python interpreter will be restarted.
Python interpreter will be restarted.


Resetting the learning environment:
| No action taken

Skipping install of existing datasets to "dbfs:/mnt/dbacademy-datasets/scalable-machine-learning-with-apache-spark/v02"

Validating the locally installed datasets:
| listing local files...(3 seconds)
| validation completed...(3 seconds total)

Creating & using the schema "charlie_ohara_4mi2_da_sml" in the catalog "hive_metastore"...(1 seconds)

Predefined tables in "charlie_ohara_4mi2_da_sml":
| -none-

Predefined paths variables:
| DA.paths.working_dir: dbfs:/mnt/dbacademy-users/charlie.ohara@standard.ai/scalable-machine-learning-with-apache-spark
| DA.paths.user_db:     dbfs:/mnt/dbacademy-users/charlie.ohara@standard.ai/scalable-machine-learning-with-apache-spark/database.db
| DA.paths.datasets:    dbfs:/mnt/dbacademy-datasets/scalable-machine-learning-with-apache-spark/v02

Setup completed (9 seconds)



## Load Dataset

Read in a cleaned version of the Airbnb dataset with just numeric features.

In [0]:
from sklearn.model_selection import train_test_split
import pandas as pd # doing equivalent work on a single node vs spark parallelization

df = pd.read_csv(f"dbfs:/mnt/dbacademy-datasets/scalable-machine-learning-with-apache-spark/v02/airbnb/sf-listings/airbnb-cleaned-mlflow.csv".replace("dbfs:/", "/dbfs/")).drop(["zipcode"], axis=1)

# split 80/20 train-test
X_train, X_test, y_train, y_test = train_test_split(df.drop(["price"], axis=1),
                                                    df[["price"]].values.ravel(),
                                                    test_size = 0.2,
                                                    random_state = 42) 


## Define Objective Function

Now we need to define an **`objective_function`** where you evaluate the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html" target="_blank">random forest's</a> predictions using R2.

In the code below, compute the **`r2`** and return it (remember we are trying to maximize R2, so we need to return it as a negative value).

In [0]:
# TODO
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer, r2_score
from numpy import mean
  
def objective_function(params):
    # set the hyperparameters that we want to tune:
    max_depth = params['max_depth']
    max_features = params['max_features']

    regressor = RandomForestRegressor(max_depth=max_depth, max_features=max_features, random_state=42)

    # Evaluate predictions
    r2 = mean(cross_val_score(regressor, X_train, y_train, cv=3))

    # Note: since we aim to maximize r2, we need to return it as a negative value ("loss": -metric)
    return -r2



## Define Search Space

We need to define a search space for HyperOpt. Let the **`max_depth`** vary between 2-10, and **`max_features`** be one of: "auto", "sqrt", or "log2".

In [0]:
# TODO
from hyperopt import hp

# algos to use when choosing the max features to use for training
max_features_choices =  ["auto", "sqrt", "log2"]
search_space = {
    "max_depth": hp.quniform("max_depth", 2, 10, 1),
    "max_features": hp.choice("max_features", max_features_choices)
}


## Train Models Concurrently

Instead of using the default **`Trials`** class, you can leverage the **`SparkTrials`** class to trigger the distribution of tuning tasks across Spark executors. On Databricks, SparkTrials are automatically logged with MLflow.

**`SparkTrials`** takes 3 optional arguments, namely **`parallelism`**, **`timeout`**, and **`spark_session`**. You can refer to this <a href="http://hyperopt.github.io/hyperopt/scaleout/spark/" target="_blank">page</a> to read more.

In the code below, fill in the **`fmin`** function.

In [0]:
# TODO
from hyperopt import fmin, tpe, SparkTrials
import mlflow
import numpy as np

# Number of models to evaluate
num_evals = 8
# Number of models to train concurrently
spark_trials = SparkTrials(parallelism=2)
# Automatically logs to MLflow
best_hyperparam = fmin(
    fn=objective_function, # what are we trying to maximize as an evluation metric?
    space=search_space, # how we are going to search = using max_depth and max_features hyperparameters 
    max_evals=num_evals, # max number of models to train
    trials=spark_trials, # where we are going to store the trials
    algo=tpe.suggest # bayseian approach
)

# Re-train best model and log metrics on test dataset
with mlflow.start_run(run_name="best_model"):
    # get optimal hyperparameter values
    best_max_depth = best_hyperparam["max_depth"]
    best_max_features = max_features_choices[best_hyperparam["max_features"]] # best hyperparam provides index for the max_features_choices list 

    # train model on entire training data
    regressor = RandomForestRegressor(max_depth=best_max_depth, max_features=best_max_features, random_state=42)
    regressor.fit(X_train, y_train)

    # evaluate on holdout/test data
    r2 = regressor.score(X_test, y_test)

    # Log param and metric for the final model
    mlflow.log_param("max_depth", best_max_depth)
    mlflow.log_param("max_features", best_max_features)
    mlflow.log_metric("loss", r2)

Hyperopt with SparkTrials will automatically track trials in MLflow. To view the MLflow experiment associated with the notebook, click the 'Runs' icon in the notebook context bar on the upper right. There, you can view all runs.
To view logs from trials, please check the Spark executor logs. To view executor logs, expand 'Spark Jobs' above until you see the (i) icon next to the stage from the trial job. Click it and find the list of tasks. Click the 'stderr' link for a task to view trial logs.


  0%|          | 0/8 [00:00<?, ?trial/s, best loss=?] 25%|██▌       | 2/8 [00:07<00:21,  3.59s/trial, best loss: -0.5718057846111207] 50%|█████     | 4/8 [00:12<00:11,  2.99s/trial, best loss: -0.6468700448781513] 75%|███████▌  | 6/8 [00:17<00:05,  2.80s/trial, best loss: -0.6658971922113791] 88%|████████▊ | 7/8 [00:22<00:03,  3.34s/trial, best loss: -0.6658971922113791]100%|██████████| 8/8 [00:23<00:00,  2.73s/trial, best loss: -0.6658971922113791]100%|██████████| 8/8 [00:23<00:00,  2.93s/trial, best loss: -0.6658971922113791]


Total Trials: 8: 8 succeeded, 0 failed, 0 cancelled.





Now you can compare all of the models using the MLflow UI. 

To understand the effect of tuning a hyperparameter:

0. Select the resulting runs and click Compare.
0. In the Scatter Plot, select a hyperparameter for the X-axis and loss for the Y-axis.


## Classroom Cleanup

Run the following cell to remove lessons-specific assets created during this lesson:

In [0]:
DA.cleanup()

&copy; 2023 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>