# Hyperparameter tuning with Hyperopt

In this lab, you will learn to tune hyperparameters in Azure Databricks. This lab will cover the following exercises:
- Exercise 2: Using Hyperopt for hyperparameter tuning.

To upload the necessary data, please follow the instructions in the lab guide.

## Attach notebook to your cluster
Before executing any cells in the notebook, you need to attach it to your cluster. Make sure that the cluster is running.

In the notebook's toolbar, select the drop down arrow next to Detached, and then select your cluster under Attach to.

Make sure you run each cells in order.

-sandbox
## Exercise 2: Using Hyporopt for hyperparameter tuning
[Hyperopt](https://github.com/hyperopt/hyperopt) is a Python library for hyperparameter tuning. Databricks Runtime for Machine Learning includes an optimized and enhanced version of Hyperopt, including automated MLflow tracking and the `SparkTrials` class for distributed tuning.  

This exercise illustrates how to scale up hyperparameter tuning for a single-machine Python ML algorithm and track the results using MLflow. Run the cell below to import the required packages.

In [0]:
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC

from hyperopt import fmin, tpe, hp, SparkTrials, STATUS_OK, Trials

# If you are running Databricks Runtime for Machine Learning, `mlflow` is already installed and you can skip the following line. 
import mlflow

### Load the data
In this exercise, you will use a dataset that includes chemical and visual features of wine samples to classify them based on their cultivar (grape variety).

The dataset consists of 12 numeric features and a classification label with the following classes:
- **0** (variety A)
- **1** (variety B)
- **2** (variety C)

Run the following cell to load the table into a Spark dataframe and reivew the dataframe.

In [0]:
df = spark.sql("select * from wine")
display(df)

Separate the features from the label (WineVariety):

In [0]:
import numpy as np
df_features = df.select('Alcohol','Malic_acid','Ash','Alcalinity','Magnesium','Phenols','Flavanoids','Nonflavanoids','Proanthocyanins','Color_intensity','Hue','OD280_315_of_diluted_wines','Proline').collect()
X = np.array(df_features)

df_label = df.select('WineVariety').collect()
y = np.array(df_label)

Check the first four wines to see if the data is loaded in correctly:

In [0]:
for n in range(0,4):
    print("Wine", str(n+1), "\n  Features:",list(X[n]), "\n  Label:", y[n])

## Part 1. Single-machine Hyperopt workflow

Here are the steps in a Hyperopt workflow:  
1. Define a function to minimize.  
2. Define a search space over hyperparameters.  
3. Select a search algorithm.  
4. Run the tuning algorithm with Hyperopt `fmin()`.

For more information, see the [Hyperopt documentation](https://github.com/hyperopt/hyperopt/wiki/FMin).

### Define a function to minimize
In this example, we use a support vector machine classifier. The objective is to find the best value for the regularization parameter `C`.  

Most of the code for a Hyperopt workflow is in the objective function. This example uses the [support vector classifier from scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).

In [0]:
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC

def objective(C):
    # Create a support vector classifier model
    clf = SVC(C)
    
    # Use the cross-validation accuracy to compare the models' performance
    accuracy = cross_val_score(clf, X, y).mean()
    
    # Hyperopt tries to minimize the objective function. A higher accuracy value means a better model, so you must return the negative accuracy.
    return {'loss': -accuracy, 'status': STATUS_OK}

### Define the search space over hyperparameters

See the [Hyperopt docs](https://github.com/hyperopt/hyperopt/wiki/FMin#21-parameter-expressions) for details on defining a search space and parameter expressions.

In [0]:
search_space = hp.lognormal('C', 0, 1.0)

### Select a search algorithm

The two main choices are:
* `hyperopt.tpe.suggest`: Tree of Parzen Estimators, a Bayesian approach which iteratively and adaptively selects new hyperparameter settings to explore based on past results
* `hyperopt.rand.suggest`: Random search, a non-adaptive approach that samples over the search space

In [0]:
algo=tpe.suggest

Run the tuning algorithm with Hyperopt `fmin()`

Set `max_evals` to the maximum number of points in hyperparameter space to test, that is, the maximum number of models to fit and evaluate.

In [0]:
argmin = fmin(
  fn=objective,
  space=search_space,
  algo=algo,
  max_evals=16)

In [0]:
# Print the best value found for C
print("Best value found: ", argmin)

## Part 2. Distributed tuning using Apache Spark and MLflow

To distribute tuning, add one more argument to `fmin()`: a `Trials` class called `SparkTrials`. 

`SparkTrials` takes 2 optional arguments:  
* `parallelism`: Number of models to fit and evaluate concurrently. The default is the number of available Spark task slots.
* `timeout`: Maximum time (in seconds) that `fmin()` can run. The default is no maximum time limit.

This example uses the very simple objective function defined in Cmd 12. In this case, the function runs quickly and the overhead of starting the Spark jobs dominates the calculation time, so the calculations for the distributed case take more time. For typical real-world problems, the objective function is more complex, and using `SparkTrails` to distribute the calculations will be faster than single-machine tuning.

Automated MLflow tracking is enabled by default. To use it, call `mlflow.start_run()` before calling `fmin()` as shown in the example.

In [0]:
from hyperopt import SparkTrials

# To display the API documentation for the SparkTrials class, uncomment the following line.
# help(SparkTrials)

In [0]:
spark_trials = SparkTrials()

with mlflow.start_run():
  argmin = fmin(
    fn=objective,
    space=search_space,
    algo=algo,
    max_evals=16,
    trials=spark_trials)

In [0]:
# Print the best value found for C
print("Best value found: ", argmin)

To view the MLflow experiment associated with the notebook, click the **Experiment** icon in the notebook context bar on the upper right.  There, you can view all runs. To view runs in the MLflow UI, click the icon at the far right next to **Experiment Runs**. 

To examine the effect of tuning `C`:

1. Select the resulting runs and click **Compare**.
1. In the Scatter Plot, select **C** for X-axis and **loss** for Y-axis.