# Custom hyperparemeter search spaces and custom Mlmodels

In the final part of the classification tutorial, we will see how you can set up your own search spaces within do_datasciencing and outside of it.

In [0]:
import datasciencefunctions as ds
import datasciencefunctions.classification as dsclass

#0. Load and prepare data

#### 
* We will load the adult databricks example dataset. 
* It contains categorical, ordinal and numeric (continuous) predictors representing demographic info of US adults and a target denoting whether their income exceeded USD 50 000. 
* We want to train a model to predict whether the person's income exceeds USD 50 000. 
(you can read more details in the readme below).

In [0]:
with open("/dbfs/databricks-datasets/adult/README.md") as f:
    x = ''.join(f.readlines())

print(x)

In [0]:
schema = """
  age DOUBLE,
  workclass STRING,
  fnlwgt DOUBLE,
  education STRING,
  education_num DOUBLE,
  marital_status STRING,
  occupation STRING,
  relationship STRING,
  race STRING,
  sex STRING,
  capital_gain DOUBLE,
  capital_loss DOUBLE,
  hours_per_week DOUBLE,
  native_country STRING,
  income STRING
"""

df_adult = (
    spark
    .read
    .format("csv")
    .schema(schema)
    .option("header", True)
    .option("path", "dbfs:/databricks-datasets/adult/adult.data")
    .load()
    .sample(fraction=0.35) # only take a sample of the dataset for tutorial purposes
)

df_adult.printSchema()

When running binary classification, datasciencefunctions expects you to specify a label (or target) column with values 1 and 0.

We will create a classification target column called "income_above_50K" with value 1 if the person's income exceeds USD 50K and 0 otherwise.

In [0]:
import pyspark.sql.functions as F

df_adult_ml = (
    df_adult
    .withColumn(
        "income_above_50K", 
        F.when(F.col("income")==" >50K", 1).otherwise(0)
    )
    # we drop the income column because it is perfectly correlated with our label
    .drop("income")
)

In [0]:
display(df_adult_ml.limit(10))

age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income_above_50K
50.0,Self-emp-not-inc,83311.0,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,United-States,0
53.0,Private,234721.0,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,United-States,0
37.0,Private,284582.0,Masters,14.0,Married-civ-spouse,Exec-managerial,Wife,White,Female,0.0,0.0,40.0,United-States,0
49.0,Private,160187.0,9th,5.0,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0.0,0.0,16.0,Jamaica,0
30.0,State-gov,141297.0,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,0.0,0.0,40.0,India,1
40.0,Private,121772.0,Assoc-voc,11.0,Married-civ-spouse,Craft-repair,Husband,Asian-Pac-Islander,Male,0.0,0.0,40.0,?,1
32.0,Private,186824.0,HS-grad,9.0,Never-married,Machine-op-inspct,Unmarried,White,Male,0.0,0.0,40.0,United-States,0
54.0,Private,302146.0,HS-grad,9.0,Separated,Other-service,Unmarried,Black,Female,0.0,0.0,20.0,United-States,0
35.0,Federal-gov,76845.0,9th,5.0,Married-civ-spouse,Farming-fishing,Husband,Black,Male,0.0,0.0,40.0,United-States,0
43.0,Private,117037.0,11th,7.0,Married-civ-spouse,Transport-moving,Husband,White,Male,0.0,2042.0,40.0,United-States,0


In [0]:
import mlflow
from datasciencefunctions.utils import current_dbx_notebook_path

# set experiment based on current notebook path, you will probably want to change it to the shared experiment you'll be working on
mlflow.set_experiment(current_dbx_notebook_path(dbutils) + "_test_experiment")

# 1. Custom hyperparameter search

In `datasciencefunctions` models are trained using a validation option of choice with two possible hyperparameter space search techniques:
* hyperopt
* param_grid

#### Hyperopt
`hyperopt` is a Python package for framework-agnostic hyperparameter optimization and so can be used with models from e.g. scikit-learn and PySpark. It uses a stochastic technique with similarites to gradient descent (and also markov chain monte carlo methods) to search the parameter space. During the search it explores the given hyperparameter space and in each iteration tries to move to the parts of the space with the highest chance of improving a given loss function (in our case it's the value `-metric` where we try to maximize `metric`). Several advantages of `hyperopt` include:
 * No need to provide specific values for numeric parameters (just define the distribution).
 * `param_grid` only tries specific points in the space and can therefore miss 
 * Usually finds better solution than `param_grid` if given the same number of combinations to try.
 * Easier control over computation time needed (simply specify `max_evals` instead of calculating number points in param grid).
However, since the method is stochastic by nature, it might produce different results each time when using a low number of iterations in a large hyperparameter space. 

#### Hyperparameter grid
The `param_grid` option is the classical way of searching the hyperparameter space - the user gives a set of possible hyperparameter values as an input, all combinations of hyperparameters from the grid are tried and the one with the best loss function on the test set is then selected. The advantage of `param_grid` is the fact that it is deterministic and will explore all combinations of possible hyperparameters given by the user. However, it is not suitable for large hyperparameter spaces with broad ranges and large numbers of hyperparameters since it is computationally heavy. Unlike `hyperopt` it also requires the user to specify all possible options for each hyperparameter by hand.

## Custom searches with hyperopt

Let us first see how we can specify a custom hyperparameter for our model space when using the hyperopt technique. We will demonstrate that on a random forest classifier architecture in PySpark which is available in `ds.MlModel` as `spark_random_forest_classifier`. The default hyperparameter space cabe found as follows.

In [0]:
ds.MlModel.spark_random_forest_classifier.default_hyperopt_param_space

We still need to implement a function that will allow printing the set defaults in a easy to read way but once we do the output of the above will look like this:

```
{
    "max_depth": scope.int(hp.quniform("max_depth", 2, 50, 1)),
    "n_estimators": scope.int(hp.quniform("n_estimators", 1, 100, 1)),
}
```

This is the default hyperparameter space. You can read [how to define hyperopt spaces here (a part of hyperopt documentation)](https://github.com/hyperopt/hyperopt/wiki/FMin#21-parameter-expressions).

In the example above (which is the default hyperopt search space for the `ds.MlModel.spark_random_forest_classifier` model) we say that we want to search a paramater space for the maximum depth of the tree from 2 to 50 and take into considerations all values between. Similarly for the number of estimators (i.e. the number of trees).

Unlike in param_grid, hyperopt allows you to define the search space for a hyperparameter as an interval and even allows you to skew the possible values to one side (making them more likely to be tried).

For instance in `ds.MlModel.spark_GLM_binomial` we could specify the search space for the regularization parameter as 

```"regParam": hp.uniform("regParam", 0.0, 1.0)```

which would tell hyperopt to search the entire interval from 0 to 1 uniformly or for example

`"regParam": hp.loguniform("regParam", 0.0, 1.0)`

if we believe values closer to zero are more likely going to be useful but still want to include larger values as well.

Let us now see how to change the hyperopt hyperparameter search space within `do_datasciencing`:

In [0]:
from hyperopt import hp
from hyperopt.pyll import scope

cat_list = [
    "workclass",
    "education",
    "marital_status",
    "race",
    "sex",
    "native_country",
]

num_list = [
    "age",
    "capital_gain",
    "capital_loss",
    "hours_per_week",
]

df_test, df_train, model_summary = (
    dsclass.do_datasciencing(
        df_adult_ml,
        model_type=ds.MlModel.spark_random_forest_classifier,
        label_col="income_above_50K",
        # we are logging the model as an MLFlow experiment
        use_mlflow=True,
        # we specify the categorical and continuous columns and scaling used
        params_fit_pipeline={
            "cat_cols": cat_list,
            "num_cols": num_list,
            "scaling": "standard",
        },  
        # now we use params_fit_model to specify the desired number of hyperopt iterations (given by "max_evals")
        # and define a custom hyperparameter space 
        params_fit_model={
            "max_evals": 10,
            "custom_params": {
                "impurity": hp.choice("impurity", ["gini"]),
                "numTrees": scope.int(hp.quniform("numTrees", 1, 100, 1)),
                "subsamplingRate": hp.uniform("subsamplingRate", 0.5, 1.0),
            },
        },    
    )
)

In the code above, what interests us is `"max_evals"` where we set the number of hyperopt iterations to `10` and `"custom_params"` where we put in a dictionary of hyperparameter ranges in the form

```<parameter_name>: <parameter_range_in_hyperopt_syntax>```.

In the example above we have:
```
"custom_params": {
    "impurity": hp.choice("impurity", ["gini"]),
    "numTrees": scope.int(hp.quniform("numTrees", 1, 100, 1)),
    "subsamplingRate": hp.uniform("subsamplingRate", 0.5, 1.0),
},
```
which says that:
1. We want to set impurity to gini (we would not even have to specify this since that is the default value used in Spark and all hyperparameters not specified in the hyperparameter space are set to their PySpark/scikit-learn defaults automatically)
2. We want to explore different numbers of trees, from 1 to 100 as in the default but only consider multiples of 10
3. We also want to see how the random forest model performs with different subsampling rates in the interval between 0.5 and 1.0

#### Modifying the default hyperparameter search space
If we only want to add the subsampling rate to our hyperparameter space, but use the default space otherwise. In that case, we do not have to rewrite the entire hyperparameter search space. Since it is given as a dictionary of hyperparameter names and hyperopt (or param_grid) ranges, you can simply add the new hyperparameter to the dictionary (or rewrite an existing one) like this:

In [0]:
modified_param_space = ds.MlModel.spark_random_forest_classifier.default_hyperopt_param_space
modified_param_space["subsamplingRate"] = hp.uniform("subsamplingRate", 0.5, 1.0)

...and then we just pass our new param_space into custom_params:

In [0]:
df_test, df_train, model_summary = (
    dsclass.do_datasciencing(
        df_adult_ml,
        model_type=ds.MlModel.spark_random_forest_classifier,
        label_col="income_above_50K",
        # we are logging the model as an MLFlow experiment
        use_mlflow=True,
        # we specify the categorical and continuous columns and scaling used
        params_fit_pipeline={
            "cat_cols": cat_list,
            "num_cols": num_list,
            "scaling": "standard",
        },  
        # now we use params_fit_model to specify the desired number of hyperopt iterations (given by "max_evals")
        # and define a custom hyperparameter space 
        params_fit_model={
            "max_evals": 10,
            "custom_params": modified_param_space,
        },    
    )
)

## Custom searches with `param_grid`

The way of defining the custom hyperparameter spaces with `param_grid` is almost the same as with hyperopt with the exception that the `param_grid` syntax of hyperparameter ranges is different and you can find the default values as follows:

In [0]:
ds.MlModel.spark_random_forest_classifier.default_param_grid_values

# Where to now?

You've finished the tutorial series!

[Back to the introductory notebook](classification.ipynb)