# Hyperparameter Tuning with Sckit-Learn and Dask Demo

## **DSE 230**

### Import libraries

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline

### Load Data from weather_encoded.csv

In [None]:
data_file = "weather_encoded.csv"
df = pd.read_csv(data_file)

### Print the columns of the dataframe

In [None]:
df.columns

### Print the number of rows in the dataframe

In [None]:
len(df)

### Print the schema of the dataframe

In [None]:
df.info()

## Prepare Data

Drop the column `RISK_MM`. Assign `RainTomorrow` as the target label

In [None]:
features, label = df.drop(["RainTomorrow", "RISK_MM"], axis=1), df["RainTomorrow"]

print(f"feature columns: \n{features.columns}\n")
print(f"Features: \n{features.head()}\n")
print(f"Labels: \n{label.head()}")

### Split into train and test data
* Pass `random_state=seed` to `train_test_split` for reproducability
* Use 60% for training and 40% for testing
* Print the number of samples in train data and test data

In [None]:
from sklearn.model_selection import train_test_split

seed=123
X_train, X_test, y_train, y_test = train_test_split(features, label, test_size=0.4, random_state=seed)
print("Number of train samples:", len(X_train))
print("Number of test samples:", len(X_test))

### Build Decision Tree Classifier

* Refer - https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
* Pass `random_state=seed` as an argument to `DecisionTreeClassifier`

In [None]:
from sklearn.tree import DecisionTreeClassifier

<< YOUR CODE HERE >>

### Print train and test accuracy
Refer - https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.score

In [None]:
print("Train accuracy:", << YOUR CODE HERE >>)
print("Test accuracy:", << YOUR CODE HERE >>)

### Parameters of the model
* `<model>._tree` returns the underlying tree structure
* Print the `max_depth` of the Decision tree model
* Refer - https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html#sphx-glr-auto-examples-tree-plot-unveil-tree-structure-py
* Print the parameters of the Decision tree model using `<model>.get_params()`
* Refer - https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.get_params

In [None]:
<< YOUR CODE HERE >>

## Hyperparameter Tuning

### Set up grid search

* Parameters:
    * `max_depth` in the range \[1,10) - Maximum depth of the tree
    * `min_samples_split` in the range \[2, 10) - Minimum number of samples required in a split
    * `creiterion` in ['gini', 'entropy'] - Criterion for splitting at a given node

* Refer - https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {'max_depth'        : list(range(1, 10)),
              'min_samples_split': list(range(2, 10)),
              'criterion'        : ['gini','entropy']
             }

### Scikit-learn with no parallelism

* `%%time` calculates the time taken to execute the cell
* `cv` parameter determines the cross-validation splits. Pass `10` as the value for `cv`
* `param_list` is dictionary with model parameters names as keys and lists of parameter settings to try as values
* Pass `cv`, `param_list` and `random_seed` as parameters to `GridSearchCV`

Refer - https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

In [None]:
%%time 

<< YOUR CODE HERE >>

### Print `best_params_` and `best_score_` of the grid search model
*Not* for the `dt_model`

In [None]:
<< YOUR CODE HERE >>

### Print train and test accuracy of the grid search model
*Not* for the `dt_model`

In [None]:
print("Train accuracy:", << YOUR CODE HERE >>)
print("Test accuracy:", << YOUR CODE HERE >>)

### Parallelism with Scikit-Learn
Same as above, but in addition to `cv`, `param_list` and `random_seed`, pass `n_jobs=-1` as a parameter to `GridSearchCV`

In [None]:
%%time 

<< YOUR CODE HERE >>

### Print `best_params_` and `best_score_`
Same as above

In [None]:
<< YOUR CODE HERE >>

### Print train and test accuracy
Same as above

In [None]:
<< YOUR CODE HERE >>

### Parallelism with Dask

Many Scikit-Learn algorithms are written for parallel execution using Joblib, which natively provides thread-based and process-based parallelism. Joblib is what backs the `n_jobs=` parameter in normal use of Scikit-Learn.

Dask can scale these Joblib-backed algorithms out to a **cluster of machines** by providing an alternative Joblib backend.

Refer - https://ml.dask.org/joblib.html

In [None]:
import joblib
from dask.distributed import Client

# Start and connect to local client
client = Client(n_workers=2)

# client = Client("scheduler-address:8786")  # connecting to remote cluster

In [None]:
client

### Use dask backend to Joblib

* To use the Dask backend to Joblib you have to create a Client, and wrap your `scikit-learn` code with `joblib.parallel_backend('dask')`

In [None]:
%%time 

# Set up grid search model here
<< YOUR CODE HERE >>
with joblib.parallel_backend("dask"):
    # model.fit here
    << YOUR CODE HERE >>

### Print `best_params_` and `best_score_`
Same as above

In [None]:
<< YOUR CODE HERE >>

### Print train and test accuracy
Same as above

In [None]:
<< YOUR CODE HERE >>

### Close connection to client

In [None]:
client.shutdown()