# Hyperparameter Tuning with Sckit-Learn and Dask Demo

## **DSE 230**

### Import libraries

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline

### Load Data from weather_encoded.csv

In [2]:
data_file = "weather_encoded.csv"
df = pd.read_csv(data_file)

### Print the columns of the dataframe

In [3]:
df.columns

Index(['Unnamed: 0', 'MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation',
       'Sunshine', 'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm',
       'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am',
       'Cloud3pm', 'Temp9am', 'Temp3pm', 'RainToday', 'RISK_MM',
       'RainTomorrow', 'WindGustDir_ENE', 'WindGustDir_ESE', 'WindGustDir_N',
       'WindGustDir_NE', 'WindGustDir_NNE', 'WindGustDir_NNW',
       'WindGustDir_NW', 'WindGustDir_S', 'WindGustDir_SE', 'WindGustDir_SSE',
       'WindGustDir_SSW', 'WindGustDir_SW', 'WindGustDir_W', 'WindGustDir_WNW',
       'WindGustDir_WSW', 'WindDir9am_ENE', 'WindDir9am_ESE', 'WindDir9am_N',
       'WindDir9am_NE', 'WindDir9am_NNE', 'WindDir9am_NNW', 'WindDir9am_NW',
       'WindDir9am_S', 'WindDir9am_SE', 'WindDir9am_SSE', 'WindDir9am_SSW',
       'WindDir9am_SW', 'WindDir9am_W', 'WindDir9am_WNW', 'WindDir9am_WSW',
       'WindDir3pm_ENE', 'WindDir3pm_ESE', 'WindDir3pm_N', 'WindDir3pm_NE',
       'WindDir3pm_NNE', 'WindDir3p

### Print the number of rows in the dataframe

In [4]:
len(df)

328

### Print the schema of the dataframe

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 328 entries, 0 to 327
Data columns (total 65 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Unnamed: 0       328 non-null    int64  
 1   MinTemp          328 non-null    float64
 2   MaxTemp          328 non-null    float64
 3   Rainfall         328 non-null    float64
 4   Evaporation      328 non-null    float64
 5   Sunshine         328 non-null    float64
 6   WindGustSpeed    328 non-null    int64  
 7   WindSpeed9am     328 non-null    int64  
 8   WindSpeed3pm     328 non-null    int64  
 9   Humidity9am      328 non-null    int64  
 10  Humidity3pm      328 non-null    int64  
 11  Pressure9am      328 non-null    float64
 12  Pressure3pm      328 non-null    float64
 13  Cloud9am         328 non-null    int64  
 14  Cloud3pm         328 non-null    int64  
 15  Temp9am          328 non-null    float64
 16  Temp3pm          328 non-null    float64
 17  RainToday       

## Prepare Data

Drop the column `RISK_MM`. Assign `RainTomorrow` as the target label

In [6]:
features, label = df.drop(["RainTomorrow", "RISK_MM"], axis=1), df["RainTomorrow"]

print(f"feature columns: \n{features.columns}\n")
print(f"Features: \n{features.head()}\n")
print(f"Labels: \n{label.head()}")

feature columns: 
Index(['Unnamed: 0', 'MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation',
       'Sunshine', 'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm',
       'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am',
       'Cloud3pm', 'Temp9am', 'Temp3pm', 'RainToday', 'WindGustDir_ENE',
       'WindGustDir_ESE', 'WindGustDir_N', 'WindGustDir_NE', 'WindGustDir_NNE',
       'WindGustDir_NNW', 'WindGustDir_NW', 'WindGustDir_S', 'WindGustDir_SE',
       'WindGustDir_SSE', 'WindGustDir_SSW', 'WindGustDir_SW', 'WindGustDir_W',
       'WindGustDir_WNW', 'WindGustDir_WSW', 'WindDir9am_ENE',
       'WindDir9am_ESE', 'WindDir9am_N', 'WindDir9am_NE', 'WindDir9am_NNE',
       'WindDir9am_NNW', 'WindDir9am_NW', 'WindDir9am_S', 'WindDir9am_SE',
       'WindDir9am_SSE', 'WindDir9am_SSW', 'WindDir9am_SW', 'WindDir9am_W',
       'WindDir9am_WNW', 'WindDir9am_WSW', 'WindDir3pm_ENE', 'WindDir3pm_ESE',
       'WindDir3pm_N', 'WindDir3pm_NE', 'WindDir3pm_NNE', 'WindDir3pm_NNW',
       '

### Split into train and test data

In [7]:
from sklearn.model_selection import train_test_split

seed=123
X_train, X_test, y_train, y_test = train_test_split(features, label, test_size=0.4, random_state=seed)
print("Number of train samples:", len(X_train))
print("Number of test samples:", len(X_test))

Number of train samples: 196
Number of test samples: 132


### Build Decision Tree Classifier

* Refer - https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
* Pass `random_state=seed` as an argument to `DecisionTreeClassifier`

In [8]:
from sklearn.tree import DecisionTreeClassifier

# seed for reproducing the same result
dt_model = DecisionTreeClassifier(random_state=seed)

dt_model.fit(X_train,y_train)

DecisionTreeClassifier(random_state=123)

### Print train and test accuracy

In [9]:
print("Train accuracy:", dt_model.score(X_train,y_train))
print("Test accuracy:", dt_model.score(X_test,y_test))

Train accuracy: 1.0
Test accuracy: 0.7196969696969697


### Hyperparameters of the model
* `dt_model._tree` returns the underlying tree structure
* Refer - https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html#sphx-glr-auto-examples-tree-plot-unveil-tree-structure-py

In [10]:
print(dt_model.tree_.max_depth)
dt_model.get_params()

7


{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'random_state': 123,
 'splitter': 'best'}

## Hyperparameter Tuning

### Set up grid search

* Parameters:
    * `max_depth` in the range \[1,10) - Maximum depth of the tree
    * `min_samples_split` in the range \[2, 10) - Minimum number of samples required in a split
    * `creiterion` in ['gini', 'entropy'] - Criterion for splitting at a given node
* Refer - https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

In [11]:
from sklearn.model_selection import GridSearchCV

param_grid = {'max_depth'        : list(range(1, 10)),
              'min_samples_split': list(range(2, 10)),
              'criterion'        : ['gini','entropy']
             }

### Scikit-learn with no parallelism

* `%%time` calculates the time taken to execute the cell
* `cv` parameter determines the cross-validation splits
* `param_list` is dictionary with parameters names as keys and lists of parameter settings to try as values

In [12]:
%%time 

dt_model = DecisionTreeClassifier(random_state=seed)
dt_model_grid = GridSearchCV(dt_model, param_grid, cv=10)
dt_model_grid.fit(X_train, y_train)

CPU times: user 16.2 s, sys: 0 ns, total: 16.2 s
Wall time: 16.2 s


GridSearchCV(cv=10, estimator=DecisionTreeClassifier(random_state=123),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9],
                         'min_samples_split': [2, 3, 4, 5, 6, 7, 8, 9]})

In [13]:
print(dt_model_grid.best_params_)
print(dt_model_grid.best_score_)
print("Train accuracy:", dt_model_grid.score(X_train, y_train))
print("Test accuracy:", dt_model_grid.score(X_test, y_test))

{'criterion': 'entropy', 'max_depth': 2, 'min_samples_split': 2}
0.908421052631579
Train accuracy: 0.923469387755102
Test accuracy: 0.8106060606060606


### Parallelism with Scikit-Learn
Same as above, but pass `n_jobs=-1` as a parameter to `GridSearchCV`

In [14]:
%%time 

dt_model = DecisionTreeClassifier(random_state=seed)
dt_model_grid = GridSearchCV(dt_model, param_grid, cv=10, n_jobs=-1)
dt_model_grid.fit(X_train, y_train)

CPU times: user 1.44 s, sys: 182 ms, total: 1.63 s
Wall time: 5.62 s


GridSearchCV(cv=10, estimator=DecisionTreeClassifier(random_state=123),
             n_jobs=-1,
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9],
                         'min_samples_split': [2, 3, 4, 5, 6, 7, 8, 9]})

In [15]:
print (dt_model_grid.best_params_)
print (dt_model_grid.best_score_)
print (f"Train accuracy:", dt_model_grid.score(X_train, y_train))
print (f"Test accuracy:", dt_model_grid.score(X_test, y_test))

{'criterion': 'entropy', 'max_depth': 2, 'min_samples_split': 2}
0.908421052631579
Train accuracy: 0.923469387755102
Test accuracy: 0.8106060606060606


### Parallelism with Dask

Many Scikit-Learn algorithms are written for parallel execution using Joblib, which natively provides thread-based and process-based parallelism. Joblib is what backs the `n_jobs=` parameter in normal use of Scikit-Learn.

Dask can scale these Joblib-backed algorithms out to a **cluster of machines** by providing an alternative Joblib backend.

Refer - https://ml.dask.org/joblib.html

In [16]:
import joblib
from dask.distributed import Client

# Start and connect to local client
client = Client(n_workers=2)

# client = Client("scheduler-address:8786")  # connecting to remote cluster

In [17]:
client

0,1
Client  Scheduler: tcp://127.0.0.1:39301  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 2  Cores: 8  Memory: 13.29 GB


### Use dask backend to Joblib

* To use the Dask backend to Joblib you have to create a Client, and wrap your `scikit-learn` code with `joblib.parallel_backend('dask')`

In [18]:
%%time 

dt_model = DecisionTreeClassifier(random_state=seed)
dt_model_grid = GridSearchCV(dt_model, param_grid, cv=10)
with joblib.parallel_backend("dask"):
    # Your scikit-learn code
    dt_model_grid.fit (X_train, y_train)

CPU times: user 7.43 s, sys: 854 ms, total: 8.28 s
Wall time: 17.8 s


In [19]:
print (dt_model_grid.best_params_)
print (dt_model_grid.best_score_)
print ("Train accuracy:", dt_model_grid.score(X_train, y_train))
print ("Test accuracy:", dt_model_grid.score(X_test, y_test))

{'criterion': 'entropy', 'max_depth': 2, 'min_samples_split': 2}
0.908421052631579
Train accuracy: 0.923469387755102
Test accuracy: 0.8106060606060606


### Close connection to client

In [20]:
client.shutdown()

distributed.client - ERROR - Failed to reconnect to scheduler after 10.00 seconds, closing client
_GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=CancelledError()>
asyncio.exceptions.CancelledError
