# Module 6 Lab 6 - Grid searching for the best model hyper parameters

A grid search is an automated way to fit a model with different hyper parameters and compare the results in order to find the best performing model.  Hyper parameters are those parameters that control how the model is defined, fitted and trainied, but are not related to the training data itself.  In this lab, we will see how to do a grid search on a decision tree.

In [1]:
import numpy as np
import pandas as pd

## Apply transformations to our data
These are the same transfomrations as applied in the labs.  See Lab 1 for more explanations.

In [2]:
from sklearn.utils import resample
from sklearn.preprocessing import StandardScaler

data = pd.read_csv('../resources/diabetes_readmission.csv')
display(data.head())

y = data['readmitted']
X = data[['discharge', 'age', 'race', 'admission_type', 'specialty', 'time_in_hospital', 'diag_1', 'A1Cresult', 'change']].copy()

scaler = StandardScaler()
X['time_in_hospital'] = scaler.fit_transform(X[['time_in_hospital']])

# create dummy variables
X = pd.concat([X, pd.get_dummies(X['age'], prefix = 'age', drop_first=True)], axis=1)
X = pd.concat([X, pd.get_dummies(X['race'], prefix = 'race', drop_first=True)], axis=1)
X = pd.concat([X, pd.get_dummies(X['admission_type'], prefix = 'admission_type', drop_first=True)], axis=1)
X = pd.concat([X, pd.get_dummies(X['specialty'], prefix = 'specialty', drop_first=True)], axis=1)
X = pd.concat([X, pd.get_dummies(X['diag_1'], prefix = 'diag_1', drop_first=True)], axis=1)
X = pd.concat([X, pd.get_dummies(X['A1Cresult'], prefix = 'A1Cresult', drop_first=True)], axis=1)

# drop originals
X = X.drop(['age', 'race', 'admission_type', 'specialty', 'diag_1', 'A1Cresult'], axis=1)

# balance the classes
display(y.value_counts())

X_oversampled, y_oversampled = resample(X[y == 1], y[y == 1], replace=True, n_samples=X[y == 0].shape[0], random_state=42)

X = pd.DataFrame(np.vstack((X[y == 0], X_oversampled)), columns=X.columns)
y = np.hstack((y[y == 0], y_oversampled))

display(pd.Series(y).value_counts())

Unnamed: 0,readmitted,discharge,age,race,admission_type,specialty,time_in_hospital,diag_1,A1Cresult,change
0,1,0,0,1,1,0,8,8,0,1
1,0,0,0,1,1,4,2,8,0,0
2,0,1,2,1,1,2,4,8,2,0
3,0,0,2,1,1,2,3,8,3,1
4,0,0,0,0,1,2,5,8,0,0


0    60145
1     5813
Name: readmitted, dtype: int64

1    60145
0    60145
dtype: int64

## Grid search and cross validation
For grid searching, we will not be using a train/test split, and instead use cross validation, as that method is directly supported by scikit learn's [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) class.

The first task we must do is to select the hyper parameters we wish to grid search.  We can tune any parameter that is involved in the building of the model, which includes any of the parameters for the class constructor as well as any custom functions we may have developed for returning a model (as is the case for building a model using KerasClassifier)

The [documentation for DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) will give us some idea of what we can tune.

In this practice you will grid search the following parameters:

* criterion - the method for determing the quality of a split in the tree
* splitter - how to choose the best split
* max_depth - the mximum number of splits in the tree

Refer to the documentation for the valid values for these parameters.

For scoring the grid search, we will use accuracy for selecting the best set of parameters.  The values to specify for these can be found [here](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter)

In [3]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

model = DecisionTreeClassifier(random_state=42)

param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [1, 5, 10, 20, None],
    'splitter': ['best', 'random'],
}

model = DecisionTreeClassifier(random_state=42)

grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring='accuracy')
grid_result = grid.fit(X, y)

print("Best model: %f with parameters %s" % (grid_result.best_score_, grid_result.best_params_))



Best model: 0.648408 with parameters {'criterion': 'entropy', 'max_depth': None, 'splitter': 'random'}


## Results
The best model was identified as having been fit with `{'criterion': 'entropy', 'max_depth': None, 'splitter': 'random'}`, and an accuracy of ~0.64.  (Note that we cannot compare this to lab 2 accuracy because that was assessed on a test data set versus cross validation).

You will note that the values for criterion and splitter differ from the defaults for DecisionTreeClassifier, which supports the idea that searching for the best hyper parameters for a model is a worthwhile effort.

## Retrieving the best estimator and saving the model

We can save the model to a file for later use, such as if we want to reload it and apply it to a new data set.  We can also directly get the best estimator for immediate use.

In [4]:
import joblib

# get the best estimator to use immediately if we want
best = grid.best_estimator_

# save to a file
joblib.dump(best, '../../../output/best_rf_model.pkl')

# load from a file
best = joblib.load('../../../output/best_rf_model.pkl') 

## Visualize the results of the best estimator

In [5]:
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt

model = best
print('accuracy', model.score(X, y))

probs = model.predict_proba(X)
preds = probs[:,1]

fpr, tpr, threshold = metrics.roc_curve(y, preds)
roc_auc = metrics.auc(fpr, tpr)

plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

y_pred = model.predict(X)

cmatrix = confusion_matrix(y, y_pred)
print('confusion matrix:')
print(cmatrix)
print('\nclassification report:')
print(classification_report(y, y_pred))

accuracy 0.6827167678111231


<Figure size 640x480 with 1 Axes>

confusion matrix:
[[39815 20330]
 [17836 42309]]

classification report:
              precision    recall  f1-score   support

           0       0.69      0.66      0.68     60145
           1       0.68      0.70      0.69     60145

    accuracy                           0.68    120290
   macro avg       0.68      0.68      0.68    120290
weighted avg       0.68      0.68      0.68    120290

