# Model Tuning with Grid Search

## Objectives

- Explain what hyperparameters are
- Describe the purpose of grid searching
- Implement grid searching for the purposes of model optimization

![](https://imgs.xkcd.com/comics/machine_learning.png)

## Hyperparameters

Many of the models we have looked at are really *families* of models in the sense that they make use of **hyperparameters**.

Thus for example the $k$-nearest-neighbors algorithm allows us to make:

- a 1-nearest-neighbor model
- a 2-nearest-neighbors model
- a 3-nearest-neighbors model
- etc.

Or, for another example, the decision tree algorithm allows us to make:

- a classifier that branches according to information gain
- a classifier that branches according to Gini impurity
- a regressor that branches according to mean squared error
- etc.

Depending on the sort of problem and data at hand, it is natural to experiment with different values of these hyperparameters to try to improve model performance.

> We can think of these **hyperparameters** as _dials_ of the base model

<img src='https://cdn.dribbble.com/users/947489/screenshots/4522230/0001-0128.gif' width=500>

### Difference from Parametric / Non-Parametric Models

Contrast the notion of hyperparameters with the distinction between **parametric** and **non-parametric** models.

A linear regression model is parametric in the sense that we start with a given model *form* and we then search for the optimal parameters to fill in that form. But *those* parameters are not the sort we might tweak for the purposes of improving model performance! On the contrary, there is **one** best set of parameters, and the training of the model is a matter of finding those optimal values.

## Scenario: Cat in the Dat(a)

Let's revisit our cat in the data kaggle competition: https://www.kaggle.com/c/cat-in-the-dat-ii

In [1]:
# Imports
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder, StandardScaler
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from sklearn.metrics import roc_auc_score, plot_roc_curve, plot_precision_recall_curve
from sklearn.metrics import confusion_matrix, plot_confusion_matrix

In [2]:
# Grab, then explore data
df = pd.read_csv("data/cat_in_the_dat2_train.csv", index_col='id')

In [3]:
# Define our X and y

X = df.drop(columns=['target'])
y = df['target']

# and train test split - to create our val holdout set!
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.1,
                                                  random_state=0)

### Preprocessor Pipeline

Let's bring back our preprocessor pipeline we built:

In [4]:
# First defining our overall used columns - those with less than 30 uniques
used_cols = [col for col in X_train.columns if len(X_train[col].unique()) < 30]

# Defining our object columns
obj_cols = [col for col in used_cols if X_train[col].dtype == 'O']
# And our object transformer
obj_transformer = Pipeline(steps=[
    ("obj_imputer", SimpleImputer(strategy = 'constant', fill_value='WHATEVER')),
    ('ohe', OneHotEncoder(drop=['WHATEVER']*len(obj_cols)))
])

# Now defining our numeric columns
num_cols = [col for col in used_cols if X_train[col].dtype != 'O']
# And our numeric transformer
num_transformer = Pipeline(steps = [
    ('num_imputer', SimpleImputer(strategy='most_frequent')),
    ('scaler', MinMaxScaler())
])

# Putting them together into a preprocessor
preprocessor = ColumnTransformer(transformers=[
    ("num_trans", num_transformer, num_cols),
    ('obj_trans', obj_transformer, obj_cols)
])

In [20]:
preprocessor.fit(X_train[used_cols])

ColumnTransformer(transformers=[('num_trans',
                                 Pipeline(steps=[('num_imputer',
                                                  SimpleImputer(strategy='most_frequent')),
                                                 ('scaler', MinMaxScaler())]),
                                 ['bin_0', 'bin_1', 'bin_2', 'ord_0', 'day',
                                  'month']),
                                ('obj_trans',
                                 Pipeline(steps=[('obj_imputer',
                                                  SimpleImputer(fill_value='WHATEVER',
                                                                strategy='constant')),
                                                 ('ohe',
                                                  OneHotEncoder(drop=['WHATEVER',
                                                                      'WHATEVER',
                                                                      'WHATEVER',
        

In [14]:
num_cols

['bin_0', 'bin_1', 'bin_2', 'ord_0', 'day', 'month']

In [32]:
obj_cols

['bin_3',
 'bin_4',
 'nom_0',
 'nom_1',
 'nom_2',
 'nom_3',
 'nom_4',
 'ord_1',
 'ord_2',
 'ord_3',
 'ord_4']

In [31]:
preprocessor.named_transformers_['obj_trans']['ohe'].get_feature_names()

array(['x0_F', 'x0_T', 'x1_N', 'x1_Y', 'x2_Blue', 'x2_Green', 'x2_Red',
       'x3_Circle', 'x3_Polygon', 'x3_Square', 'x3_Star', 'x3_Trapezoid',
       'x3_Triangle', 'x4_Axolotl', 'x4_Cat', 'x4_Dog', 'x4_Hamster',
       'x4_Lion', 'x4_Snake', 'x5_Canada', 'x5_China', 'x5_Costa Rica',
       'x5_Finland', 'x5_India', 'x5_Russia', 'x6_Bassoon', 'x6_Oboe',
       'x6_Piano', 'x6_Theremin', 'x7_Contributor', 'x7_Expert',
       'x7_Grandmaster', 'x7_Master', 'x7_Novice', 'x8_Boiling Hot',
       'x8_Cold', 'x8_Freezing', 'x8_Hot', 'x8_Lava Hot', 'x8_Warm',
       'x9_a', 'x9_b', 'x9_c', 'x9_d', 'x9_e', 'x9_f', 'x9_g', 'x9_h',
       'x9_i', 'x9_j', 'x9_k', 'x9_l', 'x9_m', 'x9_n', 'x9_o', 'x10_A',
       'x10_B', 'x10_C', 'x10_D', 'x10_E', 'x10_F', 'x10_G', 'x10_H',
       'x10_I', 'x10_J', 'x10_K', 'x10_L', 'x10_M', 'x10_N', 'x10_O',
       'x10_P', 'x10_Q', 'x10_R', 'x10_S', 'x10_T', 'x10_U', 'x10_V',
       'x10_W', 'x10_X', 'x10_Y', 'x10_Z'], dtype=object)

In [15]:
ohe_col_names = preprocessor.named_transformers_['obj_trans']['ohe'].get_feature_names()

In [18]:
output_cols = [*num_cols, *ohe_col_names]
len(output_cols)

87

In [27]:
pd.DataFrame(preprocessor.transform(X_train[used_cols]).A,
             index=X_train.index, columns=output_cols).head()

Unnamed: 0_level_0,bin_0,bin_1,bin_2,ord_0,day,month,x0_F,x0_T,x1_N,x1_Y,...,x10_Q,x10_R,x10_S,x10_T,x10_U,x10_V,x10_W,x10_X,x10_Y,x10_Z
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
111903,0.0,0.0,0.0,1.0,0.5,0.727273,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
521435,0.0,1.0,0.0,0.5,0.666667,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
484488,0.0,0.0,0.0,0.5,0.166667,0.636364,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
529639,0.0,1.0,1.0,0.0,0.666667,0.090909,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
589943,0.0,0.0,0.0,0.5,0.833333,0.636364,0.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Trying Different Models & Values

Let's explore three different model types, using our preprocessor pipeline and evaluating using `cross_validate`:

#### Logistic Regression

In [33]:
clf_lr = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('logreg', LogisticRegression(random_state = 713))
])

In [34]:
cv_results = cross_validate(clf_lr, X_train[used_cols], y_train, 
                            scoring=['roc_auc', 'f1'])

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [35]:
cv_results

{'fit_time': array([5.79211378, 5.3588841 , 5.79528618, 5.84712386, 5.85742164]),
 'score_time': array([1.35079098, 1.34045386, 1.32278991, 1.32265592, 1.32111001]),
 'test_roc_auc': array([0.73592693, 0.73430216, 0.73870237, 0.73259678, 0.73553261]),
 'test_f1': array([0.16649166, 0.17204481, 0.16301652, 0.16921013, 0.17259222])}

#### Decision Tree

In [36]:
clf_dt = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('tree', DecisionTreeClassifier(max_depth=7, random_state=713))
])

In [37]:
cv_results = cross_validate(clf_dt, X_train[used_cols], y_train, 
                            scoring=['roc_auc', 'f1'])

cv_results

{'fit_time': array([5.24018002, 5.21961069, 5.22077107, 5.19924617, 5.18029904]),
 'score_time': array([1.3204999 , 1.31627917, 1.30912685, 1.32200408, 1.30161095]),
 'test_roc_auc': array([0.67141229, 0.66792007, 0.67202339, 0.66846962, 0.66986597]),
 'test_f1': array([0.05099737, 0.04137211, 0.04184697, 0.03716586, 0.05640116])}

#### Your Choice: KNN or Naive Bayes?

In [38]:
clf_nb = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('bayes', MultinomialNB())
])

In [39]:
cv_results = cross_validate(clf_nb, X_train[used_cols], y_train, 
                            scoring=['roc_auc', 'f1'])

cv_results

{'fit_time': array([3.00266695, 2.95650196, 2.97177315, 2.97852421, 2.92720199]),
 'score_time': array([1.33264804, 1.3210609 , 1.32920384, 1.32787466, 1.325418  ]),
 'test_roc_auc': array([0.72314457, 0.72089656, 0.72660956, 0.71907788, 0.72207205]),
 'test_f1': array([0.07242211, 0.07242958, 0.07285013, 0.07132257, 0.0764714 ])}

## Automatically Searching for Optimal Hyperparameters with Grid Search

It's not a bad idea to experiment with the values of your models' hyperparameters a bit as you're getting a feel for your models' performance. But there are more systematic ways of going about the search for optimal hyperparameters. One method of hyperparameter tuning is **grid searching**. 

The idea is to build multiple models with different hyperparameter values and then see which one performs the best. The hyperparameters and the values to try form a sort of *grid* along which we are looking for the best performance. 

For example, for KNN:

    n_neighbors | metric      | weights
    ______________________________________
    1           | 'minkowski' | 'uniform'
    3           | 'manhattan' | 'distance'
    5           |


Scikit-Learn has a [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) class whose `fit()` method runs this procedure. Note that this can be quite computationally expensive since:

- A model is constructed for each combination of hyperparameter values that we input; and
- Each model is cross-validated.

### `GridSearchCV`

Note! When using `GridSearchCV` in SKLearn with a pipeline, you need to use the nicknames we give to tell it where these hyperparameters are being tested!

Resource: https://scikit-learn.org/stable/tutorial/statistical_inference/putting_together.html

In [40]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

In [41]:
# Choose one model type to optimize, and set up the base pipeline

clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('tree', DecisionTreeClassifier(random_state=713))
])

In [47]:
# Define the parameter grid for one model type

param_grid = {
    "tree__max_depth": [5, 14],
    "tree__min_samples_leaf": [5, 14]
}

**Question: How many models will we be constructing with this grid?**

- 60 with cv=5

In [48]:
# Initialize the grid search object with five-fold cross-validation

gs = GridSearchCV(clf, param_grid, cv=5, scoring='roc_auc', verbose=3)

# Then fit it to our X_train and y_train
gs.fit(X_train[used_cols], y_train)

Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV] tree__max_depth=5, tree__min_samples_leaf=5 .....................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  tree__max_depth=5, tree__min_samples_leaf=5, score=0.647, total=   4.7s
[CV] tree__max_depth=5, tree__min_samples_leaf=5 .....................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    4.7s remaining:    0.0s


[CV]  tree__max_depth=5, tree__min_samples_leaf=5, score=0.646, total=   4.6s
[CV] tree__max_depth=5, tree__min_samples_leaf=5 .....................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    9.4s remaining:    0.0s


[CV]  tree__max_depth=5, tree__min_samples_leaf=5, score=0.648, total=   4.6s
[CV] tree__max_depth=5, tree__min_samples_leaf=5 .....................
[CV]  tree__max_depth=5, tree__min_samples_leaf=5, score=0.647, total=   4.7s
[CV] tree__max_depth=5, tree__min_samples_leaf=5 .....................
[CV]  tree__max_depth=5, tree__min_samples_leaf=5, score=0.646, total=   4.6s
[CV] tree__max_depth=5, tree__min_samples_leaf=14 ....................
[CV]  tree__max_depth=5, tree__min_samples_leaf=14, score=0.647, total=   4.7s
[CV] tree__max_depth=5, tree__min_samples_leaf=14 ....................
[CV]  tree__max_depth=5, tree__min_samples_leaf=14, score=0.646, total=   4.7s
[CV] tree__max_depth=5, tree__min_samples_leaf=14 ....................
[CV]  tree__max_depth=5, tree__min_samples_leaf=14, score=0.648, total=   4.9s
[CV] tree__max_depth=5, tree__min_samples_leaf=14 ....................
[CV]  tree__max_depth=5, tree__min_samples_leaf=14, score=0.647, total=   4.6s
[CV] tree__max_depth=5, 

[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:  4.5min finished


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('num_trans',
                                                                         Pipeline(steps=[('num_imputer',
                                                                                          SimpleImputer(strategy='most_frequent')),
                                                                                         ('scaler',
                                                                                          MinMaxScaler())]),
                                                                         ['bin_0',
                                                                          'bin_1',
                                                                          'bin_2',
                                                                          'ord_0',
                                                             

In [49]:
# Check out the best parameters
gs.best_params_

{'tree__max_depth': 14, 'tree__min_samples_leaf': 14}

In [50]:
# Check out the best score
gs.best_score_

0.6767920172812092

In [55]:
# We can grab the best estimator out, as a fit model!
gs_preds = gs.best_estimator_.predict(X_val[used_cols])

roc_auc_score(y_val, gs_preds)

0.5310754291131321

In [52]:
# A bigger picture of our results
gs.cv_results_

{'mean_fit_time': array([ 3.99676528,  4.00485163, 21.66217537, 21.10274239]),
 'std_fit_time': array([0.04184848, 0.07868755, 0.31630694, 0.21992343]),
 'mean_score_time': array([0.67216496, 0.68976598, 0.68103204, 0.68554816]),
 'std_score_time': array([0.00444199, 0.02224489, 0.00369937, 0.00550398]),
 'param_tree__max_depth': masked_array(data=[5, 5, 14, 14],
              mask=[False, False, False, False],
        fill_value='?',
             dtype=object),
 'param_tree__min_samples_leaf': masked_array(data=[5, 14, 5, 14],
              mask=[False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'tree__max_depth': 5, 'tree__min_samples_leaf': 5},
  {'tree__max_depth': 5, 'tree__min_samples_leaf': 14},
  {'tree__max_depth': 14, 'tree__min_samples_leaf': 5},
  {'tree__max_depth': 14, 'tree__min_samples_leaf': 14}],
 'split0_test_score': array([0.64698058, 0.64698058, 0.67136491, 0.67749753]),
 'split1_test_score': array([0.64645104, 0.64645104

In [53]:
# Which we can throw into a dataframe if we like
pd.DataFrame(gs.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_tree__max_depth,param_tree__min_samples_leaf,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,3.996765,0.041848,0.672165,0.004442,5,5,"{'tree__max_depth': 5, 'tree__min_samples_leaf...",0.646981,0.646451,0.647599,0.646708,0.646017,0.646751,0.000529,4
1,4.004852,0.078688,0.689766,0.022245,5,14,"{'tree__max_depth': 5, 'tree__min_samples_leaf...",0.646981,0.646451,0.647599,0.646708,0.646018,0.646751,0.000529,3
2,21.662175,0.316307,0.681032,0.003699,14,5,"{'tree__max_depth': 14, 'tree__min_samples_lea...",0.671365,0.669342,0.673427,0.669577,0.6726,0.671262,0.001613,2
3,21.102742,0.219923,0.685548,0.005504,14,14,"{'tree__max_depth': 14, 'tree__min_samples_lea...",0.677498,0.674446,0.677906,0.674934,0.679177,0.676792,0.00181,1


### Choice of Grid Values

Which values should you pick for your grid? Intuitively, you should try both "large" and "small" values, but of course what counts as large and small will really depend on the type of hyperparameter.

- For a k-nearest neighbors model, 1 or 3 would be a small value for the number of neighbors and 15 or 17 would be a large value.
- For a decision tree model, what counts as a small `max_depth` will really depend on the size of your training data. A `max_depth` of 5 would likely have little effect on a very small dataset but, at the same time, it would probably significantly decrease the variance of a model where the dataset is large.
- For a logistic regression's regularization constant, you may want to try a set of values that are exponentially separated, like \[1, 10, 100, 1000\].
- **If a grid search finds optimal values at the ends of your hyperparameter ranges, you might try another grid search with more extreme values.**

### Exercise

Do a grid search on a different model type! What are the optimal values for the hyperparameters you've chosen?

In [None]:
# Example logistic regression param grid, with model nickname 'lr'
lr_grid = {
    'lr__C': [.01, .1, 1, 10, 100],
    'lr__penalty': ['l1', '12']
}

# Level Up: Random Searching

It is also possible to search for good hyperparameter values randomly. This is a nice choice if computation time is an issue or if you are tuning over continuous hyperparameters.

### `RandomizedSearchCV` with `LogisticRegression`

In [None]:
# Import
from sklearn.model_selection import RandomizedSearchCV

In [None]:
log_reg_grid = {'C': stats.uniform(loc=0, scale=10),
               'l1_ratio': stats.expon(scale=0.2)}

In [None]:
rs = RandomizedSearchCV(estimator=LogisticRegression(penalty='elasticnet',
                                                    solver='saga',
                                                    max_iter=1000,
                                                    random_state=42),
                        param_distributions=log_reg_grid,
                        random_state=42)

rs.fit(X_train, y_train)

rs.best_params_

# Level Up: More Grid Search Practice

Use a classifier of your choice to predict the category of price range for the phones in this dataset. Try tuning some hyperparameters using a grid search, and then write up a short paragraph about your findings.

In [None]:
phones_train = pd.read_csv('data/phone-train.csv')
phones_test = pd.read_csv('data/phone-test.csv')

In [None]:
# Your code here