# Introduction to Gridsearching Hyperparameters

---

![](https://snag.gy/aYcCt2.jpg)

### Learning Objective
- Describe what the terms gridsearch and hyperparameter mean.
- Build a gridsearching procedure from scratch.
- Apply sklearn's `GridSearchCV` object with basketball data to optimize a KNN model.
- Use and evaluate attributes of the gridsearch object.
- Describe the pitfalls of searching large hyperparameter spaces.

### Lesson Guide
- [What is "grid searching"? What are "hyperparameters"?](#intro)
- [Basketball Data](#basketball-data)
- [Fitting a Default KNN](#fit-knn)
- [Searching for the Best Hyperparameters](#searching)
    - [Grid Search Pseudocode](#pseudocode)
    - [Using `GridSearchCV`](#gscv)
- [A Caution on Grid Searching](#caution)
- [Independent Practice: Grid Searching Regularization Penalties with Logistic Regression](#practice)

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import seaborn as sns

plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

<a id='intro'></a>

## What is "Grid Searching"? What are "Hyperparameters"?

---

Models often have built-in specifications that we can use to fine-tune our results. For example, when we choose a linear regression, we may decide to add a penalty to the loss function such as the Ridge or the Lasso. Those penalties require the regularization strength, alpha, to be set. 

**These specifications are called hyperparameters.**

Hyperparameters are different from the parameters of the model that result from a fit, such as the coefficients. They are set prior to the fit - usually when we instantiate it - and they affect or determine the model's behavior.

There are often more than one kind of hyperparamter to set for a model. For example, in the KNN algorithm, we have a hyperparameter to set the number of neighbors. We also have a hyperparameter to set the weights, eithe uniform or distance. Generally, we want to know the *optimal* hyperparameter settings, the set that results in the best model evaluation. 

**The search for the optimal set of hyperparameters is called gridsearching.**

Gridsearching gets its name from the fact that we are searching over a "grid" of parameters. For example, imagine the `n_neighbors` hyperparameters on the x-axis and `weights` on the y-axis, and we need to test all points on the grid.

**Gridsearching uses cross-validation internally to evaluate the performance of each set of hyperparameters.** More on this later.

<a id='basketball-data'></a>

## Basketball Data

---

To explore the process of gridsearching over sets of hyperparameters, we will use some basketball data. The data below has statistics for 4 different seasons of NBA basketball: 2013-2016.
- This data includes aggregate statistical data for each game. 
- The data of each game is aggregated by match for all players.
- Scraped from http://www.basketball-reference.com

Many of the columns in the dataset represent the mean of a statistic across the last 10 games, for example. Non-target statistics are for *prior* games, they do not include information about player performance in the current game.

**We are interested in predicting whether the home team will win the game or not.** This is a classification problem.


### Load the data and create the target and predictor matrix
- The target will be a binary column of whether the home team wins.
- The predictors should be numeric statistics columns.

Exclude these columns from the predictor matrix:

    ['GameId','GameDate','GameTime','HostName',
     'GuestName','total_score','total_line','game_line',
     'winner','loser','host_wins','Season']


In [2]:
data = pd.read_csv('data/basketball_data.csv')

In [3]:
data.columns[:10]

Index(['Season', 'GameId', 'GameDate', 'GameTime', 'HostName', 'GuestName',
       'total_score', 'total_line', 'game_line', 'Host_HostRank'],
      dtype='object')

In [4]:
# A:
predictors = list(
    set(data.columns) - set(['GameId','GameDate','GameTime','HostName',
                              'GuestName','total_score','total_line','game_line',
                              'winner','loser','host_wins','Season'])
)

df = data[predictors].copy()
X = df

# Create a binary int column to represent host's win/loss
y = (data.HostName == data.winner).astype(int)

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [6]:
from sklearn.preprocessing import StandardScaler
from sklearn_pandas import DataFrameMapper

mapper = DataFrameMapper([], default=StandardScaler())

Z_train = mapper.fit_transform(X_train)
Z_test = mapper.transform(X_test)

<a id='fit-knn'></a>

## Fitting the Default KNN

---

Below we can fit a default `KNeighborsClassifier` to predict win vs. not on the training data, then score it on the testing data. 

Remember to compare your score to the baseline accuracy.

In [7]:
from sklearn.neighbors import KNeighborsClassifier

In [8]:
neighbors = KNeighborsClassifier(n_neighbors=11)
neighbors.fit(Z_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=11, p=2,
                     weights='uniform')

In [9]:
neighbors.score(Z_train, y_train)

0.7038216560509554

In [10]:
neighbors.score(Z_test, y_test)

0.613588110403397

In [11]:
print(np.mean(y_test))

0.5987261146496815


<a id='searching'></a>

## Searching for the Best Hyperparameters

---

Our default KNN performs quite poorly on the test data. But what if we changed the number of neighbors? The weighting? The distance metric?

These are all hyperparameters of the KNN algorithm. How would we do this manually? We would need to evaluate on the training data the set of hyperparameters that perform best, and then use this set of hyperparameters to fit the final model and score on the testing set.

<a id='pseudocode'></a>
### Gridsearch pseudocode for our KNN

```python
accuracies = {}
for k in neighbors_to_test:
    for w in weightings_to_test:
        for d in distance_metrics_to_test:
            hyperparam_set = (k, w, d)
            knn = KNeighborsClassifier(n_neighbors=n, weights=w, metric=d)
            cv_accuracies = cross_val_score(knn, X_train, y_train, cv=5)
            accuracies[hyperparam_set] = np.mean(cv_accuracies)
```

In the pseudocode above, we would find the key in the dictionary (a hyperparameter set) that has the largest value (mean cross-validated accuracy).



<a id='gscv'></a>
### Using `GridSearchCV`

This would be an annoying process to have to do manually. Luckily sklearn comes with a convenience class for performing gridsearch:

```python
from sklearn.model_selection import GridSearchCV
```

The `GridSearchCV` has a handful of important arguments:

| Argument | Description |
| --- | ---|
| **`estimator`** | The sklearn instance of the model to fit on |
| **`param_grid`** | A dictionary where keys are hyperparameters for the model and values are lists of values to test |
| **`cv`** | The number of internal cross-validation folds to run for each set of hyperparameters |
| **`n_jobs`** | How many cores to use on your computer to run the folds (-1 means use all cores) |
| **`verbose`** | How much output to display (0 is none, 1 is limited, 2 is printouts for every internal fit) |


Below is an example for how one might set up the gridsearch for our KNN:

```python
knn_parameters = {
    'n_neighbors':[1,3,5,7,9],
    'weights':['uniform','distance']
}

knn_gridsearcher = GridSearchCV(KNeighborsClassifier(), knn_parameters, verbose=1)
knn_gridsearcher.fit(X_train, y_train)
```

**Try out the sklearn gridsearch below on the training data.**

In [12]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import recall_score, make_scorer, f1_score

In [13]:
knn_params = {
    'n_neighbors':[1,3,5,9,15,21],
    'weights':['uniform','distance'],
    'metric':['euclidean','manhattan']
}

knn_gridsearch = GridSearchCV(KNeighborsClassifier(), knn_params, cv=5, verbose=1, n_jobs=-1)

knn_gridsearch.fit(Z_train, y_train)

Fitting 5 folds for each of 24 candidates, totalling 120 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    4.7s
[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed:    8.6s finished


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                            metric='minkowski',
                                            metric_params=None, n_jobs=None,
                                            n_neighbors=5, p=2,
                                            weights='uniform'),
             iid='warn', n_jobs=-1,
             param_grid={'metric': ['euclidean', 'manhattan'],
                         'n_neighbors': [1, 3, 5, 9, 15, 21],
                         'weights': ['uniform', 'distance']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=1)

<a id='gs-results'></a>
### Examining the results of the gridsearch

Once the gridsearch has fit (this can take awhile!) we can pull out a variety of information and useful objects from the gridsearch object, stored as attributes:

| Property | Use |
| --- | ---|
| **`results.param_distributions`** | Displays parameters searched over. |
| **`results.best_score_`** | Best mean cross-validated score achieved. |
| **`results.best_estimator_`** | Reference to model with best score.  Is usable / callable. |
| **`results.best_params_`** | The parameters that have been found to perform with the best score. |
| **`results.cv_results`** | Display score attributes with corresponding parameters. (make sure to use return_train_score=True with GridSearch) | 

**Print out the best score found in the search.**

In [14]:
knn_gridsearch

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                            metric='minkowski',
                                            metric_params=None, n_jobs=None,
                                            n_neighbors=5, p=2,
                                            weights='uniform'),
             iid='warn', n_jobs=-1,
             param_grid={'metric': ['euclidean', 'manhattan'],
                         'n_neighbors': [1, 3, 5, 9, 15, 21],
                         'weights': ['uniform', 'distance']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=1)

#### Examining `grid.cv_results_`

The docs have a full description of the results, and you can check them out in the [sklearn docs for GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).  It's likely not a coincidence that the `grid.cv_results_` are in a convienent format that makes them easy to import into a DataFrame.

In [15]:
knn_gridsearch.best_score_

0.6199575371549894

In [16]:
scores = pd.DataFrame(knn_gridsearch.cv_results_)
scores.head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_metric,param_n_neighbors,param_weights,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.027741,0.005889,0.398121,0.005112,euclidean,1,uniform,"{'metric': 'euclidean', 'n_neighbors': 1, 'wei...",0.54947,0.552212,0.582301,0.568142,0.566372,0.563694,0.011895,21
1,0.022611,0.005421,0.374393,0.006517,euclidean,1,distance,"{'metric': 'euclidean', 'n_neighbors': 1, 'wei...",0.54947,0.552212,0.582301,0.568142,0.566372,0.563694,0.011895,21
2,0.016816,0.002095,0.388486,0.010165,euclidean,3,uniform,"{'metric': 'euclidean', 'n_neighbors': 3, 'wei...",0.574205,0.546903,0.60708,0.576991,0.60531,0.582095,0.022312,17
3,0.016576,0.003577,0.369897,0.011103,euclidean,3,distance,"{'metric': 'euclidean', 'n_neighbors': 3, 'wei...",0.574205,0.546903,0.60708,0.576991,0.60531,0.582095,0.022312,17
4,0.013506,0.00043,0.381325,0.02368,euclidean,5,uniform,"{'metric': 'euclidean', 'n_neighbors': 5, 'wei...",0.597173,0.576991,0.60354,0.584071,0.6,0.592357,0.010112,15


**Print out the set of hyperparameters that achieved the best score.**

In [17]:
knn_gridsearch.best_params_

{'metric': 'manhattan', 'n_neighbors': 21, 'weights': 'uniform'}

**Assign the best fit model (`best_estimator_`) to a variable and score it on the test data.**

Compare this model to the bechmark accuracy and your default KNN (0.79).

In [18]:
KNeighborsClassifier(metric='manhattan', n_neighbors=21, weights='uniform')

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='manhattan',
                     metric_params=None, n_jobs=None, n_neighbors=21, p=2,
                     weights='uniform')

In [19]:
best_knn = knn_gridsearch.best_estimator_
best_knn.score(Z_test, y_test)

0.6443736730360934

In [20]:
print('baseline:', np.mean(y_test))
print('default KNN:', neighbors.score(Z_test, y_test))

baseline: 0.5987261146496815
default KNN: 0.613588110403397


<a id='caution'></a>

## A Word of Caution on Grid searching

---

Sklearn models often have many options/hyperparameters with many different possible values. It may be tempting to search over a wide variety of them. In general, this is not wise.

Remember that **gridsearch searches over all possible combinations of hyperparamters in the paramter dictionary!**

The KNN model class takes a wider range of options during instantiation than we have explored. Imagine that we had this as our parameter dictionary:

```python
parameter_grid = {
    'n_neighbors':range(1,151),
    'weights':['uniform','distance',custom_function],
    'algorithm':['ball_tree','kd_tree','brute','auto'],
    'leaf_size':range(1,152),
    'metric':['minkowski','euclidean'],
    'p':[1,2]
}
```

**How many different combinations will need to be tested?

| Parameter | Potential Values | Unique Values |
| --- | ---| ---: |
| **n_neighbors** | int range 1-150 | 150 |
| **weights** | strs:  "uniform", "distance" or user defined function | 3 |
| **algorithm** | strs: "ball_tree", "kd_tree", "brute", "auto" | 4 |
| **leaf_size** | int range 1-151 | 151 |
| **metric** | str: "minkowski" or 'euclidean' type | 2 |
| **p** | int: 1=manhattan_distance, 2= euclidean_distance | 2 |
|| <br>_150 \* 3 \* 4 \* 151 \* 2 \* 2 = n combinations_ <br><br>| _1,087,200_ |

Over a million tests *before we even consider the number of cross-validation folds!*

If we're not careful, gridsearching can quickly blow up, taking our time and machine with it. A lot of the hyperparameters we put in the dumb example above are either redundant or not useful.

> **It is extremely important to understand what the hyperparameters do and think critically about what ranges are useful and relevant to your model!**


One way to survey a space of possible parameters without testing every single parameter is to use a randomized search.

In [21]:
from sklearn.model_selection import RandomizedSearchCV

In [22]:
knn_params = {
    'n_neighbors':[1,3,5,9,15,21],
    'weights':['uniform','distance'],
    'metric':['euclidean','manhattan']
}

knn_randomsearch = RandomizedSearchCV(KNeighborsClassifier(), knn_params, cv=5, n_iter=10, verbose=1, n_jobs=2, random_state=42)

knn_randomsearch = knn_randomsearch.fit(Z_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    5.1s
[Parallel(n_jobs=2)]: Done  50 out of  50 | elapsed:    5.5s finished


In [23]:
knn_randomsearch.best_params_

{'weights': 'distance', 'n_neighbors': 21, 'metric': 'euclidean'}

In [24]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver='lbfgs', max_iter=1000)
model.fit(Z_train, y_train)
model.score(Z_test, y_test)

0.6985138004246284

In [25]:
param_grid = {
    'solver': ['lbfgs'],
    'max_iter': [1000],
    'C': [0.01, 0.1, 1, 10.0, 100],
    'fit_intercept': [True, False],
}

grid = GridSearchCV(LogisticRegression(), param_grid, cv=3, verbose=1, n_jobs=-1)
grid.fit(Z_train, y_train)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Done  15 out of  30 | elapsed:    1.6s remaining:    1.6s
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:    3.0s finished


GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='warn',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='warn',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='warn', n_jobs=-1,
             param_grid={'C': [0.01, 0.1, 1, 10.0, 100],
                         'fit_intercept': [True, False], 'max_iter': [1000],
                         'solver': ['lbfgs']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=1)

In [26]:
grid.best_score_
grid.best_estimator_.score(Z_test, y_test)

0.7016985138004246

In [27]:
from sklearn.pipeline import make_pipeline 

pipe = make_pipeline(mapper, grid)
pipe.fit(X_train, y_train)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:    2.1s finished


Pipeline(memory=None,
         steps=[('dataframemapper',
                 DataFrameMapper(default=StandardScaler(copy=True,
                                                        with_mean=True,
                                                        with_std=True),
                                 df_out=False, features=[], input_df=False,
                                 sparse=False)),
                ('gridsearchcv',
                 GridSearchCV(cv=3, error_score='raise-deprecating',
                              estimator=LogisticRegression(C=1.0,
                                                           class_weight=None,
                                                           dual=False,
                                                           fit_intercept=True,
                                                           intercept_scaling=1,...
                                                           max_iter=100,
                                                      

In [28]:
pipe.score(X_test, y_test)

0.7016985138004246

## Conclusion

### Gridsearch searches model hyperparameters!
- Model parameter = coefficient, intercept
- Model hyperparameter = n_neighbors, distance_metric, etc

Hyperparameters control model behavior.

### Gridsearch syntax
```python
param_grid = {
    "model_hyperparameter_1": ["params", "to", "test"],
    "model_hyperparameter_2": ["params", "to", "test"],
    "model_hyperparameter_3": ["params", "to", "test"],
}

grid = GridSearchCV(
    estimator, 
    param_grid, 
    cv = 5, 
    verbose = 1,
    return_train_score = True
)

grid.fit(X, y)
```

### Gridsearch Evaluation

- grid.best_score_
- grid.best_params_
- grid.best_estimator_
- grid.cv_results

Be careful when searching hyperparameters!  

```python
parameter_grid = {
    'n_neighbors':range(1,151),
    'weights':['uniform','distance',custom_function],
    'algorithm':['ball_tree','kd_tree','brute','auto'],
    'leaf_size':range(1,152),
    'metric':['minkowski','euclidean'],
    'p':[1,2]
}
```
#### `150 * 3 * 4 * 151 * 2 * 2 = 1,087,200` combinations



<a id='practice'></a>

## Practice: Grid Search Regularization Penalties with Logistic Regression

---

Logistic regression models can also apply the Lasso and Ridge penalties. The `LogisticRegression` class takes these regularization-relevant hyperparameters:

| Argument | Description |
| --- | ---|
| **`penalty`** | `'l1'` for Lasso, `'l2'` for Ridge |
| **`solver`** | Must be set to `'liblinear'` for the Lasso penalty to work. |
| **`C`** | The regularization strength. Equivalent to `1./alpha` |

**You should:**
1. Fit and validate the accuracy of a default logistic regression on the basketball data.
- Perform a gridsearch over different regularization strengths and Lasso and Ridge penalties.
- Compare the accuracy on the test set of your optimized logistic regression to the baseline accuracy and the default model.
- Look at the best parameters found. What was chosen? What does this suggest about our data?
- Look at the (non-zero, if Lasso was selected as best) coefficients and associated predictors for your optimized model. What appears to be the most important predictors of winning the game?


In [29]:
from sklearn.linear_model import LogisticRegression

In [30]:
lr = LogisticRegression(solver='lbfgs', max_iter=200)

In [31]:
gs_params = {
    'penalty':['l1','l2'],
    'solver':['liblinear'],
    'C':np.logspace(-5,0,100)
}

lr_gridsearch = GridSearchCV(LogisticRegression(), gs_params, cv=5, verbose=1)