<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Hyperparameters, GridSearch, and Pipelines

_Authors: Kiefer Katovich, David Yerrington, Matt Brems, Noelle Brown_

---

![](https://snag.gy/aYcCt2.jpg)

### Learning Objectives
- Describe what the terms hyperparameters, GridSearch, and pipeline mean.
- Apply `sklearn`'s `GridSearchCV` object.
- Use attributes of the GridSearch object.
- Describe the pitfalls of searching large hyperparameter spaces.
- Build pipelines.

In [1]:
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.linear_model import LogisticRegression

In [2]:
# Read in data.
un_data = pd.read_csv('./data/UNdata.csv')

# Examine first five rows.
un_data.head()

Unnamed: 0,country,region,lifeMale,lifeFemale,infantMortality,GDPperCapita
0,Afghanistan,Asia,45.0,46.0,154,2848
1,Albania,Europe,68.0,74.0,32,863
2,Algeria,Africa,67.5,70.3,44,1531
3,Angola,Africa,44.9,48.1,124,355
4,Argentina,America,69.6,76.8,22,8055


## United Nations Data

- `country`: the name of the nation
- `region`: the region of the world (Africa, America, Asia, Europe, Oceania)
- `lifeMale`: the life expectancy of males
- `lifeFemale`: the life expectancy of females
- `infantMortality`: the infant mortality rate (generally reported per 1,000 live births)
- `GDPperCapita`: the Gross Domestic Product per person

In [3]:
un_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188 entries, 0 to 187
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   country          188 non-null    object 
 1   region           188 non-null    object 
 2   lifeMale         188 non-null    float64
 3   lifeFemale       188 non-null    float64
 4   infantMortality  188 non-null    int64  
 5   GDPperCapita     188 non-null    int64  
dtypes: float64(2), int64(2), object(2)
memory usage: 8.9+ KB


In [4]:
# Check for missing values.
un_data.isnull().sum()

country            0
region             0
lifeMale           0
lifeFemale         0
infantMortality    0
GDPperCapita       0
dtype: int64

In [17]:
# Set country to be the index.
un_data.set_index('country', inplace=True)

In [18]:
# Dummy region.
un_data_dums = pd.get_dummies(un_data, columns=['region'], drop_first=True)
un_data_dums.head()

Unnamed: 0_level_0,lifeMale,lifeFemale,infantMortality,GDPperCapita,females_are_strong_as_hell,region_America,region_Asia,region_Europe,region_Oceania
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Afghanistan,45.0,46.0,154,2848,0,0,1,0,0
Albania,68.0,74.0,32,863,1,0,0,1,0
Algeria,67.5,70.3,44,1531,0,0,0,0,0
Angola,44.9,48.1,124,355,0,0,0,0,0
Argentina,69.6,76.8,22,8055,1,1,0,0,0


<details><summary>What is our reference category for this dummy variable?</summary>

- Africa!
- There is no dummy variable for Africa in our data, meaning that all dummy variables would be interpreted **relative to Africa**.
</details>

### Create $Y$ variable

In [19]:
# Create a column with 1 if the female life expectancy is greater
# than the male life expectancy.
un_data_dums['females_are_strong_as_hell'] = (un_data_dums['lifeFemale'] > un_data_dums['lifeMale']).astype(int)
un_data_dums.head()

# The column name is a reference to the 
# Netflix series "The Unbreakable Kimmy Schmidt."

Unnamed: 0_level_0,lifeMale,lifeFemale,infantMortality,GDPperCapita,females_are_strong_as_hell,region_America,region_Asia,region_Europe,region_Oceania
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Afghanistan,45.0,46.0,154,2848,1,0,1,0,0
Albania,68.0,74.0,32,863,1,0,0,1,0
Algeria,67.5,70.3,44,1531,1,0,0,0,0
Angola,44.9,48.1,124,355,1,0,0,0,0
Argentina,69.6,76.8,22,8055,1,1,0,0,0


In [20]:
# What should we check next?
un_data_dums['females_are_strong_as_hell'].value_counts(normalize=True)

1    0.989362
0    0.010638
Name: females_are_strong_as_hell, dtype: float64

<details><summary>Do you have any concerns about the above?</summary>
    
- Our classes are severely unbalanced.
- We should check out our tools for handling unbalanced classes. (e.g. moving our classification threshold, implement stratified $k$-fold cross-validation)
- Given the relatively low sample size and the small number of the observations in the minority category here, it is unlikely that our model would be able to predict that a nation has a higher male life expectancy.
</details>

In [21]:
# Create a column with 1 if the female life expectancy is 5
# or more years longer than the male life expectancy.
un_data['females_are_strong_as_hell'] = (un_data['lifeFemale'] >= (un_data['lifeMale'] + 5)).astype(int)

# Check the thing we need to check!
un_data['females_are_strong_as_hell'].value_counts(normalize=True)

0    0.569149
1    0.430851
Name: females_are_strong_as_hell, dtype: float64

**We are interested in predicting whether or not the female life expectancy of a nation is at least five years greater than the male life expectancy.** This is a classification problem.

### Create the training and testing data

In [22]:
# Set up X and y.
X = un_data_dums.drop(['females_are_strong_as_hell', 'lifeMale', 'lifeFemale'], axis = 'columns')
y = un_data_dums['females_are_strong_as_hell']

In [23]:
# Split our data into training and testing sets.

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size = 0.33,
                                                    random_state = 42,
                                                    stratify = y) # Note the stratify argument here!

In [24]:
X_train

Unnamed: 0_level_0,infantMortality,GDPperCapita,region_America,region_Asia,region_Europe,region_Oceania
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Central.African.Rep,96,379,0,0,0,0
Brunei,9,16683,0,1,0,0
Czech.Republic,9,4450,0,0,1,0
Barbados,9,7173,1,0,0,0
Cape.Verde,41,994,0,0,0,0
...,...,...,...,...,...,...
United.Arab.Emirates,15,17690,0,1,0,0
Pakistan,74,504,0,1,0,0
Cameroon,58,627,0,0,0,0
Armenia,25,354,0,0,1,0


<details><summary>Before we fit a k-Nearest Neighbors model, what do we need to do? Why?</summary>
    
- Standardize our data!
- If we *don't* standardize our data, then features that have larger spreads (e.g. higher ranges or higher standard deviations) will have a disproportionate influence on our model.
- If all of your variables are already on the same scale, then scaling is not necessary.
</details>

In [25]:
# Instantiate.
sc = StandardScaler()

# Fit and transform.
X_train_sc = sc.fit_transform(X_train)

# Transform.
#X_test_sc = sc.transform(X_test)

## Fit the Default kNN

Below we fit a default `KNeighborsClassifier` to predict `y`. ([Here is the documentation.](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html))

<details><summary>What is the default number of neighbors used in kNN?</summary>
    
- 5.
</details>

In [26]:
# Instantiate.
knn = KNeighborsClassifier()

# Fit.
knn.fit(X_train_sc, y_train)

KNeighborsClassifier()

In [27]:
# Evaluate.
knn.score(X_train_sc, y_train)

0.992

<details><summary>What score is this?</summary>

- Accuracy.
</details>

In [None]:
# Evaluate against the baseline.


<details><summary>Is selecting k = 5 a good choice? Is it the best choice?</summary>

- We don't know!
- $k$ is a hyperparameter.
</details>

## What are "hyperparameters?"

Models often have built-in quantities that we can use to fine-tune our results. 
- What value of $k$ do we select?
- What distance metric do we select?
- Do we use LASSO or Ridge regularization?
- What value of $\alpha$ or $C$ do we use?

These are quantities our model **cannot** learn... **we must decide on these ourselves**!

> These are different from statistical parameters, which are quantities a model _can_ learn.

However, different values for hyperparameters can result in substantially different models. 
- Let's [visualize fits for different values of $k$](http://scott.fortmann-roe.com/docs/BiasVariance.html) in $k$-nearest neighbors.

<details><summary>We want to find the optimal values for our hyperparameters. How do you think we might do this?</summary>

- Try many different values of hyperparameters and see which ones perform the best on our data.
</details>

## Searching for the Best Hyperparameters

Our default kNN performs quite poorly on the test data. But what if we changed the number of neighbors? The weighting? The distance metric?

These are all hyperparameters of kNN. How would we do this manually? We would need to evaluate on the training data the set of hyperparameters that perform best, and then use this set of hyperparameters to fit the final model and score on the testing set.

**One method of searching for the optimal set of hyperparameters is called GridSearching.**

GridSearching gets its name from the fact that we are searching over a "grid" of hyperparameters. For example, imagine the `n_neighbors` hyperparameters as the columns and `weights` as the rows. This makes a grid. We check the accuracy for all combinations of hyperparameters on the grid.

![](./images/grid.jpg)

### Using `GridSearchCV`

This would be an annoying process to have to do manually. Luckily `sklearn` comes in handy:

```python
from sklearn.model_selection import GridSearchCV
```

The `GridSearchCV` has a handful of important arguments:

| Argument | Description |
| --- | ---|
| **`estimator`** | The sklearn instance of the model to fit on |
| **`param_grid`** | A dictionary where keys are hyperparameters for the model and values are lists of values to test |
| **`cv`** | The number of internal cross-validation folds to run for each set of hyperparameters |
| **`n_jobs`** | How many cores to use on your computer to run the folds (-1 means use all cores) |
| **`verbose`** | How much output to display (0 is none, 1 is limited, 2 is printouts for every internal fit) |


Below is an example for how one might set up the GridSearch for our kNN:

```python
knn_parameters = {
    'n_neighbors':[2,3],
    'weights':['uniform','distance'],
    'p':[1,2]
}

knn_gridsearcher = GridSearchCV(KNeighborsClassifier(), knn_parameters, verbose=1)
knn_gridsearcher.fit(X_train, y_train)
```

**Try out the `sklearn` GridSearch below on the training data.** [You can find the GridSearchCV documentation here.](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

In [None]:
# Create dictionary of hyperparameters.
# The keys MUST match the names of the arguments!
knn_params = {
    'n_neighbors': range(1, 51, 10),
    'metric': ['euclidean', 'manhattan']
}

In [None]:
# Instantiate our GridSearchCV object.
knn_gridsearch = GridSearchCV(, # What is the model we want to fit?
                              , # What is the dictionary of hyperparameters?
                              , # What number of folds in CV will we use?
                              verbose=1)

In [None]:
# Fit the GridSearchCV object to the data
;

### Examining the Results of the GridSearch

Once the GridSearch has fit (this can take awhile!) we can pull out a variety of information and useful objects from the GridSearch object, stored as attributes:

| Property | Description |
| --- | ---|
| **`results.param_grid`** | Displays hyperparameters searched over. |
| **`results.best_score_`** | Best mean cross-validated score achieved. |
| **`results.best_estimator_`** | Reference to model with best score.  Is usable / callable. |
| **`results.best_params_`** | The hyperparameters that have been found to perform with the best score. |
| **`results.grid_scores_`** | Display score attributes with corresponding hyperparameters. | 

In [None]:
# Print out the score.
# from documentation: Mean cross-validated score of the best_estimator
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

knn_gridsearch.best_score_

In [None]:
# Print out the set of hyperparameters that achieved the best score.


In [None]:
# Evaluate the best fit model on the test data.


**Let's see everything!**

In [None]:
pd.DataFrame(knn_gridsearch.cv_results_).sort_values('rank_test_score').head()

In [None]:
gs_df = pd.DataFrame(knn_gridsearch.cv_results_)
gs_df = gs_df[gs_df['param_metric'] == 'euclidean']
gs_df.plot(x='param_n_neighbors', y='mean_test_score');

## A Word of Caution on GridSearching

`sklearn` models often have many hyperparameters with many different possible values. It may be tempting to search over a wide variety of them. In general, this is not wise.

<details><summary>Why not?</summary>

- Remember that GridSearch searches over **all possible combinations of hyperparameters in the parameter dictionary!**

Imagine that we had this as our parameter dictionary:

```python
parameter_grid = {
    'n_neighbors': range(1, 151),
    'weights': ['uniform', 'distance', custom_function],
    'algorithm': ['ball_tree', 'kd_tree', 'brute', 'auto'],
    'leaf_size': range(1, 152),
    'metric': ['minkowski', 'euclidean'],
    'p': [1, 2]
}
```

**How many different combinations will need to be tested?**

| Parameter | Number of Chosen Values |
| --- | --- |
| **n_neighbors** | 150 |
| **weights** | 3 |
| **algorithm** | 4 |
| **leaf_size** | 151 |
| **metric** | 2 |
| **p** | 2 |
| <br>_150 \* 3 \* 4 \* 151 \* 2 \* 2 = n combinations_ <br><br>| _1,087,200_ |

If we select `cv = 5`, we would fit 1,087,200 models on five folds, meaning we fit 5,436,000 models!

If you're not careful, GridSearching can quickly scale out of hand computationally.

> **It is extremely important to understand what the hyperparameters do and think critically about what ranges are useful and relevant to your model!**
</details>

## A brief detour: estimators and transformers.
**Estimators** and **transformers** are two types of classes in `sklearn`.

We've seen several examples of each so far.

### Scikit-Learn Estimators
Estimators are essentially _models_. They fit this format:

```python
# Instantiate.
model = LinearRegression(params)
# Fit.
model.fit(X_train, y_train)
# Predict.
y_pred = model.predict(X_test)
```

Estimators have a **fit** and **predict** method.

### Scikit-Learn Transformers
Transformers are not models. They transform your data using similar syntax to estimators. They work like this:

```python
# Instantiate.
ss = StandardScaler(params)
# Fit.
ss.fit(X_train)
# Transform.
X_transformed = ss.transform(X_train)
```

Instead of `fit` and `predict`, they have **fit** and **transform** methods. In fact, since you fit and transform together so often, they have a shortcut:

```python
ss = StandardScaler(params)
X_transformed = ss.fit_transform(X_train)
```

We've seen a few transformers, including `StandardScaler()` and `PolynomialFeatures()`. There's also `OneHotEncoder()` for dummy encoding and `LabelEncoder()` for factorizing variables. Later we'll see `PCA()`, which is also a transformer.

### Why is this relevant?

Check out the [StandardScaler documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).

Transformers may have hyperparameters as well - **but we can't GridSearch over a transformer**! There's no way to get an accuracy (or other) score from just a transformer, since a transformer can't predict!


![](./images/grid.jpg)

In addition, the acronym ETL, meaning "extract, transform, load," is a very common one in data science. When we gather data from one or more places, there might be **a lot** of preprocessing going on.

Oftentimes, we'll want to apply several transformers to a dataset, *then* build a model. 
- If you do all of these preprocessing steps independently, your code can be messy and it'll be prone to errors!
- It can be challenging to consistently recreate this process.

## Pipelines
![](./images/pipe.png)

Pipelines will allow us to do two things:
1. Chain many transformers together before ending in an estimator.
2. Allow us to GridSearch over a transformer's hyperparameters.

In [None]:
# Instantiate a StandardScaler + kNN pipeline.
pipe =

In [None]:
# Fit.


In [None]:
# Evaluate.


In [None]:
# Get params - yes, you can GridSearchCV over these!
# Notice the naming convention of pipe arguments.


In [None]:
# Instantiate pipeline object.
pipe_2 = Pipeline([
    ('ss', StandardScaler()),
    ('knn', KNeighborsClassifier())
])

In [None]:
# Define dictionary of hyperparameters.
pipe_2_params = {}

In [None]:
# Instantiate our GridSearchCV object.
pipe_2_gridsearch = GridSearchCV(pipe_2, # What is the model we want to fit?
                                 pipe_2_params, # What is the dictionary of hyperparameters?
                                 cv=5, # What number of folds in CV will we use?
                                 verbose=1)

In [None]:
# Fit the GridSearchCV object to the data.
pipe_2_gridsearch.fit(X_train, y_train);

In [None]:
# Print out best score.
# from documentation: Mean cross-validated score of the best_estimator
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

pipe_2_gridsearch.best_score_

In [None]:
# Print out best estimator.
pipe_2_gridsearch.best_estimator_

In [None]:
# Evaluate the best model on the test data.


<details><summary>What would you conclude from this output?</summary>
    
- Our model performs slightly better when cross-validated on our training data than on our testing data, but the difference is pretty small.
- There may be slight overfitting.
- GridSearching gets us the best performing model on the training set; we always have to take care to not overfit!
</details>

## Interview Question

<details><summary>What is the difference between hyperparameters and statistical parameters?</summary>
    
- Statistical parameters are quantities that a model can learn or estimate. Examples include $\beta_0$ and $\beta_1$ in a linear model.
- Hyperparameters are quantities our model cannot learn, but affect the fit of our model. Examples include $k$ in $k$-nearest neighbors and $alpha$ in regularization.
</details>

## (BONUS) RandomizedSearchCV + Visualizing Results

When you're exploring a particularly high number of different hyperparameters, it can be advantageous to do a randomized search instead of a GridSearch.

`from sklearn.model_selection import RandomizedSearchCV`

A good blog post on GridSearch, RandomizedSearch, and visualizing the outputs of these methods [can be found here](https://towardsdatascience.com/using-3d-visualizations-to-tune-hyperparameters-of-ml-models-with-python-ba2885eab2e9).

Another good example on RandomizedSearch [here](https://github.com/justmarkham/scikit-learn-tips/blob/master/notebooks/17_randomized_search.ipynb).

## (BONUS) make_pipeline

`make_pipeline` does the same thing as `pipeline`, but does not require you to name your steps!

`from sklearn.pipeline import make_pipeline`

See an explanation of the difference between the two [here](https://github.com/justmarkham/scikit-learn-tips/blob/master/notebooks/12_pipeline_vs_make_pipeline.ipynb) and see an example of it used [here](https://github.com/justmarkham/scikit-learn-tips/blob/master/notebooks/08_pipeline.ipynb).

## (BONUS) Named Steps

GridSearch not giving you all of the information you need? Want to see what is happening in the intermediate steps in a pipeline? Use the `named_steps` attribute! An example of how to use this can be found [here](https://github.com/justmarkham/scikit-learn-tips/blob/master/notebooks/13_examine_pipeline_steps.ipynb).