# Grid Search

_By Jeff Hale (mostly)_

---

## Learning Objectives
By the end of this lesson students will be able to:

- Understand when to use GridSearchCV and a pipeline
- Learn how to use GridSearchCV with a pipeline to find optimal hyperparameters


---

# GridSearch with a Pipeline

Grid searching is the best way to optimize hyperparameters.

`hyperparameters` are the arguments you choose for a model that can have different values. You tune these to improve model performance. For example the most important hyperparameter for a KNN model is `n_neighbors` (the number of nearest neighbors to include in the model). 

Pipelines are the best way to do multiple preprocessing steps.

Put them together for an awesome chunk of your data science workflow :)

In [None]:
import seaborn as sns
import pandas as pd
import numpy as np


from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, Lasso, Ridge


In [None]:
# read in the data
boston = pd.read_csv('../data/boston_data.csv')

In [None]:
# inspect 
boston.head()

In [None]:
# break into X and y
X = boston.drop('MEDV', axis=1)
y = boston['MEDV']

In [None]:
X.head(2)

In [None]:
y.head(2)

#### train test split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

In [None]:
X_train.head(2)

## GridSearch with a pipeline Syntax

`GridSearch` accepts a `Pipeline` object as an estimator and a parameter grid.

The param grid is a dictionary. 

It uses the `string_name`s from your pipeline step followed by a dunder `__` (double underscore) and the argument name for that particular step. 

You then provide an iterable to search over (generally a list or a range-style object).

What's an iterable? Something Python can iterate over.

### Make a pipeline

### Set up a param grid with the following:
    'lasso__alpha': [.5, 1]

You can also specify the number of folds using `cv`. Default is 5.

### Instantiate our GridSearchCV object with our pipe and param grid

We use this the same as other models, `fit`ting and `score`ing like normal (but now using the hyperparameters that gave us the best results).

#### Fit our gs object on the training data

#### Score the training data

#### Score the test data

So what are our best parameters?

#### look at `.best_params_`

Note that we'll use our `best_estimator_` to access the `Pipeline` that was fit with our `best_params_`.


We can look at the `coef_` to see how our features are weighted.

#### look at the lasso coefficients

The following code demonstrates using these methods to align our original columns with our final betas.

# Machine Learning Steps

After you have X and y set:

- Split into training and test (holdout) sets.
- Create a pipeline for preprocessing and the model you want to use.
- Create your parameter grid to search over
- Create a GridSearchCV object and pass it the pipeline object and parameter grid
- Fit and score the GridSearchCV object
- Inspect and iterate!

## Titanic Pipeline with GridSearch

Read in titanic data from seaborn

In [None]:
df_titanic = sns.load_dataset('titanic', )
df_titanic.head()

In [None]:
df_titanic.info()

## Split into x and y. 

Let's use `survived` for y and `sex` and `class` for X.

In [None]:
X = df_titanic[['sex', 'class']]
y = df_titanic['survived']

In [None]:
X.head()

In [None]:
y.head()

## Split into training and test sets

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import OneHotEncoder

### Create a pipeline with make_pipeline

The steps are OneHotEncoder and KNN.

### Use GridSearchCV with the pipeline to find the best value of K

#### Make the parameter dictionary

In [88]:
params = ...

#### Instantiate the GridSearchCV object and pass it the pipeline and paramgrid

#### Fit 

#### Score on accuracy

#### What's the best number of neighbors?
Get the best params.

## Create a baseline model and score it.

Find the accuracy of the baseline model

# Summary

You've seen `GridSearchCV` with pipelines.

## Check for understanding

- Why would you want to use `GridSearchCV` with a pipeline?
- What do you pass `GridSearchCV` if you are using a pipeline?
- How do you specify the parameter grid?


`GridSearchCV` is an extremely powerful tool for your toolkit! 🛠
