# Grid Search

_By Jeff Hale (mostly)_

---

## Learning Objectives
By the end of this lesson students will be able to:

- Understand when to use GridSearchCV and a pipeline
- Learn how to use GridSearchCV with a pipeline to find optimal hyperparameters


---

# GridSearch with a Pipeline

Grid searching is the best way to optimize hyperparameters.

`hyperparameters` are the arguments you choose for a model that can have different values. You tune these to improve model performance. For example the most important hyperparameter for a KNN model is `n_neighbors` (the number of nearest neighbors to include in the model). 

Pipelines are the best way to do multiple preprocessing steps.

Put them together for an awesome chunk of your data science workflow :)

In [1]:
import seaborn as sns
import pandas as pd
import numpy as np


from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, Lasso, Ridge


In [2]:
# read in the data
boston = pd.read_csv('../data/boston_data.csv')

In [3]:
# inspect 
boston.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,5.33,36.2


In [4]:
# break into X and y
X = boston.drop('MEDV', axis=1)
y = boston['MEDV']

In [5]:
X.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,4.98
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,9.14
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,4.03
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,2.94
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,5.33


In [6]:
y.head()

0    24.0
1    21.6
2    34.7
3    33.4
4    36.2
Name: MEDV, dtype: float64

In [7]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

In [8]:
X_train.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT
335,0.03961,0.0,5.19,0,0.515,6.037,34.5,5.9853,5,224,20.2,8.01
142,3.32105,0.0,19.58,1,0.871,5.403,100.0,1.3216,5,403,14.7,26.82
170,1.20742,0.0,19.58,0,0.605,5.875,94.6,2.4259,5,403,14.7,14.43
241,0.10612,30.0,4.93,0,0.428,6.095,65.1,6.3361,6,300,16.6,12.4
379,17.8667,0.0,18.1,0,0.671,6.223,100.0,1.3861,24,666,20.2,21.78


## GridSearch Syntax

`GridSearch` accepts a `Pipeline` object as an estimator and a parameter grid.

The param grid is a dictionary. 

It uses the `string_name`s from your pipeline step followed by a dunder `__` (double underscore) and the argument name for that particular step. 

You then provide an iterable to search over (generally a list or a range-style object).

What's an iterable? Something Python can iterate over.

### Make a pipeline

In [9]:
pipeline = make_pipeline(StandardScaler(), Lasso())
pipeline

Pipeline(memory=None,
         steps=[('standardscaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('lasso',
                 Lasso(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=1000, normalize=False, positive=False,
                       precompute=False, random_state=None, selection='cyclic',
                       tol=0.0001, warm_start=False))],
         verbose=False)

### Set up a param grid with the following:
    'lasso__alpha': [.5, 1]

In [10]:
params = {
    'lasso__alpha': [.5, 1]
}

You can also specify the number of folds using `cv`. Default is 5.

### instantiate our gridsearch with our pipe and params

In [11]:
gs = GridSearchCV(pipeline, param_grid=params, verbose=1)

We use this the same as other models, `fit`ting and `score`ing like normal (but now using the hyperparameters that gave us the best results).

In [12]:
# fit on the training data
gs.fit(X_train, y_train)

Fitting 5 folds for each of 2 candidates, totalling 10 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    0.1s finished


GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('standardscaler',
                                        StandardScaler(copy=True,
                                                       with_mean=True,
                                                       with_std=True)),
                                       ('lasso',
                                        Lasso(alpha=1.0, copy_X=True,
                                              fit_intercept=True, max_iter=1000,
                                              normalize=False, positive=False,
                                              precompute=False,
                                              random_state=None,
                                              selection='cyclic', tol=0.0001,
                                              warm_start=False))],
                                verbose=False),
             iid='deprecated', n_jobs=None,
    

#### score the training data

In [13]:
gs.score(X_train, y_train)

0.7087967116113545

#### score the test data

In [16]:
gs.score(X_test, y_test)

0.595133419439353

So what are our best parameters?

#### look at `.best_params_`

In [25]:
gs.best_params_

{'lasso__alpha': 0.5}

Note that we'll use our `best_estimator_` to access the `Pipeline` that was fit with our `best_params_`.

Within the `best_estimator_`there is a dictionary called `named_steps`. We can use our `string_names` to access the steps in our `Pipeline`. This is where we'll go to access info about the transformations and parameters done at each step.

In [26]:
gs.best_estimator_

Pipeline(memory=None,
         steps=[('standardscaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('lasso',
                 Lasso(alpha=0.5, copy_X=True, fit_intercept=True,
                       max_iter=1000, normalize=False, positive=False,
                       precompute=False, random_state=None, selection='cyclic',
                       tol=0.0001, warm_start=False))],
         verbose=False)

`VarianceThreshold` with our best threshold of 0.05 removes one of our columns.

Here our `Lasso` uses an alpha of .001. We can look at the `coef_` to see how our features are weighted.

#### look at the lasso coefficients

In [29]:
gs.best_estimator_.named_steps['lasso'].coef_

array([-0.16217369,  0.        , -0.        ,  0.        , -0.        ,
        3.31843031, -0.        , -0.25424356, -0.        , -0.10441123,
       -1.54716117, -3.86902367])

In [30]:
list(zip(X.columns, gs.best_estimator_.named_steps['lasso'].coef_))

[('CRIM', -0.1621736906849839),
 (' ZN ', 0.0),
 ('INDUS ', -0.0),
 ('CHAS', 0.0),
 ('NOX', -0.0),
 ('RM', 3.3184303116728264),
 ('AGE', -0.0),
 ('DIS', -0.2542435575465667),
 ('RAD', -0.0),
 ('TAX', -0.10441123204401798),
 ('PTRATIO', -1.547161171608939),
 ('LSTAT', -3.869023670138737)]

The following code demonstrates using these methods to align our original columns with our final betas.

# Machine Learning Steps

After you have X and y set.

- Split into training and test (holdout) sets.
- Create a pipeline for preprocessing and the model you want to use.
- Create your parameter grid to search over
- Create a GridSearchCV object and pass it the pipeline object and parameter grid
- Fit and score the GridSearchCV object
- Inspect and iterate!

## Titanic Pipeline with GridSearch

Read in titanic data from seaborn

In [37]:
df_titanic = sns.load_dataset('titanic', )
df_titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [38]:
df_titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.6+ KB


## Split into x and y. 

Let's use `survived` for y and `sex` and `class` for X.

In [39]:
X = df_titanic[['sex', 'class']]
y = df_titanic['survived']

In [40]:
X.head()

Unnamed: 0,sex,class
0,male,Third
1,female,First
2,female,Third
3,female,First
4,male,Third


In [41]:
y.head()

0    0
1    1
2    1
3    1
4    0
Name: survived, dtype: int64

## Split into training and test sets

In [65]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 7)

In [66]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import OneHotEncoder

In [74]:
pipe = make_pipeline(OneHotEncoder(handle_unknown='ignore'), KNeighborsClassifier())
pipe

Pipeline(memory=None,
         steps=[('onehotencoder',
                 OneHotEncoder(categories='auto', drop=None,
                               dtype=<class 'numpy.float64'>,
                               handle_unknown='ignore', sparse=True)),
                ('kneighborsclassifier',
                 KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                      metric='minkowski', metric_params=None,
                                      n_jobs=None, n_neighbors=5, p=2,
                                      weights='uniform'))],
         verbose=False)

## Use KNN with a pipeline and GridSearchCV to find the best value of K

In [75]:
params = dict(kneighborsclassifier__n_neighbors=range(1,40))
params

{'kneighborsclassifier__n_neighbors': range(1, 40)}

In [76]:
gs_titanic = GridSearchCV(pipe, params)

In [77]:
gs_titanic.fit(X_train, y_train)

GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('onehotencoder',
                                        OneHotEncoder(categories='auto',
                                                      drop=None,
                                                      dtype=<class 'numpy.float64'>,
                                                      handle_unknown='ignore',
                                                      sparse=True)),
                                       ('kneighborsclassifier',
                                        KNeighborsClassifier(algorithm='auto',
                                                             leaf_size=30,
                                                             metric='minkowski',
                                                             metric_params=None,
                                                             n_jobs=None,
                                  

In [78]:
gs_titanic.score(X_test, y_test)

0.7937219730941704

#### Get the best params

In [79]:
gs_titanic.best_params_

{'kneighborsclassifier__n_neighbors': 15}

## Create a baseline model and score it.

Find the accuracy. 

# Summary

You've seen `GridSearchCV` with pipelines.

## Check for understanding

- Why would you want to use `GridSearchCV` with a pipeline?
- What do you pass `GridSearchCV` if you are using a pipeline?
- How do you specify the parameter grid?


`GridSearchCV` is an extremely powerful tool for your toolkit! 🛠
