# Grid Search

_By Jeff Hale (mostly)_

---

## Learning Objectives
By the end of this lesson students will be able to:

- Understand when to use GridSearchCV and a pipeline
- Learn how to use GridSearchCV with a pipeline to find optimal hyperparameters


---

# GridSearch with a Pipeline

Grid searching is the best way to optimize hyperparameters.

`hyperparameters` are the arguments you choose for a model that can have different values. You tune these to improve model performance. For example the most important hyperparameter for a KNN model is `n_neighbors` (the number of nearest neighbors to include in the model). 

Pipelines are the best way to do multiple preprocessing steps.

Put them together for an awesome chunk of your data science workflow :)

In [3]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, Lasso, Ridge

In [12]:
# read in the data
boston = pd.read_csv('../data/boston_data.csv')

In [13]:
# inspect 
boston.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,5.33,36.2


In [14]:
# break into X and y
X = boston.drop('MEDV', axis=1)
y = boston['MEDV']

In [15]:
X.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,4.98
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,9.14
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,4.03
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,2.94
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,5.33


In [16]:
y.head()

0    24.0
1    21.6
2    34.7
3    33.4
4    36.2
Name: MEDV, dtype: float64

In [17]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

In [18]:
X_train.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT
335,0.03961,0.0,5.19,0,0.515,6.037,34.5,5.9853,5,224,20.2,8.01
142,3.32105,0.0,19.58,1,0.871,5.403,100.0,1.3216,5,403,14.7,26.82
170,1.20742,0.0,19.58,0,0.605,5.875,94.6,2.4259,5,403,14.7,14.43
241,0.10612,30.0,4.93,0,0.428,6.095,65.1,6.3361,6,300,16.6,12.4
379,17.8667,0.0,18.1,0,0.671,6.223,100.0,1.3861,24,666,20.2,21.78


## GridSearch Syntax

`GridSearch` accepts a `Pipeline` object as an estimator and a parameter grid.

The param grid is a dictionary. 

It uses the `string_name`s from your pipeline step followed by a dunder `__` (double underscore) and the argument name for that particular step. 

You then provide an iterable to search over (generally a list or a range-style object).

What's an iterable? Something Python can iterate over.

### Make a pipeline

In [19]:
pipeline = make_pipeline(StandardScaler(), Lasso())
pipeline

Pipeline(memory=None,
         steps=[('standardscaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('lasso',
                 Lasso(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=1000, normalize=False, positive=False,
                       precompute=False, random_state=None, selection='cyclic',
                       tol=0.0001, warm_start=False))],
         verbose=False)

### Set up a param grid with the following:
    'lasso__alpha': [.5, 1]

In [20]:
params = {
    'lasso__alpha': [.5, 1]
}

You can also specify the number of folds using `cv`. Default is 5.

### instantiate our gridsearch with our pipe and params

In [21]:
gs = GridSearchCV(pipeline, param_grid=params, verbose=1)

We use this the same as other models, `fit`ting and `score`ing like normal (but now using the hyperparameters that gave us the best results).

In [22]:
# fit on the training data
gs.fit(X_train, y_train)

Fitting 5 folds for each of 2 candidates, totalling 10 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    0.1s finished


GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('standardscaler',
                                        StandardScaler(copy=True,
                                                       with_mean=True,
                                                       with_std=True)),
                                       ('lasso',
                                        Lasso(alpha=1.0, copy_X=True,
                                              fit_intercept=True, max_iter=1000,
                                              normalize=False, positive=False,
                                              precompute=False,
                                              random_state=None,
                                              selection='cyclic', tol=0.0001,
                                              warm_start=False))],
                                verbose=False),
             iid='deprecated', n_jobs=None,
    

#### score the training data

In [23]:
gs.score(X_train, y_train)

0.7087967116113545

#### score the test data

In [24]:
gs.score(X_test, y_test)

0.595133419439353

So what are our best parameters?

#### look at `.best_params_`

In [25]:
gs.best_params_

{'lasso__alpha': 0.5}

Note that we'll use our `best_estimator_` to access the `Pipeline` that was fit with our `best_params_`.

Within the `best_estimator_`there is a dictionary called `named_steps`. We can use our `string_names` to access the steps in our `Pipeline`. This is where we'll go to access info about the transformations and parameters done at each step.

In [26]:
gs.best_estimator_

Pipeline(memory=None,
         steps=[('standardscaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('lasso',
                 Lasso(alpha=0.5, copy_X=True, fit_intercept=True,
                       max_iter=1000, normalize=False, positive=False,
                       precompute=False, random_state=None, selection='cyclic',
                       tol=0.0001, warm_start=False))],
         verbose=False)

`VarianceThreshold` with our best threshold of 0.05 removes one of our columns.

Here our `Lasso` uses an alpha of .001. We can look at the `coef_` to see how our features are weighted.

#### look at the lasso coefficients

In [29]:
gs.best_estimator_.named_steps['lasso'].coef_

array([-0.16217369,  0.        , -0.        ,  0.        , -0.        ,
        3.31843031, -0.        , -0.25424356, -0.        , -0.10441123,
       -1.54716117, -3.86902367])

In [30]:
list(zip(X.columns, gs.best_estimator_.named_steps['lasso'].coef_))

[('CRIM', -0.1621736906849839),
 (' ZN ', 0.0),
 ('INDUS ', -0.0),
 ('CHAS', 0.0),
 ('NOX', -0.0),
 ('RM', 3.3184303116728264),
 ('AGE', -0.0),
 ('DIS', -0.2542435575465667),
 ('RAD', -0.0),
 ('TAX', -0.10441123204401798),
 ('PTRATIO', -1.547161171608939),
 ('LSTAT', -3.869023670138737)]

The following code demonstrates using these methods to align our original columns with our final betas.

# Machine Learning Steps

After you have X and y set.

- Split into training and test (holdout) sets.
- Create a pipeline for preprocessing and the model you want to use.
- Create your parameter grid to search over
- Create a GridSearchCV object and pass it the pipeline object and parameter grid
- Fit and score the GridSearchCV object
- Inspect and iterate!

# Summary

You've seen `GridSearchCV` with pipelines.

## Check for understanding

- Why would you want to use `GridSearchCV` with a pipeline?
- What do you pass `GridSearchCV` if you are using a pipeline?
- How do you specify the parameter grid?


`GridSearchCV` is an extremely powerful tool for your toolkit! 🛠
