# Interfacing with `GridSearchCV` and `Pipeline`
The syntax to interface `GridSearchCV` and a pipeline estimator can be tricky. There are two methods, each with its pros and cons. The first method feeds a pipeline estimator into `GridSearchCV` while the second has `GridSearchCV` as a step in the pipeline. Remember, both `Pipeline` and `GridSearchCV` are estimator classes, with the `fit`, `predict`, and `score` methods.

In [1]:
import numpy as np

from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

For this demostration, let's use the California housing dataset and create a `Pipeline` object containing two steps: a feature scaler and a ridge predictor.

In [2]:
# Get data and perform train/test split
X = fetch_california_housing()['data']
y = fetch_california_housing()['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Create pipeline estimator
pipe_1 = Pipeline([('scaler', StandardScaler()), ('regressor', Ridge())])
pipe_1.fit(X_train, y_train)

print(pipe_1.score(X_test, y_test))

0.5943141338604155


## Perform a gridseach on a pipeline
The first method performs a gridsearch on a `Pipeline` estimator. A `Pipeline` object absorbs the attributes and parameters of all the steps, the transformers and the final predictor, and uses the name of the step as the prefix to the name of the attributes. For example, the ridge regularization parameter is now referred to as `regressor__alpha` rather than just `alpha`. The prefix is needed distinguish between the hyperparameters among all the estimators of the pipeline. The `get_params` method returns a dictionary of the pipeline parameters, using the name of the stage and double underscore (dunder) as a prefix.

In [3]:
# Print dictionary of pipeline parameters
pipe_1.get_params()

{'memory': None,
 'steps': [('scaler',
   StandardScaler(copy=True, with_mean=True, with_std=True)),
  ('regressor',
   Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
         normalize=False, random_state=None, solver='auto', tol=0.001))],
 'verbose': False,
 'scaler': StandardScaler(copy=True, with_mean=True, with_std=True),
 'regressor': Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
       normalize=False, random_state=None, solver='auto', tol=0.001),
 'scaler__copy': True,
 'scaler__with_mean': True,
 'scaler__with_std': True,
 'regressor__alpha': 1.0,
 'regressor__copy_X': True,
 'regressor__fit_intercept': True,
 'regressor__max_iter': None,
 'regressor__normalize': False,
 'regressor__random_state': None,
 'regressor__solver': 'auto',
 'regressor__tol': 0.001}

To perform a gridsearch on the ridge regularization hyperparameter, the keyword `regressor__alpha` needs to be used.

In [4]:
# Perform hyperparameter tuning on pipeline estimator
param_grid = {'regressor__alpha': np.logspace(-3, 3, 20)} # 10^-3 to 10^3
gs_est = GridSearchCV(pipe_1, param_grid, cv=3, n_jobs=2, verbose=1) # GS of pipe
gs_est.fit(X_train, y_train)

print(gs_est.score(X_test, y_test))

Fitting 3 folds for each of 20 candidates, totalling 60 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.


0.5941973302098567


[Parallel(n_jobs=2)]: Done  60 out of  60 | elapsed:    2.0s finished


In [5]:
gs_est.best_params_

{'regressor__alpha': 12.742749857031322}

The advantage of feeding a `pipeline` object to `GridSearchCV` is that it allows for tuning all the hyperparameters of the steps in the pipeline, not just the final predictor. For example, if we wanted to explore the effect of modifying the `with_mean` parameter of the scaler transformer, we would use the keyword `scaler__with_mean` since we used the string "scaler" to refer to the scaler transformer when creating our pipeline.

When calling the `fit` method, each training step of the cross-validation scheme will perform the fit and transformations on all steps in the pipeline and not just the final predictor. If the gridsearch is not searching through the hyperparameters of the transformers, needless computations are being performed which may come with a larger runtime of the gridsearch.

## `GridSearchCV` inside the pipeline
`GridSearchCV` can be the final step of the pipeline. Since `GridSearchCV` is only operating on an estimator that is not a pipeline, the name of hyperparameters is simply the original name from the estimator. For example, to modify the regularization hyperparameter of the ridge estimator, the string to that refers to this hyperparameter is simply "alpha". The disadvantage of this approach is that you can only search through the hyperparameters of the estimator that was fed to `GridSearchCV` and not of any of the other transformers. However, since the gridsearch only starts at the last step of the pipeline, the transformations of the dataset occurs once, which is computationally more efficient.

In [6]:
# Include gridsearch estimator inside of the pipeline
param_grid = {'alpha': np.logspace(-3, 3, 20)}
gs = GridSearchCV(Ridge(), param_grid) # GS only for predictor
pipe_2 = Pipeline([('scaler', StandardScaler()), ('gs_est', gs)])
pipe_2.fit(X_train, y_train)

print(pipe_2.score(X_test, y_test))

0.5941973302098567




In [14]:
pipe_2.named_steps

{'scaler': StandardScaler(copy=True, with_mean=True, with_std=True),
 'gs_est': GridSearchCV(cv='warn', error_score='raise-deprecating',
              estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                              max_iter=None, normalize=False, random_state=None,
                              solver='auto', tol=0.001),
              iid='warn', n_jobs=None,
              param_grid={'alpha': array([1.00000000e-03, 2.06913808e-03, 4.28133240e-03, 8.85866790e-03,
        1.83298071e-02, 3.79269019e-02, 7.84759970e-02, 1.62377674e-01,
        3.35981829e-01, 6.95192796e-01, 1.43844989e+00, 2.97635144e+00,
        6.15848211e+00, 1.27427499e+01, 2.63665090e+01, 5.45559478e+01,
        1.12883789e+02, 2.33572147e+02, 4.83293024e+02, 1.00000000e+03])},
              pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
              scoring=None, verbose=0)}

## Caching transformers of a pipeline

### Don't want to perform StandardScaler() every time a new hyperparameter value is fit.

As of version 0.19, scikit-learn has given `Pipeline` a way to cache transformers. Caching allows unneccessarily fitting transformers; if a transformer with a set of arguments has already been fitted, it would have been cached when initially fitted and is simply loaded into memory. The caching of transformers makes the hyperparameter tuning process run faster and one can perform the search on the pipeline rather than have the search as the final step. Note, that caching transformers will only be faster if it's computationally expensive to fit the tranformer. You may not see any improvements if you are using something like `StandardScaler`, which is fast to fit.

To cache the transformers of a pipeline, you need to use the `memory` keyword argument and set it equal to a string referring to the path of the caching directory. We can make a temporary directing using `mkdtemp` and remove it after use with `rmtree`.

In [9]:
from tempfile import mkdtemp
from shutil import rmtree

cachedir = mkdtemp()
pipe_cache = Pipeline([('scaler', StandardScaler()), ('regressor', Ridge())], memory=cachedir)
pipe_cache.fit(X_train, y_train)
rmtree(cachedir)

## Using `RandomizedSearchCV` as an alternative to `GridSearchCV`

The number of gridpoints to explore when using `GridSearchCV` grows exponentially with the number of different types of hyperparameters you are exploring. You can consider each hyperparameter as a dimension in a n-dimensional space. An alternate to `GridSearhCV` is `RandomizedSearchCV` which constructs the grid of hyperparameters but randomly samples from the grid. The number of grid points sampled is controlled by the `n_iter` keyword. It has been shown that a randomized search can perform just as well as a full gridsearch but with only a fraction of the runtime. A "savings" of computational time can then be "invested" into exploring other hyperparameters or using a larger grid.

In [10]:
# Use randomized search
param_grid = {'regressor__alpha': np.logspace(-4, 4, 100)}
gs_est = RandomizedSearchCV(pipe_1, param_grid, cv=3, n_jobs=2, verbose=1, random_state=0, n_iter=20)
gs_est.fit(X_train, y_train)

print(gs_est.score(X_test, y_test))

Fitting 3 folds for each of 20 candidates, totalling 60 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.


0.5942975332161693


[Parallel(n_jobs=2)]: Done  60 out of  60 | elapsed:    1.9s finished


## Using `dill` to serialize estimators
`pickle` is a Python standard library that serializes Python objects. Serialization is the process of converting a data structure or object into a byte stream that can stored to disk. Deserialization refers to the opposite process where a byte stream is constructed to form a data structure or object. Think of pickling as a way to save Python objects to disk so they can be loaded and used at a later time. The `pickle` library has its limitations on what objects it can successfully serialize. Fortunately, the `dill` library improves on `pickle` by extending the set of possible Python data types that can be serialize. The syntax between `pickle` and `dill` are nearly identical.

It is useful to serialize scikit-learn estimators, especially after they have been trained. To serialize our trained estimator, we simply open a file and call `dill.dump` on the file object. Remember, it is good practice to use the `with` statement when dealing with file objects.

In [11]:
import dill

with open('pipe_of_ridge.dill', 'wb') as fp:
    dill.dump(pipe_1, fp)

In the above code, when using `open`, the `w` signafies that we want write access to the file. The file extension can be anything, however, using "dill" is descriptive and a nice convention. To deserialize the estimator, use `load` on the dill file. Note that `r` is used with `open` because we need read access to the file.

In [12]:
with open('pipe_of_ridge.dill', 'rb') as fp:
    pipe_1 = dill.load(fp)

Now we have the estimator loaded in memory and is ready to be used. If you are curious about the memory requirements of different trained estimators, you can use `dill` and check the resulting file size.

In [5]:
print(pipe_1.score(X_test, y_test))

NameError: name 'pipe_1' is not defined

**Note:** scikit-learn's preferred way to serialize models is using [`Joblib`](https://joblib.readthedocs.io/en/latest/index.html). You can read more [here](https://scikit-learn.org/stable/modules/model_persistence.html) but it pretty works the same as `pickle` and `dill`.

## Persisting the Model
Let's use a library to save our model.
**Note:** Do not 'open' dill file from stranger!

`pickle`

`dill` - can saved more object types than pickle

In [1]:
import dill

In [2]:
dill.dump?

In [None]:
with open('saved_model.dill', 'wb') as f:  # .dill is just convention to remind, but could name as anything; f is file handle
    dill.dump(model, f)                    # (model name, f); saves dill file in Jupyter

In [None]:
with open('saved_model.dill', 'rb') as f:  # rb = read binary
    saved_model = dill.load(f)

In [None]:
saved_model.predict(X_test)   # already fit! Can predict straight away

In [None]:
# could also dump e.g. training data!

with open('training_data.dill', 'wb') as f:
    dill.dump(X_train, f)

In [None]:
with open('saved_model.dill', 'rb') as f:  # can load training data df!
    X_train = dill.load(f)