# Grid Searching
In this notebook we will look at some very basic techniques for grid searching parameters in scikit-learn.

To start things off, we will load a medium sized dataset (MNIST) as we have done many times in the past. 

To cover:
- Setting up Grid Search in scikit
- Using with pipelines
- Memory replication
- The problems of Parallelsim

In [1]:
# more data for handwriting recognition?
# Let's use Raschka's implementation for using the mnist dataset:
# https://github.com/rasbt/python-machine-learning-book/blob/master/code/ch12/ch12.ipynb
import os
import struct
import numpy as np
# from sklearn.preprocessing import RobustScaler
 
def load_mnist(path, kind='train'):
    """Load MNIST data from `path`"""
    labels_path = os.path.join(path, '%s-labels.idx1-ubyte' % kind)
    images_path = os.path.join(path, '%s-images.idx3-ubyte' % kind)
        
    with open(labels_path, 'rb') as lbpath:
        magic, n = struct.unpack('>II', lbpath.read(8))
        labels = np.fromfile(lbpath, dtype=np.uint8)

    with open(images_path, 'rb') as imgpath:
        magic, num, rows, cols = struct.unpack(">IIII", imgpath.read(16))
        images = np.fromfile(imgpath, dtype=np.uint8).reshape(len(labels), 784)
 
    return images, labels

X_train, y_train = load_mnist('data/', kind='train')
print('Rows: %d, columns: %d' % (X_train.shape[0], X_train.shape[1]))

X_test, y_test = load_mnist('data/', kind='t10k')
print('Rows: %d, columns: %d' % (X_test.shape[0], X_test.shape[1]))

X_train = X_train/255.0
X_test = X_test/255.0



Rows: 60000, columns: 784
Rows: 10000, columns: 784


In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

# we want to know some parameters of the logistic regression
# in order to use it with the current data

# here are some parameters for tuning
LogisticRegression().get_params()

{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'max_iter': 100,
 'multi_class': 'ovr',
 'n_jobs': 1,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'liblinear',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

___
Clearly, we should adjust the regularization parameter, C. We should probably also look at the type of regularization, L1-norm versus L2-norm for this application.

In order to adjust these parameters, we also need a scoring function. For this application, it makes sense to choose the model with the best accuracy, although other cost metrics/evaluation criteria could be applied. 
___

In [7]:
%%time 

# how long should it take to fit one model?
clf = LogisticRegression()
clf.fit(X_train,y_train)
print(accuracy_score(y_test,clf.predict(X_test)))

0.9203
CPU times: user 1min 21s, sys: 258 ms, total: 1min 21s
Wall time: 1min 21s


So it takes about one minute per model to create. Running a grid search, then will take some time. To speed things up, let's use a smaller number of cross validation folds (like 2 or 3). Also let's limit the number of parameters investigated to about four sets. 

In [15]:
grid = {'C':[10,1], # adjust regularization cost
        'penalty':['l1','l2'], #adjust weights regularization
       }
# this will create a parameter grid of 2x2=4 combinations

cv_strat = StratifiedKFold(n_splits=3) # and three splits per combination
# so this will make 4x3=12 different models to create

search = GridSearchCV(estimator=LogisticRegression(),
                      param_grid=grid,
                      cv=cv_strat,
                      n_jobs=-1, # run in parallel 
                     )

# this is the object that will perform the search
print(search)

GridSearchCV(cv=StratifiedKFold(n_splits=3, random_state=None, shuffle=False),
       error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'penalty': ['l1', 'l2'], 'C': [10, 1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)


In [16]:
%%time
# now lets fit it to the training data and get a good estimator! Maybe!
search.fit(X_train,y_train)

CPU times: user 1min 5s, sys: 1.44 s, total: 1min 6s
Wall time: 8min 20s


GridSearchCV(cv=StratifiedKFold(n_splits=3, random_state=None, shuffle=False),
       error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'penalty': ['l1', 'l2'], 'C': [10, 1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

While this process was running, this is how my machine resources were distributed for the different python workers:
![resources](PDF_Slides/resources.png)

As you can see, there were four different workers running on the separate cores. Each core was utilizing about 90-97% of the core and each core consumed about 500MB of RAM. That is because the MNIST data was copied to each worker (memory replication). That's not ideal, but makes sense if this were a distributed environment where each machine needed a copy of the memory. It also makes the grid search much easier to program and use because we don't need to worry about locking memory between the different processes (which could potentially be a big slow down to search performance). Since my machine has 8GB of RAM, this fits nicely into memory even when replicated. 

The training happened in parallel, so it only took about 8 minutes to finish! That means the training for the 12 different logistic regression models (about ~1.5 minute per model) did not take a full 18 minutes. It might have taken (12 models x 1.5 mins) / 4 cores=4.5 mins, but each training process takes more or less time to converge depending on the solver. That means the overhead (in terms of CPU) was not too terrible and the processing sped up nearly 2 times! That is parallelism! (in fact, it was embarassingly parallel, so *meh*, but its okay to get excited when computation gets faster).

We will return to memory and CPU tradeoffs later, but for now lets see how the model performed.

In [19]:
# this will be a deprecated property, but it gives a nice summary
search.grid_scores_



[mean: 0.91168, std: 0.00184, params: {'penalty': 'l1', 'C': 10},
 mean: 0.91208, std: 0.00202, params: {'penalty': 'l2', 'C': 10},
 mean: 0.91472, std: 0.00214, params: {'penalty': 'l1', 'C': 1},
 mean: 0.91372, std: 0.00260, params: {'penalty': 'l2', 'C': 1}]

In [25]:
# we can also look at the mean time it took for the parameters
# this accesses the cv_results_ property that has a lot more info in it
list(zip(search.cv_results_['mean_fit_time'],search.cv_results_['params']))

[(166.47111829121908, {'C': 10, 'penalty': 'l1'}),
 (212.12199465433756, {'C': 10, 'penalty': 'l2'}),
 (80.596707344055176, {'C': 1, 'penalty': 'l1'}),
 (85.935367345809937, {'C': 1, 'penalty': 'l2'})]

It looks like the C=10 setting took much longer on average than smaller C values. That accounts for the increased convergence time we saw above.

So the different parameters did quite well in terms of accuracy on the validation folds (~91-92%). How well do they perform on the test set that we explicitly left out?

In [20]:
# its each to simply use the best grid search estimator.
# We simply treat the search like an estimator
yhat = search.predict(X_test)
print(accuracy_score(y_test,yhat))

0.9195


So all-in-all the estimator performed about the same regardless of the input parameters. 92% is nothing to be ashamed of on MNIST, but it certainly is nothing to be proud of. Its much less than state of the art and its much less than we performed with a multi-layer perceptron. We won't try to increase this value too much though, because the point of this notebook is to exemplify grid search strategy.

Now let's try grid searching with a pipeline.

In [27]:
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA


model = Pipeline([
        ('pca_pre',PCA()),
        ('clf',LogisticRegression())
    ])

# now we can access individual models in the pipe
# using the __ naming
grid = {'clf__C':[1,0.1], # adjust regularization cost
        'pca_pre__n_components':[100,10], #num components
       }
# this will create a parameter grid of 2x2=4 combinations

search = GridSearchCV(estimator=model,
                      param_grid=grid,
                      cv=cv_strat,
                      n_jobs=-1, # run in parallel 
                     )

# this is the object that will perform the search
print(search)

GridSearchCV(cv=StratifiedKFold(n_splits=3, random_state=None, shuffle=False),
       error_score='raise',
       estimator=Pipeline(steps=[('pca_pre', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('clf', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'clf__C': [1, 0.1], 'pca_pre__n_components': [100, 10]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)


In [28]:
%%time
# now lets refit and see if the PCA helped
search.fit(X_train,y_train)

CPU times: user 50.4 s, sys: 1.27 s, total: 51.7 s
Wall time: 2min 32s


GridSearchCV(cv=StratifiedKFold(n_splits=3, random_state=None, shuffle=False),
       error_score='raise',
       estimator=Pipeline(steps=[('pca_pre', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('clf', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'clf__C': [1, 0.1], 'pca_pre__n_components': [100, 10]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [29]:
search.grid_scores_



[mean: 0.90727, std: 0.00270, params: {'clf__C': 1, 'pca_pre__n_components': 100},
 mean: 0.77648, std: 0.00404, params: {'clf__C': 1, 'pca_pre__n_components': 10},
 mean: 0.90380, std: 0.00235, params: {'clf__C': 0.1, 'pca_pre__n_components': 100},
 mean: 0.77440, std: 0.00443, params: {'clf__C': 0.1, 'pca_pre__n_components': 10}]

So it looks like we were able to solve the problem quickly. But really, we need to do a broader search of the parameters if we want things to improve. Lets expand the limits we were working on. This time we will need to create:
- 3 pca params x 3 costs x 2 penalties x 3 folds = 54 models

In [33]:
%%time
# we will use slightly different dictionary construction here
# this might be easier to write
grid = dict(pca_pre__n_components=[100, 200, 500],
            clf__C=[1e-4, 1e-2, 1.0],
            clf__penalty=['l1', 'l2'])

# this is certainly a lot more parameters!!
search = GridSearchCV(estimator=model,
                      param_grid=grid,
                      cv=cv_strat,
                      n_jobs=-1, # run in parallel 
                     )

# now lets refit and see if the increased search space helped
search.fit(X_train,y_train)

CPU times: user 5min 5s, sys: 8.8 s, total: 5min 14s
Wall time: 31min 23s


In [34]:
search.grid_scores_



[mean: 0.20925, std: 0.00025, params: {'clf__penalty': 'l1', 'clf__C': 0.0001, 'pca_pre__n_components': 100},
 mean: 0.20925, std: 0.00025, params: {'clf__penalty': 'l1', 'clf__C': 0.0001, 'pca_pre__n_components': 200},
 mean: 0.20925, std: 0.00025, params: {'clf__penalty': 'l1', 'clf__C': 0.0001, 'pca_pre__n_components': 500},
 mean: 0.81565, std: 0.00683, params: {'clf__penalty': 'l2', 'clf__C': 0.0001, 'pca_pre__n_components': 100},
 mean: 0.81670, std: 0.00715, params: {'clf__penalty': 'l2', 'clf__C': 0.0001, 'pca_pre__n_components': 200},
 mean: 0.81670, std: 0.00701, params: {'clf__penalty': 'l2', 'clf__C': 0.0001, 'pca_pre__n_components': 500},
 mean: 0.87803, std: 0.00385, params: {'clf__penalty': 'l1', 'clf__C': 0.01, 'pca_pre__n_components': 100},
 mean: 0.87795, std: 0.00380, params: {'clf__penalty': 'l1', 'clf__C': 0.01, 'pca_pre__n_components': 200},
 mean: 0.87795, std: 0.00380, params: {'clf__penalty': 'l1', 'clf__C': 0.01, 'pca_pre__n_components': 500},
 mean: 0.89157, 

The result are not so great... But this happens many times in tuning. The above task took 30 minutes to complete. 
___

Now let's get a quick taste of what we having coming up next. We will use dask-learn (its experimental and may not last, but let's use it anyway) to perform the exact same grid search. The only difference here is the lazily evaluated nature of dask.

In [37]:
%%time

from dask.diagnostics import ProgressBar # nice convenience function
from dklearn.grid_search import GridSearchCV as DaskGridSearchCV

# The use of Dask introduces lazy computations
dsearch = DaskGridSearchCV(estimator=model,
                           param_grid=grid,
                           cv=3, 
                          )

# now lets refit and see if we can increase the search time
with ProgressBar():
    dsearch.fit(X_train,y_train)

[########################################] | 100% Completed | 21min 14.8s
[########################################] | 100% Completed |  4min 56.6s
CPU times: user 1h 11min 4s, sys: 1min 44s, total: 1h 12min 48s
Wall time: 26min 13s


The above example took 25 minutes to run, versus the 30 minutes for sklearn. Why? Because the lazy evaluation of dask allowed us to cache the result of the PCA, rather than recalcualting it each time like the naive implementation by scikit-learn (overall the PCA was only a modest CPU user). Scikit's implementation isn't neccessarily bad, but recalcuating PCA is a needless step in this example (this does increase memory footprint however). 
For more explanation, take a look at this example from **Jim Crist**:
- http://matthewrocklin.com/blog/work/2016/07/12/dask-learn-part-1

And his graphic:
![Dask profile](https://mrocklin.github.com/blog/images/grid_search_schedule.gif)

Using dask, the cpu is still running many different processes, but not replicating python processes to do it. In Dask, we get:
![Dask resources](PDF_Slides/resource_dask.png)

Numpy matrices (the MNIST data) has still been replicated and we are still running with many threads. 