# Completing the ML workflow

Over the past few tutorials we've seen many aspects of a supervised ML workflow. From loading data to preprocessing, selecting and training a model, optimizing hyperparameters and finally evaluating the model. It's time to put all these together into a complete workflow for supervised ML problems.

<img src="https://github.com/rasbt/pattern_classification/raw/master/Images/supervised_learning_flowchart.png" width="65%">

The main steps are:

1. **Load** the data into python
2. **Split** the data into train/test sets
3. **Preprocess** the data

    1. Perform all **necessary** preprocessing steps. These include:
        - Handling **missing** data (i.e. discard or impute)
        - Feature **encoding** (i.e. convert alphanumeric features into numeric)
        - Feature **scaling** (i.e. transform features so that they occupy similar value ranges) 
        
    2. **Optionally** we might want to perform:
        - Feature **selection** (i.e. discard some of the features)
        - Feature **extraction** (i.e. transform the data into a usually smaller feature space)
        - **Resampling** (i.e. under/over-sampling)
4. **Select** a ML algorithm
5. Optimize the algorithm's **hyperparameters** through **cross-validation**.
6. **Evaluate** its performance on the test set. If it is inadequate, or if we want to improve on the results: **start over from step 2 and refine the process**! 
7. Finally, if we've achieved an adequate performance on the test set: train the model one last time, with the optimal hyperparameters, **on the whole dataset**.

Scikit-learn has two very helpful classes that make our life easier when refining hyperparameters: **pipeline** and **grid search**.

## pipeline

*scikit-learn* [pipelines](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) provide a convenient way for incorporating multiple steps in a ML workflow.

The concept of the `pipeline` is to encapsulate more than one steps into a single class. The first steps of the pipeline involve **preprocessing** steps. Through these the data is transformed accordingly. The last step of the pipeline is a model that can make predictions. Unfortunately **all preprocessing steps must be *scikit-learn* compatible objects**.

All intermediate steps in a pipeline are transforms and must implement both a `.fit()` and a `.transform()` argument (like the scaler we saw before). The last step should be an estimator (i.e. have `.fit()` and `.predict()` methods). We need to pass these steps, sequentially, as a *list* of *tuples*, each containing the name and object of the transform/estimator.

```python
from sklearn.pipeline import Pipeline

pipe = Pipeline([('transform1', transform1), ('transform2', transform2), ..., ('estimator', estimator)])
```

Let's try to implement a pipeline containing a StandardScaler and a k-NN model. 

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn import datasets

# Load the iris dataset
iris = datasets.load_breast_cancer()

seed = 13  # random seed for reproducibility

# Shuffle and split the data
train, test, train_labels, test_labels = train_test_split(iris['data'], iris['target'], test_size=0.4, random_state=seed)

# Define a scaler (default parameters)
scaler = StandardScaler() 

# Define a kNN model (not default parameters)
knn = KNeighborsClassifier(n_neighbors=11)

# Create a pipeline with the scaler and the kNN
pipe = Pipeline([('standardizer', scaler), ('classifier', knn)])

# Train on the training set
pipe.fit(train, train_labels)

# Evaluate on the test set
preds = pipe.predict(test)
print(accuracy_score(test_labels, preds))

0.956140350877193


What the pipeline did is that when we called `pipe.fit()`, internally it called `.fit_transform()` **for each of its transforms** and `.fit()` **for its estimator**. Assuming an estimator with $M$ preprocessing steps, when we called `pipe.fit()` it ran the equivalent of fitting and transforming the data through each of the preprocessing steps and fitting the last step (i.e. the estimator)

```python
# Assuming that our pipeline is:
pipe = Pipeline([('transform1', transform1), ('transform2', transform2), ..., ('estimator', estimator)])

# If we ran:
pipe.fit(train, train_labels)

# It would be the equivalent of:
tmp = transform1.fit_transform(train)
tmp = transform2.fit_transform(tmp)
# ...
tmp = transformM.fit_transform(tmp)
estimator.fit(tmp)
```

Running `pipe.predict()`, on the other hand, simply applied `.transform()` to each of the preprocessing steps and `.predict()` to the final step.

```python
# If we ran:
preds = pipe.predict(test, test_labels)

# It would be the equivalent of:
tmp = transform1.transform(test)
tmp = transform2.transform(tmp)
# ...
tmp = transformM.transform(tmp)
preds = estimator.predict(tmp)
```

An easier way to create Pipelines is through scikit-learn `make_pipeline` function. This is a shorthand for the Pipeline constructor, that does not require naming the estimators. Instead, their names will be set to the lowercase of their types automatically.

```python
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(scaler, knn) 
```

**Note**: If we want to put a sampler from imblearn into our pipeline we **must** use ` imblearn.pipeline.Pipeline` which extends sklearn's pipeline.

In [2]:
from sklearn.feature_selection import VarianceThreshold
from sklearn.decomposition import PCA
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline  # import imblearn's pipeline because one of the steps is SMOTE


pipe = Pipeline([('selector', VarianceThreshold()),
                 ('scaler', StandardScaler()),
                 ('sampler', SMOTE()),
                 ('pca', PCA()),
                 ('knn', KNeighborsClassifier())])

pipe.fit(train, train_labels)

preds = pipe.predict(test)
print(accuracy_score(test_labels, preds))

0.9473684210526315


## Grid search

Before, we attempted to optimize a model by selecting its hyperparameters through a for loop. There is a much easier way provided through scikit-learn's [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV). This function takes two main arguments: an estimator (or pipeline) and a *grid* of parameters we want the grid search to consider. The grid could be one of two things:

- A dictionary with the hyperparameter names as its keys and a list of values as the corresponding dictionary value:
```python
grid = {'name1': [val1, val2, val3], 'name2': [val4, val5], ...}
```
This will force the grid search to search for **all** possible combinations of parameter values:  
```python
(val1, val4, ...), (val1, val5, ...), (val2, val4, ...), (val2, val5, ...), ... etc.
```

- A list of such dictionaries:
```python
grid = [{'name1': [val1, val2, val3], 'name2': [val4, val5], ...},
        {'name1': [val1, val2, val3], 'name3': [val6, val7], ...}]
```
This will create a grid that contains combinations from both dictionaries.

After creating such a grid:

```python
from sklearn.model_selection import GridSearchCV

grid = {...}
clf = GridSearchCV(estimator, grid)
clf.fit(X_train, y_train)  # will search all possible combinations defined by the grid
preds = clf.predict(X_test)  # will generate predictions based on the best configuration

# In order to access the best model:
clf.best_estimator_
```

In [3]:
from sklearn.model_selection import GridSearchCV

# Scale the data to be comparable to previous.
scaled_train = scaler.fit_transform(train)
scaled_test = scaler.transform(test)

# Define a search grid.
grid = {'n_neighbors': list(range(1, 15, 2)), 
        'p': [1, 2, 3, 4]}

# Create the GridSearch class. This will serve as our classifier from now on.
clf = GridSearchCV(knn, grid, cv=5)  # 5-fold cross validation

# Train the model as many times as designated by the grid.
clf.fit(scaled_train, train_labels)

# Evaluate on the test set and print best hyperparameters
preds = clf.predict(scaled_test)
print(accuracy_score(test_labels, preds))
print(clf.best_estimator_)

0.9780701754385965
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=3, p=1,
           weights='uniform')


Grid searches can be performed on pipelines too! The only thing that changes is that now we need to specify which step each parameter belongs to. This is done by adding both the name of the step and the name of the parameter separated by two underscores (i.e. `__`). 

```python
pipe = Pipeline([('step1', ...), ...])
grid = {'step1__param1`': [val1, ...], ...}  # this dictates param1 from step1 to take the values [val1, ...]
clf = GridSearchCV(pipe, grid)
clf.fit(X_train, y_train)  # will search all possible combinations defined by the grid
preds = clf.predict(X_test)  # will generate predictions based on the best configuration
```

In [4]:
# Revert to the previous pipeline
pipe = Pipeline([('standardizer', scaler), ('classifier', knn)])

# Define a grid that checks for hyperparameters for both steps
grid = {'standardizer__with_mean': [True, False],  # Check parameters True/False for 'with_mean' argument of scaler
        'standardizer__with_std': [True, False],  # Check parameters True/False for 'with_std' argument of scaler
        'classifier__n_neighbors': list(range(1, 15, 2)),  # Check for values of 'n_neighbors' of knn
        'classifier__p': [1, 2, 3, 4]}  # Check for values of 'p' of knn

# Create and train the grid search
clf = GridSearchCV(pipe, grid, cv=5)
clf.fit(train, train_labels)

# Evaluate on the test set and print best hypterparameter values
print('Best accuracy: {:.2f}%'.format(accuracy_score(test_labels, clf.predict(test))*100))
print(clf.best_estimator_)  # print the best configuration

Best accuracy: 97.37%
Pipeline(memory=None,
     steps=[('standardizer', StandardScaler(copy=True, with_mean=True, with_std=True)), ('classifier', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=1,
           weights='uniform'))])


Let's try to optimize the more complex pipeline. 

In [5]:
pipe = Pipeline(steps=[('selector', VarianceThreshold()),
                       ('scaler', StandardScaler()),
                       ('sampler', SMOTE()),
                       ('pca', PCA()),
                       ('knn', KNeighborsClassifier())])

grid = {'selector__threshold': [0.0, 0.005],
        'pca__n_components': list(range(5, 16, 5)),
        'knn__n_neighbors': list(range(1, 15, 2)),
        'knn__p': [1, 2, 3, 4]}

clf = GridSearchCV(pipe, grid, cv=5)
clf.fit(train, train_labels)

print('Best accuracy: {:.2f}%'.format(accuracy_score(test_labels, clf.predict(test)) * 100))
print(clf.best_estimator_) 

Best accuracy: 96.49%
Pipeline(memory=None,
     steps=[('selector', VarianceThreshold(threshold=0.0)), ('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('sampler', SMOTE(k_neighbors=5, kind='deprecated', m_neighbors='deprecated', n_jobs=1,
   out_step='deprecated', random_state=None, ratio=None,
   sampling_strategy='auto', s...ki',
           metric_params=None, n_jobs=None, n_neighbors=5, p=1,
           weights='uniform'))])


With the inclusion of the feature selection/extraction steps, we actually managed to **hurt** our performance here.

### Tips for using grid search:

1. Always **calculate** the number of times a model is fit. In the example above we check for $2 \cdot 3 \cdot 4 \cdot 4 = 168$ different hyperparameter combinations. Because we are using a 5-fold cross validation, each combination is used for 5 separate model fits. So the above grid search accounts for 840 different fits! It is very easy when using a grid search for this number to go up to the thousands which would take a **long time to complete**. If we were using a feature selection or imputing through a model, we would need to take that into account too!

2. Grid search has a parameter called `verbose` which offers several **levels of verbosity**. I'd recommend setting a `verbose=1` so that *scikit-learn* informs you on the number of times a model needs to be trained and how much time it took. You can, however, set a larger value which will inform you on the progress of each fit in detail. Caution: this will flood your screen!

3. Instead of checking all different parameter combinations which would be computationally impossible to achieve, we could use a more **progressive** grid search! Imagine we want to optimize a hyperparameter `x` that ranges from $1$ to $1000$:
 - First perform a grid search on `[1, 5, 10, 50, 100, 500, 1000]` (or even more sparse if it takes too long). We get the best performance for $x = 500$.
 - Now perform a grid search on `[200, 350, 500, 650, 800]`. The best performance is produced with $x=800$.
 - Choose an even more close grid `[725, 730, 735, 740, 745, 750]`.
 - Repeat until you achieve the desired precision.

4. `GridSearchCV` has a parameter called `n_jobs`. This can determine the number of jobs to run in parallel. This can increase computation time, but might criple your pc.


### Drawbacks:

One major drawback of using pipelines is that they support only scikit-learn compatible objects. Many preprocessing steps, however, need to be implemented in a library like *pandas*. To refine these steps we'll need to do so manually! Either that or you can write your own class in an sklearn-like manner and incorporate them into a pipeline. 