# Best Practices for Model Evaluation and Hyperparameter Tuning

## Streamlining workflows with pipelines

When working on a dataset we saw that we have to reuse the parameters that were obtained during the fitting of the training data. We will learn how to use the **Pipeline** class in scikit-learn for this.

### Loading a dataset

We will be working with the **Breast Cancer Wisconsin** dataset which contains 569 samples of malignant and benign tumor cells.

We will read the dataset and split it into training and test sets

In [3]:
import pandas as pd

df = pd.read_csv(
    'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', 
    header=None)

Now we assign the 30 features to a NumPy array x, using **LabelEncoder** we transform the class labels from their original string representation (**M** and **B**) to integers.

In [4]:
from sklearn.preprocessing import LabelEncoder

x = df.loc[:, 2:].values
y = df.loc[:, 1].values
le = LabelEncoder()
y = le.fit_transform(y)

Now the two classes are encoded as 1 = M and 0 = B. To check if the behaviour is correct we can call the method on two dummy examples

In [5]:
le.transform(['M', 'B'])

array([1, 0])

Before constructing the first model pipeline let's divide data into training and test sets

In [6]:
from sklearn.cross_validation import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.2, random_state=1)

### Combining transformers and estimators in a pipeline

As we learned previously, we want to standardize the features of the model before running it. So, let's assume that we want to compress our data from 30 dimensions onto a lower two-dimensional subspace via PCA. Instead of going through the transformation steps we can chain the **StandardScaler**, **PCA** and **LogisticRegression** in a pipeline

In [7]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

In [8]:
pipe_lr = Pipeline([('scl', StandardScaler()), 
                    ('pca', PCA(n_components=2)), 
                    ('clf', LogisticRegression(random_state=1))])
pipe_lr.fit(x_train, y_train)
print('Test Accuracy: %.3f' % pipe_lr.score(x_test, y_test))

Test Accuracy: 0.947


The **Pipeline** object takes a list of tuples as input where the first element is an ID used to access single elements in the pipeline, and the second element is a scikit-learn transformer or estimator.

## Cross-validation

To avoid overfitting we can use **cross-validation**. There are two major techniques: **holdout cross-validation** and **k-fold cross-validation**.

### Holdout method

By splitting the dataset into training and test datasets and performing model selection and hyperparameter tuning on the same test set we make it become training data. A better way to use the **holdout method** is to split data into training, validation and test set.

The issue with this method is that it will depend on the split of the data, and it can be impractical for small data problems.

### K-fold cross-validation

In **k-fold cross-validation** we randomly split the training dataset into *k* folds without replacement, where $k-1$ folds are used for model training and one fold is used for testing. The procedure is repeated *k* times. 

The default value for *k* is 10 which is a reasonable number for many datasets, for small ones we can increase *k*, in this way more data will be used for training every iteration.

For classification problems we can improve k-fold performance with **Stratified k-fold** which tends to perform better with unequal class proportions. In fact with this method every fold retains the proportion of classes.

In [9]:
import numpy as np
from sklearn.cross_validation import StratifiedKFold

In [10]:
kfold = StratifiedKFold(y=y_train,
                        n_folds=10,
                        random_state=1)
scores = []
for k, (train, test) in enumerate(kfold):
    pipe_lr.fit(x_train[train], y_train[train])
    score = pipe_lr.score(x_train[test], y_train[test])
    scores.append(score)
    print('Fold: %s, Class dist.: %s, Acc: %.3f' % (k+1, np.bincount(y_train[train]), score))

Fold: 1, Class dist.: [256 153], Acc: 0.891
Fold: 2, Class dist.: [256 153], Acc: 0.978
Fold: 3, Class dist.: [256 153], Acc: 0.978
Fold: 4, Class dist.: [256 153], Acc: 0.913
Fold: 5, Class dist.: [256 153], Acc: 0.935
Fold: 6, Class dist.: [257 153], Acc: 0.978
Fold: 7, Class dist.: [257 153], Acc: 0.933
Fold: 8, Class dist.: [257 153], Acc: 0.956
Fold: 9, Class dist.: [257 153], Acc: 0.978
Fold: 10, Class dist.: [257 153], Acc: 0.956


In [11]:
print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))

CV accuracy: 0.950 +/- 0.029


Though the previous code is useful to understand how k-fold works under the hood, we have a better and faster implementation implemented in scikit-learn.

In [12]:
from sklearn.cross_validation import cross_val_score

In [18]:
scores = cross_val_score(estimator=pipe_lr,
                        X=x_train,
                        y=y_train,
                        cv=10,
                        n_jobs=1)
print('CV accuracy scores: %s' % scores)

CV accuracy scores: [ 0.89130435  0.97826087  0.97826087  0.91304348  0.93478261  0.97777778
  0.93333333  0.95555556  0.97777778  0.95555556]


In [19]:
print('CV accuracy %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))

CV accuracy 0.950 +/- 0.029


## Debugging algorithms with learning and validation curves

We will now take a look at the performance of algorithms via **learning curves** and **validation curves**.

### Diagnosing bias and variance problems with learning curves

