# Resampling Methods

## Left out samples validation

The **training error** can be easily calculated by applying the statistical learning method to the observations used in its training. But because of overfitting, the training error rate can dramatically underestimate the error that would be obtained on new samples.


The **test error** is the average error that results from a learning method to predict the response on a new samples that is, on samples that were not used in training the method. Given a data set, the use of a particular learning method is warranted if it results in a low test error. The test error can be easily calculated if a designated test set is available. Unfortunately, this is usually not the case.

Thus the original dataset is generally splited in a training and a test (or validation) data sets. Large training set (80%) small test set (20%) might provide a poor estimation of the predictive performances. On the contrary, large test set and small training set might produce a poorly estimated learner. This is why, on situation where we cannot afford such split, it recommended to use cross-Validation scheme to estimate the predictive power of a learning algorithm.


## Cross-Validation (CV)

Cross-Validation scheme randomly divid the set of observations into $K$ groups, or **folds**, of approximately equal size. The first fold is treated as a validation set, and the method $f()$ is fitted on the remaining union of $K - 1$ folds: ($f(X_{-K}, y_{-K})$).

The mean error measure (generally a loss function) is evaluated of the on the observations in the held-out fold. For each sample $i$ we consider the model estimated on the data set that did not contain it, noted $-K(i)$. This procedure is repeated $K$ times; each time, a different group of observations is treated as a test set.
Then we compare the predicted value ($f(X_{-K(i)}) = \hat{y_i})$ with true value $y_i$ using a Error function $L()$. Then the cross validation estimate of prediction error is

$$
CV(f) = \frac{1}{N} \sum_i^N L\left(y_i, f(X_{-K(i)}) \right).
$$

This validation scheme is known as the **K-Fold CV**. Typical choices of $K$ are 5 or 10, [Kohavi 1995]. The extreme case where $K = N$ is known as **leave-one-out cross-validation, LOO CV**.

### CV for regression

Usually the error function $L()$ is the r-squared score. However other function could be used.

In [17]:
import numpy as np
from sklearn import datasets
import sklearn.linear_model as lm
import sklearn.metrics as metrics
from sklearn.cross_validation import KFold

X, y = datasets.make_regression(n_samples=100, n_features=100, 
                         n_informative=10, random_state=42)
model = lm.Ridge(alpha=10)

cv = KFold(len(y), n_folds=5, random_state=42)
y_test_pred = np.zeros(len(y))
y_train_pred = np.zeros(len(y))

for train, test in cv:
    X_train, X_test, y_train, y_test = X[train, :], X[test, :], y[train], y[test]
    model.fit(X_train, y_train)
    y_test_pred[test] = model.predict(X_test)
    y_train_pred[train] = model.predict(X_train)

print("Train r2:%.2f" % metrics.r2_score(y, y_train_pred))
print("Test  r2:%.2f" % metrics.r2_score(y, y_test_pred))

Train r2:0.99
Test  r2:0.72


Scikit-learn provides user-friendly function to perform CV:

In [18]:
from sklearn.cross_validation import cross_val_score

scores = cross_val_score(estimator=model, X=X, y=y, cv=5)
print("Test  r2:%.2f" % scores.mean())

# provide a cv
scores = cross_val_score(estimator=model, X=X, y=y, cv=cv)
print("Test  r2:%.2f" % scores.mean())

Test  r2:0.73
Test  r2:0.73


### CV for classification

With classification problems it is essential to sample folds where each set contains approximately the same percentage of samples of each target class as the complete set. This is called **stratification**. In this case, we will use ``StratifiedKFold`` with is a variation of k-fold which returns stratified folds.

Usually the error function $L()$ are, at least, the sensitivity and the specificity. However other function could be used.

In [19]:
from sklearn import datasets
import sklearn.linear_model as lm
import sklearn.metrics as metrics
from sklearn.cross_validation import StratifiedKFold

X, y = datasets.make_classification(n_samples=100, n_features=100, 
                         n_informative=10, random_state=42)

model = lm.LogisticRegression(C=1)

cv = StratifiedKFold(y, n_folds=5)
y_test_pred = np.zeros(len(y))
y_train_pred = np.zeros(len(y))

for train, test in cv:
    X_train, X_test, y_train, y_test = X[train, :], X[test, :], y[train], y[test]
    model.fit(X_train, y_train)
    y_test_pred[test] = model.predict(X_test)
    y_train_pred[train] = model.predict(X_train)

recall_test  = metrics.recall_score(y, y_test_pred, average=None)
recall_train = metrics.recall_score(y, y_train_pred, average=None)
acc_test = metrics.accuracy_score(y, y_test_pred)


print("Train SPC:%.2f; SEN:%.2f" % tuple(recall_train))
print("Test  SPC:%.2f; SEN:%.2f" % tuple(recall_test))
print("Test  ACC:%.2f" % acc_test)

Train SPC:1.00; SEN:1.00
Test  SPC:0.80; SEN:0.82
Test  ACC:0.81


Scikit-learn provides user-friendly function to perform CV:

In [20]:
from sklearn.cross_validation import cross_val_score

scores = cross_val_score(estimator=model, X=X, y=y, cv=5)
scores.mean()

# provide CV and score
def balanced_acc(estimator, X, y):
    '''
    Balanced acuracy scorer
    '''
    return metrics.recall_score(y, estimator.predict(X), average=None).mean()

scores = cross_val_score(estimator=model, X=X, y=y, cv=cv, scoring=balanced_acc)
print("Test  ACC:%.2f" % scores.mean())

Test  ACC:0.81


Note that with Scikit-learn user-friendly function we average the scores' average obtained on individual folds which may provide slightly different results that the overall average presented earlier.

## CV for model selection: setting the hyper parameters

It is important to note CV may be used for two separate goals:

1. **Model assessment**: having chosen a final model, estimating its prediction error (generalization error) on new data.

2. **Model selection**: estimating the performance of different models in order to choose the best one. One special case of model selection is the selection model's hyper parameters. Indeed remember that most of learning algorithm have a hyper parameters (typically the regularization parameter) that has to be set.

Generally we must address the two problems simultaneously. The usual approach for both problems is to randomly divide the dataset into three parts: a training set, a validation set, and a test set.

- The **training set** (train) is used to fit the models;

- the **validation set** (val) is used to estimate prediction error for model selection or to determine the hyper parameters over a grid of possible values.

- the **test set** (test) is used for assessment of the generalization error of the final chosen model.


### Grid search procedure

Model selection of the best hyper parameters over a grid of possible values

For each possible values of hyper parameters $\alpha_k$:

1. Fit the learner on training set: $f(X_{train}, y_{train}, \alpha_k)$

2. Evaluate the model on the validation set and keep the parameter(s) that minimises the error measure

    $\alpha_* = \arg \min L(f(X_{train}), y_{val}, \alpha_k)$

3. Refit the learner on all training + validation data using the best hyper parameters: $f^* \equiv f(X_{train \cup val}, y_{train \cup val}, \alpha_*)$

4. ** Model assessment ** of $f^*$ on the test set: $L(f^*(X_{test}), y_{test})$

### Nested CV for Model selection and assessment

Most of time, we cannot afford such three-way split. Thus, again we will use CV, but in this case we need two nested CVs.

One **outer CV loop, for model assessment**. This CV performs $K$ splits of the dataset into training plus validation ($X_{-K}, y_{-K}$) set and a test set $X_{K}, y_{K}$

One **inner CV loop, for model selection**. For each run of the outer loop, the inner loop loop performs $L$ splits of dataset ($X_{-K}, y_{-K}$) into training set: ($X_{-K,-L}, y_{-K,-L}$) and a validation set: ($X_{-K,L}, y_{-K,L}$).

### Implementation with scikit-learn

Note that the inner CV loop combined with the learner form a new learner with an automatic model (parameter) selection procedure. This new learner can be easily constructed using Scikit-learn. The learned is wrapped inside a ``GridSearchCV`` class.

Then the new learned can be pluged into the classical outer CV loop.

In [21]:
import numpy as np
from sklearn import datasets
import sklearn.linear_model as lm
from sklearn.grid_search import GridSearchCV
import sklearn.metrics as metrics
from sklearn.cross_validation import KFold

# Dataset
noise_sd = 10
X, y, coef = datasets.make_regression(n_samples=50, n_features=100, noise=noise_sd,
                         n_informative=2, random_state=42, coef=True)
 
# Use this to tune the noise parameter such that snr < 5
print("SNR:", np.std(np.dot(X, coef)) / noise_sd)

# param grid over alpha & l1_ratio
param_grid = {'alpha': 10. ** np.arange(-3, 3), 'l1_ratio':[.1, .5, .9]}


# Warp 
model = GridSearchCV(lm.ElasticNet(max_iter=10000), param_grid, cv=5)
    
# 1) Biased usage: fit on all data, ommit outer CV loop                 
model.fit(X, y)
print("Train r2:%.2f" % metrics.r2_score(y, model.predict(X)))
print(model.best_params_)

# 2) User made outer CV, usefull to extract specific imformation 
cv = KFold(len(y), n_folds=5, random_state=42)
y_test_pred = np.zeros(len(y))
y_train_pred = np.zeros(len(y))
alphas = list()

for train, test in cv:
    X_train, X_test, y_train, y_test = X[train, :], X[test, :], y[train], y[test]
    model.fit(X_train, y_train)
    y_test_pred[test] = model.predict(X_test)
    y_train_pred[train] = model.predict(X_train)
    alphas.append(model.best_params_)

print("Train r2:%.2f" % metrics.r2_score(y, y_train_pred))
print("Test  r2:%.2f" % metrics.r2_score(y, y_test_pred))
print("Selected alphas:", alphas)

# 3.) user-friendly sklearn for outer CV
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(estimator=model, X=X, y=y, cv=cv)
print("Test  r2:%.2f" % scores.mean())

SNR: 2.63584694464
Train r2:0.96
{'l1_ratio': 0.9, 'alpha': 1.0}
Train r2:1.00
Test  r2:0.62
Selected alphas: [{'l1_ratio': 0.9, 'alpha': 0.001}, {'l1_ratio': 0.9, 'alpha': 0.001}, {'l1_ratio': 0.9, 'alpha': 0.001}, {'l1_ratio': 0.9, 'alpha': 0.01}, {'l1_ratio': 0.9, 'alpha': 0.001}]
Test  r2:0.55


## Random Permutations

## Bootstraping