# Resampling Methods (MACS 30100)
### by [Richard W. Evans](https://sites.google.com/site/rickecon/), February 2018
The code in this Jupyter notebook was written using Python 3.6. It uses data files [`Titanic dataset`](https://raw.githubusercontent.com/BigDataGal/Python-for-Data-Science/master/titanic-train.csv). For the code to run properly, you will either need to have access to the internet or you should have the data file in the same folder as the Jupyter notebook file. Otherwise, you will have to change the respective lines of the code that read in the data to reflect the location of that data. Some of this content was taken from Dr. Benjamin Soltoff's resampling methods notes [here](http://cfss.uchicago.edu/persp006_resampling.html).

Resampling methods are a way to test the sensitivity of statistical results to estimation using a different sample. It is often too difficult or too expensive to draw a new sample from the population. Resampling methods take advantage of the training-set test-set paradigm to evaluate the sensitivity of estimates to sample variance. The two main classes of resampling methods are:

1. Cross validation
2. Bootstrapping

In choosing models to predict or match data or to infer relationships between variables, James, et al (2013) decompose the process into *model assessment* and *model selection*. Model assessment is treated in this notebook. It is the process and various means of evaluating the fit or accuracy of a given model. Model selection is the process of adjusting parameters, variables, or functional relationships between variables to better fit the data.

## 1. Cross validation

### 1.1. Validation set approach
This is the approach that we have already studied in the [classifiers 1 notebook](https://github.com/UC-MACSS/persp-model_W18/blob/master/Notebooks/Classfcn1/KKNlogitLDA.ipynb).

1. Partition the data into a training set and a test set.
2. Estimate the model using the training set.
3. Evaluate the fit or predictive accuracy on the test set.

The primary measure of fit is the mean squared error (MSE) of the estimated model on the test set. Let the test set have $N$ observations. The MSE of the test set is the sum of squared deviations of the actual dependent variable values minus the predicted values.

$$ MSE = \frac{1}{N}\sum_{i=1}^N\left(y_i - \hat{y}_i\right)^2 $$

Let's calculate the MSE from our logistic regression of the titanic example.

In [3]:
# Import needed stuff
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

import sklearn
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn.model_selection import LeaveOneOut, KFold
from sklearn import metrics 
from sklearn.metrics import classification_report, mean_squared_error
from pylab import rcParams

import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline
rcParams['figure.figsize'] = 10, 8
sb.set_style('whitegrid')



In [4]:
# Read in Titanic data
url = ('https://raw.githubusercontent.com/BigDataGal/Python-for-Data-Science/' +
      'master/titanic-train.csv')
titanic = pd.read_csv(url)
titanic.columns = ['PassengerId','Survived','Pclass','Name','Sex','Age',
                   'SibSp','Parch','Ticket','Fare','Cabin','Embarked']

# Get rid of columns we don't use
titanic = titanic.drop(['PassengerId','Name','Ticket','Cabin'], 1)

# Impute missing age values
def age_approx(cols):
    Age = cols[0]
    Pclass = cols[1]
    
    if pd.isnull(Age):
        if Pclass == 1:
            return 37
        elif Pclass == 2:
            return 29
        else:
            return 24
    else:
        return Age

titanic['Age'] = \
    titanic[['Age', 'Pclass']].apply(age_approx, axis=1)
    
# Drop any observations with missing values
titanic.dropna(inplace=True)

# Make gender dummies and embark dummies and get rid of
# original variables
gender = pd.get_dummies(titanic['Sex'], drop_first=True)
embark_location = pd.get_dummies(titanic['Embarked'],
                                 drop_first=True)
titanic.drop(['Sex', 'Embarked'], axis=1, inplace=True)
titanic = pd.concat([titanic, gender, embark_location], axis=1)

# Drop Pclass variable due to excessive correlation with Fare
titanic.drop(['Pclass'], axis=1, inplace=True)

titanic.head()

Unnamed: 0,Survived,Age,SibSp,Parch,Fare,male,Q,S
0,0,22.0,1,0,7.25,1,0,1
1,1,38.0,1,0,71.2833,0,0,0
2,1,26.0,0,0,7.925,0,0,1
3,1,35.0,1,0,53.1,0,0,1
4,0,35.0,0,0,8.05,1,0,1


Now partition the data into the same 60% training set sample that we did in the [logistic regression notebook](https://github.com/UC-MACSS/persp-model_W18/blob/master/Notebooks/Classfcn1/KKNlogitLDA.ipynb) and estimate the logistic regression with all the variables.

In [5]:
X = titanic[['Age', 'SibSp', 'Parch', 'Fare', 'male', 'Q', 'S']]
y = titanic['Survived']
# This function train_test_split is from sklearn.cross_validation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4,
                                                    random_state=25)
LogReg = LogisticRegression()
LogReg.fit(X_train, y_train)
y_pred = LogReg.predict(X_test)
# Note that the squared doesn't matter in a Logistic model

# You can code the MSE yourself
MSE_vs = ((y_test - y_pred) ** 2).sum() / y_pred.shape[0]
print('Validation set MSE = ', MSE_vs)

# Or you can use scikit-learn's method
print('Validation set MSE = ', mean_squared_error(y_test, y_pred))

Validation set MSE =  0.2247191011235955
Validation set MSE =  0.224719101124


### 1.2. Leave-one-out cross validation
Leave-one-out cross validation (LOOCV) is an approach in which the model is assessed using $N$ different training sets and test sets of a specific size. Let the data have $N$ observations. LOOCV is to choose a training set with $N-1$ observations, such that the test set only has one observation $y_i$. Repeat this $N$ with a slightly different training set such that each data point is the test set in exactly one of these sebsets.

In this case, the mean squared error MSE has no summation because there is only one observation in the test set.

$$ MSE_i = (y_i - \hat{y}_i)^2 $$

The LOOCV estimate for the test MSE is the average of these $N$ test error estimates.

$$ CV_{loo} = \frac{1}{N}\sum_{i=1}^N MSE_i $$

In [6]:
# Define loo as a leave-one-out object, then
# split it into N different partitions

# Note that the LeaveOneOut() function does not work
# well with pandas DataFrames
Xvars = titanic.ix[:, (1, 2, 3, 4, 5, 6, 7)].values
yvals = titanic.ix[:, 0].values
N_loo = Xvars.shape[0]
loo = LeaveOneOut()
loo.get_n_splits(Xvars)
MSE_vec = np.zeros(N_loo)

# This loop will take 20 or 30 seconds
for train_index, test_index in loo.split(Xvars):
    X_train, X_test = Xvars[train_index], Xvars[test_index]
    y_train, y_test = yvals[train_index], yvals[test_index]
    LogReg = LogisticRegression()
    LogReg.fit(X_train, y_train)
    y_pred = LogReg.predict(X_test)
    MSE_vec[test_index] = (y_test - y_pred) ** 2
    print('MSE for test set', test_index, ' is', MSE_vec[test_index])

MSE_loo = MSE_vec.mean()
MSE_loo_std = MSE_vec.std()
print('test estimate MSE loocv=', MSE_loo,
      ', test estimate MSE standard err=', MSE_loo_std)

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  


MSE for test set [0]  is [ 0.]
MSE for test set [1]  is [ 0.]
MSE for test set [2]  is [ 0.]
MSE for test set [3]  is [ 0.]
MSE for test set [4]  is [ 0.]
MSE for test set [5]  is [ 0.]
MSE for test set [6]  is [ 0.]
MSE for test set [7]  is [ 0.]
MSE for test set [8]  is [ 0.]
MSE for test set [9]  is [ 0.]
MSE for test set [10]  is [ 0.]
MSE for test set [11]  is [ 0.]
MSE for test set [12]  is [ 0.]
MSE for test set [13]  is [ 0.]
MSE for test set [14]  is [ 1.]
MSE for test set [15]  is [ 0.]
MSE for test set [16]  is [ 0.]
MSE for test set [17]  is [ 1.]
MSE for test set [18]  is [ 1.]
MSE for test set [19]  is [ 0.]
MSE for test set [20]  is [ 0.]
MSE for test set [21]  is [ 1.]
MSE for test set [22]  is [ 0.]
MSE for test set [23]  is [ 1.]
MSE for test set [24]  is [ 1.]
MSE for test set [25]  is [ 1.]
MSE for test set [26]  is [ 0.]
MSE for test set [27]  is [ 1.]
MSE for test set [28]  is [ 0.]
MSE for test set [29]  is [ 0.]
MSE for test set [30]  is [ 0.]
MSE for test set [

MSE for test set [272]  is [ 0.]
MSE for test set [273]  is [ 0.]
MSE for test set [274]  is [ 0.]
MSE for test set [275]  is [ 1.]
MSE for test set [276]  is [ 0.]
MSE for test set [277]  is [ 0.]
MSE for test set [278]  is [ 0.]
MSE for test set [279]  is [ 0.]
MSE for test set [280]  is [ 0.]
MSE for test set [281]  is [ 0.]
MSE for test set [282]  is [ 1.]
MSE for test set [283]  is [ 0.]
MSE for test set [284]  is [ 0.]
MSE for test set [285]  is [ 1.]
MSE for test set [286]  is [ 0.]
MSE for test set [287]  is [ 1.]
MSE for test set [288]  is [ 0.]
MSE for test set [289]  is [ 0.]
MSE for test set [290]  is [ 0.]
MSE for test set [291]  is [ 0.]
MSE for test set [292]  is [ 1.]
MSE for test set [293]  is [ 0.]
MSE for test set [294]  is [ 0.]
MSE for test set [295]  is [ 0.]
MSE for test set [296]  is [ 1.]
MSE for test set [297]  is [ 1.]
MSE for test set [298]  is [ 0.]
MSE for test set [299]  is [ 0.]
MSE for test set [300]  is [ 1.]
MSE for test set [301]  is [ 0.]
MSE for te

MSE for test set [563]  is [ 1.]
MSE for test set [564]  is [ 0.]
MSE for test set [565]  is [ 0.]
MSE for test set [566]  is [ 1.]
MSE for test set [567]  is [ 0.]
MSE for test set [568]  is [ 1.]
MSE for test set [569]  is [ 1.]
MSE for test set [570]  is [ 0.]
MSE for test set [571]  is [ 1.]
MSE for test set [572]  is [ 0.]
MSE for test set [573]  is [ 0.]
MSE for test set [574]  is [ 0.]
MSE for test set [575]  is [ 0.]
MSE for test set [576]  is [ 0.]
MSE for test set [577]  is [ 1.]
MSE for test set [578]  is [ 1.]
MSE for test set [579]  is [ 0.]
MSE for test set [580]  is [ 0.]
MSE for test set [581]  is [ 0.]
MSE for test set [582]  is [ 0.]
MSE for test set [583]  is [ 0.]
MSE for test set [584]  is [ 0.]
MSE for test set [585]  is [ 0.]
MSE for test set [586]  is [ 1.]
MSE for test set [587]  is [ 0.]
MSE for test set [588]  is [ 0.]
MSE for test set [589]  is [ 0.]
MSE for test set [590]  is [ 0.]
MSE for test set [591]  is [ 0.]
MSE for test set [592]  is [ 1.]
MSE for te

MSE for test set [820]  is [ 1.]
MSE for test set [821]  is [ 0.]
MSE for test set [822]  is [ 0.]
MSE for test set [823]  is [ 0.]
MSE for test set [824]  is [ 0.]
MSE for test set [825]  is [ 0.]
MSE for test set [826]  is [ 1.]
MSE for test set [827]  is [ 1.]
MSE for test set [828]  is [ 0.]
MSE for test set [829]  is [ 1.]
MSE for test set [830]  is [ 0.]
MSE for test set [831]  is [ 0.]
MSE for test set [832]  is [ 0.]
MSE for test set [833]  is [ 0.]
MSE for test set [834]  is [ 0.]
MSE for test set [835]  is [ 0.]
MSE for test set [836]  is [ 1.]
MSE for test set [837]  is [ 1.]
MSE for test set [838]  is [ 0.]
MSE for test set [839]  is [ 0.]
MSE for test set [840]  is [ 0.]
MSE for test set [841]  is [ 0.]
MSE for test set [842]  is [ 0.]
MSE for test set [843]  is [ 0.]
MSE for test set [844]  is [ 0.]
MSE for test set [845]  is [ 0.]
MSE for test set [846]  is [ 0.]
MSE for test set [847]  is [ 0.]
MSE for test set [848]  is [ 0.]
MSE for test set [849]  is [ 0.]
MSE for te

### 1.3. k-fold cross validation
$k$-fold cross validation is a method in which the dataset is randomly divided into $k$ groups (folds). Define a test set of the model as the $k$th fold. For each test set $k$, the model is estimated on the data from the other $k-1$ folds. Let the number of observations in the $k$th fold be $N_k$, and let $\mathcal{K}$ be the set of observations in the $k$th fold. The $MSE_k$ of the $k$th fold is:

$$ MSE_k = \frac{1}{N_k}\sum_{i\in\mathcal{K}}(y_i - \hat{y}_i)^2 $$

Then the $k$-fold estimate for the test MSE is the average of these $k$ test error estimates.

$$ CV_{kf} = \frac{1}{k}\sum_{j=1}^k MSE_j $$

LOOCV is a special case of $k$-fold cross validation in which $k=N$.

Let's use the Titanic data again and test our logit model performance with a $k$-fold cross validation with $k=6$.

In [None]:
k = 2
kf = KFold(n_splits=k, random_state=10, shuffle=True)
kf.get_n_splits(Xvars)

MSE_vec_kf = np.zeros(k)

k_ind = int(0)
for train_index, test_index in kf.split(Xvars):
    # print("TRAIN:", train_index, "TEST:", test_index)
    print('k index=', k_ind)
    X_train, X_test = Xvars[train_index], Xvars[test_index]
    y_train, y_test = yvals[train_index], yvals[test_index]
    LogReg = LogisticRegression()
    LogReg.fit(X_train, y_train)
    y_pred = LogReg.predict(X_test)
    MSE_vec_kf[k_ind] = ((y_test - y_pred) ** 2).mean()
    print('MSE for test set', k_ind, ' is', MSE_vec_kf[k_ind])
    k_ind += 1

MSE_kf = MSE_vec_kf.mean()
MSE_kf_std = MSE_vec_kf.std()
print('test estimate MSE k-fold=', MSE_kf,
      'test estimate MSE standard err=', MSE_kf_std)

### 1.4. Bias versus variance
Note that the LOOCV method has low bias (estimated on large number of data) but high variance (errors are based on one draw). In contrast, the $k$-fold method has more bias (estimated with less data) but lower variance. Each test set has more observations.

* $k$-fold cross validation can often provide more accurate estimates of the test error rate.
* $k$-fold is less computationally intensive
* LOOCV has the least bias
* LOOCV is the most computationally expensive

## 2. Bootstrapping
This name comes from the expression "to pull oneself up by ones own bootstraps." In a way similar to the cross validation methods of the last section, we can use *the bootstrap* to quantify the undertainty associated with a given estimator, learning model, or method. In the econometrics and statistics literature, this often shows up as "bootstrapped standard errors". Bootstrapping is valuable because it is so widely applicable to a range of models.

1. Randomly draw $S$ datasets of size $N_S$ with replacement. Define each training set of observations as $\mathcal{K}_s$ and each corresponding test set as $\mathcal{-K}_{s}$.
2. Calculate the MSE for each test set $\mathcal{-K}_{s}$

The bootstrap estimate for the test MSE is the average MSE from each random test set.

$$ CV_{boot} = \frac{1}{S}\sum_{s=1}^S MSE_s $$

In [None]:
N_bs = 10

MSE_vec_bs = np.zeros(N_bs)

for bs_ind in range(N_bs):
    X_train, X_test, y_train, y_test = \
        train_test_split(X, y, test_size=0.4)
    LogReg = LogisticRegression()
    LogReg.fit(X_train, y_train)
    y_pred = LogReg.predict(X_test)
    MSE_vec_bs[bs_ind] = ((y_test - y_pred) ** 2).mean()
    print('MSE for test set', bs_ind, ' is', MSE_vec_bs[bs_ind])

MSE_bs = MSE_vec_bs.mean()
MSE_bs_std = MSE_vec_bs.std()
print('test estimate MSE bootstrap=', MSE_bs,
      'test estimate MSE standard err=', MSE_bs_std)

## References
* James, Gareth, Deaniela Witten, Trevor Hastie, and Robert Tibshirani, [*An Introduction to Statistical Learning with Applications in R*](http://link.springer.com.proxy.uchicago.edu/book/10.1007%2F978-1-4614-7138-7), New York, Springer (2013).