# Bootstrapping

Today, we are going to be examining two tools that I think border on statistical magic called **bootstrapping** and **k-fold cross-validation**. We have seen bootstrapping before when we introduced descriptive statistics, so maybe we will start there for today.

Recall the difference between a **population** and a **sample**. The population is all the possible observations out there. For instance if I were an epidemiologist, this might be people in the U.S. or children in Washington state. 


<img width="304" height="232" alt="Image result for population versus sample" src="http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_BiostatisticsBasics/Sampling3.jpg">

But this can also be applied to measurements in the lab as well such as possible voltage values from this battery. There could be an infinite number of measurements, but either way: accessing the actual population is virtually impossible.

<img width="304" height="223" alt="Image result for thermometer gif" src="https://media.giphy.com/media/26FL3uMhARSAvIZZS/giphy.gif">


As a scientist, you only have access to a sample. Part of designing an experiment is choosing how big your sample should be.

But a key problem is: if you change your sample, it could change your sample mean and sample variance significantly. We talked previously about how calculating **p values** is one way of establishing confidence in our measurements. But **bootstrapping** is a fancy statistical way of getting uncertainties (such as confidence intervals and standard errors in linear regression) just by playing with your sampling.

Let's demonstrate:

Let's say we were in the lab, and we were making a calibration curve, say absorbance against concentration using UV-vis measurements. You generate the following plot:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd

Generate data and do linear regression

In [None]:
y = np.linspace(0, 100, 20)
x_raw = np.linspace(0, 4, 20)
x = x_raw + np.exp(x_raw*0.4*np.random.rand(20)) - 1

slope, intercept, rvalue, pvalue, stderr = stats.linregress(x, y)
y_fit = intercept + np.array([0,8])*slope

In [None]:
fig, ax = plt.subplots(figsize=(8,6))

ax.scatter(x, y, alpha=0.5, label='raw data')
ax.plot(np.array([0,8]), y_fit, linewidth=4, label='linear fit')
#ax.set_yscale('log')
#ax.set_xscale('log')
ax.set_xlim([0, 8])
ax.set_ylim([0, 120])
ax.legend()

But uh oh. You can see that some are those points are a little wiggly. You want to get some indication of how your calibration curve would change if you repeated the experiment again. Instead of actually repeating the experiment, you decide to use bootstrapping.

In bootstrapping, you will resample from you original dataset (pretending its representative of the actual population) with replacement and perform a regression on this sub-sample of your sample. We will generate some code to do this for us. We will take advantage of `pandas` sampling capabilities:

In [None]:
data = pd.DataFrame({'x': x, 'y':y})
data.head()

In [None]:
fig, ax = plt.subplots(figsize=(8,6))
ax.scatter(x, y, alpha=0.5, label='raw data')

# initializing outputs
slope = np.zeros(10)
intercept = np.zeros(10)
rvalue = np.zeros(10)
pvalue = np.zeros(10)
stderr = np.zeros(10)

for i in range(0, 10):
    # Using pandas to grab a random sample WITH replacement
    subsample = data.sample(n=20, replace=True)
    #print(i)
    # Linear regression performed on each sample.
    slope[i], intercept[i], rvalue[i], pvalue[i], stderr[i] = stats.linregress(subsample['x'], subsample['y'])
    y_fit = intercept[i] + np.array([0,8])*slope[i]

    ax.plot(np.array([0,8]), y_fit, linewidth=2, color='b', alpha=0.2)

We can repeat this, like, 1000 times, generate a distribution of slopes and intercepts, and use the distributions to calculate confidence intervals on the slope and intercept:

In [None]:
fig, ax = plt.subplots(figsize=(8,6))
ax.scatter(x, y, alpha=0.5, label='raw data')
n=1000

# initializing outputs
slope = np.zeros(n)
intercept = np.zeros(n)
rvalue = np.zeros(n)
pvalue = np.zeros(n)
stderr = np.zeros(n)

for i in range(0, n):
    # Using pandas to grab a random sample WITH replacement
    subsample = data.sample(n=20, replace=True)
    #print(i)
    # Linear regression performed on each sample.
    slope[i], intercept[i], rvalue[i], pvalue[i], stderr[i] = stats.linregress(subsample['x'], subsample['y'])
    y_fit = intercept[i] + np.array([0,8])*slope[i]

    ax.plot(np.array([0,8]), y_fit, linewidth=2, color='b', alpha=0.2)

In [None]:
print('slope: {:.2f} (95% CI: {:.2f} - {:.2f})'.format(np.mean(slope),
                                                       np.quantile(slope, 0.025),
                                                       np.quantile(slope, 0.975)))

print('intercept: {:.2f} (95% CI: {:.2f} - {:.2f})'.format(np.mean(intercept),
                                                       np.quantile(intercept, 0.025),
                                                       np.quantile(intercept, 0.975)))


As a note, this isn't the most useful example, as `stats.linregress` actually calculates SEM values for you. But bootstrapping is generalizable to any statistical quantities you may be interested in. One useful example is error propagation. If you know the error in raw datasets, and you want to know how that error will impact downstream calculations, you can use bootstrapping to do it without having to do the calculus involved in standard error propagation.

Similarly, bootstrapping can be used to estimate error in machine learning methods extending beyond simple linear regression, but we we'll leave it at that so we can move on to other topics.

# Cross-Validation

Cross validation is similar, but, well, different. Cross validation is related to sampling, just like bootstrapping, but is used in slightly different contexts. 

Remember when we talked about the difference between training and test datasets?

<img width="304" height="175" alt="Image result for training test dataset" src="http://scott.fortmann-roe.com/docs/docs/MeasuringError/holdout.png">

When **training** a machine learning model, we split our dataset into a **training dataset** and a **test dataset**. The training dataset is used to train our machine learning model. The test dataset is used to verify the accuracy of our trained model. It is **very** important that the test dataset and training dataset are completely separate, otherwise you could overestimate the accuracy of your model.

We rarely have an indepedent test dataset. In practice, we will randomly split our sample dataset into a training dataset and test dataset. This process is called **validation**. Validation is typically used in two scenarios: **model selection** (proper level of flexibility) and **model assessment** (evaluating a model's performance). We will take a look at both use cases.

`scikit-learn` handily has a function that does this split for us. But first, let's create a datset that we can practice on using a regression model:

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [None]:
x = np.linspace(0, 10, 1000)
y = 0.2*x + 1*x**2 - 0.3*x**3 + 0.021*x**4 + 0.3*np.random.rand(1000)*x

plt.scatter(x, y, alpha=0.1)

X = pd.DataFrame({'x':x, 'x2':x**2, 'x3':x**3, 'x4':x**4, 'x5':x**5, 'x6':x**6, 'x7':x**7, 'x8':x**8,
                  'x9':x**9, 'x10':x**10, 'x11':x**12, 'x13':x**13, 'x14':x**14, 'x15':x**15, 'x16':x**16, 'x17':x**17})

We'll use validation to select the degree of flexibility of our linear model. First, the split:

In [None]:
X

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

In [None]:
reg = LinearRegression().fit(X_train[['x']], y_train)
y_pred = reg.predict(X_test[['x']])

In [None]:
plt.scatter(x, y, alpha=0.1)
plt.plot(X_test[['x']], y_pred, linewidth=2)

We can get a measure of the degree of quality of fit using the mean squared error:

In [None]:
from sklearn.metrics import mean_squared_error

In [None]:
mean_squared_error(y_test, y_pred)

Now, let's repeat the linear regression while increasing the degree of our polynomial:

In [None]:
fig, ax = plt.subplots(figsize=(8,6))
ax.scatter(x, y, alpha=0.1)
mse = np.zeros(17)

for i in range(1, 17):
    reg = LinearRegression().fit(X_train[X_train.columns[0:i].values], y_train)
    y_pred = reg.predict(X_test[X_test.columns[0:i].values])

    ax.scatter(X_test[['x']], y_pred, marker='o', s=25, label=i)
    mse[i-1] = mean_squared_error(y_test, y_pred)
    ax.legend()

In [None]:
fig, ax = plt.subplots(figsize=(8,6))
ax.plot(np.linspace(1, 16, 17), mse, linewidth=4)
ax.set_ylabel('Mean Squared Error')
ax.set_xlabel('Degree of Polynomial')

From these results, we can see that the MSE reaches a local minimum at a degree of 4. This seems to be the best balance between bias and variance. 

This validation method has two disadvantages:
    
* The test error is highly variable, depending on your training dataset. If we used a different training dataset, we'd have a different error
* We only use a subset of the training dataset to create our model. The fewer the observations, the worse the final model (we actually overestimate the test error rate!)

In [None]:
fig, ax = plt.subplots(figsize=(8,6))

for run in range(8):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
    mse = np.zeros(17)

    for i in range(1, 18):
        reg = LinearRegression().fit(X_train[X_train.columns[0:i].values], y_train)
        y_pred = reg.predict(X_test[X_test.columns[0:i].values])

        mse[i-1] = mean_squared_error(y_test, y_pred)

    
    ax.plot(np.linspace(1, 16, 17), mse, linewidth=4, color='b', alpha=0.2)
    ax.set_ylabel('Mean Squared Error')
    ax.set_xlabel('Degree of Polynomial')

One common alternative to is **K-fold cross-validation**. In this configuration, we split the original dataset into *k* units. In on training-test configuration, *k-1* of the units will be used as a training dataset, while the remaining unit will be used as a test dataset. We train a model *k* times, switching out the test dataset each time, and use the average error as an estimate of the actual test error.

<img width="608" height="334" alt="Image result for k-fold cross-validation" src="https://cdn-images-1.medium.com/max/1600/1*me-aJdjnt3ivwAurYkB7PA.png">

In [None]:
from sklearn.model_selection import KFold

In [None]:
kf = KFold(n_splits=10, shuffle=True)
kf.get_n_splits(X, y)

In [None]:
fig, ax = plt.subplots(figsize=(8,6))

mse = np.zeros((17, 10))
j = 0
for train_index, test_index in kf.split(X, y):
    #print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y[train_index], y[test_index]

    for i in range(1, 18):
        reg = LinearRegression().fit(X_train[X_train.columns[0:i].values], y_train)
        y_pred = reg.predict(X_test[X_test.columns[0:i].values])

        mse[i-1, j] = mean_squared_error(y_test, y_pred)

    
    ax.plot(np.linspace(1, 16, 17), mse[:, j], linewidth=4, color='b', alpha=0.09)
    ax.set_ylabel('Mean Squared Error')
    ax.set_xlabel('Degree of Polynomial')
    j += 1
    
avg_mse = mse.mean(axis=1)
ax.plot(np.linspace(1, 16, 17), avg_mse, color='purple')

In [None]:
avg_mse