Chapter 17
# Estimation with Bootstrap

The bootstrap method is a resampling technique used to estimate statistics on a population by sampling a dataset with replacement.  It can be used to estimate summary statistics such as the mean or standard deviation.

It is used in applied machine learning to estimate the skill of machine learning momdels when making predictions on data not included in the training data.

# Bootstrap Method
This is a statistical technique for estimating quantities about a population by averaging estimates from multiple small data samples.

Samples are constructed by drawing observations from a large data sample one at a time, and returning them to the data sample after they have been chosen (sampling with replacement).  This allows a given observation to be included in a given small sample more than once.

The bootstrap method can be used to estimate the quantity of a population, by repeatedly taking small samples, calculating the statistic, and taking the average of the calculated statistics.

1. choose a number of bootstrap samples to perform
2. choose a sample size
3. for each bootstrap sample:
- draw a sample of the specified size with replacement
- calculate the statistic on the sample
4. calculate the mean of the calculated sample statistics

The procedure can also be used to estimate the skill of a machine learning model.  This is done by training the model on the sample, and evaluating the skill of the model on those samples not included in the sample (known as out-of-bag samples or OOB)

1. choose a number of bootstrap samples to perform
2. choose a sample size
3. for each bootstrap sample:
- draw a sample of the specified size with replacement
- fit a model on the data sample
- estimate the skill of the model on the out-of-bag sample
4. calculate the mean of the sample of model skill estimates

Importantly, any data preparation prior to fitting the model must occur within the for-loop on the data sample.  This is to avoid data leakage, where knowledge of the test dataset is used to improve the model.

A useful feature of the bootstrap method is that the resulting sample of estimations often forms a Gaussian distribution

# Configuration of the Bootstrap
There are two parameters that must be chosen when performing the bootstrap
- Sample Size - in machine learning, it is common to use a sample size that is the same as the original dataset (unless computational efficiency is an issue)
- Repetitions - must be large enough to ensure that meaningful statistics can be calculated on the sample.  A minimum might be 20 or 30 repetitions.  A smaller number would further add variance to the estimated statistic.  Ideally, the repetitions would be as large as possible given the time resources (hundreds or thousands)

# Worked Example
Imagine we hae a dataset with 6 observations.
- data = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]

We will choose the size of the sample (4), and randomly choose the first observation from the dataset.
- sample = [0.2]

Return the observation to the dataset, and repeat this step 3 more times.  We now have our data sample.
- sample = [0.2, 0.1, 0.2, 0.6]

An estimate can then be calculated on the drawn sample.
- statistic = calculation([0.2, 0.1, 0.2, 0.6])

Observations not chosen for the sample may be used as out of sample observations (OOB)
- oob = [0.3, 0.4, 0.5]

To evaluate a machine learning model, the model is fit on the sample, and evaluated on the OOB sample.
- train = [0.2, 0.1. 0.2, 0.6]
- test = [0.3, 0.4, 0.6] 
- model = fit(train)
- statistic = evaluation(model, test)

The above is repeated 30+ times to give a sample of calculated statistics.
The sample of statistics can then be summarised by calculating e.g. mean, standard deviation

The bootstrap method does not have to be implemented manually.  The scikit-learn function resample() can be used.  It takes the arguments:
- data array
- whether or not to sample with replacement
- size of sample
- seed for pseudorandom number generator used prior to sampling

Unfortunately the API does not include any mechanism to gather easily the OOB observations

In [2]:
# scikit-learn bootstrap
from sklearn.utils import resample

# data sample
data = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]

# prepare bootstrap sample
boot = resample(data, replace=True, n_samples=4, random_state=1)
print('Bootstrap Sample: %s' % boot)

# out of bag observations
oob = [x for x in data if x not in boot]
print('OOB Sample: %s' % oob)

Bootstrap Sample: [0.6, 0.4, 0.5, 0.1]
OOB Sample: [0.2, 0.3]


# Extensions

In [22]:
# implement your own function to create a sample and OOB sample with the bootstrap method
import numpy as np
from sklearn.utils import resample

# data sample
data = np.random.randint(0, 100, 100)
print('Data:', data)

# prepare bootstrap sample
boot = resample(data, replace=True, n_samples=100, random_state=1)
print('Bootstrap Sample: %s' % boot)

# out of bag observations
oob = [x for x in data if x not in boot]
print('OOB Sample: %s' % oob)

Data: [23 25 95 16 30 81 17 54 53 54 86 66 47 83 27 97 21 14 24 81 41 92  3 28
 31 79 69 23 52 72 59 72 38 88  7 97 44 88 20  4 57 59  4 49 65 16 84  1
 86 84  6 54 32 19 51 77 72 27  8 51 75 21 44 12 35 59 17  3 16 39 80 38
  2 91 76 73 31 10 73 73 47 91 15 73  8 18 76 97 70 16  0 71 61 69 31 13
 76 18 71 42]
Bootstrap Sample: [88 47  2 54 73 81 73 35 21 25 31 38 17 79  6 41 24  8 66 52 72 27  6 16
 97 97 31 76 76 83 54 54 12 21  3 27 25 23 75 91 53 70 83  1  2 59 38 16
 80 92 84 27 16 16 31 49 31 69 32 47 59 15 97 35 16 79 71 97 54 69 79  3
 54  3 28 23 88 27 73 20 53 38  7 86 28 97 97 79 38 61 76 44 84 38 70 28
 77 59 10 16]
OOB Sample: [95, 30, 14, 4, 57, 4, 65, 19, 51, 51, 39, 18, 0, 13, 18, 42]
