# Bootstrapping

Boostrapping is a simple approach to evaluate the uncertainty on the estimates of parameters, given a population (dataset) with points $\mathcal{D} = \lbrace x_i \rbrace_{i=1, \ldots, N}$.

Algorithm:
1. Choose two integers, $M < N$ and $L > 1$.
2. Draw $M$ samples with replacement from the original dataset $\mathcal{D}$ to create a smaller dataset $\mathcal{D}_M$.
3. Estimate the chosen parameter over the $\mathcal{D}_M$.
4. Repeat points (2) and (3) $L$ times in order to get $L$ values for the estimate, then look at their distribution.

Let's see this in practice: we'll sample $N=1000$ values from a Gaussian distribution of chosen mean and variance (which we'll then forget) and then estimate the mean value and the uncertainty over its estimate.

In [None]:
import numpy as np
import tensorflow as tf
import tensorflow_probability as tfp
import matplotlib.pyplot as plt
import seaborn as sns

tfd = tfp.distributions

sns.set_theme()

## Generate data

In [None]:
N = 10000

# Choose parameters for the distribution.
μ = 5.4
σ = 1.2

# Instantiate the probability distribution object.
gaussian = tfd.Normal(loc=μ, scale=σ)

# Generate samples.
samples = gaussian.sample(sample_shape=N)

Plot the data.

In [None]:
fig = plt.figure(figsize=(14, 6))

sns.histplot(
    x=samples,
    kde=True,
    stat='density',
    color=sns.color_palette()[0]
)

plt.axvline(
    x=μ,
    color=sns.color_palette()[1],
    label='True mean',
    alpha=1.
)

plt.axvline(
    x=samples.numpy().mean(),
    color=sns.color_palette()[2],
    label='Estimated mean (whole dataset)',
    alpha=1.
)

plt.xlabel('Value', fontsize=15)
plt.xticks(fontsize=12)
plt.ylabel('Density', fontsize=15)
plt.yticks(fontsize=12)

plt.legend(loc='upper right', fontsize=15)

plt.title('Sampled values', fontsize=18)

## Bootstrap estimation

In [None]:
M = 7000
L = 1000

subdatasets = tf.stack([
    tf.gather(
        samples,
        tf.random.uniform(shape=[M], minval=0, maxval=samples.shape[0], dtype=tf.dtypes.int32)
    )
    for _ in range(L)
])

Plot some sub-datasets and their respective estimated mean.

In [None]:
n_plots = 3

indices = np.random.choice(range(subdatasets.shape[0]), size=n_plots, replace=False)

fig = plt.figure(figsize=(14, 6))

for i in range(n_plots):
    sns.histplot(
        x=subdatasets[indices[i], :],
        kde=True,
        stat='density',
        color=sns.color_palette()[i],
        alpha=0.3
    )
    
    plt.axvline(
        x=subdatasets[indices[i], :].numpy().mean(),
        color=sns.color_palette()[i],
        alpha=1.,
        label=f'Estimated mean for dataset {indices[i]}'
    )
    
plt.xlabel('Value', fontsize=15)
plt.xticks(fontsize=12)
plt.ylabel('Density', fontsize=15)
plt.yticks(fontsize=12)

plt.legend(loc='upper right', fontsize=14)

Compute the estimated mean for each sub-dataset and plot the distribution of the estimates.

In [None]:
mean_estimates = tf.reduce_mean(subdatasets, axis=1)

fig = plt.figure(figsize=(14, 6))

sns.histplot(
    x=mean_estimates,
    kde=True,
    stat='density'
)

plt.xlabel('Value', fontsize=15)
plt.xticks(fontsize=12)
plt.ylabel('Density', fontsize=15)
plt.yticks(fontsize=12)

plt.title('Mean estimates over the sub-datasets', fontsize=18)

The point value of the estimate can be computed over the whole dataset while quantiles of the distribution of the estimates over the sub-datasets can be used to gauge the uncertainty over the point estimate.

In [None]:
# Compute the 5th and 95th quantile.
quantile_05, quantile_95 = np.quantile(mean_estimates.numpy(), 0.05), np.quantile(mean_estimates.numpy(), 0.95)

# Compute the standard deviation of the estimated means.
σ_μ_bootstrap = mean_estimates.numpy().std()

point_estimate = samples.numpy().mean()

print(
    f'Estimated μ: {point_estimate} (true: {μ})\n'
    f'Estimated σ_μ: {σ_μ_bootstrap} (true: {σ / np.sqrt(N)})'
)

## Sanity check

Since we have a distribution from which we can generate multiple datasest, let's estimate numerically the value of $\sigma_\mu$ (which coincides with the classic value $\sigma /\sqrt{N}$).

In [None]:
generated_means = tf.reduce_mean(gaussian.sample(sample_shape=[L, N]), axis=1)

σ_μ_numerical = generated_means.numpy().std()

print(
    f'Bootstrap-estimated σ_μ: {σ_μ_bootstrap}\n',
    f'Numerically estimated σ_μ: {σ_μ_numerical}'
)