# Machine Learning 2023-2024 - UMONS

# The Bootstrap.

In this lab, we'll experiment with **the bootstrap**, a simple but powerful resampling method that allows to quantify the uncertainty associated with any statistic estimated from a population sample. Additionally, the bootstrap enables us to estimate the validation error of any learning model, similarly to cross-validation. 

The steps of the (non-parametric) boostrap procedure can be summarized as:
- Start with a dataset $\mathcal{D} = \{z_i\}_{i=1}^n$.
- For $b=1,...,B$, do:
  - Sample a dataset $\mathcal{D}^{(b)}=\{z_i^{\ast~(b)}\}_{i=1}^n$ with replacement from the original dataset $\mathcal{D}$.
  - Estimate the statistic of interest on $\hat{\theta}^{\ast~(b)} = s(\mathcal{D}^{(b)})$ (e.g. the mean, $\hat{\theta}^{\ast~(b)} = \frac{1}{n}\sum_{i=1}^n z_i^{\ast~(b)}$).
- Compute the sampling distribution of the statistic $s$ from $\{\hat{\theta}^{\ast~(1)},...,\hat{\theta}^{\ast~(B)}\}$.

The sampling distribution of the statistic $s$ can then be used to quantify the uncertainty associated with $\hat{\theta} = s(\mathcal{D})$, the statistic's estimate from the original dataset $\mathcal{D}$. To get some intuition of the concept, we will start by applying the bootstrap algorithm on some simulated data, whose mean and variance is known.

**We import the necessary libraries**

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from scipy.stats import norm
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OneHotEncoder
from sklearn.utils import resample
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import cross_validate

np.random.seed(0)

**1) Generate a dataset $\mathcal{D}$ with $n=100$ observations from a Normal distribution with $\mu =6$ and $\sigma=2$. Check the `np.random.normal` function.**

In [2]:
true_mean = 6
true_std = 2
# TODO: z = ...

**Let's take a look at the CDFs of the population distribution and the empirical distribution.**

In [3]:
fig, ax = plt.subplots()
z_plot = np.linspace(0, 12, 10000)
ax.plot(z_plot, norm.cdf(z_plot, true_mean, true_std), label='True CDF')
ax.plot(z_plot, np.mean(z <= z_plot[:, np.newaxis], axis=1), label='Empirical CDF')
ax.legend();

**2) Compute the sample mean $\bar{z}$ and sample standard deviation $\bar{\sigma}$ of the dataset $\mathcal{D}$.** 

**3) Implement the bootstrap algorithm defined above for $B=1000$. Our goal here is to quantify the uncertainty associated to the sample mean $\bar{z} = \frac{1}{n} \sum_{i=1}^n z_i$ and standard deviation $s = \frac{1}{n - 1} \sum_{i=1}^n (z_i - \bar{z})^2$ using a 90% confidence interval.**

**First, resample $B$ datasets using `sklearn.utils.resample` and collect the sample mean and standard deviation for each dataset.**

**4) What is the average number of unique points per bootstrap sample (i.e., the number of points that have been sampled at least once)? Compute it by modifying the code above. You can use the function `np.unique`.**

**Can you compute the expected number of unique points per bootstrap sample using the theory of the course?**

**5) Let $\hat{z}^{\ast~(b)}$ denote the sample means obtained on each bootstrapped dataset. For a confidence level $1 - \alpha$, a bootstrap empirical confidence interval for $\bar{z}$ can be obtained as:**

$$\text{CI} = [q_{\alpha/2}(\hat{z}^{\ast~(b)}), q_{1-\alpha/2}(\hat{z}^{\ast~(b)})]$$

**where $q_{\alpha}(\hat{z}^{\ast~(b)})$ is the $\alpha$-quantile of the sampling distribution of $\hat{z}^{\ast~(b)}$ (e.g. the median of $\hat{z}^{\ast~(b)}$ is $q_{0.5}(\hat{z}^{\ast~(b)})$).**

**Check the method `np.quantile` to compute quantiles.**

**Then, plot the sampling distribution of the mean (e.g., using `sns.histplot`) and add the true value, the sample estimate and the 90% confidence interval.**

**Redo the same for the sample standard deviation $\hat{\sigma}$.**


**We will now use bootstrapping to estimate the uncertainty associated to the mean of a sample drawn from an unknown distribution. To this end, we will reuse the 'Fish Market' dataset, and apply the bootstrap to the variable 'Height'.**

In [8]:
df = pd.read_csv('data/fish_lab.csv', index_col=0)
df = df.astype({'Species': 'category'})
df = df.sample(frac=1)
df

**6) Using the same procedure as before, compute $90\%$ confidence intervals for the variable height.**

**7) Similarly to cross-validation, the bootstrap can be used to estimate the validation error of any learning model. To this end, perform the following steps:**
- **Using the provided preprocessor, create a pipeline to apply the same preprocessing steps as in the previous lab followed by a linear regression model.** 
- **For $b=1,...,1000$:**
    - **Sample a training dataset $\mathcal{D}_{\text{train}}^{(b)}$ with replacement from the original dataset $\mathcal{D}$.**
    - **Define a test dataset $\mathcal{D}_{\text{test}}^{(b)}$ containing the observations from $\mathcal{D}$ that are not in $\mathcal{D}_{\text{train}}^{(b)}$.**
    - **For $\mathcal{D}_{\text{train}}^{(b)}$ and $\mathcal{D}_{\text{test}}^{(b)}$, select 'Height' as the target variable, and the remaining variables as predictors.**
    - **Fit the pipeline on $\mathcal{D}_{\text{train}}^{(b)}$.**
    - **Predict on $\mathcal{D}_{\text{test}}^{(b)}$.**
    - **Compute the $\text{MSE}^{(b)}$ on $\mathcal{D}_{\text{test}}^{(b)}$.**
    - **Put $\text{MSE}^{(b)}$ in a list.**

**With this procedure, we obtain the sampling distribution of the MSE.**
- **Plot this sampling distribution using a histogram.**
- **Add a point estimate $\text{MSE} = \frac{1}{B}\sum_{b=1}^B \text{MSE}^{(b)}$ on the plot.**
- **Add a 90% upper bound for the MSE on the plot, which corresponds to an interval from $-\infty$ to the quantile 0.9 of the sampling distribution.**

In [10]:
X, y = df.drop(columns='Height'), df[['Height']]

cont_columns = X.select_dtypes(include=['float64']).columns
cat_columns = X.select_dtypes(include=['category']).columns
# Transformers for imputation
cont_imputer = SimpleImputer(strategy='mean')
cat_imputer = SimpleImputer(strategy='most_frequent')
cat_pipeline = make_pipeline(cat_imputer, OneHotEncoder(sparse_output=False, handle_unknown='ignore'))

# ColumnTransformer to apply transformations to the correct features
preprocessor = ColumnTransformer(transformers=[
    ('cont', cont_imputer, cont_columns),
    ('cat', cat_pipeline, cat_columns)
])

**We can verify that we obtain a relatively similar test MSE with cross-validation.**

In [12]:
cv_results = cross_validate(model, X, y, cv=3, scoring=['neg_mean_squared_error'], return_estimator=True, return_indices=True)
test_mse_per_fold = -cv_results['test_neg_mean_squared_error']
print(f'Test MSE: {test_mse_per_fold.mean():.2f}')