[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/drbob-richardson/stat220/blob/main/Lecture_Code/10_4_Bootstrapping.ipynb)

In [45]:
import pandas as pd

# Load the dataset
exp = pd.read_csv("https://richardson.byu.edu/220/student_expenses.csv")

# Display the first few rows of the dataset
exp.head()

Unnamed: 0,Gender,Age,Study_year,Scholarship,Transporting,expenses
0,Female,21,2,No,No,150
1,Male,25,3,No,Motorcycle,220
2,Male,23,2,Yes,No,180
3,Male,19,3,No,Motorcycle,200
4,Female,19,2,No,Motorcycle,300


Let's look at what resample does.

In [74]:
from sklearn.utils import resample

X = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
print(np.mean(resample(X)))


5.7


In [78]:
y = ['a','b','c','d','e','f','g','h','i','j']
print(resample(X,y))

[[5, 8, 5, 10, 3, 10, 4, 1, 8, 6], ['e', 'h', 'e', 'j', 'c', 'j', 'd', 'a', 'h', 'f']]


We will resample our data a large number of times. The logic of this is that we will assume the data we have is an empirical distribution of possible observations (we've seen empirical distributions before when discussing Bayes model output). By resampling with replacement, we are essentially taking many draws from the distribution of possible obervations.

Finding a bootstrapped confidence inetrval
1. Resample the data with replacement
2. Fit a model to the data
3. Make a prediction based on that model and collect the predictions
4. Repeat steps 1 to 3 a large number of times (1000, 100000)
5. Find the 2.5 and 97.5 percentiles of the predictions

This code does it for the full dataset, meaning it finds bootstrapped confidence intervals for all in sample data.

In [84]:
from sklearn.tree import DecisionTreeRegressor
import numpy as np

# Assuming 'expenses' is the target variable and the rest are features
X = exp.drop('expenses', axis=1)
y = exp['expenses']

# Convert categorical variables to dummy variables
X = pd.get_dummies(X)

# Initialize the regressor
regressor = DecisionTreeRegressor(max_depth = 3)

# Number of bootstrap samples
n_bootstrap_samples = 1000

# Store predictions for each sample
bootstrap_predictions = []

for _ in range(n_bootstrap_samples):
    # Bootstrap sample
    X_sample, y_sample = resample(X, y)

    # Fit the model
    regressor.fit(X_sample, y_sample)

    # Predict on the original data
    predictions = regressor.predict(X)
    bootstrap_predictions.append(predictions)

# Convert to numpy array for calculations
bootstrap_predictions = np.array(bootstrap_predictions)

# Calculate percentiles for prediction intervals (e.g., 2.5th and 97.5th percentile for a 95% interval)
lower_bound = np.percentile(bootstrap_predictions, 2.5, axis=0)
upper_bound = np.percentile(bootstrap_predictions, 97.5, axis=0)

# Display prediction intervals for the first few observations
for i in range(5):
    print(f"Observation {i}: Lower Bound = {lower_bound[i]}, Upper Bound = {upper_bound[i]}")


Observation 0: Lower Bound = 161.81439393939394, Upper Bound = 223.26013289036544
Observation 1: Lower Bound = 158.55158730158732, Upper Bound = 257.1517857142857
Observation 2: Lower Bound = 157.5, Upper Bound = 240.0
Observation 3: Lower Bound = 183.90981198589895, Upper Bound = 251.25721153846152
Observation 4: Lower Bound = 177.60504201680672, Upper Bound = 262.00357142857143


This code makes predictions for a single new observation. Note how we again use reindex to get the proper levels accounted for the dummy variables.

In [79]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.utils import resample
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
exp = pd.read_csv("https://richardson.byu.edu/220/student_expenses.csv")

# Preparing the data
X = exp.drop('expenses', axis=1)
y = exp['expenses']

# Convert categorical variables to dummy variables
X = pd.get_dummies(X)

# Initialize list to store bootstrapped models
bootstrap_models = []

# Number of bootstrap samples
n_bootstrap_samples = 1000

for _ in range(n_bootstrap_samples):
    # Bootstrap sample
    X_sample, y_sample = resample(X, y)

    # Fit the model
    regressor = DecisionTreeRegressor(max_depth = 3)
    regressor.fit(X_sample, y_sample)

    # Store the model
    bootstrap_models.append(regressor)

# Define a new observation (hypothetical student data)
new_student = {'Gender': 'Female', 'Age': 20, 'Study_year': 2, 'Scholarship': 'Yes', 'Transporting': 'No'}

# Convert new observation to DataFrame and create dummy variables
new_student_df = pd.DataFrame([new_student])
new_student_df = pd.get_dummies(new_student_df).reindex(columns=X.columns, fill_value=0)

# Predict expenses for the new observation using each bootstrapped model
new_student_predictions = [model.predict(new_student_df)[0] for model in bootstrap_models]

# Calculate prediction intervals
lower_bound_new_student = np.percentile(new_student_predictions, 2.5)
upper_bound_new_student = np.percentile(new_student_predictions, 97.5)

# Print prediction interval for the new observation
print(f"Prediction interval for the new student: Lower Bound = {lower_bound_new_student}, Upper Bound = {upper_bound_new_student}")



Prediction interval for the new student: Lower Bound = 164.44202898550725, Upper Bound = 225.837426686217


How does this compare to the linear regression confidence interval?

In [80]:
import pandas as pd
import statsmodels.formula.api as smf

# Load the dataset
exp = pd.read_csv("https://richardson.byu.edu/220/student_expenses.csv")


formula = "expenses ~ Age + Study_year + Gender + Scholarship + Transporting"

# Fit the linear regression model using the formula API
model = smf.ols(formula, data=exp).fit()

# Define new observation for linear regression
new_student = pd.DataFrame([{'expenses':'', 'Gender': 'Female ', 'Age': 20, 'Study_year': 2, 'Scholarship': 'Yes', 'Transporting': 'No'}])

# Predict and calculate the prediction interval for the new observation
predictions_lr = model.get_prediction(new_student)
interval_lr = predictions_lr.summary_frame(alpha=0.05)[['obs_ci_lower', 'obs_ci_upper']]

# Print the prediction interval from linear regression
print("Linear Regression Prediction Interval for the New Student:")
print(interval_lr)


Linear Regression Prediction Interval for the New Student:
   obs_ci_lower  obs_ci_upper
0     88.781429      299.2447


In [83]:
predictions_lr.summary_frame(alpha=0.05)

Unnamed: 0,mean,mean_se,mean_ci_lower,mean_ci_upper,obs_ci_lower,obs_ci_upper
0,194.013065,12.270581,169.639064,218.387066,88.781429,299.2447
