Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [1]:
NAME = "Varshini Rana"

# Presenting Uncertainty
## School of Information, University of Michigan

## Week 1: Assignment Overview
Version 1.1
### The objectives for this week are for you to:
- Visualize a population distribution, sample, and sampling distribution
- Practice visualizing density plots and histograms
- Practice bootstrapping a sampling distribution
- Review and reflect on the concept and motivation of bootstrap

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn import linear_model
from sklearn.model_selection import train_test_split
import altair as alt
from collections import Counter

# Part 1. Visualize the bootstrap of a single mean (15 points)

Suppose that you want to summarize how many times a day students at a university pick up their smartphones. But at a university of a reasonable size (say 5,000 students), it would be hard to take a complete **census** of everyone at the school. Instead, you make an online survey which points responders to a pickup-counting app they can install on their phone to help track pickups. In the next few days, you receive 30 students' responses with their number of pickups in a given day, a **sample** of all students in school. You will calculate the mean of these 30 pickups to get an **estimate** for the total number of pickups in the lab.

What we would like to know is the **population mean**: the average number of times per day students at this university pick up their phones. This is a **population parameter**; hence our uncertainty in it is **parameter uncertainty**.

Since it is infeasible to collect data from all 5,000 students, we'll have to estimate that number based on a smaller **sample**. We also want to **quantify our uncertainty** in this estimate using some kind of distribution, as discussed in lecture. While there are many ways to do this (confidence distributions, Bayesian posterior distributions, etc), for the purposes of this assignment we will construct a **bootstrap sampling distribution**. Below is a python example to work through. (5 points)

### Question 1.1.1 Generating the population (2 points)

For the purposes of this example, we will first generate the population we want to sample from. Remember that in reality, we would not be generating the population, nor would we be able to observe the population directly! But for this first example, we want to explore the relationship between the population, the sample, and between the population mean and an estimate of the mean. So we will start with a known population, a luxury we do not have in real-world analyses.

Fill in the code below to construct a function to generate the population.

In [3]:
def get_population(lam, size=5000, seed=42):
    
    """
    Construct a function to generate the population for the phone pickup scenario.
    The population should be randomly drawn from a Poisson distribution with a mean
    of `lam`. Use the np.random.poisson() function to generate the data.
    Hist: the `lam` parameter of np.random.poisson() is the mean of the Poisson distribution.
    
    lam: the mean (lambda parameter of a Poisson distribution) used to generate the population
    size: the total number of people in the population (default: 5000 people).
    seed: seed used to initialize the random number generator for reproducibility (default: 42).
    
    This function should return the following variables as a tuple: (pickups, mean, std).
    pickups: a numpyarray of length `size` where each item is a randomly generated
        total pickup from a Poisson distribution with mean `true_mean`
    mean: the population mean number of pickups
    std: the population standard deviation of pickups
    """
    
    np.random.seed(seed)
    # YOUR CODE HERE
#     raise NotImplementedError()

    pickups=np.random.poisson(lam=lam, size=size)
    mean=pickups.mean()
    std=pickups.std()
    
    return pickups, mean, std

In [4]:
# test your code, we want to check that population is correctly generated before continuing
pop_pickups, pop_mean, pop_std = get_population(45, 5000, 42)
assert np.abs(pop_mean - 45) < 0.5, "Get population problem, testing pop_mean, population mean returned does not match expected population mean"
assert np.abs(pop_std - np.sqrt(45)) < 0.5, "Get population problem, testing pop_std, population standard deviation returned does not match expected standard deviation"

Using the above function, you can construct a population and inspect the population mean and standard deviation as follows:

In [5]:
pop_pickups, pop_mean, pop_std = get_population(45, 5000)
print("Population mean:", pop_mean)
print("Population SD:", pop_std)

Population mean: 45.0598
Population SD: 6.786886175559452


### Question 1.1.2 Sampling from the population (2 points)

Next, we'll generate a **sample** from the population. Fill in the `get_sample_statistics()` function:

In [6]:
def get_sample_statistics(pickups, size, seed=42):
    
    """
    Construct a function to draw a random sample from a population for the phone pickup scenario.
    
    pickups: the numpyarray containing the full population; e.g. the first element in the tuple
        generated by `get_population()`
    size: the number of people included in the sample pulled from the total population.
        For example, it may be 30 students.
    seed: seed used to initialize the random number generator for reproducibility (default: 42).
    
    This function should return the following variables as a tuple: (sample, mean, std)
    sample: a numpyarray where each item is a randomly selected from the total population.
    mean: the mean of the sample
    std: the standard deviation of the sample
    """
    
    np.random.seed(seed)
    # YOUR CODE HERE
#     raise NotImplementedError()

    sample=np.random.choice(a=pickups, size=size)
    mean=sample.mean()
    std=sample.std()
    
    return sample, mean, std

In [7]:
# test your code, we want to check that the statistics for sample data looks correct before we continue doing more exploration
pop_pickups, pop_mean, pop_std = get_population(45, 5000, 42)
sample_pickups, sample_mean, sample_std = get_sample_statistics(pop_pickups, 30, 42)
assert np.abs(sample_mean - 44) < 0.5, "Get sample statistics problem, testing sample_mean, sample mean returned does not match expected sample mean"
assert np.abs(sample_std - 6) < 0.5, "Get sample statistics problem, testing sample_std, sample standard deviation returned does not match expected standard deviation"

Using the above function, you can construct a sample from the population as follows:

In [8]:
pop_pickups, pop_mean, pop_std = get_population(45, 5000)
sample_pickups, sample_mean, sample_std = get_sample_statistics(pop_pickups, 30)
print("Sample mean:", sample_mean)
print("Sample SD:", sample_std)

Sample mean: 44.333333333333336
Sample SD: 5.605552802554109


## 1.3 Visualize and compare population distribution, sample, and bootstrap sampling distribution
The population distribution shows the values of the variable for all the individuals in the population. The distribution of sample data contains the values of the variable for all the individuals in the sample, which is a subset of the population.

For a reminder of the ideas behind sampling and confidence intervals, go to [Seeing Theory Chapter 2](https://seeing-theory.brown.edu/frequentist-inference/index.html#section2).

We'll start by visualizing the population distribution:

In [9]:
df = pd.DataFrame(data = {"Pickups" : pop_pickups})

pop_hist_chart = alt.Chart(df).mark_bar().encode(
    alt.X("Pickups", bin=alt.Bin(maxbins=40), 
        # set the x axis limits to be from 0 to 80
        # so that it is easier to compare across visualizations.
        scale=alt.Scale(domain=[0,80])  
    ),
    y='count()'
)

pop_hist_chart

Aternatively, we could map population density onto the y position by using the `transform_density()` method in Altair. Combined with an `area` mark, this allows us to create a density plot. We'll also add a dotted vertical rule indicating the **population mean** in black:

In [10]:
df = pd.DataFrame(data = {"pop_pickups" : pop_pickups})

pop_density_chart = alt.Chart(df).transform_density(
    density='pop_pickups',
    as_=['pop_pickups', 'density'],
    
).mark_area(color="lightgray").encode(
    x=alt.X('pop_pickups', axis=alt.Axis(title="Population Distribution of Pickups")),
    y=alt.Y('density:Q', axis=alt.Axis(title="Density"))
)

pop_mean_chart = alt.Chart(df).mark_rule(color="black", strokeDash=[1,1]).encode(
    x="mean(pop_pickups)"
)

pop_chart = pop_density_chart + pop_mean_chart

pop_chart

We can also look at a dot plot of the sample drawn from the population. We'll add a vertical rule indicating the **sample mean** in red:

In [11]:
df = pd.DataFrame(data = {"sample_pickups" : sample_pickups})

sample_dot_chart = alt.Chart(df).mark_circle(color="gray").encode(
    alt.X("sample_pickups", axis=alt.Axis(title="Sample of 30 Pickups"))
)

sample_mean_chart = alt.Chart(df).mark_rule(color="red").encode(
    x="mean(sample_pickups)"
)

sample_chart = sample_dot_chart + sample_mean_chart + pop_mean_chart

sample_chart

### Question 1.3.1 Sampling distribution of the mean (4 points)

As mentioned in this week's lecture, one way to quantify the uncertainty in an estimate is through a **bootstrap sampling distribution**. To derive a bootstrap sample distribution for a statistic (say a mean) from a sample of size *k* (here, *k* = 30), follow these steps *B* times, to get *B* statistics, forming samples from your bootstrap sampling distribution:

1. Draw random sample of size *k* with replacement from your existing sample.
2. Compute the statistic (e.g. the mean) for the sample.

We then choose a large value of *B* (say 5000) to approximate the sampling distribution of the desired statistic (e.g., the mean).

Follow the instructions below, within the code cell, to construct the bootstrap sampling distribution for the mean of your sample (i.e. the `sample_pickups` variable):

In [12]:
#Number of bootstrap samples --- typically something large, like 5000
B = 5000

# set the random seed for reproducibility
np.random.seed(42)

bootstrap_means = []
# construct the bootstrap sampling distribution of the mean of your sample (`sample_pickups`)
# and store it in `bootstrap_means`
# Hint: use np.random.choice()
# YOUR CODE HERE
# raise NotImplementedError()

for i in range(B):
    sample=np.random.choice(sample_pickups, size=30)
    avg=np.mean(sample)
    bootstrap_means.append(avg)

# bootstrap_means=np.random.choice(a=sample_pickups, size=B, replace=True)

print("Population mean:        ", pop_mean)
print("Sample mean:            ", sample_mean)
# the bootstrap sampling mean should be similar to the mean of your sample
print("Bootstrap sampling mean:", np.mean(bootstrap_means))

Population mean:         45.0598
Sample mean:             44.333333333333336
Bootstrap sampling mean: 44.35646666666666


In [13]:
# the bootstrap sampling mean should be close to the sample mean
sample_mean_se = np.std(sample_pickups, ddof=1) / np.sqrt(len(sample_pickups))
assert len(bootstrap_means) == 5000, "Bootstrap sampling distribution: length of bootstrap_means should be 5000"
assert np.abs(np.std(bootstrap_means) - sample_mean_se) / sample_mean_se < 0.05, "Bootstrap sampling distribution: SD of boostrap_means does not match SE of sample_pickups"
assert np.abs(np.mean(bootstrap_means) - sample_mean)/sample_mean < 5*1e-3, "Bootstrap sampling distribution: mean of bootstrap_means does not match sample_mean"

### Question 1.3.2 Combined plot of population, sample, and sampling distribution (7 points)

Finally, plot your population distribution, sample, and the sampling distribution of the mean all together. Your output should look like this:

![Plot showing the population distribution, a sample of size 30, and a bootstrap sampling distribution of the mean stacked on top of each other](assets/assignment1_pop_sample_bootstrap.png)

In [14]:
# Hints: use the `.properties()` method to adjust width/height of plots, and
# use `.resolve_scale()` to make the x axes line up with each other. 
# `alt.vconcat()` (or the `&` operator) can be used to stack charts.

# YOUR CODE HERE
# raise NotImplementedError()

# population distribution
df = pd.DataFrame(data = {"pop_pickups" : pop_pickups})

pop_density_chart = alt.Chart(df).transform_density(
    density='pop_pickups',
    as_=['pop_pickups', 'density'],
    
).mark_area(color="lightgray").encode(
    x=alt.X('pop_pickups', axis=alt.Axis(title="Population distribution of pickups")),
    y=alt.Y('density:Q', axis=alt.Axis(title="density"))
)

pop_mean_chart = alt.Chart(df).mark_rule(color="black", strokeDash=[1,1]).encode(
    x="mean(pop_pickups)"
)

pop_chart = (pop_density_chart + pop_mean_chart).properties(height=100, width=600)

# sample
df = pd.DataFrame(data = {"sample_pickups" : sample_pickups})

sample_dot_chart = alt.Chart(df).mark_circle(color="gray").encode(
    alt.X("sample_pickups", axis=alt.Axis(title="sample of 30 pickups"), scale=alt.Scale(domain=[20,70]))
)

sample_mean_chart = alt.Chart(df).mark_rule(color="red").encode(
    x="mean(sample_pickups)"
)

sample_chart = (sample_dot_chart + sample_mean_chart + pop_mean_chart).properties(width=600)

# bootstrap sampling distribution
df = pd.DataFrame(data = {"bootstrap_means" : bootstrap_means})

bootstrap_density_chart = alt.Chart(df).transform_density(
    density='bootstrap_means',
    as_=['bootstrap_means', 'density'],
    
).mark_area(color="lightgray").encode(
    x=alt.X('bootstrap_means', axis=alt.Axis(title="Bootstrap sampling distribution of mean pickups"), 
            scale=alt.Scale(domain=[20,70])),
    y=alt.Y('density:Q', axis=alt.Axis(title="density"))
)

bootstrap_mean_chart = alt.Chart(df).mark_rule(color="red").encode(
    x="mean(bootstrap_means)"
)

bootstrap_chart = (bootstrap_density_chart + bootstrap_mean_chart + pop_mean_chart).properties(height=100, width=600)

# combined plot
pop_chart & sample_chart & bootstrap_chart

# Part 2. Motor Trend Cars Data Set (15 points)

Practice: In this section you will predict the gas efficiency (miles per gallon) of a car, given its weight and horsepower of the vehicle. We will use the Motor Trends Cars Data Set (a.k.a. mtcars), which was a toy data set assembled in 1974 in the United States. The response variable is miles-per-gallon (mpg), a common measure of fule efficiency. The data file contains 10 attributes of automobiles that can be used to explain or predict fuel efficiency (mpg).

Linear regression models the relationship between a response variable (sometimes called a dependent variable) and one or more predictors (sometimes called independent variables) as a linear function, like this:

$\begin{align}
y_i &\sim \mathrm{Normal}(\mu_i, \sigma^2)\\
\mu_i &= \theta_{0} + \theta_{1}x_{1,i} + \cdots + \theta_{p}x_{p,i}
\end{align}
$

Where:

- $y_i$ is observation $i$
- $x_{p,i}$ is predictor $p$ for observation $i$
- $\theta_p$ is the *coefficent* for predictor $p$
- $\theta_0$ is the intercept

We would like to estimate those coefficients and quantify their uncertainty using bootstrapping.

### Reminder: Large-world versus small-world uncertainty

We are going to use ordinary least squares (OLS) linear regression, which has several assumptions behind it. These include things like *constant variance* (also called homoskedasticity), which means that the variance of observations ($\sigma^2$ in the above formula) is not a function of the predictors (the $x$s); and that the effects of predictors on the mean of the response are *linear* and *additive*. The validity of these assumptions is related to large-world uncertainty, and should be evaluated in real world applications. We will return to large-world uncertainty later in the course.

## 2.1 Load and explore the data

As stated above, we will be using the mtcars data set to predict (estimate) the fuel economy of a car (mpg). There are several things to note. First, this is not a regression modeling class, though it helps if you have a basic understanding of regression. Resources will be shared to help round out details of regression for those new to the subject. You do need to be able to interpret the meaning of a predictor coefficient in a basic multiple regression model. Second, the purpose of this exercise is to explore and discuss uncertainty in regression modeling. The focus of your understanding should be on how and why we model uncertainty in this context.

From the mtcars data set, we're going to use two variables as predictors:

- **hp**: horse power -- Gross horsepower measures the theoretical output of an engine’s power output
- **wt**: weight --  The overall weight of the vehicle per 1000lbs (half US ton)

Our response variable will be the **mpg** -- mile per gallon (a measure of fuel efficiency).

Let's load the dataset and view some descriptive statistics.

In [15]:
#Read in the data files
data = pd.read_csv(r'assets/mtcars.csv')
data.columns

Index(['model', 'mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am',
       'gear', 'carb'],
      dtype='object')

In [16]:
#create a dataframe containing predictors (cars_X) and the response variable (car_y)

cars_X = data[['hp','wt']]
cars_y = pd.Series(data['mpg'])

#also create a combined data frame with both predictors and response variables
cars_df = pd.concat([cars_y, cars_X], axis=1)

# show the first ten rows of the data we'll be using
cars_df.head(n = 10).round(1)

Unnamed: 0,mpg,hp,wt
0,21.0,110,2.6
1,21.0,110,2.9
2,22.8,93,2.3
3,21.4,110,3.2
4,18.7,175,3.4
5,18.1,105,3.5
6,14.3,245,3.6
7,24.4,62,3.2
8,22.8,95,3.2
9,19.2,123,3.4


In [17]:
#Get some statistics from our variables: count, mean standard deviation etc.
cars_df.describe().round(1)

Unnamed: 0,mpg,hp,wt
count,32.0,32.0,32.0
mean,20.1,146.7,3.2
std,6.0,68.6,1.0
min,10.4,52.0,1.5
25%,15.4,96.5,2.6
50%,19.2,123.0,3.3
75%,22.8,180.0,3.6
max,33.9,335.0,5.4


Many factors may have impact on the fuel efficiency in an automobile. It's usually a good idea to generate a data visualization to show the relationship between your variables. Below, we visualize the relationship between the two variables in the dataset (`wt` and `hp`) and our measure for fuel efficiency (`mpg`):

In [18]:
base = alt.Chart(cars_df).mark_circle(color="black").encode(
    alt.X("wt"), alt.Y("mpg")
)

# Create regression line
linear_fit = base.transform_regression(
    "wt", "mpg", method="linear"
).mark_line()

base + linear_fit

In [19]:
base = alt.Chart(cars_df).mark_circle(color="black").encode(
        alt.X("hp"), alt.Y("mpg")
)

# Create regression line
linear_fit = base.transform_regression(
    "hp", "mpg", method="linear"
).mark_line()

base + linear_fit

From the plots above, we see that both `wt` and `hp` are correlated with `mpg`(though the scatterplot illustrate that these relationships are far from perfect). Let's fit a linear model (multiple regression) using these explanatory variables to predict fuel efficiency (mpg).

## 2.2 Train a linear model

A linear model for `mpg` based on `hp` and `wt` might look like this:

$\begin{align}
\mathrm{mpg}_i &\sim \mathrm{Normal}(\mu_i, \sigma^2)\\
\mu_i &= \theta_{0} + \theta_{\mathrm{hp}}\mathrm{hp}_{i} + \theta_{\mathrm{wt}}\mathrm{wt}_{i}
\end{align}
$

Where:

- $\mathrm{mpg}_i$ is the `mpg` on row $i$
- $\mathrm{hp}_{i}$ is the value of `hp` on row $i$
- $\mathrm{wt}_{i}$ is the value of `wt` on row $i$
- $\theta_\mathrm{hp}$ is the *coefficent* for `hp`
- $\theta_\mathrm{wt}$ is the *coefficent* for `wt`
- $\theta_0$ is the intercept
- $\sigma$ is the standard deviation of the mpg conditional on the predictors

### Question 2.2.1 Linear regression (5 points)

- Complete the `linear_reg` function below to return the model above given a data frame of predictors (`X`) and a data frame with the response variable (`y`). You should use sklearn's class: `linear_model.LinearRegression`.

In [20]:
def linear_reg(X, y):
    """
    X: A pandas DataFrame containing the predictors as columns
    y: A pandas Series containing the response variable
    
    step 1: Initialize the linear regression model
    step 2: Fit the model on X and y
    """
    # YOUR CODE HERE
#     raise NotImplementedError()

    model=linear_model.LinearRegression()
    model.fit(X, y)
   
    return model

In [21]:
# test your linear regression code
reg = linear_reg(cars_X, cars_y)
assert len(reg.coef_) == 2, "Implement linear regression problem, testing the number of coefficients, number of coefficients returned does not match expected, check your linear regression code"
assert np.abs(reg.coef_[0] + 0.03177) < 1, "Implement linear regression problem, testing the hp coefficient, hp coefficient returned does not match expected, check your linear regression code"
assert np.abs(reg.coef_[1] + 3.8778) < 1, "Implement linear regression problem, testing the wt coefficient, wt coefficient returned does not match expected, check your linear regression code"

The coefficients below give us an estimate of the impact of a one-point increase in the predictor on the fuel efficiency (mpg) ---*contingent upon our model assumptions matching well with the real world*.

But remember that there is uncertainty in these estimates!

In [22]:
#Print the coefecients/weights for each feature/column of our model
print("hp coefficient:        %.2f" % (reg.coef_[0]))
print("weight coefficient:    %.2f" % (reg.coef_[1]))

hp coefficient:        -0.03
weight coefficient:    -3.88


## 2.3 Derive small-world uncertainty using bootstrapping

In our bootstrapping methods and analysis, we will focus on the coefficient for weight, $\theta_{wt}$. We would like to explore the partial relationship between fuel efficiency (mpg) and weight (wt), holding the horsepower (hp) constant.

### Question 2.3.1 Bootstrapping the room number coefficient, theta_RM (5 points)

We will follow the same steps as when we bootstrapped the mean in Part 1:

1. Draw random sample of size *k* with replacement from your existing sample.
2. Compute the statistic for the sample (in this case, fit a linear model and extract $\theta_\mathrm{wt}$ from it)
3. Replicate steps 1 and 2 *B* times, to get *B* statistics, forming samples from your bootstrap sampling distribution, $\tilde\theta_\mathrm{wt}$

In [23]:
k = len(cars_df)
B = 5000
# seed for the random number generator (for reproducibility)
random_state = np.random.RandomState(42)

bootstrap_theta_wt = []
for _ in range(B):
    #get a random sample of k rows from housing *with replacement*
    df_sample = cars_df.sample(n=k, replace=True, random_state=random_state)
    
    #split out predictors (assign them to a data frame called X) 
    #and response variable (assign it to a series called y)
    # YOUR CODE HERE
#     raise NotImplementedError()
    X=df_sample.drop("mpg", axis=1)
    y=df_sample["mpg"]
    
    #fit the model using the linear_regression function we implemented before
    # assign it to the variable `model`
    # YOUR CODE HERE
#     raise NotImplementedError()
    model=linear_reg(X, y)
    
    #extract theta_wt coefficient from the fit
    # assign it to the variable `theta_wt`
    # YOUR CODE HERE
#     raise NotImplementedError()
    theta_wt=model.coef_[1]
    
    #add wt coefficient to list of bootstrap samples
    bootstrap_theta_wt.append(theta_wt)

In [24]:
# test your bootstrap code for theta_rm
assert np.abs(np.mean(bootstrap_theta_wt) + 3.8635) < .001, "Bootstrap theta_wt: bootstrap mean of theta_wt does not match expected value, check your bootstrapping code"
assert np.abs(np.std(bootstrap_theta_wt) - 0.7066) < .001, "Bootstrap theta_wt: bootstrap SD of theta_wt does not match expected value, check your bootstrapping code"

### Question 2.3.2 Visualizing the bootstrap sampling distribution of theta_wt (5 points)

Visualize the bootstrap sampling distribution ($\tilde{\theta}_{wt}$ = `bootstrap_theta_wt`) as a *density plot*, marking the mean with a vertical rule, as we did for the sampling distribution at the end of Part 1:

In [25]:
# YOUR CODE HERE
# raise NotImplementedError()

df = pd.DataFrame(data = {"bootstrap_theta_wt" : bootstrap_theta_wt})

bootstrap_theta_wt_density_chart = alt.Chart(df).transform_density(
    density='bootstrap_theta_wt',
    as_=['bootstrap_theta_wt', 'density'],
    
).mark_area(color="lightgray").encode(
    x=alt.X('bootstrap_theta_wt', axis=alt.Axis(title="Bootstrap Sampling Distribution of theta_wt")),
    y=alt.Y('density:Q', axis=alt.Axis(title="Density"))
)

bootstrap_theta_wt_mean_chart = alt.Chart(df).mark_rule(color="red").encode(
    x="mean(bootstrap_theta_wt)"
)

bootstrap_theta_wt_chart = bootstrap_theta_wt_density_chart + bootstrap_theta_wt_mean_chart

bootstrap_theta_wt_chart

### Question 2.3.3 Reflect on your visualization and results (5 points)

How would you describe the results of your model above? Write up a description referencing the above visualization and taking into account your uncertainty. Be sure to focus your analysis on what this tells us about the uncertainty in the relationship between car weight (wt) and fuel efficiency (mpg). Also, discuss what in this exercise should be considered large world uncertainty and what should be considered small world uncertainty. Be sure to be specific.


Answer: Ordinary Least Squares (OLS) regression was used to model the relationship between fuel efficiency measured in miles per gallon (mpg) and the predictors car weight (wt) and horsepower (hp). The fuel efficiency mpg is negatively correlated to wt and hp, having linear regression coefficients of -3.88 and -0.03 respectively. This means that an increase in the predictors wt or hp would lead to a decrease in the response variable mpg, and vice versa.
 
However, these results are contingent upon the model assumptions (such as constant variance or homoskedasticity) matching well with the real world, which is an indicator of **large-world uncertainty**. Constant variance indicates that the variance of observations is not a function of the predictors. It also indicates that the effects of predictors on the mean of the response are linear and additive. Hence, the coefficients obtained as a result of this model have some measure of uncertainty associated with them.

As a way to quantify this uncertainty, a bootstrap sampling distribution was used, which is a measure of **small-world uncertainty**. In particular, the partial relationship between mpg and wt was explored, keeping hp constant and focusing on the wt coefficient $\tilde\theta_\mathrm{wt}$. The visualization above shows the bootstrap sampling distribution of $\tilde\theta_\mathrm{wt}$ as a density plot, with a vertical red line marking the mean of the distribution. This distribution is a result of resampling the dataset 5000 times and obtaining a $\tilde\theta_\mathrm{wt}$ coefficient for each of those bootstrap samples. $\tilde\theta_\mathrm{wt}$ in this distribution varies from ~-1.5 to ~-6.5, with a mean of ~-3.8. This exercise has likely yielded a distribution which looks close to the true sampling distribution centered around the sample mean, with the width of the distribution indicating the standard error around the mean.

Please remember to submit both the HTML and .ipynb formats of your completed notebook. When generating your HTML, be sure to run your complete code first before downloading as HTML.
Please remember to work on your explanations and interpretations!