# Tutorial 2: Bootstrapping and its Relationship to the Sampling Distribution

### Lecture and Tutorial Learning Goals
After completing this week's lecture and tutorial work, you will be able to:
1. Contrast quantitative and categorical variables.
2. Explain why we don’t know/have a sampling distribution in practice/real life.
3. Describe the standard deviation and the variance and write computer scripts to calculate estimates of these parameters.
4. Define standard error and explain its purpose.
5. Define bootstrapping.
6. Write a computer script to create a bootstrap distribution to approximate a sampling distribution.
7. Contrast a bootstrap sampling distribution with a sampling distribution obtained using multiple samples.
8. Contrast sampling with and without replacement.

In [None]:
# Run this cell before continuing.
library(cowplot)
library(datateachr)
library(digest)
library(gridExtra)
library(infer)
library(repr)
library(taxyvr)
library(tidyverse)
source("tests_tutorial_02.R")

## 1. Warm-Up Questions

Let's start off with a few questions about bootstrapping and sampling practices in reality.

**Question 1.0**
<br>{points: 3}

In 1-2 sentences, explain what bootstrapping is useful for.

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 1.1**
<br>{points: 1}

True or false?

A bootstrap sampling distribution will **always** have a similar width as the sampling distribution it is approximating.

_Assign your answer to an object called `answer1.1`. Your answer should be either "true" or "false", surrounded by quotes._

In [None]:
# answer1.1 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.1()

**Question 1.2**
<br>{points: 1}

True or false?

In reality, when we take a sample from the population, we are sampling with replacement.

_Assign your answer to an object called `answer1.2`. Your answer should be either "true" or "false", surrounded by quotes._

In [None]:
# answer1.2 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.2()

## 2. Proportions

So far we have only been interested in the population parameters of mean and median. However, these parameters can only describe numerical data; what about categorical data? For this type of data, we often refer to **proportions**. Recall that proportion is the ratio of the number of individuals in the population with a given attribute to the size of the entire population. For example, if there are $n$ individuals in our population and there are $x$ individuals amongst the population with a given attribute of interest, then the proportion, $p$, of the individuals of interest is 

$$p = \frac{x}{n}$$

Just like with the other parameters we have explored, we can also estimate the proportion of a population by taking a sample and calculating a point estimate. Thus, we can also have a sampling distribution of sample proportions. But how does this sampling distribution look? Does it have the same properties as the others we have explored? Let's find out.


Before we continue, here are a couple questions to reinforce your knowledge of the concept of proportion.

**Question 2.0**
<br>{points: 1}

Consider the population of students enrolled at UBC's Vancouver campus. As of the start of the 2019 Winter term, there were 57250 students in total, and among them, 10294 were Faculty of Science students ([source](https://bog3.sites.olt.ubc.ca/files/2020/01/4_2020.02_Enrolment-Annual-Report.pdf)). What is the proportion of Science students at UBC?

_Assign your answer to an object called `answer2.0`. Your answer should be a single number._

In [None]:
# answer2.0 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

answer2.0

In [None]:
test_2.0()

**Question 2.1**
<br>{points: 1}

You are told that 168 students in your BIOL 200 section have taken, or are currently taking, one or more CPSC or STAT courses, which is 60% of the students in the section. How many students are there in total in your BIOL 200 section?

_Assign your answer to an object called `answer2.1`. Your answer should be a single number._

In [None]:
# answer2.1 <- ...

# your code here
fail() # No Answer - remove if you provide an answer
answer2.1

In [None]:
test_2.1()

### Proportions of Vancouver Property Tax 

To learn more about proportions and the sampling distribution of sample proportions, we are again going to revisit the `tax_2019` dataset from the `taxyvr` R package. However, this time we'll be looking at a different variable for **all** properties in Vancouver: `geo_local_area`, which describes the local area where the property can be found (such as Kitsilano, Downtown, or Oakridge). Specifically, we will be looking at the **proportion of properties in Vancouver located downtown** (have a `geo_local_area` of `"Downtown"`), which we can call our population parameter of interest.

Just as before, don't forget about where we are in the grand scheme of things. Let's remind ourselves of the two things we should be keeping in mind as we work through this section:

> First, we don't usually have access to data for the entire population that we are interested. If we did, we could always calculate the population parameter directly. Here, we are taking the opportunity of having access to these entire populations to study sampling distributions. Second, always remember the purpose of learning about sampling distributions. By learning about the properties of sampling distributions, you will be able to understand the inherent variability/error in point estimates. This "error" associated with a point estimate is critical, and in later weeks we will learn how to report it formally.

**Question 2.2**
<br>{points: 1}

True or false?

The `geo_local_area` variable in the `tax_2019` dataset is an example of a categorical variable.

_Assign your answer to an object called `answer2.2`. Your answer should be either "true" or "false", surrounded by quotes._

In [None]:
# answer2.2 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.2()

**Question 2.3**
<br>{points: 1}

From the `tax_2019` dataset, filter out the rows that contain an `NA` in the `geo_local_area` column and select only this column.

_Assign your data frame to an object called `geo_pop`._

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

head(geo_pop)

In [None]:
test_2.3()

**Question 2.4**
<br>{points: 1}

Calculate the proportion of properties in our population `geo_pop` that are `"Downtown"`; this is the population parameter of interest. Recall that we are going to be exploring the effect of sampling size on the distributions of the point estimates for this parameter.

To do this, copy and paste the code below and then rearrange the lines so that the code runs properly.

    count(geo_local_area) %>% 
    p <- 
    geo_pop %>% 
    pull(p)
    filter(geo_local_area == "Downtown") %>% 
    mutate(p = n/sum(n)) %>% 
    
_Assign your answer to an object called `p`. Your answer should be a single number._

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

p

In [None]:
test_2.4()

**Question 2.5**
<br>{points: 1}

Let's take our first set of samples from our population `geo_pop`. First, take 2000 random samples of a size 10 using the `rep_sample_n` function and a seed of `2410`.

_Assign your data frame to an object called `samples_10`._

In [None]:
set.seed(2410) # DO NOT CHANGE!

# your code here
fail() # No Answer - remove if you provide an answer
head(samples_10)

In [None]:
test_2.5()

**Question 2.6**
<br>{points: 1}

Next, calculate the proportion of each sample you took in the previous question. Name the new column containing the sample proportions `sample_proportion`.

Use the scaffolding provided below as a guide:

```r
sample_proportions_10 <- 
    ... %>% 
    group_by(...) %>% 
    summarize(... = 
              sum(...)/n())
```

_Assign your data frame to an object called `sample_proportions_10`._

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

head(sample_proportions_10)

In [None]:
test_2.6()

**Question 2.7**
<br>{points: 1}

Finally, visualize the distribution of the sample proportions from the previous question by plotting a histogram using `geom_histogram` with the argument `binwidth = 1/10`. Add a title of "n = 10" to the plot using `ggtitle` and ensure that the x-axis has a human-readable label.

_Assign your plot to an object called `sampling_dist_10`._

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

sampling_dist_10

In [None]:
test_2.7()

**Question 2.8**
<br>{points: 1}

The example above shows that not all sampling distributions come out as a nice symmetrical bell shape. 

_Use the histogram above to answer the following question._

The true proportion of buildings in Vancouver that are located downtown is 0.195. Suppose the data was adjusted such that the true proportion is now 0.5, and we created another sampling distribution with samples of size 100 using the code above. How would the symmetry of the new sampling distribution compare to the one generated above?

A. The new sampling distribution would be less symmetrical.

B. The symmetry of the new sampling distribution would be about the same.

C. The new sampling distribution would be more symmetrical.

D. It is impossible to tell how the symmetry of the new sampling distribution would compare.

_Assign your answer to an object called `answer2.8`. Your answer should be a single character surrounded by quotes._

In [None]:
# answer2.8 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.8()

In Module 3, we will explore in detail how sample size affect the sampling distribution.

## 3. Bootstrap Distribution vs Sampling Distribution

### Root Barriers

In this section, we are going to test the limits of bootstrapping to see whether it results in reliable approximations of asymmetrical sampling distributions, such as the one shown above. To do this, we will attempt to use bootstrapping to estimate sampling distributions that we know are even less symmetrical and compare them to see if our estimates are reasonable. One population that we have at our disposal that yields some asymmetrical sampling distributions is the `vancouver_trees` data set from the `datateachr` package. One example of this is the sampling distribution of sample proportions for the `root_barrier` variable; in this section, we will be looking at the proportion of trees that **do not** have a root barrier.

<img src="https://www.flexiblelining.co.uk/media/shared/product-images/urban-hard-landscaping/rootbarrier/170UR4170-rootbarrier-panels-1.jpg" width=400>

<div style="text-align: center"><i>Image from <a href="https://www.flexiblelining.co.uk/green-roof-systems/roof-garden-root-barrier/ribbed-root-barrier-panels"> Flexible Lining Products</i></a></div>

Recall that the `vancouver_trees` dataset contains information about public trees planted along boulevards in Vancouver. The `root_barrier` variable in this dataset specifies whether or not a tree was planted with a root barrier or not. A root barrier is a type of underground wall that protects buildings, sidewalks, and roads from roots, which can severely damage these structures. One example of a type of root barrier is shown in the picture above.

**Question 3.1** 
<br> {points: 1}

Filter `vancouver_trees` such that there are no `NA` values in the `root_barrier` column, and then select only that column. Use the scaffolding provided below as a guide:

```r
barrier_pop <- vancouver_trees %>% 
    filter(...) %>% 
    ...(root_barrier)
```

_Assign your data frame to an object called `barrier_pop`._

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
head(barrier_pop)

In [None]:
test_3.1()

**Question 3.2** 
<br> {points: 1}

Draw 2000 random samples of size 20 from the population `barrier_pop` using the `rep_sample_n` function and a seed of 3909. For each sample, calculate the proportion of trees that **do not** have a root barrier (i.e. where `root_barrier == "N"`) as the point estimate. Lastly, we will visualize the distribution of the sample proportions you just calculated by plotting a histogram. The plot should have a variable named `p` on the x-axis. Use the scaffolding provided below as a guide:

```r
barrier_sampling_dist <- 
    ... %>% 
    rep_sample_n(size = ..., reps = ..., replace = ...) %>% 
    ...(replicate) %>% 
    summarize(x = sum(... == "N"),
              n = n()) %>% 
    mutate(p = ... / ...) %>% 
    ggplot(aes(x = ...)) +
        geom_histogram(binwidth = 1/20, color = 'white') +
        xlab("Proportion") +
        ggtitle("Sampling Distribution of Proportions (n = 20)")
        
```

_Assign your plot to an object called `barrier_sampling_dist`._

In [None]:
set.seed(3909) # DO NOT CHANGE!
        
# your code here
fail() # No Answer - remove if you provide an answer

barrier_sampling_dist

In [None]:
test_3.2()

**Question 3.3** 
<br> {points: 1}

Take a single random sample of size 20 from `barrier_pop` using `rep_sample_n` and a seed of 1933. Ensure your resulting data frame only has a single column: `root_barrier`.

**Hint:** Remember to `ungroup()` before using `select()`!

_Assign your data frame to an object called `barrier_sample`._

In [None]:
set.seed(1933) # DO NOT CHANGE!

# your code here
fail() # No Answer - remove if you provide an answer

head(barrier_sample)

In [None]:
test_3.3()

**Question 3.4** 
<br> {points: 1}

Now we want to produce a bootstrap sampling distribution using `barrier_sample` sample we just took, which we will be able to compare to the sampling distribution we generated above. We want to use the exact same scaffolding as **question 3.2** (except the object name we are saving to) to complete the following task: 

> Take 2000 bootstrap samples from `barrier_sample` using `rep_sample_n` with a seed of 2767. Then, calculate the proportion of trees in each sample that does not have a root barrier (`root_barrier == "N"`); name the column containing the sample propotions `p`. Lastly, use `geom_histogram` with bin widths of 1/20 to visualize the bootstrap distribution. Add a descriptive title to the plot using `ggtitle` and ensure that the x-axis has a human-readable label. 

**Which two `...`'s in the scaffolding below _must_ be different than the code you used in question 3.2?**

```R
# LINE  1:    bootstrap_dist_20 <- ... %>% 
# LINE  2:       rep_sample_n(size = ..., reps = ..., replace = ...) %>% 
# LINE  3:       ...(replicate) %>% 
# LINE  4:       summarize(x = sum(... == "N"),
# LINE  5:                 n = n()) %>% 
# LINE  6:       mutate(p = ... / ...) %>% 
# LINE  7:       ggplot(aes(x = p)) +
# LINE  8:           geom_histogram(... = ...) +
# LINE  9:           xlab("Proportion") +
# LINE 10:           ...("n = 20")
```

A. The `...` in `LINE 1` and the third `...` from the left in `LINE 2`

B. The `...` in `LINE 1` and the second `...` from the left in `LINE 8`

C. The first `...` from the left in `LINE 2` and the third `...` from the left in `LINE 2`

D. The first `...` from the left in `LINE 2` and the second `...` from the left in `LINE 8`

E. Some other two `...`'s not listed above.

F. None of the above; only one `...` must be different.

G. None of the above; three or more of the `...` must be different.

_Assign your answer to an object called `answer2.4`. Your answer should be a single character surrounded by quotes._

In [None]:
# answer3.4 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_3.4()

**Question 3.5** 
<br> {points: 1}

Take 2000 bootstrap samples from `barrier_sample` using `rep_sample_n` with a seed of 2767. Then, calculate the proportion of trees in each sample that does not have a root barrier (`root_barrier == "N"`). Lastly, use `geom_histogram` with bin widths of 1/20 to visualize the bootstrap distribution. Add a descriptive title to the plot using `ggtitle` and ensure that the x-axis has a human-readable label. 

**Hint:** use your answer to the previous question and your code from **question 3.2**.

_Assign your plot to an object called `barrier_bootstrap_dist`._

In [None]:
set.seed(2767) # DO NOT CHANGE!

# barrier_bootstrap_dist <- 
#     ... %>% # this is a multiline command that you need to fill
#     ggplot(aes(x = p)) +
#     geom_histogram(binwidth = 1/20, color = 'white') +
#     xlab("Proportion") +
#     ggtitle("Bootstrap Distribution of Sample Proportions (n = 20)")

# your code here
fail() # No Answer - remove if you provide an answer

barrier_bootstrap_dist

In [None]:
test_3.5()

**Question 3.6** 
<br> {points: 1}

**Note:** this question has two parts!

a) Calculate the standard deviation of the sampling distribution you generated above (`barrier_sampling_dist`); this is the standard error of the corresponding estimator.

_Assign your answer to an object called `standard_error`. Your answer should be a single number._

<br>

b) Calculate the standard deviation of the bootstrap distribution you generated above (`barrier_bootstrap_dist`).

_Assign your answer to an object called `standard_deviation`. Your answer should be a single number._

**Hints:**
- You can get the data that was used to generate using a plot with `plot_name$data`, for example: `barrier_sampling_dist$data`.
- You can convert a 1x1 data frame to a number using `as.numeric()`.

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
standard_error
standard_deviation

In [None]:
test_3.6()

**Question 3.7** 
<br> {points: 1}

True or false?

The standard deviation of a bootstrap distribution is a "good guess" of the standard deviation of the corresponding sampling distribution.

In [None]:
# answer3.7 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_3.7()

**Question 3.8** 
<br> {points: 3}

Will the standard deviation of a bootstrap distribution **always** be relatively close to the standard deviation of the corresponding sampling distribution?
- If no, describe one situation related to our root barrier scenario above that would result in the `standard_deviation` object from **question 3.6** being very different than the `standard_error` object.
- If yes, explain why no such situation exists.

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

## 4. A Closer Look at Bootstrapping

There is one "rule" related to bootstrapping that we have not mentioned yet:

> When generating a bootstrap distribution to estimate the sampling distribution for the original sample size, the **bootstrap samples** should be the **same size** as the **original sample** to get a useful estimate.

For example, we would get poor results if we took a sample of size 30 from the population, and then took many bootstrap samples (resamples from the original sample, with replacement) of size 60 to estimate a sampling distribution for samples of size 30. Why? Let's try it out ourselves to discover the answer. Afterwards, we'll also go through some other questions to continue to solidify our understanding of the various nuances related to bootstrapping.

### Pissard plum

To explore the "rule of thumb" that we mentioned above, we will again use the `vancouver_trees` data set from the `datateachr` package. However, this time the population we are interested in is only the trees with the common name `"PISSARD PLUM"`, and the parameter that we are interested in is the standard deviation of the `diameter` of these trees.

In [None]:
head(vancouver_trees)

**Question 4.0** 
<br> {points: 1}

Filter the `vancouver_trees` dataset for the population that we are interested in and then select the variable that we are interested in (your final data frame should have a single column).

_Assign your data frame to an object called `plum_pop`._

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

head(plum_pop)

In [None]:
test_4.0()

**Question 4.1** 
<br> {points: 1}

Take a single random sample of size 10 from `plum_pop` using the `rep_sample_n` function and a seed of 0737. Ensure your resulting data frame only has a single column: `diameter`.

_Assign your data frame to an object called `plum_sample`._

In [None]:
set.seed(0737) # DO NOT CHANGE!

# your code here
fail() # No Answer - remove if you provide an answer

head(plum_sample)

In [None]:
test_4.1()

**Question 4.2** 
<br> {points: 1}

Take 2500 bootstrap samples **of size 100** from the sample you took in the previous question by using the `rep_sample_n` function and a seed of 9284. 

_Assign your data frame to an object called `plum_resamples`._

In [None]:
set.seed(9284) # DO NOT CHANGE!

# your code here
fail() # No Answer - remove if you provide an answer

head(plum_resamples)

In [None]:
test_4.2()

**Question 4.3** 
<br> {points: 3}

Calculate the standard deviation for each resample that you took in the previous question with `group_by()` and `summarize()`. Name the new column containing the standard deviation `sd`.

_Assign your data frame to an object called `resample_estimates`._

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

head(resample_estimates)

In [None]:
# Here we check to see if you have given your answer the correct object name
# and if your answer is plausible. However, all other tests have been hidden
# so you can practice deciding when you have the correct answer.
test_that('Did not assign answer to an object called "resample_estimates"', {
  expect_true(exists("resample_estimates"))
})
test_that("Solution should be a data frame", {
  expect_true("data.frame" %in% class(resample_estimates))
})

**Question 4.4** 
<br> {points: 3}

Visualize the bootstrap distribution (of `resample_estimates`) by plotting a histogram using `geom_histogram` with bin widths of 0.25. Ensure that the x-axis has a human-readable label.

_Assign your plot to an object called `plum_bootstrap_dist`._

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

plum_bootstrap_dist

In [None]:
# Here we check to see if you have given your answer the correct object name
# and if your answer is plausible. However, all other tests have been hidden
# so you can practice deciding when you have the correct answer.
test_that('Did not assign answer to an object called "plum_bootstrap_dist"', {
  expect_true(exists("plum_bootstrap_dist"))
})
test_that("Solution should be a ggplot object", {
  expect_true(is.ggplot(plum_bootstrap_dist))
})

**Question 4.5** 
<br> {points: 3}

Produce a sampling distribution (**not** a bootstrap distribution) of sample standard deviations for samples of size 10 from the population `plum_pop` using a procedure similar to the previous questions and the last section; use 2500 sample replicates and a seed of 2362. Then, visualize the distribution using a histogram. 

_Assign your plot to an object called `plum_sampling_dist`._

In [None]:
set.seed(2362) # DO NOT CHANGE!

# plum_sampling_dist <- 
#     ... %>% # this is a multiline command
#     ggplot(aes(x = sd)) +
#         geom_histogram(binwidth = 0.5) +
#         xlab("Standard Deviation")

# your code here
fail() # No Answer - remove if you provide an answer

plum_sampling_dist

In [None]:
# Here we check to see if you have given your answer the correct object name
# and if your answer is plausible. However, all other tests have been hidden
# so you can practice deciding when you have the correct answer.
test_that('Did not assign answer to an object called "plum_sampling_dist"', {
  expect_true(exists("plum_sampling_dist"))
})
test_that("Solution should be a ggplot object", {
  expect_true(is.ggplot(plum_sampling_dist))
})

In the code cell below, we have used `plot_grid` to plot the sample distribution and bootstrap distribution side by side.

**Note:** some of the sample standard deviations are not visible because we have manually set bounds on the x-axis so you can compare the important parts of the distributions more easily

_Use the two plots below to answer the next **three questions**._

In [None]:
options(repr.plot.width = 12, repr.plot.height = 4)
plot_grid(plum_sampling_dist +
              labs(title = "Sampling Distribution",
                   caption = "Generated using 2500 sample replicates of size 10.") +
              scale_x_continuous(limits = c(0, 10)),
          plum_bootstrap_dist +
              labs(title = "Bootstrap Distribution",
                   caption = "Generated using 2500 bootstrap samples of size 100 from a sample of size 10.") + 
              scale_x_continuous(limits = c(0, 10)),
          ncol = 2)

**Question 4.6** 
<br> {points: 3}

Which statement **best** describes the bootstrap distribution above?

A. The distribution of many point estimates for the standard deviation of the population, which were acquired by taking many samples from the population and calculating the standard deviation of each sample.

B. The distribution of many point estimates for the standard deviation of the sampling distribution (which is the standard error of the corresponding estimator), which were acquired by re-sampling from the original sample and calculating the standard deviation of each re-sample.

C. The distribution of the standard deviations of many samples that were taken from the population.

D. The distribution of standard deviations for many re-samples that were taken from the original sample.

_Assign your answer to an object called `answer4.6`. Your answer should be a single character surrounded by quotes._

In [None]:
# answer4.6 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
# Here we check to see if you have given your answer the correct object name
# and if your answer is plausible. However, all other tests have been hidden
# so you can practice deciding when you have the correct answer.
test_that('Did not assign answer to an object called "answer4.6"', {
  expect_true(exists("answer4.6"))
})
test_that('Solution should be a single character ("A", "B", "C", or "D")', {
  expect_match(answer4.6, "a|b|c|d", ignore.case = TRUE)
})

**Question 4.7** 
<br> {points: 3}

By referencing the plots above, explain why it's not a good idea to take bootstrap sizes of a **larger size than the original sample** to estimate the sampling distribution for the original sample size.

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 4.8** 
<br> {points: 3}

Suppose you took a single sample of size 164 and then took many bootstrap samples of size 10 from the first sample to produce a bootstrap distribution for the mean of the `diameter` variable in the `plum_pop` population. Suppose you wanted to use the standard deviation of the bootstrap distribution to estimate the standard deviation of the sampling distribution of sample means for the `diameter` variable for samples of size 164. How would you expect the estimate to compare to the actual standard error?

A. The estimate would likely be an under-estimate.

B. The estimate would likely be accurate.

C. The estimate would likely be an over-estimate.

D. There is not enough information to make this comparison.

_Assign your answer to an object called `answer4.8`. Your answer should be a single character surrounded by quotes._

In [None]:
# answer4.8 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
# Here we check to see if you have given your answer the correct object name
# and if your answer is plausible. However, all other tests have been hidden
# so you can practice deciding when you have the correct answer.
test_that('Did not assign answer to an object called "answer4.8"', {
  expect_true(exists("answer4.8"))
})
test_that('Solution should be a single character ("A", "B", "C", or "D")', {
  expect_match(answer4.8, "a|b|c|d", ignore.case = TRUE)
})

### More Bootstrapping Nuances

**Question 4.9** 
<br> {points: 3}

Suppose a bootstrap distribution of sample means of the `diameter` variable in `plum_pop` was created by using `rep_sample_n` to take a single sample of size 8 from the population and then 3000 bootstrap samples. The resulting distribution is displayed below with bin widths of 0.25:

<img src="plot.png" width=600/>

a) Given that the standard deviation of the `diameter` variable for the population `plum_pop` is around 5.0, is this a shape that you would expect the bootstrap distribution to have?

b) If you answered yes, justify yourself in 1-2 sentences. If you answered no, justify yourself in 1-2 sentence and describe an error or scenario that would result in such a distribution in another sentence.

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 4.10** 
<br> {points: 3}

Consider the following single random sample of 6 observations of the reported average hours of screen time a person is exposed to each day:

| `screen_time` <br> `<dbl>`|
| -- |
| 3 |
| 6 |
| 8 |
| 1 |
| 7 |
| 7 |
    
Below are two more data frames that are claimed to have been created by bootstrapping from the original sample.

| `screen_time` <br> `<dbl>`|
| -- |
| 6 |
| 7 |
| 6 |
| 7 |
| 7 |
| 1 |

| `screen_time` <br> `<dbl>`|
| -- |
| 7 |
| 1 |
| 7 |
| 3 |
| 6 |
| 8 |

 Consider the values in the two data frames above. Do you agree that the two data frames above were bootstrapped samples? Explain why or why not in your own words in a few sentences.

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.