# Week 3 - Fundamentals of Sampling and Variability

By the end of this worksheet, you will be able to:

- Distinguish between a population and a sample from a finite population. 
- Define commonly used population parameters and their corresponding point estimates.   
- Identify the population, sample, parameters and point estimates in a given scenario. 
- Explain what sampling variability is and how it arises from samples drawn at random. 
- Define standard error of an estimator and describe how it relates to sampling variability.  
- Distinguish between sampling from a finite population and model-based approaches to randomness. 
- Classify random variables as discrete or continuous, with supporting examples. 
- Recognize common types of probability distributions and when they are appropriate. 

## Getting Started

Let's load the necessary packages we will be using throughout the worksheet!

In [None]:
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(infer))

Cherry blossom trees have become an iconic part of Vancouverâ€™s spring landscape. They were first introduced in the 1930s through a gift from the mayors of Kobe and Yokohama, Japan, as a gesture of goodwill and to honour Japanese-Canadian veterans. Over the decades, the number of cherry trees in the city has grown significantly, and today they are planted widely across Vancouverâ€™s streets and parks. 

<div style="text-align: center;">
  <img 
    src="https://images.unsplash.com/photo-1738653904986-43ffbad2e59a?q=80&w=1740&auto=format&fit=crop&ixlib=rb-4.1.0&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" 
    alt="Cherry blossom tree" 
    style="width: 400px; max-width: 100%;"
  />
</div>

In this activity, we will focus on the three most common cherry blossom cultivars in Vancouver: Akebono, Kanzan, and Shirofugen. These varieties not only represent a large portion of the cityâ€™s cherry tree population but also bloom in a predictable sequence each spring. Akebono trees bloom first, typically in early April, followed by Kanzan about one to two weeks later, and then Shirofugen another one to two weeks after that. For more details, check out [this article](https://news.ubc.ca/2023/03/finding-the-best-cherry-blossoms-sakura-trees-in-vancouver/#:~:text=We%20actually%20have%2055%20different,week%20or%20two%20after%20Kanzan.)!



For the purposes of this worksheet, we will define cherry blossom trees as those of the Akebono, Kanzan, or Shirofugen cultivars and use them to explore ideas of sampling and variability using data adapted from Vancouverâ€™s public tree inventory ([https://opendata.vancouver.ca/explore/dataset/public-trees/](https://opendata.vancouver.ca/explore/dataset/public-trees/)). 



In [None]:
#Reading the data
trees <- read_csv("data/vancouver_trees.csv")

In [None]:
head(trees)

In [None]:
#Checking the structure of the data
glimpse(trees) 

In this worksheet, we will explore key ideas in sampling variability by working with data from the City of Vancouverâ€™s public tree inventory, which includes information on over 183,000 individual trees. For the purposes of this activity, we will treat this dataset as a finite population. From this population, we will take samples to estimate characteristics such as average tree diameter, the proportion of cherry blossom trees, and other summary measures. Through this process, you will learn to distinguish between a population and a sample, identify parameters and point estimates, and observe how results can vary from sample to sample due to randomness. This variability in sample outcomes is known as sampling variability, and understanding it is key to interpreting data accurately. You will also be introduced to the concept of standard error and how it relates to the variability of point estimates. Finally, we will connect this work to broader probability concepts, including random variables, sample spaces, and distributions, while considering how sampling from a finite population differs from using model-based approaches.

---

### Warm-up

**Question 1**

For each of the following questions, assign the correct option to the given variable name in quotes. For example, if the correct answer is option B, enter `answer1.1 <- "B"` in the answer cell.


1.1. Which of the following best describes the population of interest?

- A. The cherry blossom trees in one neighbourhood
- B. A random sample of 100 trees from the dataset
- C. All 183,473 trees recorded in the Vancouver tree dataset
- D. All the trees in Canada


In [None]:
# answer1.1 <- "FILL_THIS_IN"

# YOUR CODE HERE
fail()


In [None]:
library(digest)
stopifnot("type of answer1.1 is not character"= setequal(digest(paste(toString(class(answer1.1)), "c6fcd")), "daa5d0bca52e5216c22f39b8285c77a8"))
stopifnot("length of answer1.1 is not correct"= setequal(digest(paste(toString(length(answer1.1)), "c6fcd")), "4522d10fb1060cb7fbeaf805e96549da"))
stopifnot("value of answer1.1 is not correct"= setequal(digest(paste(toString(tolower(answer1.1)), "c6fcd")), "dae2de16122b45f6af503503a408dd6e"))
stopifnot("letters in string value of answer1.1 are correct but case is not correct"= setequal(digest(paste(toString(answer1.1), "c6fcd")), "9a8db0074bb5ce268bbf8ff39eb89c6d"))

print('Success!')

1.2. Suppose you randomly select 200 trees from the dataset to examine their species and diameter.
What is the sample in this context?

- A. The 200 randomly selected trees
- B. The average diameter of all 183,473 trees
- C. The cherry blossom trees among the 183,473
- D. All trees in the city of Vancouver

In [None]:
# answer1.2 <- "FILL_THIS_IN"

# YOUR CODE HERE
fail()

In [None]:
library(digest)
stopifnot("type of answer1.2 is not character"= setequal(digest(paste(toString(class(answer1.2)), "25c54")), "d89134f0a13882fc082c3b0bf06a1dbc"))
stopifnot("length of answer1.2 is not correct"= setequal(digest(paste(toString(length(answer1.2)), "25c54")), "2c86c22fc7d1b0d54074f259bc5f4410"))
stopifnot("value of answer1.2 is not correct"= setequal(digest(paste(toString(tolower(answer1.2)), "25c54")), "c91c5802a6589afdb554d2429bdbc2ba"))
stopifnot("letters in string value of answer1.2 are correct but case is not correct"= setequal(digest(paste(toString(answer1.2), "25c54")), "a25176fb74cadc5665945275d2a473ba"))

print('Success!')

1.3. Letâ€™s say we are interested in the true proportion of cherry blossom trees (Akebono, Kanzan, and Shirofugen) in the Vancouver tree population. What is the population **parameter** in this context?

- A. The proportion of cherry blossom trees in a sample of 200
- B. The mean height of a sample of trees
- C. The true proportion of cherry blossom trees in all 183,473 trees
- D. The number of neighbourhoods with cherry blossom trees

In [None]:
# answer1.3 <- "FILL_THIS_IN"

# YOUR CODE HERE
fail()

In [None]:
library(digest)
stopifnot("type of answer1.3 is not character"= setequal(digest(paste(toString(class(answer1.3)), "e4677")), "e94da9577bf3d578e5a2c25fb50eca77"))
stopifnot("length of answer1.3 is not correct"= setequal(digest(paste(toString(length(answer1.3)), "e4677")), "da18bf1e70bc68d6bb7d53b9af3ff9c1"))
stopifnot("value of answer1.3 is not correct"= setequal(digest(paste(toString(tolower(answer1.3)), "e4677")), "ca6cad040382def4b745230d3597eddc"))
stopifnot("letters in string value of answer1.3 are correct but case is not correct"= setequal(digest(paste(toString(answer1.3), "e4677")), "d6e73778db5cb86a4132823e1b664ebb"))

print('Success!')

1.4. Suppose that in your random sample of 200 trees, 42 are cherry blossom trees.
What is the **point estimate** for the proportion of cherry blossom trees in the population?

- A. 42
- B. The total number of of cherry blossom trees in the full data set
- C. The proportion of cherry blossom trees in the full data set
- D. 42/200

In [None]:
# answer1.4 <- "FILL_THIS_IN"

# YOUR CODE HERE
fail()

In [None]:
library(digest)
stopifnot("type of answer1.4 is not character"= setequal(digest(paste(toString(class(answer1.4)), "54852")), "6fb3f05f096d7532b346e05914a8350c"))
stopifnot("length of answer1.4 is not correct"= setequal(digest(paste(toString(length(answer1.4)), "54852")), "8c442a0503b0d8b961c61f7ed255a8f7"))
stopifnot("value of answer1.4 is not correct"= setequal(digest(paste(toString(tolower(answer1.4)), "54852")), "3041c5e2a158beb68803ba5a673ec3e6"))
stopifnot("letters in string value of answer1.4 are correct but case is not correct"= setequal(digest(paste(toString(answer1.4), "54852")), "4f3e608e4842e548d85099b8126a562e"))

print('Success!')

---
**Question 2**

This bar chart shows the proportion of cherry blossom trees compared to other trees in the entire population of 183,473 public trees recorded in the City of Vancouver. It visualizes how common cherry blossoms are relative to all other tree types in the dataset. We can see that about 7% of the trees in our population are cherry blossoms.

In [None]:
pop_counts <- trees |>
  mutate(Type = ifelse(CHERRY_BLOSSOM, "Cherry Blossom", "Other")) |>
  count(Type) |>
  mutate(Proportion = n / sum(n))

ggplot(pop_counts, aes(x = Type, y = Proportion, fill = Type)) +
  geom_col(show.legend = FALSE, width = 0.7) +
  scale_fill_manual(values = c("Cherry Blossom" = "#D8A7B1",  
                   "Other" = "#A8C3A0")) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  labs(
    title = "Proportion of Cherry Blossom Trees in the Population",
    y = "Proportion",
    x = ""
  ) +
  geom_text(aes(label = scales::percent(Proportion, accuracy = 1)),
            vjust = -0.5, size = 6, fontface = "bold") +
  theme_minimal(base_size = 14)

Use the `rep_sample_n()` function from the infer package to take a random sample of 100 trees from the dataset and calculate the proportion of cherry blossom trees in your sample and compare it to population parameter of 0.07, storing your answer as `answer2`. Set the random seed to 200 to ensure your result is reproducible. 

In [None]:
set.seed(200)

# answer2 <- trees |>
#     ...(...) |>
#     ...(prop_blossom = mean(..., na.rm=TRUE)) |>
#     ...(prop_blossom)

 # YOUR CODE HERE
 fail()

answer2

In [None]:
library(digest)
stopifnot("type of answer2 is not numeric"= setequal(digest(paste(toString(class(answer2)), "622ee")), "15536ac780cec99cf44ff04a00c4773c"))
stopifnot("value of answer2 is not correct (rounded to 2 decimal places)"= setequal(digest(paste(toString(round(answer2, 2)), "622ee")), "4a10f48bc0365dde073616c2b6bfc16c"))
stopifnot("length of answer2 is not correct"= setequal(digest(paste(toString(length(answer2)), "622ee")), "7f890e8c3758e3ecb413ab5e12de424d"))
stopifnot("values of answer2 are not correct"= setequal(digest(paste(toString(sort(round(answer2, 2))), "622ee")), "4a10f48bc0365dde073616c2b6bfc16c"))

print('Success!')

**Question 3**

In the previous question, you calculated the proportion of cherry blossom trees in your random sample of 100 trees. The true proportion of cherry blossom trees in the full population is 0.07 (7%).

Which of the following best explains why your sample proportion might be higher or lower than 0.07? Assign your the response in quotation marks as `answer3 <- "E"`, for example.

- A. There must be an error in the dataset
- B. The sample proportion is always exactly equal to the population proportion
- C. Point estimates naturally differ from the population parameter due to sampling variability
- D. Cherry blossom trees were deliberately over-sampled in this case

In [None]:
# answer3 <- "FILL_THIS_IN"

# YOUR CODE HERE
fail()

In [None]:
library(digest)
stopifnot("type of answer3 is not character"= setequal(digest(paste(toString(class(answer3)), "f0180")), "39748484aba87f9ea85f6dc3970d3aae"))
stopifnot("length of answer3 is not correct"= setequal(digest(paste(toString(length(answer3)), "f0180")), "3596ef2c0b345483ed940672239b1d0f"))
stopifnot("value of answer3 is not correct"= setequal(digest(paste(toString(tolower(answer3)), "f0180")), "e8adbbeffb7ad0dea1e8de8bde515a34"))
stopifnot("letters in string value of answer3 are correct but case is not correct"= setequal(digest(paste(toString(answer3), "f0180")), "ca4d094429bba71ddf480414219fbdee"))

print('Success!')

**Question 4**

In the previous question, you drew a single random sample of 100 trees and noticed that the sample proportion of cherry blossom trees was different from the population proportion of 0.07. But was that just a fluke? To explore this question, weâ€™ll now draw many samples of the same size and look at how the sample proportions vary.

Use `rep_sample_n()` to take 1000 samples of size 100 from the trees data set. Then, calculate the proportion of cherry blossom trees in each sample.

In [None]:
set.seed(200)

# answer4 <- trees |>
#     ...(...) |>
#     ...(...) |>
#     ...(prop_blossom = mean(...))

# YOUR CODE HERE
fail()

head(answer4)

In [None]:
library(digest)
stopifnot("answer4 should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(answer4)), "cb447")), "1c013c2dfdb94507092c5d8fd0db39c9"))
stopifnot("dimensions of answer4 are not correct"= setequal(digest(paste(toString(dim(answer4)), "cb447")), "e014e8ab2bc774106fe9fc07eb82f433"))
stopifnot("column names of answer4 are not correct"= setequal(digest(paste(toString(sort(colnames(answer4))), "cb447")), "87785e958edb44c58df1ec4de6966efa"))
stopifnot("types of columns in answer4 are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(answer4, class)))), "cb447")), "b43c9167da2c1432cf15951df95b7a03"))
stopifnot("values in one or more numerical columns in answer4 are not correct"= setequal(digest(paste(toString(if (any(sapply(answer4, is.numeric))) sort(round(sapply(answer4[, sapply(answer4, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "cb447")), "19942ea50a8711de9606412b68fa044e"))
stopifnot("values in one or more character columns in answer4 are not correct"= setequal(digest(paste(toString(if (any(sapply(answer4, is.character))) sum(sapply(answer4[sapply(answer4, is.character)], function(x) length(unique(x)))) else 0), "cb447")), "602566cfb266d7ee14a5551235cbe2ed"))
stopifnot("values in one or more factor columns in answer4 are not correct"= setequal(digest(paste(toString(if (any(sapply(answer4, is.factor))) sum(sapply(answer4[, sapply(answer4, is.factor)], function(col) length(unique(col)))) else 0), "cb447")), "602566cfb266d7ee14a5551235cbe2ed"))

print('Success!')

**Question 5**

When we take many random samples from a population and calculate a statistic like the proportion of cherry blossom trees, those statistics will vary from sample to sample. Visualizing this variation helps us understand sampling variability and how reliable a single point estimate might be.

In the previous question, you created a data set (`answer4`) containing the proportion of cherry blossom trees for 1000 different samples of size 100. Now, create a histogram to visualize the distribution of these sample proportions (i.e., the sampling distribution of the sample proportions).

In [None]:
# answer5 <- ggplot(..., aes(x = ...)) +
#     ...(binwidth = 0.005, fill = "#D8A7B1", color = "white") +
#     labs(title = "...",
#          x = "...",
#          y = "...") +
#     theme_minimal()

# YOUR CODE HERE
fail()

answer5

In [None]:
library(digest)
stopifnot("type of plot is not correct (if you are using two types of geoms, try flipping the order of the geom objects!)"= setequal(digest(paste(toString(sapply(seq_len(length(answer5$layers)), function(i) {c(class(answer5$layers[[i]]$geom))[1]})), "776e1")), "c65766cb8c490bd43a4d709af12baae7"))
stopifnot("variable x is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(answer5$layers)), function(i) {rlang::get_expr(c(answer5$layers[[i]]$mapping, answer5$mapping)$x)}), as.character))), "776e1")), "4034d0f761589d249b060680c9e742ef"))
stopifnot("variable y is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(answer5$layers)), function(i) {rlang::get_expr(c(answer5$layers[[i]]$mapping, answer5$mapping)$y)}), as.character))), "776e1")), "a00611d41e0c2d454d340e6344b5193e"))
stopifnot("x-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(answer5$layers[[1]]$mapping, answer5$mapping)$x)!= answer5$labels$x), "776e1")), "d672d9da4a647e99a5c48dc85aa9253c"))
stopifnot("y-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(answer5$layers[[1]]$mapping, answer5$mapping)$y)!= answer5$labels$y), "776e1")), "a00611d41e0c2d454d340e6344b5193e"))
stopifnot("incorrect colour variable in answer5, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(answer5$layers[[1]]$mapping, answer5$mapping)$colour)), "776e1")), "a00611d41e0c2d454d340e6344b5193e"))
stopifnot("incorrect shape variable in answer5, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(answer5$layers[[1]]$mapping, answer5$mapping)$shape)), "776e1")), "a00611d41e0c2d454d340e6344b5193e"))
stopifnot("the colour label in answer5 is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(answer5$layers[[1]]$mapping, answer5$mapping)$colour) != answer5$labels$colour), "776e1")), "a00611d41e0c2d454d340e6344b5193e"))
stopifnot("the shape label in answer5 is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(answer5$layers[[1]]$mapping, answer5$mapping)$colour) != answer5$labels$shape), "776e1")), "a00611d41e0c2d454d340e6344b5193e"))
stopifnot("fill variable in answer5 is not correct"= setequal(digest(paste(toString(quo_name(answer5$mapping$fill)), "776e1")), "1558a5e020f2b16fb38f358677fcfc4f"))
stopifnot("fill label in answer5 is not informative"= setequal(digest(paste(toString((quo_name(answer5$mapping$fill) != answer5$labels$fill)), "776e1")), "a00611d41e0c2d454d340e6344b5193e"))
stopifnot("position argument in answer5 is not correct"= setequal(digest(paste(toString(class(answer5$layers[[1]]$position)[1]), "776e1")), "90dff1017ce1e684f9053a0d3abcb65e"))

print('Success!')

Reflect on the the shape of the distribution. Is it symmetric? Skewed? Centered around a particular value? What is the spread of the distribution?

**Question 6**

We will now investigate how changing the sample size affects the variability of the sample proportions.

Repeat the same process as in question 4, but with different sample sizes: 50 and 500. That is: 

1. Use `rep_sample_n` to take 1000 samples each of sizes 50 and 500 from trees.
2. Calculate the proportion of cherry blossom trees in each sample.
3. Add a column 'n' indicating the sample size.
4. Then bind these results together with your previous `answer4` (size 100 samples) in the order of $n=50, 100, 500$. 

In [None]:
set.seed(200)

# samples_50 <- ...

# samples_500 <- ...

# samples_100 <- ...

# all_samples <- bind_rows(samples_50, samples_100, samples_500)

# YOUR CODE HERE
fail()

head(all_samples)

In [None]:
library(digest)
stopifnot("all_samples should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(all_samples)), "4245a")), "e808b56e4c59c1e81974452b2f13c922"))
stopifnot("dimensions of all_samples are not correct"= setequal(digest(paste(toString(dim(all_samples)), "4245a")), "2641d761b50e2c4f8933d970bda3b3b8"))
stopifnot("column names of all_samples are not correct"= setequal(digest(paste(toString(sort(colnames(all_samples))), "4245a")), "41673ea484febef0a1b0b51cc9849cb5"))
stopifnot("types of columns in all_samples are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(all_samples, class)))), "4245a")), "77807207879eb49094170800e0a5b74b"))
stopifnot("values in one or more numerical columns in all_samples are not correct"= setequal(digest(paste(toString(if (any(sapply(all_samples, is.numeric))) sort(round(sapply(all_samples[, sapply(all_samples, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "4245a")), "cee8ecb6fc378b4e313fc2ba1721789a"))
stopifnot("values in one or more character columns in all_samples are not correct"= setequal(digest(paste(toString(if (any(sapply(all_samples, is.character))) sum(sapply(all_samples[sapply(all_samples, is.character)], function(x) length(unique(x)))) else 0), "4245a")), "07851a216dec5924c7baf1180cd0d49f"))
stopifnot("values in one or more factor columns in all_samples are not correct"= setequal(digest(paste(toString(if (any(sapply(all_samples, is.factor))) sum(sapply(all_samples[, sapply(all_samples, is.factor)], function(col) length(unique(col)))) else 0), "4245a")), "07851a216dec5924c7baf1180cd0d49f"))

print('Success!')

**Question 7**

Now, let's take a look of histograms of the sampling distribution for each sample size of $n=50, 100, 500$.

In [None]:
options(repr.plot.width = 10, repr.plot.height = 15)

ggplot(all_samples, aes(x = prop_blossom)) +
  geom_histogram(binwidth = 0.02, fill = "#D8A7B1", color = "white") +
  facet_wrap(~ n, ncol = 1) +
  labs(
    title = "Sampling Distributions by Sample Size",
    x = "Sample Proportion",
    y = "Number of Samples"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 20, face = "bold"),
    axis.title = element_text(size = 16),
    axis.text = element_text(size = 14),
    strip.text = element_text(size = 16, face = "bold")
  )

Answer the following multiple choice questions based on the plot above. For each of the following questions, assign the correct option to the given variable in quotes. For example, if the correct answer is option B, enter `answer7.1 <- "E"` in the answer cell.

7.1. Which of the following statements best describes the effect of increasing sample size on the histogram of the sampling distribution of the sample proportion?

- A) Increasing sample size increases the spread of the sampling distribution
- B) Increasing sample size decreases the spread of the sampling distribution
- C) Increasing sample size has no effect on the shape or spread
- D) Increasing sample size makes the histogram more skewed

In [None]:
# answer7.1 <- "FILL_THIS_IN"

# YOUR CODE HERE
fail()


In [None]:
library(digest)
stopifnot("type of answer7.1 is not character"= setequal(digest(paste(toString(class(answer7.1)), "9b7d4")), "91e2f0e81fb0c70e9fc0308e6b78a465"))
stopifnot("length of answer7.1 is not correct"= setequal(digest(paste(toString(length(answer7.1)), "9b7d4")), "0315dd4388fb739b24cf346a778683e4"))
stopifnot("value of answer7.1 is not correct"= setequal(digest(paste(toString(tolower(answer7.1)), "9b7d4")), "a283df34cad3738053d709b2d59224e6"))
stopifnot("letters in string value of answer7.1 are correct but case is not correct"= setequal(digest(paste(toString(answer7.1), "9b7d4")), "c7e904f590b4b7f2c1d7b8bebed05ba4"))

print('Success!')

7.2. Which of the following statements best describes the central tendency of the sampling distributions?

- A) The sampling distributions are centered around 0.5 regardless of the true population proportion.
- B) The sampling distributions are centered around the sample size, $n$.
- C) The sampling distributions are centered around the population proportion.
- D) The sampling distributions have no specific center and vary widely.

In [None]:
# answer7.2 <- "FILL_THIS_IN"

# YOUR CODE HERE
fail()

In [None]:
library(digest)
stopifnot("type of answer7.2 is not character"= setequal(digest(paste(toString(class(answer7.2)), "d49a7")), "6efc938a8d146883ffc99b1ec1570d09"))
stopifnot("length of answer7.2 is not correct"= setequal(digest(paste(toString(length(answer7.2)), "d49a7")), "3aa7c29e42016226c51c12688bbcdc99"))
stopifnot("value of answer7.2 is not correct"= setequal(digest(paste(toString(tolower(answer7.2)), "d49a7")), "c4f7f2b8c509c99eb2cade9aa6b1e315"))
stopifnot("letters in string value of answer7.2 are correct but case is not correct"= setequal(digest(paste(toString(answer7.2), "d49a7")), "11b5cbcb8287c657754f23526f260bfa"))

print('Success!')

**Question 8**

In practice, we often want to understand how much sampling variability we can expect in a proportion estimated from a single sample. This variability is captured by the standard error (SE): a measure of how much we expect a sample proportion to vary from sample to sample.

One way to estimate the SE is by simulating many repeated samples from the population (like we have done above), calculating the sample proportion in each, and then computing the standard deviation of those sample proportions. This standard deviation of the sampling distribution is referred to as the empirical standard error.

8.1 Use `all_samples` to calculate the empirical standard deviation of the sample proportions for each sample size. What pattern do you notice as sample size increases? Note: you should have two columns named `n` and `sd`, respectively.

In [None]:
# answer8.1 <- ... |>
#   group_by(...) |>
#   ...

# YOUR CODE HERE
fail()

answer8.1

In [None]:
library(digest)
stopifnot("answer8.1 should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(answer8.1)), "88a70")), "b93dbd44e23f1cbcd7453d9d4ef67393"))
stopifnot("dimensions of answer8.1 are not correct"= setequal(digest(paste(toString(dim(answer8.1)), "88a70")), "ab3dff554246723a237c7dda7e9f19b2"))
stopifnot("column names of answer8.1 are not correct"= setequal(digest(paste(toString(sort(colnames(answer8.1))), "88a70")), "042370b041df0e26e18bdfd96f4640a6"))
stopifnot("types of columns in answer8.1 are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(answer8.1, class)))), "88a70")), "b9feeeddefbe8ce9aa37889193b93c28"))
stopifnot("values in one or more numerical columns in answer8.1 are not correct"= setequal(digest(paste(toString(if (any(sapply(answer8.1, is.numeric))) sort(round(sapply(answer8.1[, sapply(answer8.1, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "88a70")), "72fa2fae8bf35b6e05fcf2936cdb375d"))
stopifnot("values in one or more character columns in answer8.1 are not correct"= setequal(digest(paste(toString(if (any(sapply(answer8.1, is.character))) sum(sapply(answer8.1[sapply(answer8.1, is.character)], function(x) length(unique(x)))) else 0), "88a70")), "bba9c15397b5588150afccc1d2dfd72f"))
stopifnot("values in one or more factor columns in answer8.1 are not correct"= setequal(digest(paste(toString(if (any(sapply(answer8.1, is.factor))) sum(sapply(answer8.1[, sapply(answer8.1, is.factor)], function(col) length(unique(col)))) else 0), "88a70")), "bba9c15397b5588150afccc1d2dfd72f"))

print('Success!')

8.2. The standard error (SE) measures how much we expect a statistic like the sample proportion to vary from sample to sample, if we could take many samples from the population.

Which of the following best explains why we typically estimate the standard error from a single sample, instead of computing it directly from a simulated sampling distribution?

- A. Because we rarely have access to the entire population to draw repeated samples from
- B. Because it's impossible
- C. Because random samples donâ€™t vary much
- D. Because it's unethical to collect more than one sample

In [None]:
# answer8.2 <- "FILL_THIS_IN"

# YOUR CODE HERE
fail()

In [None]:
library(digest)
stopifnot("type of answer8.2 is not character"= setequal(digest(paste(toString(class(answer8.2)), "5c170")), "e14176d9f25845ec213be7cc04638f04"))
stopifnot("length of answer8.2 is not correct"= setequal(digest(paste(toString(length(answer8.2)), "5c170")), "cd551964a03e4d5436d56e6b0f78b38e"))
stopifnot("value of answer8.2 is not correct"= setequal(digest(paste(toString(tolower(answer8.2)), "5c170")), "d2cc4953cf3e8fc3e6da0e675c24e762"))
stopifnot("letters in string value of answer8.2 are correct but case is not correct"= setequal(digest(paste(toString(answer8.2), "5c170")), "db3e0a6eb3766fe5ed253f4efac1b88a"))

print('Success!')

**Question 9**

In the previous exercise, we used repeated samples from the full Vancouver trees data set to simulate a sampling distribution of sample proportion. From that, we computed the empirical standard deviation of the sample proportions (an estimate of the standard error).

But in practice, we almost never have access to the entire population, so we canâ€™t take thousands of samples like we did here. Instead, we typically only have one sample. So how can we still estimate how much our sample proportion might vary from sample to sample?

Fortunately, there is a theoretical formula we can use to estimate the standard error of a sample proportion, based on just a single sample:

$$SE(\hat{p})=\sqrt{\frac{\hat{p}(1-\hat{p})}{n}},$$
where $\hat{p}$ is the observed sample proportion (e.g., the proportion of cherry blossom trees) and $n$ is the sample size.

This formula gives us an estimate of the expected variability in sample proportions due to random sampling.

Now, using the sample proportion you previously calculated in `answer2` from a sample of size 100, write R code to compute the standard error of this sample proportion.  Note that your answer should be a single numeric value.

In [None]:
# answer9 <- ...

# YOUR CODE HERE
fail()

answer9

In [None]:
library(digest)
stopifnot("type of answer9 is not numeric"= setequal(digest(paste(toString(class(answer9)), "f1f31")), "d02912e14c9461259f7a2ab1dfdea09f"))
stopifnot("value of answer9 is not correct (rounded to 2 decimal places)"= setequal(digest(paste(toString(round(answer9, 2)), "f1f31")), "e986544f1a0b5c9ffc5921b2f6ae6dca"))
stopifnot("length of answer9 is not correct"= setequal(digest(paste(toString(length(answer9)), "f1f31")), "78e83dd1a02175d02c04d382297e3c32"))
stopifnot("values of answer9 are not correct"= setequal(digest(paste(toString(sort(round(answer9, 2))), "f1f31")), "e986544f1a0b5c9ffc5921b2f6ae6dca"))

print('Success!')

Compare the standard error you just calculated using the formula to the empirical standard error estimated from the sampling distribution for $n=100$. Are the values close? What does this tell you about using the formula to estimate the standard error in practice?

YOUR ANSWER HERE

---
Up until now, we've focused on sampling from a finite population, like our Vancouver Trees data set, and have explored how point estimates like the proportion of cherry blossom trees can vary from sample to sample. This approach assumes the population is fixed and that all randomness comes from the act of sampling.

Now, weâ€™ll explore a different perspective: the model-based approach to randomness. In this approach, we consider probability models to describe uncertainty, variability, and randomness, whether or not there is a fixed population. These models allow us to describe random variables and how data might behave under certain assumptions, rather than relying solely on sampling.

**Question 10**

Imagine we had more data about the City of Vancouver, including the following variables:
- `tree_canopy_cover`: the percentage of land area in a neighborhood covered by tree canopy
- `is_park_space`: whether a land parcel is designated as public park space (TRUE/FALSE)
- `pollution_level`: measured air pollution index 
- `num_traffic_signals`: number of traffic signals in a given neighborhood
- `building_height`: the height of a building, in meters

Indicate whether each variable is discrete ("D") or continuous ("C"), and store your answer in a tibble called `answer9`. For each one, assign either "D" for discrete or "C" for continuous by replacing the `...` in the corresponding row under `variable_type`.

In [None]:
# answer10 <- tibble(
#   name = c(
#     "tree_canopy_cover",
#     "is_park_space",
#     "pollution_level",
#     "num_traffic_signals",
#     "building_height"
#   ),
#   variable_type = c(
#     "...",  # tree_canopy_cover
#     "...",  # is_park_space
#     "...",  # pollution_level
#     "...",  # num_traffic_signals
#     "..."   # building_height
#   )
# )

# YOUR CODE HERE
fail()

answer10

In [None]:
library(digest)
stopifnot("answer10 should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(answer10)), "d95d3")), "8efd9950161b392078544f248e854a32"))
stopifnot("dimensions of answer10 are not correct"= setequal(digest(paste(toString(dim(answer10)), "d95d3")), "349b1e95828c987a22a7ae0549789df0"))
stopifnot("column names of answer10 are not correct"= setequal(digest(paste(toString(sort(colnames(answer10))), "d95d3")), "890d396bc1534e48ddcfd442b23bde27"))
stopifnot("types of columns in answer10 are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(answer10, class)))), "d95d3")), "0e8c83f6e4bf4874bcf2bfdfbf252e4f"))
stopifnot("values in one or more numerical columns in answer10 are not correct"= setequal(digest(paste(toString(if (any(sapply(answer10, is.numeric))) sort(round(sapply(answer10[, sapply(answer10, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "d95d3")), "46320f155fa3a728a0bebf1fd3185a47"))
stopifnot("values in one or more character columns in answer10 are not correct"= setequal(digest(paste(toString(if (any(sapply(answer10, is.character))) sum(sapply(answer10[sapply(answer10, is.character)], function(x) length(unique(x)))) else 0), "d95d3")), "621b50e646fefcac295bdc19f2f26996"))
stopifnot("values in one or more factor columns in answer10 are not correct"= setequal(digest(paste(toString(if (any(sapply(answer10, is.factor))) sum(sapply(answer10[, sapply(answer10, is.factor)], function(col) length(unique(col)))) else 0), "d95d3")), "46320f155fa3a728a0bebf1fd3185a47"))

print('Success!')

**Question 11**

Now that we have identified the types of variables as either discrete or continuous, we want to think about how to model the randomness in these variables using probability distributions. Different types of variables and situations are best described by different probability distributions.

In the next set of questions, you will see scenarios based on the variables above. Your task is to select the most appropriate probability distribution to model each situation.

11.1. You randomly select 15 land parcels and record whether each is designated as park space (`is_park_space: TRUE/FALSE`). You want to model the number of parcels that are park space.

Which distribution best models this?

- A) Uniform
- B) Poisson
- C) Normal
- D) Binomial

In [None]:
# answer11.1 <- "FILL_THIS_IN"

# YOUR CODE HERE
fail()

In [None]:
library(digest)
stopifnot("type of answer11.1 is not character"= setequal(digest(paste(toString(class(answer11.1)), "d3d6a")), "6ee3234375ed935e8ed4520cb3bf730c"))
stopifnot("length of answer11.1 is not correct"= setequal(digest(paste(toString(length(answer11.1)), "d3d6a")), "177750b033f3fdb94d32e52893c1ed00"))
stopifnot("value of answer11.1 is not correct"= setequal(digest(paste(toString(tolower(answer11.1)), "d3d6a")), "7275c350d2c699236c53de30b8ac15f4"))
stopifnot("letters in string value of answer11.1 are correct but case is not correct"= setequal(digest(paste(toString(answer11.1), "d3d6a")), "836556a21ec664d9f94bbd4c2bf7758b"))

print('Success!')

11.2. You count the number of traffic signals (`num_traffic_signals`) in various neighborhoods. The number of signals can vary, and events are independent.

Which distribution best fits this data?

- A) Uniform
- B) Poisson
- C) Normal
- D) Binomial

In [None]:
# answer11.2 <- "FILL_THIS_IN"

# YOUR CODE HERE
fail()

In [None]:
library(digest)
stopifnot("type of answer11.2 is not character"= setequal(digest(paste(toString(class(answer11.2)), "735bc")), "5a8011676d927b12a99a8bcef379d8f7"))
stopifnot("length of answer11.2 is not correct"= setequal(digest(paste(toString(length(answer11.2)), "735bc")), "0e561e8a009981d35314f4108cade4f3"))
stopifnot("value of answer11.2 is not correct"= setequal(digest(paste(toString(tolower(answer11.2)), "735bc")), "3dd06b31f691cd171a3ca80f2d5c4429"))
stopifnot("letters in string value of answer11.2 are correct but case is not correct"= setequal(digest(paste(toString(answer11.2), "735bc")), "0635d0648435b094c98c896ed06546f7"))

print('Success!')

11.3. You measure building heights (`building_height`) in meters. Heights vary smoothly around an average value with some natural variation.

Which distribution best models this?

- A) Uniform
- B) Poisson
- C) Normal
- D) Binomial

In [None]:
# answer11.3 <- "FILL_THIS_IN"

# YOUR CODE HERE
fail()

In [None]:
library(digest)
stopifnot("type of answer11.3 is not character"= setequal(digest(paste(toString(class(answer11.3)), "49f31")), "0c8ad7e25c075c841fe96233be93216b"))
stopifnot("length of answer11.3 is not correct"= setequal(digest(paste(toString(length(answer11.3)), "49f31")), "018dfa4cd4e1012722e05a7dbf07f7a8"))
stopifnot("value of answer11.3 is not correct"= setequal(digest(paste(toString(tolower(answer11.3)), "49f31")), "8178bf3454742bea17006adf680bbd44"))
stopifnot("letters in string value of answer11.3 are correct but case is not correct"= setequal(digest(paste(toString(answer11.3), "49f31")), "e6fc63506733dae9aa4eaa3a29aca21b"))

print('Success!')

11.4. The percentage of tree canopy cover (`tree_canopy_cover`) in different neighborhoods is assumed to be equally likely anywhere from 0% to 100%.

Which distribution best fits this assumption?

- A) Uniform
- B) Poisson
- C) Normal
- D) Binomial

In [None]:
# answer11.4 <- "FILL_THIS_IN"

# YOUR CODE HERE
fail()

In [None]:
library(digest)
stopifnot("type of answer11.4 is not character"= setequal(digest(paste(toString(class(answer11.4)), "c8104")), "ac73223a83baaeb7ec415106b951b68f"))
stopifnot("length of answer11.4 is not correct"= setequal(digest(paste(toString(length(answer11.4)), "c8104")), "80cb34fe9d994806294991e305ef7dad"))
stopifnot("value of answer11.4 is not correct"= setequal(digest(paste(toString(tolower(answer11.4)), "c8104")), "b4087a1f37357e7ecd4c4563c03380e4"))
stopifnot("letters in string value of answer11.4 are correct but case is not correct"= setequal(digest(paste(toString(answer11.4), "c8104")), "3bf350f8153644fd5a3c27e6f4e2b01e"))

print('Success!')

> Great job! ðŸŽ‰ Youâ€™ve practiced identifying appropriate probability distributions for different variables and learned the fundamentals of sampling variability. Later in the course, we will learn how to simulate these random variables using computer code to better understand how samples can vary and why this matters for data analysis.