First load some dependencies...

In [None]:
library(moderndive)
library(tidyverse)

We will be working with the `bowl` data set from [ModernDiver chapter 7](https://moderndive.com/7-sampling.html).

In [None]:
bowl |> glimpse()

What does the following code calculate?

In [None]:
bowl |> summarize(n_red = sum(color == 'red'), prop_red = n_red / n())

## Sample with for loop

The code below shows how we can sample 50 rows from the bowl data frame.

In [None]:
bowl |> sample_n(50)

The loop below runs the `sample_n` function `n_samples` times. For each iteration, we sample `sample_size` rows from `bowl`.

In [None]:
n_samples = 1000
sample_size = 100

bowl_samples = data.frame()
for (i in 1:n_samples) {
    bowl_sample = dplyr::sample_n(bowl, sample_size) |>
        mutate(rep = i)
    bowl_samples = rbind(bowl_samples, bowl_sample)
}

bowl_samples |> head()

Inspect the code above. 

1. What does the `rbind` function do?
2. Where does the `rep` variable come from in the `bowl_samples` data frame?
3. How many rows are in the `bowl_samples` table? (Use the `nrow` function) Does the row count make sense based on `sample_size` and `n_samples`?

Create a new table that counts the number of red balls for each "rep" in bowl samples.
1. `group_by` rep.
2. Use `summarize` and `sum(color =='red')` to calculate number of red balls per group. Call this new variable `n_red`.
3. Use `n_red / n()` to calculate the *proportion* of red balls in each sample. Call this proportion `prop_red`.

Plot a histogram of `prop_red`. Does the central tendency make sense knowing the composition of the full bowl dataset?

## Sample with `purrr`

We can also use functions from [`purrr`](https://purrr.tidyverse.org/) to iterate in R. This will tend to be much faster than using loops.

In [None]:
library(purrr)

Run this code. Can you make sense of the `map_dfr` function?

In [None]:
n_samples = 10000
sample_size = 1000

bowl_samples = 1:n_samples |> map_dfr(function(i) sample_n(bowl, sample_size), .id = 'rep')
# bowl_samples = 1:n_samples |> map_dfr(~sample_n(bowl, sample_size), .id = 'rep')                                      

bowl_samples |> head()

Run the commented line above ☝️. What's the difference? What happens if you change the value of the `.id` argument?

## Using `rep_sample_n`

Use the `Contextual Help` feature in Jupyter Lab to see the documentation for the `rep_sample_n` function. Where does this function come from?

Use `rep_sample_n` to sample the `bowl` table. Plot a histogram of the proportion of red balls in each sample.

## Central Limit Theorem

Make three tables of 1,000 samples from the `bowl` table. Use three different sample sizes, 100, 250, and 1,000 (one sample size for each of your three tables). Use `mutate` to indicate the sample size with a new column called `sample_size`. 

Use `rbind` to concatenate your three tables together into a new dataframe. Group by sample size and replicate ID and calculate the proportion red balls in each sample. Plot an overlapping histogram of the distributions of red ball proportion for each sample size (i.e. color by sample size). Which sample size produces the most narrow distribution?

(**Hint:** you will want to convert your `sample_size` column to a factor!)