# Worksheet 3: The central limit theorem and mathematical approximation of the sampling distribution

### Lecture and Tutorial Learning Goals
After completing this week's lecture and tutorial work, you will be able to:

1. Describe the Law of Large Numbers.
2. Describe a normal distribution.
3. Explain the Central Limit Theorem and other general asymptotic results.
4. List the properties of the sampling distribution.
5. Decide whether to use asymptotic theory or bootstrapping to compute estimator uncertainty.

In [None]:
# Run this cell before continuing.
library(cowplot)
library(digest)
library(gridExtra)
library(infer)
library(repr)
library(taxyvr)
library(tidyverse)
source("tests_worksheet_03.R")

## 1. Warm-Up Questions

Before we get started, let's start with a few warm-up questions.

**Question 1.0**
<br>{points: 1}

True or false?

The distribution of a single sample will **always** have a similar shape to the population distribution.

_Assign your answer to an object called `answer1.0`. Your answer should be either "true" or "false" surrounded by quotes._

In [None]:
# answer1.0 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.0()

**Question 1.1**
<br>{points: 1}

Which of the following variables is **not** an example of a quantitative variable?

A. The number of red Skittles in a given package.

B. The species of a tree.

C. The age of a student in STAT 201.

D. The weight of a newborn puppy.

_Assign your answer to an object called `answer1.1`. Your answer should be a single character surrounded by quotes._

In [None]:
# answer1.1 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.1()

**Question 1.2**
<br>{points: 1}

We quantify the variation of a point estimator by using:

A. The standard deviation of the population distribution, which is called the standard error.

B. The standard deviation of the sample distribution, which is called the standard error.

C. The number of possible samples we can take from the population.

D. The standard deviation of the sampling distribution, which is called the standard error.

_Assign your answer to an object called `answer1.2`. Your answer should be a single character surrounded by quotes._

In [None]:
# answer1.2 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.2()

**Question 1.3**
<br>{points: 1}

True or false?

An estimator of a population parameter is a random variable whose distribution is the sampling distribution.

In [None]:
# answer1.3 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.3()

## 2. Influence of Sample Size

What happens to our sampling distributions when we increase or decrease the sample size (e.g. a sample size of 10 or 100)? Does it matter? How does it affect our sampling distribution(s)? Is there a pattern? These are some questions that we will answer as we progress through the remainder of this worksheet. To do so, we will revisit the real population we worked with in `worksheet_01`.

<div style="text-align: center">
    <img src="https://media.giphy.com/media/5XRB3Ay93FZw4/giphy.gif"><br>
    <i>Image from <a href="https://media.giphy.com/media/5XRB3Ay93FZw4/giphy.gif">giphy.com</i></a>
    <br><br>
</div>

As you work through the rest of this worksheet, there are a few important things you should keep in mind. First, you must acknowledge that we don't usually have access to data for the entire population that we are interested in like we have so far. If we did, we could always calculate the population parameter directly. Here, we are taking the opportunity of having access to these entire populations to study sampling distributions. Second, always remember the purpose of learning about sampling distributions. By learning about the properties of sampling distributions, you will be able to understand the inherent variability/error in point estimates. This "error" associated with a point estimate is critical, and in later weeks we will learn how to report it formally.

### Vancouver Property Tax (Revisited)

If you recall from `worksheet_01`, we explored the population distribution, some sample distributions, and the sampling distribution of sample means for the tax assessment value for **multiple-family dwellings in strata housing** in Vancouver using the `tax_2019` dataset from the `taxyvr` R package. As mentioned previously, we're going to revisit and extend our exploration of the sampling distribution of sample means by altering the size of our simulated samples from this dataset.

Let's start by filtering the `tax_2019` dataset for the population that we are interested in again. Since you already did this in a previous worksheet, we have done it for you in the code cell below. Recall that we were interested in the `current_land_value` of properties that meet the following criteria:
- **Have a `current_land_value` greater than \$1:**  Some properties are assigned a value of `NA` and these are the properties undergoing big renovations. These values get amended after the improvement and are reflected in the following year's assessment. The same occurs with homes that are assessed at \\$0 and \\$1.
- **Are of `legal_type` `"STRATA"`**
- **Are of `zone_category` `"Multiple Family Dwelling"`** 

_If you need a refresher on the `tax_2019` dataset and where it came from, please look back at `worksheet_01` and re-read the introduction of section 2 there._

In [None]:
# Run this cell before continuing.
multi_family_strata <- 
    tax_2019 %>%  
    filter(!is.na(current_land_value),
           current_land_value > 1,
           legal_type == "STRATA",
           zone_category == "Multiple Family Dwelling") %>% 
    select(current_land_value)
head(multi_family_strata)

**Question 2.0** 
<br> {points: 1}

Now let's start taking samples from our population `multi_family_strata`. First, take 2000 random samples of size 10 using the `rep_sample_n` function and a seed of `9869`.

_Assign your data frame to an object called `samples_10`._

In [None]:
set.seed(9869) # DO NOT CHANGE!

# your code here
fail() # No Answer - remove if you provide an answer

head(samples_10)

In [None]:
test_2.0()

**Question 2.1** 
<br> {points: 1}

Next, calculate the mean of each sample you took in **question 2.0**; these are our point estimates. Name the new column containing the sample means `sample_mean`.

_Assign your data frame to an object called `sample_means_10`._

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

head(sample_means_10)

In [None]:
test_2.1()

Let's visualize the distribution of the sample means from the previous question.

In [None]:
sampling_dist_10 <- 
    sample_means_10 %>% 
    ggplot(aes(x = sample_mean)) +
    geom_histogram(binwidth = 50000, color='white') +
    xlab("Sample Mean Land Value (CAD)") +
    ggtitle("n = 10")

sampling_dist_10

**Question 2.2** 
<br> {points: 1}

Using the same strategy as you did above, draw 2000 random samples of size 30 from `multi_family_strata` using the `rep_sample_n` function, but this time use the seed `7032`. For each sample, calculate the mean as the point estimate. Lastly, let's visualize the distribution of the sample means (point estimates) you just calculated by plotting a histogram.

_Assign your plot to an object called `sampling_dist_30`._

In [None]:
set.seed(7032) # DO NOT CHANGE!

# sampling_dist_30 <- 
#     multi_family_strata %>% 
#     ... %>%  # This is a multiline command
#     ggplot(aes(x = sample_mean)) +
#     geom_histogram(binwidth = 25000, color = 'white') +
#     xlab("Mean Land Value (CAD)") +
#     ggtitle("n = 30")

# your code here
fail() # No Answer - remove if you provide an answer

sampling_dist_30

In [None]:
test_2.2()

**Question 2.3** 
<br> {points: 1}

Again, using the same strategy as you did above, draw 2000 random samples of size 100 using the `rep_sample_n` function. Use the seed `8408`. For each sample, calculate the mean as the point estimate. Lastly, let's visualize the distribution of the sample means (point estimates) you just calculated by plotting a histogram.

_Assign your plot to an object called `sampling_dist_100`._

In [None]:
set.seed(8408) # DO NOT CHANGE!

# sampling_dist_100 <- 
#     multi_family_strata %>% 
#     ... %>% # This is a multiline command
#     ggplot(aes(x = sample_mean)) +
#     geom_histogram(binwidth = 14000, color = 'white') +
#     xlab("Mean Land Value (CAD)") +
#     ggtitle("n = 100")


# your code here
fail() # No Answer - remove if you provide an answer

sampling_dist_100

In [None]:
test_2.3()

In the code cell below, we have used `plot_grid` to plot the three sampling distributions side-by-side. We have sorted the plots by increasing order of sample size from left to right. **Note**: a small number of the sample means are not visible because we manually set bounds on the x-axis so you can compare the distributions more easily (this causes the warnings you observe below).

_Use the set of plots below to answer the **next question**. Some of the code may be confusing, but you do not need to understand the code to answer the question._

In [None]:
# Run this cell before continuing.
options(repr.plot.width = 20) # temp

mean_plot_row <- plot_grid(sampling_dist_10 +
                theme(axis.text.x = element_text(angle = 90)) +
                scale_x_continuous(breaks = seq(400000, 1200000, 200000),
                                   limits = c(400000, 1200000)),
            sampling_dist_30 +
                theme(axis.text.x = element_text(angle = 90)) +
                scale_x_continuous(breaks = seq(400000, 1200000, 200000),
                                   limits = c(400000, 1200000)),
            sampling_dist_100 +
                theme(axis.text.x = element_text(angle = 90)) +
                scale_x_continuous(breaks = seq(400000, 1200000, 200000),
                                   limits = c(400000, 1200000)),
            ncol = 3)
title <- ggdraw() + 
  draw_label("Comparison of sampling distributions of sample means",
             fontface = 'bold',
             x = 0,
             hjust = 0) +
  theme(plot.margin = margin(0, 0, 0, 7))

means_grid <- plot_grid(title,
                        mean_plot_row,
                        ncol = 1,
                        rel_heights = c(0.1, 1))

means_grid

**Question 2.4**
<br> {points: 1}

Considering the set of plots above, which statement below **is not** correct:

A. As the sample size increases, the sampling distribution becomes narrower.

B. As the sample size increases, there are more sample point estimates closer to the true population mean.

C. As the sample size increases, the sampling distribution appears more bell-shaped.

D. As the sample size increases, the standard error of the estimator increases.

_Assign your answer to an object called `answer2.4`. Your answer should be a single character surrounded by quotes._

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.4()

**Question 2.5**
<br> {points: 1}

Given what you observed above, and considering the real life scenario where you only have the resources to take one sample, answer the true or false question below: 

True or false?

The smaller your random sample, the better your point estimate reflects the true population quantity you are trying to estimate. 

_Assign your answer to an object called `answer2.5`. Your answer should be either "true" or "false", surrounded by quotes._

In [None]:
# answer2.5 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.5()

## 3. Mathematical models and approximations

In many situations, there are alternatives to bootstrapping to study sampling distributions. These methods rely on mathematical models and approximations. This week, we will examine two significant results in probability and a vital probability distribution, namely Normal distribution.

### 3.1 Law of Large Numbers

What happens with the sample average when the sample size increases? In this section, we will try to answer this question.

**Question 3.1.1**
<br> {points: 1}

In this exercise, you will study what happens with the sample mean when the sample size increases. We will consider three populations with very different distributions, namely, `pop1`, `pop2`, and `pop3`. Their histograms are plotted by the cell below. 

In [None]:
# Run this cell before continuing.
options(repr.plot.width = 18, repr.plot.height = 8)

N <- 10e5 # pop size
pops <- 
    tibble(pop1 = rnorm(N, 5, 4),
           pop2 = rexp(N, 1/3),
           pop3 = rnorm(N, rbinom(N,1,0.7)*15, 3)) %>% 
    pivot_longer(cols = starts_with('pop'), names_to = 'pop')

pops_plot <- 
    pops %>% 
    ggplot() + 
    geom_histogram(aes(value, fill=pop), color='white', binwidth = 1) + 
    facet_wrap(~pop, scales = "free") +
    theme(text = element_text(size=25)) 

pops_plot

To check how the sample mean changes as the sample size increases, we will examine 5 replicates from each population with samples ranging in size from 5 to 5000 elements. To do this, we will sample 5 replicates of size 5000 and calculate the sample mean when we include only the first 5 randomly selected elements (as if we sampled 5 replicates of size 5), then include another random element and calculate the sample mean (as if we sampled 5 replicates of size 6), and so on, all the way up to a sample size of 5000. We will do this process for each population. The code has already been written for you in the cell below.

In [None]:
set.seed(7812653)
law_large_numbers <- 
    pops %>% 
    group_by(pop) %>% 
    group_modify(~rep_sample_n(.x, size = 5000, reps = 5)) %>%
    group_by(pop, replicate) %>% 
    mutate(sample_size = row_number(), mean = cummean(value), replicate = as_factor(replicate)) %>% 
    filter(sample_size > 5)

# Let's take a peek at the first 2 rows of each replicate for each population
law_large_numbers %>% slice_head(n = 2)
# Let's take a peek at the last 2 rows of each replicate for each population
law_large_numbers %>% slice_tail(n = 2)

We now have all the data that we need, but we cannot make sense of it in a data frame. Your job is to plot this data using `geom_line`, for each population and each replication, how the mean is changing as the sample size increase. The scaffolding below was written to assist you. 
```r
law_large_numbers_plot <- 
    law_large_numbers %>% 
    ggplot() + 
    geom_line(aes(x = ..., y = ..., color = replicate), alpha=.75) +
    xlab(...) + # What is the xlab?
    ylab(...) + # What is the ylab?
    geom_hline(data = tibble(pop=c("pop1", 'pop2', 'pop3'), true_mean=c(5, 3, 0.7*15)),
               aes(yintercept = true_mean)) +                
    facet_wrap(~pop, scales="free", nrow = 1) +
    ggtitle("Sample mean for different sample sizes and population distribution (black line is the population mean)") + 
    theme(text = element_text(size=20))
```


_Assign your plot to an object called `law_large_numbers_plot`._

In [None]:
set.seed(54652) # Do not change this.

# your code here
fail() # No Answer - remove if you provide an answer

law_large_numbers_plot

In [None]:
test_3.1.1()

**Question 3.1.2**
<br>{points: 1}

Considering the effects of increasing the sample size has on the sample mean, which of the statements below is **not** true:

A. By increasing the sample size, the sampling distribution becomes narrower.

B. By sufficiently increasing the sample size, it is possible to <em><u>guarantee</u></em> (with probability 1!) that the sample mean is as close as we want from the population mean, regardless of the population distribution. 

C. By increasing the sample size, the centre of the sampling distribution becomes closer and closer to the true mean.

D. For any given sample size, the population distribution affects how much the sample mean varies.

_Assign your answer to an object called `answer3.1.2`. Your answer should be a single character surrounded by quotes._

In [None]:
# answer3.1.2 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_3.1.2()

## 3.2 Normal (also known as Gaussian) distribution

By plotting the histogram of a population, we can check the population distribution. The population distribution is a key part of statistical inference, mainly because it affects the estimator's sampling distribution. So far in the course, we haven't made any assumptions about the population distribution. Instead, to study the sampling distribution, we used bootstrapping. 

In many situations, however, it is reasonable to approximate the exact population distribution, which is unknown, by a theoretical probability distribution. You can think of it as having the formula to approximate the histogram of the population distribution. In fact, you might see people referring to probability distributions as a population.

There is one probability distribution that is especially important: the Normal distribution. Besides being very common in practice (i.e., it is a good model to be applied in a broad range of problems), the Central Limit Theorem (you will study next section) makes the Normal distribution a fundamental part of Statistics.

In this section, you are going to study the Normal distribution.


**Question 3.2.1**
<br> {points: 1}

For this exercise, we are going to use the wings' length of houseflies (made available by [Fs Lab](https://fslab.org/datasets/04_HouseflyWingLength.html) and  [Seattle Central College](https://seattlecentral.edu) from [this paper](https://academic-oup-com.ezproxy.library.ubc.ca/aesa/article-abstract/48/6/499/26366?redirectedFrom=fulltext) -- we have the code to download it for you in the cell below). Naturally, this is not the population; however, let us consider this as our population for the sake of discussion. Your job is to fill in the scaffolding below to plot the population distribution. Make sure you don't change the binwidth.

```r
pop_dist_flies <- 
    houseflies %>% 
    ...() + 
    ...(aes(x = ..., y = ..density..), color = 'white', binwidth = 0.198)+
    theme(text = element_text(size=25)) +
    xlab(...) +
    ylab(...) + 
    ggtitle(...)
    
```

_Assign your plot to an object called `pop_dist_flies`._

In [None]:
# Run this cell before continuing.
houseflies <- 
    read_table("data/HouseflyWingLength.txt", col_names = 'wings_length') %>% 
    mutate(wings_length = wings_length/10)

pop_mu = 4.55  # Ignore this for now.
pop_sd = 0.391964747951093 # Ignore this for now.

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

pop_dist_flies

In [None]:
test_3.2.1()

**Question 3.2.2**
<br> {points: 1}

In this exercise, you will check the shape of the Normal distribution and compare it with the histogram of the population distribution. In the cell below, we prepared a `tibble` for you with the values of the Normal density (i.e., the formula that will approximate the population distribution). Your job is to fill in the scaffolding below to add the  Normal density to the population histogram you obtained in the previous exercise.

```r
pop_dist_normal <- 
    pop_dist_flies + 
    geom_line(data = ..., aes(..., ...), color="red", lwd = 2)

```

_Assign your plot to an object called `pop_dist_normal`._

In [None]:
# Run this cell before continuing
data_normal <- tibble(wings_length=seq(min(houseflies$wings_length), 
                                             max(houseflies$wings_length),0.01), 
                      density = dnorm(wings_length, pop_mu, pop_sd))
head(data_normal)

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

pop_dist_normal

In [None]:
test_3.2.2()

**What do you think? Is the Normal distribution a good approximation for the population distribution?**

**Question 3.2.3**
<br>{points: 1}

The Normal distribution family is indexed by two parameters $\mu$ (read *mu*) and $\sigma$ (read *sigma*). In this exercise, your job is to investigate how changing $\mu$, and $\sigma$ affects the Normal density. To help you investigate the role of the parameters, we prepared a function that will return the `data_normal` tibble with the parameters you specified. Then, you can plot the new curve. The scaffolding below will also include the Normal curve with parameter *mu* = 0 and *sigma* = 1, which is called the Standard Normal curve. Try different values for $mu$ and $sigma$ (note: $\sigma$ must be a positive number, but $\mu$ can be anything). 

After you investigate the role of the parameters, select the statements below that are true (select all that apply): 

A. The parameter $\mu$ controls how wide the curve is, while the parameter $\sigma$ controls its location.

B. The parameter $\mu$ controls the location of the curve, while the parameter $\sigma$ controls its spread.

C. As $\mu$ increases, the Normal curve becomes narrower. 

D. As $\sigma$ increases, the Normal curve becomes wider. 

_Assign your answer to an object called `answer3.2.3`. Your answer should all selected letters above surrounded by quotes (e.g., a possible solution is `"ABCD"`)._

In [None]:
# Run this before continuing

#' Generate the data_normal tibble for a given mu and sigma of your choice
#'
#' @param mu The desired mu value you want (it can be anything).
#' @param sigma The desired sigma value you want (it must be higher than 0)
#' @return Returns the data_normal tibble with two columns: wings_length and density.
create_data_normal <- function(mu, sigma){
    return(tibble(x = seq(mu - 4 * sigma, mu + 4 * sigma, 0.01), 
                  density = dnorm(x, mu, sigma)))
}

In [None]:
# Use this cell for your experiments (uncomment the lines below)

#    ggplot() + 
#    geom_line(data = create_data_normal(mu = 0, sigma = 1), aes(x, density), color = "black", lwd = 2) + 
#    geom_line(data = create_data_normal(mu = ..., sigma = ...), aes(x, density), color = "red", lwd = 2)


In [None]:
# Your solution here:
# answer3.2.3 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_3.2.3()

**Question 3.2.4** 
<br> {points: 1}

The Normal distribution has an interesting property: regardless of the values of $\mu$ and $\sigma$, we have that:

- Approximately 68% of the observations are between $[\mu -\sigma; \mu+\sigma]$.
- Approximately 95.5% of the observations are between $[\mu -2\sigma; \mu+2\sigma]$.
- Approximately 99.7% of the observations are between $[\mu -3\sigma; \mu+3\sigma]$.

The cell below illustrate that for you.
Again, these are valid for any value of $\mu$ and $\sigma$. 

In this exercise, you will calculate the proportion of houseflies that have wings between:

- A. $[4.55 - 0.392; 4.55 + 0.392]$
- B. $[4.55 - 2\times0.392; 4.55 + 2\times0.392]$
- C. $[4.55 - 3\times0.392; 4.55 + 3\times0.392]$


What do you think? Are these values similar to the theoretical values?

_Assign your answers to objects called `answer3.2.4_partA`, `answer3.2.4_partB`, and `answer3.2.4_partC`. Your answer should be a single number._

In [None]:
# Run this cell to see the plots

data_normal <- tibble(x = seq(-4, 4, 0.01), density = dnorm(x, 0, 1)) 
plots_normal <- list()
pecentages <- c("68%", "95.5%", "99.7%")
for (i in 1:3){
plots_normal[[i]] <- 
    data_normal %>% 
    ggplot() + 
    geom_line(aes(x, density), lwd=2) + 
    geom_ribbon(data=subset(data_normal, x>-i & x<i), 
                aes(x = x, ymax = density), ymin = 0, alpha = 0.3, fill = "blue") + 
    annotate("text", x = 0, 0.18, label = pecentages[i], size = 15) +
    scale_x_continuous(breaks = c(-3, -2, -1, 0, 1, 2, 3), labels = c(expression(paste(mu, " - 3", sigma)), expression(paste(mu, " - 2", sigma)), expression(paste(mu, " - 1", sigma)),
                                                                      expression(mu),
                                                                      expression(paste(mu, " + 1", sigma)), expression(paste(mu, " + 2", sigma)), expression(paste(mu, " + 3", sigma)))) +
    theme(text = element_text(size=25))
}
do.call("grid.arrange", c(plots_normal, ncol=1))

In [None]:
# mu <- ... # The mean of the "population"
# sigma <- ... # The standard deviation of the "population" 

# (answer3.2.4_partA <- houseflies %>% filter(between(wings_length, left = ..., right = ...)) %>% nrow() / nrow(houseflies))
# (answer3.2.4_partB <- ... )
# (answer3.2.4_partC <- ... )

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_3.2.4()

**Question 2.2.5** 
<br> {points: 1}

Knowing that the population approximately follows a Normal distribution allows us to find the formula for the sampling distribution of the sample mean. For example, for samples of size $16$ houseflies, we claim that the sampling distribution of the sample mean will follow a Normal distribution with parameters $\mu_{\bar{X}}=4.55$ and $\sigma_{\bar{X}} = \frac{0.392}{4}.$ In this and the next  exercise, you are going to verify if this is (approximately) accurate. 

For this exercise, take 10000 samples of size 16, with replacement, from the `houseflies` data set.

_Assign your data frame to an object called `samples_houseflies`. This data frame should have two columns: `replicate` and `wings_length`._


In [None]:
set.seed(1)

# your code here
fail() # No Answer - remove if you provide an answer

head(samples_houseflies)

In [None]:
test_3.2.5()

**Question 3.2.6**
<br> {points: 1}

For each of the samples, you obtained in the previous exercise, calculate the sample mean. Then, plot the histogram of all sample means you obtained and the line for the Normal density. You can use the scaffolding below. 


```r
sampling_dist_mean_houseflies <-
    samples_houseflies %>% 
    group_by(...) %>% 
    summarise(sample_mean = ...) %>% 
    ggplot() + 
    geom_histogram(aes(sample_mean, y = after_stat(density)), color='white', binwidth=.05) + 
    theme(text = element_text(size=25)) +
    xlab("Mean houseflies' wing length (mm)") + 
    ggtitle('Sampling distribution of the sample mean (n = 16)') +
    geom_line(data = tibble(sample_mean = seq(3.5, 5.8, 0.01), 
                            density = dnorm(sample_mean, 4.55, 0.392/4)), 
                            aes(sample_mean, density), color = "red", lwd = 2) + 
    ggtitle("Sampling distribution of sample mean, for sample size n=16")
              
```

Take a look at the plot and think about if you would consider this a good approximation. Also, keep in mind that we got the approximation without any simulation, just using the fact that the population was approximately Normally distributed. 

_Assign your plot to an object called `sampling_dist_mean_houseflies`._

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

sampling_dist_mean_houseflies

In [None]:
test_3.2.6()

## 3.3 Central Limit Theorem (CLT)

In many cases, the Normal distribution will not be a good approximation for the population distribution. So why are we studying the Normal distribution specifically? Because, luckily for us, there is a very strong result in probability, called the Central Limit Theorem, that allows us to approximate the sampling distribution of many estimators by a Normal distribution, even when the population distribution is not Normal.

The Central Limit Theorem roughly states that if you are summing up a very large number of random terms, the distribution of this sum is approximately Normal. For example, when you calculate the sample mean, you are summing the elements of your sample; therefore, if your sample size is large, then the sampling distribution of the sample will be approximately Normal, *regardless of the population distribution*. 

Let us see the Central Limit Theorem in action. In the following exercises, we will study the sampling distribution of the sample mean, using different sample sizes from a peculiar population distribution, which is definitely not Gaussian. The cell below creates said population, stores it in an object named `weird_pop`, and plots its histogram (you do not need to worry about the code in the cell, you just need to run the cell).


In [None]:
# Run this cell before continuing

set.seed(1)
N <- 10e5 # Pop size
w <- rbinom(N, 1, 0.2) # Mixture weights

# Creates the pop
weird_pop <- 
    tibble(value = w*rnorm(N, rbinom(N,1,0.7)*15+5, 3) + (1-w)*rexp(N, 1) )

# Creates the histogram of the weird_pop
weird_pop_plot <-
    weird_pop %>% 
    ggplot() + 
    geom_histogram(aes(x=value, y = after_stat(density)), bins=70, color = "white") +
    theme(text = element_text(size=25)) +
    ggtitle("Weird population distribution - definitely not Gaussian")

# Obtain the density for different values of n.
mu <- mean(weird_pop$value)
sigma <- sd(weird_pop$value)
me <- 4*sigma

gaussian_densities <- 
    tibble(sample_size = c(10, 30, 500), 
           grid = map(.x = sample_size, 
                      .f = ~tibble(value = seq(mu-me/sqrt(.x), mu+me/sqrt(.x), 0.01), 
                                   density = dnorm(value, mu, sigma/sqrt(.x))))) %>% 
    unnest(grid)

# Plot the population distribution
weird_pop_plot

**Question 3.3.1** 
<br> {points: 1}

For this exercise, your job is to take 3000 samples of size 10. The samples should be taken without replacement.

_Assign your data frame to an object called `samples_size10`. The data frame should have two columns: `replicate` and `value`_

In [None]:
set.seed(1) # do not change this

# your code here
fail() # No Answer - remove if you provide an answer

head(samples_size10)

In [None]:
test_3.3.1()

**Question 3.3.2**
<br> {points: 1}

Fill in the scaffolding below to plot the sampling distribution of the sample mean versus the sampling distribution given by the CLT. Check how good the approximation is. 

```r
sampling_dist_size10 <-
    samples_size10 %>% 
    group_by(...) %>% 
    summarise(sample_mean = ..., `.groups` = "drop") %>% 
    ggplot() + 
    geom_histogram(aes(x=sample_mean, y = after_stat(density)), color = "white", binwidth = 0.2) +
    theme(text = element_text(size=25))+
    xlab("Sample mean") +
    ggtitle("Sampling distribution of the sample mean for samples of size 10 from Weird Population.") + 
    geom_line(data = gaussian_densities %>% filter(sample_size == 10), aes(value, density), color = "red", lwd = 2)
    
```


_Assign your plot to an object called `sampling_dist_size10`._

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

sampling_dist_size10

In [None]:
test_3.3.2()

**Question 3.3.3** 
<br> {points: 1}

Let us repeat Question 3.2.1, but this time using a much larger sample size. Take 3,000 samples of size 500 from weird_pop. The samples should be taken without replacement.

_Assign your data frame to an object called `samples_size500`. The data frame should have two columns: `replicate` and `value`_

In [None]:
set.seed(2) # Do not change this
# your code here
fail() # No Answer - remove if you provide an answer

head(samples_size500)

In [None]:
test_3.3.3()

**Question 3.3.4**
<br> {points: 1}

Let's take a look at how the CLT approximate the sampling distribution of the sample mean when the sample size is 500. Fill in the scaffolding below to plot the sampling distribution of the sample mean versus the sampling distribution given by the CLT. Check how good the approximation is. 

```r
sampling_dist_size500 <-
    samples_size500 %>% 
    group_by(...) %>% 
    summarise(sample_mean = ..., `.groups` = "drop") %>% 
    ggplot() + 
    geom_histogram(aes(x = sample_mean, y = after_stat(density)), color="white", binwidth = 0.2) +
    theme(text = element_text(size = 25))+
    xlab("Sample mean") +
    ggtitle("Sampling distribution of the sample mean for samples of size 500 from Weird Population.") + 
    geom_line(data = gaussian_densities %>% filter(sample_size == 500), aes(value, density), color = "red", lwd = 2)
    
```

_Assign your plot to an object called `sampling_dist_size500`._

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

sampling_dist_size500

In [None]:
test_3.3.4()

The cell below plots the two figures you created in the exercises above and includes the case where the sample size is 30, so you can compare how the approximation is improving as the sample size increases. 

In [None]:
# Run this cell before continuing. 

set.seed(13)

all_samples <- 
    bind_rows(samples_size10 %>% mutate(sample_size = 10), 
              weird_pop %>% rep_sample_n(size = 30, reps = 3000) %>% mutate(sample_size = 30), 
              samples_size500 %>% mutate(sample_size = 500))

all_samples %>% 
    group_by(sample_size, replicate) %>% 
    summarise(sample_mean = mean(value), `.groups` = "drop") %>% 
    select(-replicate) %>% 
    mutate(sample_size = as_factor(sample_size)) %>% 
    ggplot() +
    facet_wrap(~sample_size, scale="free") +
    geom_histogram(aes(x = sample_mean, y = after_stat(density), fill = sample_size),
                   color='white', 
                   binwidth = function(x) 2 * IQR(x) / (length(x)^(1/4)))+
    theme(text = element_text(size=25)) +
    ggtitle("Sampling distribution of the sample mean for different sample size (n)") +
    geom_line(data = gaussian_densities, aes(x = value, y = density), lwd=2) + 
    xlab("Sample mean")


The question that remains, but it is very hard to answer, is: <em>what sample size is big enough?</em> There is no universal answer. The problem is, although it is true that (some) sampling distributions will converge to a Gaussian (i.e., Normal) distribution no matter what the population distribution looks like, the sample size required to have a good approximation will depend on the population distribution. In most cases, something between 30 or 50 should be enough. But the more different a population distribution is from the Normal distribution (e.g., having multiple peaks, being asymmetric, etc...), the larger the sample size required to have a good approximation. 

In the case of `weird_pop`, we can see that the approximation given by the CLT is quite poor for a sample of size 10. For a sample of size 500, the approximation is quite good. For a sample of size 30, CLT already provides a decent approximation; however, the approximation is not as good on the left tail if you look closely. 

### CLT is not always applicable

It is important to notice that CLT is not magic -- you should not automatically rely on CLT. There are three main things you need to check:
1. Is the size of your sample large enough? 
2. Was the sample taken in an independent fashion? 
3. Is the estimator being used a sum of random components? 

We have discussed the effects of the sample size in the previous exercises. Now, we will see an example of an estimator that is not a sum of random components, and therefore, you cannot use the CLT.

**Question 3.3.5** 
<br> {points: 1}

For this exercise, you are going to estimate the `minimum` of the population. To estimate the population minimum, we are going to use the sample minimum. Let us study the sampling distribution of the sample minimum. You have already drawn samples of different sizes. They are stored now in the object `all_samples`. Complete the scaffolding below to plot the sampling distribution of the sample minimum for different sample sizes. As the sample size increases, does the sampling distribution looks more like a Normal distribution?

```r
sampling_dist_min <- 
    all_samples %>% 
    group_by(..., ...) %>% 
    summarise(sample_min = ..., `.groups` = "drop") %>% 
    ggplot() +
    geom_histogram(aes(x = sample_min, y = after_stat(density)), color="white", binwidth = 0.2) +
    facet_wrap(~sample_size, scales = "free") +
    theme(text = element_text(size = 25)) + 
    ggtitle("Sampling distribution of sample minimum for different sample sizes") +
    xlab("Sample Minimum")
    
```


_Assign your plot to an object called `sampling_dist_min`._

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

sampling_dist_min

In [None]:
test_3.3.5()

### To summarise this section:

 - The Law of Large Numbers states that the sample mean converges (i.e., gets closer and closer) to the population mean when the sample size increases. 
 - The CLT states that if we have an estimator that is a sum of random components (like the mean), then the sampling distribution of your estimator converges (i.e., gets more and more similar) to the Normal distribution.
 - The Normal distribution is very important in statistics and:
     1. It is symmetric around the mean;
     2. It has two parameters $\mu$, which is the mean of the distribution (a location parameter) and $\sigma$, which is the standard deviation of the distribution (a measure of spread).