# Worksheet 5: Confidence Intervals Based on the Central Limit Theorem



#### Lecture and Tutorial Learning Goals:
From this section, students are expected to be able to:

1. Explain the role of the Central Limit Theorem in constructing confidence intervals.
2. Describe the $t$-distribution family and its relationship with the normal distribution.
3. Write a computer script to calculate confidence intervals based on distributional assumptions.
4. Calculate z-scores.
5. Discuss the potential limitations of these methods.
6. Decide whether to use asymptotic theory or bootstrapping to compute estimator uncertainty.

In [None]:
library(tidyverse)
library(repr)
library(datateachr)
library(digest)
library(infer)
library(gridExtra)
library(cowplot)
penguins <- read.csv("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv")
source("tests_worksheet_05.R")

## 1. Short Recap & Warm-Up

Before we start exploring the new material for this week, let's remind ourselves of some of the most important points that we covered in the previous week by answering a couple of questions.

**Question 1.0**
<br>{points: 1}

In Module 4, you calculated confidence intervals based on a simulation method: bootstrapping. What is one use of bootstrapping? 

A. Since bootstrapping resamples from our original sample many times, it helps reduce the variability of our statistic, which allows us to obtain narrower confidence intervals.  

B. Bootstrapping does not improve the quality of our estimate. Bootstrapping just allows us to study the sampling distribution of our statistic, which would be otherwise unknown.

C. Bootstrapping allows us to estimate the center as well as the variability of the sampling distribution of our statistic.

D. Bootstraping estimates the population parameter.

_Assign your answer to an object called `answer1.0`. Your answer should be a single character surrounded by quotes._

In [None]:
# answer1.0 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.0()

## 2. Obtaining confidence intervals based on the CLT

In this section, you will explore how to obtain the confidence interval for the proportion and mean using the Central Limit Theorem. Remember that to calculate a confidence interval for a parameter (e.g., the population mean) using bootstrap you had to:

1. Take a sample;
2. Construct the bootstrap sampling distribution. 
3. Get the quantiles from the estimated sampling distribution. 

Now, the only thing that will change here is that we are going to use the CLT instead of bootstrap to estimate the sampling distribution (in some cases -- remember that CLT is not always applicable). Therefore, you get to skip Step 2 because you will approximate the sampling distribution using the Normal distribution (or as we will see later the $t$-distribution for the sample mean). 

We will be focused on two parameters in this section: proportion and mean.

For many of the following questions, we will use the `penguins` dataset, which contains the body mass of multiple penguin species. Run the following code, to extract the body mass of Adelie penguins.

In [None]:
body_mass_g_adelie <- 
    penguins %>% 
    filter(species == 'Adelie' & !is.na(body_mass_g)) %>% 
    pull(body_mass_g)
body_mass_g_adelie

**Question 2.0 - Estimating the proportion using CLT**
<br>{points: 1}

For this question, we want to estimate the proportion of `Adelie` penguins with `body_mass_g` over 4000g.

You are going to apply the CLT to obtain the confidence interval for the proportion. A proportion is the average of a random variable that can only assume either 0 or 1. Therefore, by calculating proportions, you are summing up random terms, and we can apply the CLT. The CLT for proportions states that the sample proportion follows a Normal distribution with mean equals to $p$, the population proportion, and standard deviation $\sqrt{p(1-p)/n}$:
$$
\hat{p}\sim N\left(p, \sqrt{\frac{p(1-p)}{n}}\right)
$$

Since, we do not know $p$, the best we can do is to use $\hat{p}$ instead of $p$. For the case of proportions, the CLT provides a fairly good approximation for values of $n$ such that $n\hat{p}\geq 10$ and $n(1-\hat{p})\geq 10$. Again, the larger $n$ is, the more accurate is the approximation. 


 What would you use as the mean and standard deviation of the sampling distribution of $\hat{p}$? 

_Assign the mean to an object called `answer2.0_mean` and the standard deviation to an object called `answer2.0_std_error`. These objects should be numbers, not data frames._

In [None]:
#phat <- mean(...)
#answer2.0_mean <- ... 
#answer2.0_std_error <- ...

# your code here
fail() # No Answer - remove if you provide an answer
cat("The phat estimate is", round(phat,4), "\nThe std. error estimate is", round(answer2.0_std_error,4))

In [None]:
test_2.0()

**Question 2.1** 
<br> {points: 1}

Using the sampling distribution you specified in the previous question, obtain a 90\%  confidence interval for the proportion of `Adelie` penguins with `body_mass_g` over 4000g. Use the scaffolding below:

```r
prop_adelie_ci <- tibble(
    lower_ci = ...,
    upper_ci = ...
)
```
(Hint: the function `qnorm` can help you).

_Assign your data frame to an object called `prop_adelie_ci`. The data frame should contain two columns only: `lower_ci` and `upper_ci`._

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
head(prop_adelie_ci)

In [None]:
test_2.1()

**Question 2.2 - Estimating the mean using CLT**
<br>{points: 1}

To estimate the population mean, we use the sample average, $\bar{X}$. The CLT roughly says that $\bar{X}$ follows a Normal distribution with parameters $\mu$ and $\frac{\sigma}{\sqrt{n}}$:
$$
\bar{X} \sim N\left(\mu, \frac{\sigma}{\sqrt{n}}\right)
$$
Since we do not know $\mu$ and $\sigma$ we replace them with their estimates $\bar{x}$ and $s=\sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(X_i-\bar{X})^2}$ (you can calculate $s$ with the `sd` function in R).


For this question, consider the `penguins` dataset. We want to estimate the mean `body_mass_g` of the `Adelie` species. What would you use as the mean and standard deviation of the sampling distribution of $\bar{X}$? 

_Assign the mean to an object called `answer2.2_mean` and the standard deviation to an object called `answer2.2_std_error`. These values should be numbers, not data frames._

In [None]:
#answer2.2_mean <- 
#    mean(..., na.rm = TRUE)

#answer2.2_std_error <-
#    sd(..., na.rm = TRUE) / ...

# your code here
fail() # No Answer - remove if you provide an answer

cat("The mean estimate is", round(answer2.2_mean,4), "\nThe std. error estimate is", round(answer2.2_std_error, 4))

In [None]:
test_2.2()

**Question 2.3** 
<br> {points: 1}

Using the sampling distribution you specified in the previous question, obtain a 95\%  confidence interval for the mean. Use the scaffolding below:

```r
body_mass_ci <- tibble(
    lower_ci = qnorm(..., ..., ...),
    upper_ci = qnorm(..., ..., ...)
)
```
While we will use the $t$-distribution below, here we will first explore the Normal distribution.

_Assign your data frame to an object called `mean_body_mass_adelie_ci`. The data frame should contain two columns only: `lower_ci` and `upper_ci`._

In [None]:
# mean_body_mass_adelie_ci <-
#     tibble(
#         lower_ci = qnorm(0.025, ..., ...),
#         upper_ci = qnorm(..., ..., ...)
#     )

# your code here
fail() # No Answer - remove if you provide an answer

mean_body_mass_adelie_ci

In [None]:
test_2.3()

**Question 2.4**
<br> {points: 1}

For the sake of comparison, obtain a 95% confidence interval for the mean `body_mass_g` of `Adelie` specie using bootstrap with 3000 replicates. You can use the scaffolding below to help you:

```r
bootstrap_ci <- 
    penguins %>% 
    filter(...) %>% 
    specify(...) %>% 
    generate(...) %>% 
    calculate(...) %>% 
    get_ci()
```

_Assign your data frame to an object called `bootstrap_ci`. The data frame should contain two columns only: `lower_ci` and `upper_ci`._

In [None]:
set.seed(54612) # Do not change this.

# your code here
fail() # No Answer - remove if you provide an answer

bootstrap_ci

In [None]:
test_2.4()

_Note: The bootstrap and CLT confidence intervals are quite close in this case, but the bootstrap interval is a little bit wider than the interval based on the CLT. One of the reasons is that we do not know $\sigma$, and we are using the sample standard deviation, $s$, to estimate it. Therefore, there is more uncertainty around our estimator $\bar{X}$ than we are accounting for. By underestimating our uncertainty, we are making our interval narrower than it should be and, consequently, the coverage can be lower than the specified. Although in this case the difference is small, in cases of smaller sample sizes, say $n<30$ or $n<20$, the difference can be notable. In question 3, you will learn how to improve the confidence interval based on the CLT by properly accounting for the extra uncertainty of using $s$ in place of $\sigma$._

## 3. Student's t Distribution (or, $t$-distribution)

The $t$-distribution family is quite similar to the standard Normal distribution:
- it is symmetric;
- it is bell-shaped;
- it is unimodal;

Run the cell below to see a plot of some $t$-distributions.

In [None]:
options(repr.plot.width=15, repr.plot.height=7)

densities <- 
    tibble(degrees_of_freedom = c(1, 5, 10)) %>% 
    mutate(tdensity = map(degrees_of_freedom, ~tibble(x = seq(-4, 4, 0.01),
                                     t_density = dt(x,.x),
                                     std_Gaussian = dnorm(x) ))) %>% 
    mutate(degrees_of_freedom = as_factor(degrees_of_freedom)) %>% 
    unnest(tdensity)
    

densities %>% 
    ggplot() +
    geom_line(aes(x, t_density, color = degrees_of_freedom)) + 
    geom_line(aes(x, std_Gaussian), lwd = 1.2) + 
    ggtitle("Densities of t-Distributions and Standard Gaussian (the thicker black line)") + 
    ylab("Density") + 
    theme(text = element_text(size=20)) 

Although $t$-distributions are very similar to the Standard Gaussian distribution, there are some key differences. A $t$-distribution:

- is always centred around 0.
- has only one parameter: the degrees of freedom (which controls the spread)
- has heavier tails (mostly for low values of degrees of freedom)
- converges to the Normal distribution for large degrees of freedom (it does not need to be very large, a $t$-distribution with 50 or more degrees of freedom is almost identical to the Normal distribution).

The heavier tails of the t-distribution allow us to account for "additional" uncertainty compared to the Normal distribution. In fact, that was the reason it came up. The t-distribution family was found by William Gosset, an employee at Guinness Brewery, when studying the error around the sample mean for small samples (so, CLT was not applicable). The story behind $t$-Distribution is quite interesting, and you can read more [here](https://priceonomics.com/the-guinness-brewer-who-revolutionized-statistics/) if you are curious.

To understand better the heavier tails of the $t$-distributions, let us discuss an example. 


In [None]:
# Run this cell before continuing. 

set.seed(1)

mu = 1.7
sigma = 0.07
gaussian_pop <- 
    tibble(height = rnorm(10000, mu, sigma))

**Question 3.1**
<br>{points: 1}

In the tibble `gaussian_pop`, we measured the height of 10,000 people, which will be our population of interest. Let us take a look at the population distribution. Use the scaffolding provided in the code box below to plot the histogram with the normal distribution.

Note that the Normal distribution is often called the Gaussian distribution.

_Assign your answer to an object named `gaussian_pop_dist`._

In [None]:
# gaussian_pop_dist <- 
#     ... %>%   
#     ... +
#     ...(aes(..., y = after_stat(density)), color = 'white') +
#     geom_line(data = tibble(x = seq(mu - 3.5*sigma, mu + 3.5*sigma, 0.01), 
#                             density = dnorm(x, mu, sigma)), 
#               aes(x = x, y = density), color = "red", lwd = 2) +
#     ggtitle(...) +
#     theme(text = element_text(size = 22))

# your code here
fail() # No Answer - remove if you provide an answer

gaussian_pop_dist

In [None]:
test_3.1()

**Question 3.2**
<br> {points: 1}

In Module 03, we saw that the Central Limit Theorem roughly states that the sampling distribution of the sample mean converges to $N\left(\mu, \sigma/\sqrt{n}\right)$, where $\mu$ and $\sigma$ are, respectively, the mean and standard deviation of the population. But what is the distribution of the sample mean for small sample sizes? Unfortunately, it will be highly dependent on the population distribution. If the population is normally distributed, the sampling distribution of the mean is also normally distributed. More specifically, it is $N\left(\mu, \sigma/\sqrt{n}\right)$. 

The previous question clearly showed that our population follows a Normal distribution. Now, we are going to draw a large number of **small** samples from the population and calculate their sample means. But this time, we want to standardize the sample means by calculating the Z-score:

$$
Z = \frac{\bar{x}_i - \mu}{\sigma/\sqrt{n}}
$$

We are still pretending that we know $\mu$ and $\sigma$, which are stored in the `mu` and `sigma` variables, respectively. Our Z-score distribution will be the Standard Normal, i.e., $N(0,1)$. 

Here's your job:

1. draw 2000 samples of size seven from the `gaussian_pop`;
2. for each sample, calculate the sample average;
3. then, obtain the transformed Z-scores of the sample averages and store them in a column named `z`;

_Assign your data frame to an object called `zscore_sample_means`. The data frame should have three columns `replicate`, `sample_mean` and `z`_

In [None]:
set.seed(89) # Do not change this
n <- 7

# zscore_sample_means <-
#     gaussian_pop %>% 
#     rep_sample_n(...) %>% 
#     group_by(...) %>% 
#     summarise(sample_mean = ...) %>% 
#     mutate(z = ... )

# your code here
fail() # No Answer - remove if you provide an answer

head(zscore_sample_means)

In [None]:
test_3.2()

**Question 3.3**
<br> {points: 1}

Compare the sampling distribution of the z-scores of the sample mean that you obtained in the previous question with the density line of a $N(0, 1)$. Use `binwidth` equals 0.3.

_Assign your plot to an object called `sampling_dist_sample_mean_z`._

In [None]:
# sampling_dist_sample_mean_z <- 
#     ... %>% 
#     ... + 
#     geom_...(aes(..., after_stat(density)), color = 'white', binwidth = ...) + 
#     geom_line(data = tibble(x = seq(-3.5, 3.5, 0.01), 
#                             density = dnorm(x, 0, 1)), 
#               aes(x = x, y = density), color = "red", lwd = 2) +
#     theme(text = element_text(size = 22)) +
#     xlab("Sample Mean of Z-score") +
#     ggtitle("Sampling distribution of the Z-scores of the sample mean vs Standard Normal density")

# your code here
fail() # No Answer - remove if you provide an answer

sampling_dist_sample_mean_z

In [None]:
test_3.3()

**Question 3.4**
<br> {points: 1}

In the previous question, you used the population standard deviation (which almost always is unknown) to calculate the z-scores. What can we do in the cases we do not know the true value of $\sigma$? A reasonable answer would be to use the sample standard deviation, $s$. However, by using this approach, there will be an increase in uncertainty. The value of $\sigma$ is fixed, a constant that is just unknown. If we use $s$ instead, we are replacing a constant $\sigma$ with a random variable that changes from sample to sample. Therefore, it would certainly increase our uncertainty as the formula for the z-score now is changing from sample to sample. Would this additional uncertainty affect the sampling distribution of the z-scores of the sample mean? Take a minute to think about this. What do you expect to happen to the sampling distribution above if we have this extra layer of uncertainty?

In this exercise, you are going to:
1. take 10000 samples of size $n=7$
2. calculate the sample average and sample standard deviation of each sample, store them in  variables named `sample_mean` and `sample_sd`, respectively
3. calculate the z-scores of the sample average, but this time using $s$ instead of $\sigma$, store them in a column called `z`
4. plot the histogram of the z-scores
5. plot the density line of the $N(0, 1)$
6. plot the density line of the $t_{6}$

Note that $t_6$ means that it's a $t$-distribution with 6 degrees of freedom. The $t$-distribution associated with the sample mean has $n-1$ degrees of freedom. It is 6 here, because the sample size is 7. 

The scaffolding below is provided to help you accomplish these steps. 
Pay close attention to the tails of the distributions.

_Assign your plot to an object called `sampling_dist_zscore_s`._

In [None]:
set.seed(5) # Do not change this

# n <- ...
# sampling_dist_zscore_s <-
#     ... %>% 
#     rep_sample_n(reps = ..., size = n, replace = ...) %>% 
#     group_by(...) %>% 
#     summarise(sample_mean = ..., sample_sd = ...) %>% 
#     mutate(z = ...) %>% 
#     ggplot() +
#     ...(aes(..., after_stat(density)), color = 'white', binwidth = 0.3) + 
#     geom_line(data = tibble(x = seq(-3.5, 3.5, 0.01), 
#        std_normal = dnorm(x, 0, 1), 
#        t = dt(x, n-1)) %>% pivot_longer(cols = c(std_normal, t), names_to = "distribution", values_to = "density"), 
#               aes(x = x, y = density, color = distribution), lwd = 2) +
#     theme(text = element_text(size = 22)) + 
#     xlab("Sample Mean of Z-score") +
#     ggtitle(...) +
#     xlim(-4, 4)

# your code here
fail() # No Answer - remove if you provide an answer
sampling_dist_zscore_s

In [None]:
test_3.4()

Please take a close look at the distribution's tails and note how our Z-Scores are more spread. If we use the normal distribution to approximate this sampling distribution, we will end up with narrower confidence intervals than we should. Remember when you compared the bootstrap confidence interval with the CLT confidence interval in the previous worksheet? However, for larger sample sizes, the t-distribution becomes much closer to the normal distribution, and the difference of using the normal distribution instead of t-distribution diminishes. 

## 4. Further comparisons between confidence intervals based on the Normal and the t-distributions

**Question 4.0 - Estimating the difference in means using CLT**
<br>{points: 1}

Let's return to the penguins data set. Are `Adelie` penguins heavier than `Chinstrap` penguins? To answer this question, 
we will estimate the difference in the weights between the two species. Let's refer to the `Adelie` penguins as population 1 and `Chinstrap` penguins as population 2. 

Assuming the sample size is large enough, we can approximate the sampling distribution of $\bar{X}_1-\bar{X}_2$ by
$$
\bar{X}_1-\bar{X}_2\sim N\left(\mu_1 - \mu_2, \sqrt{\frac{\sigma^2_1}{n_1}+\frac{\sigma^2_2}{n_2}}\right)
$$

For comparison purposes, let's first ignore the fact that using the sample standard deviations, $s_1$ and $s_2$, instead of the population standard deviations, $\sigma_1$ and $\sigma_2$, adds additional uncertainty, and let's obtain the confidence interval as:

$$
CI\left(\mu_1 - \mu_2\right) = \left(\bar{X}_1-\bar{X}_2\right) \pm z^*\sqrt{\frac{s^2_1}{n_1}+\frac{s^2_2}{n_2}}
$$
where $z^*$ is the quantile of a standard Normal.

Using this equation, obtain the 95% confidence interval for the difference in means of Adelie penguins' weight and Chinstrap penguins' weight. The sample is stored in the object `adelie_chinstrap_sample`.

_Assign your data frame to an object called `penguins_diff_means_ci`. The data frame should have two columns: `lower_ci` and `upper_ci`._

In [None]:
# Run this cell before continuing
adelie_chinstrap_sample <- 
    penguins %>%
    filter(species %in% c("Adelie", "Chinstrap") & !is.na(body_mass_g)) %>% 
    select(species, body_mass_g)

In [None]:
adelie <- 
    adelie_chinstrap_sample %>% 
    filter(species == 'Adelie') %>% 
    pull(body_mass_g)

chinstrap <- 
    adelie_chinstrap_sample %>% 
    filter(species == 'Chinstrap') %>% 
    pull(body_mass_g)

# penguins_diff_means_ci <- 
#     tibble(
#         lower_ci = mean(...) - mean(...) - qnorm(...) * sqrt(var(...)/length(...) + var(...)/length(...)),
#         upper_ci = ...
#     )

# your code here
fail() # No Answer - remove if you provide an answer

penguins_diff_means_ci

In [None]:
test_4.0()

**Question 4.1 - Estimating the difference in means using the t-distribution**

As we mentioned in question 3, it would be more appropriate to approximate the sampling distribution using the $t$-distribution, since we do not know the population standard deviations $\sigma_1$ and $\sigma_2$, and we are using the sample standard deviations, $s_1$ and $s_2$, to estimate them. Therefore, there is more uncertainty around our estimator $\bar{X}_1-\bar{X}_2$ than we are accounting for. 

Thus, we will now compute confidence intervals with the $t$-distribution. The degrees of freedom we will use here is $min(n_1-1, n_2 -1)$

Using a similar scaffolding as in the previous question, obtain the 95% confidence interval for the difference in means of Adelie penguins' weight and Chinstrap penguins' weight using the $t$-distribution. The sample is stored in the object `adelie_chinstrap_sample`.

_Assign your data frame to an object called `penguins_diff_means_ci_t`. The data frame should have two columns: `lower_ci` and `upper_ci`._

In [None]:
# penguins_diff_means_ci_t <- 
#     tibble(
#         lower_ci = mean(...) - mean(...) - qt(..., df = min(..., ...)) * sqrt(var(...)/length(...) + var(...)/length(...)),
#         upper_ci = ...
#     )

# your code here
fail() # No Answer - remove if you provide an answer

penguins_diff_means_ci_t

In [None]:
test_4.1()

Note that the confidence interval based on the $t$-distribution is wider as it accounts for the additional uncertainty.