# Week 7: Confidence Intervals (of means and proportions) Based on the Assumption of Normality or the Central Limit Theorem

#### Lecture and Tutorial Learning Goals:
From this section, students are expected to be able to:

1. Describe the Law of Large Numbers.
2. Describe a normal distribution.
3. Explain the Central Limit Theorem and its role in constructing confidence intervals.
4. Write a computer script to calculate confidence intervals based on the assumption of normality / the Central Limit Theorem.
5. Discuss the potential limitations of these methods.
6. Decide whether to use asymptotic theory or bootstrapping to compute estimator uncertainty.

In [None]:
# Run this cell before continuing.
library(cowplot)
library(datateachr)
library(digest)
library(infer)
library(repr)
library(taxyvr)
library(tidyverse)
source("tests_tutorial_07.R")

## 1. Estimating the Mean using CLT

In this section, we will use the Central Limit Theorem to obtain interval estimates (i.e., confidence intervals) for the population mean instead of the simulation approach we used in Week 4. 

The US Food & Drug Administration (FDA) monitored the mercury level in many different commercial fishes and shellfish between 1990 and 2010. The mercury levels were measured in parts per million (ppm). This study is very relevant because a high mercury level is toxic to people and can cause brain problems and affect the fetus. Pretty serious!! 

Let us start by loading and taking a peek at the dataset.

In [None]:
salmon <- read_csv("salmon.csv")
head(salmon)

**Question 1.1**
<br> {points: 1}

Since we will be relying on the CLT, it is good to check if there is no severe violation of the CLT's assumptions. A good first step is to check the sample distribution to understand what sort of distribution we are dealing with. Remember that the sample distribution is an estimate of the population distribution. This step is important because "weird" distributions, such as asymmetric and/or multimodal distributions **might** require bigger sample sizes for the CLT to kick-in.

Your job in this exercise is to plot the histogram of `mercury_concentration` of salmon. 

_Assign your plot to an object called `salmon_sample_dist`._

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

salmon_sample_dist

In [None]:
test_1.1()

**Question 1.2**
<br>{points: 1}

Based on the previous question's sample distribution plot, select all that apply:

A. The distribution is symmetric;

B. The distribution is left-skewed;

C. The distribution is right-skewed;

D. The distribution is multimodal;

E. The distribution is unimodal;

F. The distribution is quite similar to the Normal distribution;

_Assign your answer to an object called `answer1.2`. Your answer should be a sequence of characters surrounded by quotes, e.g., "ACF"._

In [None]:
# answer1.2 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
# Here we check to see if you have given your answer the correct object name
# and if your answer is plausible. However, all other tests have been hidden
# so you can practice deciding when you have the correct answer.
test_that('Did not assign answer to an object called "answer1.2"', {
  expect_true(exists("answer1.2"))
})


**Question 1.3**
<br> {points: 4}

The previous question showed that the population is quite asymmetric, with most salmons presenting low mercury levels but some salmons presenting much higher levels. Asymmetry, especially with outliers, might require a larger **n** for the CLT to kick in. On the other hand, our sample size is already somewhat large, with 94 fish in the sample. 

Remember that, if the conditions are satisfied, the CLT states that the $\bar{X}\sim N\left(\mu, \sigma/\sqrt{n}\right)$ . 

If you have a strong suspicion that your sample size is not large enough, you might want to consider using the bootstrap approach. Let's take a look at the approximation of the sampling distribution provided by both approaches: bootstrap and CLT. Let's use the sample mean and standard deviation as placeholders for $\mu$ and $\sigma$, respectively, in the case of CLT.

Your job:

1. Calculate the sample mean and store it in an object called `salmon_x_bar`. Note: `salmon_x_bar` must be a number, not a `tibble`.
2. Calculate the estimate of the standard error of $\bar{X}$ and store it in an object called `salmon_std_error`. Note: `salmon_std_error` must be a number, not a `tibble`.
3. Obtain the bootstrap sampling distribution using 2000 bootstrap samples and store them in a data frame called `salmon_btsp_samp_dist`. Note: the data frame `salmon_btsp_samp_dist` must have two columns only: (1) `replicate`; and (2) `stat`.
4. Create the histogram to visualize the bootstrap sampling distribution vs the Normal density given by the CLT. Assign your plot to an object called `salmon_btsp_vs_clt_samp_dist_plot`.

In [None]:
set.seed(20210201) # Do not change this

# # Obtain the sample mean
# salmon_x_bar <- mean(salmon$...)

# # Obtain the sample std. error 
# salmon_std_error <- ...

# # Done for you: the normal curve
# clt_samp_dist <- 
#     tibble(x = seq(salmon_x_bar - 4 * salmon_std_error, 
#                    salmon_x_bar + 4 * salmon_std_error, 0.0001),
#            density = dnorm(x, salmon_x_bar, salmon_std_error))

# # Obtain the boostrap sampling distribution
# salmon_btsp_samp_dist <-
#     salmon %>% 
#     ...

# # Let's plot the bootstrap vs the CLT estimates
# salmon_btsp_vs_clt_samp_dist_plot <- 
#     salmon_btsp_samp_dist %>% 
#     ggplot() + 
#     geom_histogram(aes(..., ..density..), color = 'white') + 
#     geom_line(data = clt_samp_dist, aes(x, density), lwd = 2, color = "red") + 
#     ... + 
#     ... + 
#     theme(text = element_text(size = 20))

# your code here
fail() # No Answer - remove if you provide an answer

salmon_btsp_vs_clt_samp_dist_plot

In [None]:
test_1.3()

As we can see from the plot above, the sampling distribution estimates given by the CLT and boostrap approaches are fairly close. Therefore, from this similarity, we could already expect that both confidence intervals will be similar. 

**Question 1.4** 
<br> {points: 1}


Obtain the 92.8% confidence interval for the mean mercury levels in fish and shellfish applying the CLT.

_Assign your data frame to an object called `salmon_clt_ci`. The data frame should have two columns: (1) `lower_ci`; and `upper_ci`_

In [None]:
# salmon_clt_ci <- 
#     tibble(lower_ci = ... + qnorm(...) * ..., 
#            upper_ci = ... )

# your code here
fail() # No Answer - remove if you provide an answer

head(salmon_clt_ci)

In [None]:
# Here we check to see if you have given your answer the correct object name
# and if your answer is plausible. However, all other tests have been hidden
# so you can practice deciding when you have the correct answer.
test_that('Did not assign answer to an object called "salmon_clt_ci"', {
  expect_true(exists("salmon_clt_ci"))
})
test_that("Solution should be a data frame", {
  expect_true("data.frame" %in% class(salmon_clt_ci))
})
test_that("Data frame does not contain the correct number of rows", {
  expect_equal(digest(as.integer(nrow(salmon_clt_ci))), "4b5630ee914e848e8d07221556b0a2fb")
})



**Question 1.5** 
<br> {points: 1}


Obtain the 92.8% confidence interval for the mean mercury levels in fish and shellfish using the bootstrap distribution you obtained previously: `salmon_btsp_samp_dist`.

_Assign your data frame to an object called `salmon_btsp_ci`. The data frame should have two columns: (1) `lower_ci`; and `upper_ci`_

In [None]:
# salmon_btsp_ci <- 
#     salmon_btsp_samp_dist %>% 
#     ...

# your code here
fail() # No Answer - remove if you provide an answer

head(salmon_btsp_ci)

In [None]:
test_1.5()

As we imagined, the CLT and bootstrap confidence intervals are quite close.

## 2. Estimating the Difference in Means using CLT

Is parking in Downtown Vancouver more expensive than in Kitsilano?
For this question, we will use the Vancouver parking meter data set. 
First, let's preview the dataset.

In [None]:
?parking_meters

In [None]:
head(parking_meters)

Let's focus on 9 am to 10 pm Monday to Friday's rate. We will take a sample of 53 downtown meters and a sample of 40 Kitsilano meters. The sample is stored in `downtown_kitsilano_sample`.

In [None]:
# Run this cell before continuing

set.seed(4759)

parking_pop <- # Some data cleaning
    parking_meters %>% 
    filter((geo_local_area %in% c("Downtown", "Kitsilano")) & (!is.na(r_mf_9a_6p))) %>%
    select(geo_local_area, r_mf_9a_6p) %>% 
    mutate(r_mf_9a_6p = as.numeric(str_remove(r_mf_9a_6p, "\\$")))

downtown_kitsilano_sample <- # Taking the sample
    parking_pop %>% 
    group_by(geo_local_area) %>% 
    sample_n(size = case_when(geo_local_area == "Downtown" ~ 53,
                              geo_local_area == "Kitsilano" ~ 40), replace = FALSE) %>% 
    ungroup()

downtown_kitsilano_sample %>% # Let's take a peek
    group_by(geo_local_area) %>% 
    sample_n(size = 3)

**Question 2.1**
<br> {points: 1}

As usual, let's start by checking the sample distribution of each neighbourhood. Use `binwidth = 1`.

_Assign your plot to an object called `parking_samp_dist_plot`._

In [None]:
# parking_samp_dist_plot <- 
#     downtown_kitsilano_sample %>% 
#     ... + 
#     ...(..., ... = 1, color = 'white') +
#     facet_wrap(~ geo_local_area) + 
#     ...
#     theme(text = element_text(size = 22))


# your code here
fail() # No Answer - remove if you provide an answer

parking_samp_dist_plot

In [None]:
test_2.1()

**Question 2.2** 
<br> {points: 1}

Obtain the sample averages of parking rates for each region, `Kitsilano` and `Downtown`, as well as their standard error.

_Assign your data frame to an object called `parking_summary`. The data frame should have three columns: (1) `geo_local_area`; (2) `sample_mean`; and (3) `sample_std_error`_

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

head(parking_summary)

In [None]:
test_2.2()

**Question 2.3** 
<br> {points: 1}

Obtain a 94% confidence interval for the difference between the means of Downtown and Kitsilano's parking rates using the CLT. 

_Assign your data frame to an object called `parking_clt_ci`. The data frame should have two columns: `lower_ci` and `upper_ci`._

In [None]:
# downtown_mean <- parking_summary$...
# downtown_var <- ...
# kits_mean <- ...
# kits_var <- ...

# parking_clt_ci <- 
#     tibble(lower_ci = ...,
#            upper_ci = ...)

# your code here
fail() # No Answer - remove if you provide an answer

head(parking_clt_ci)

In [None]:
# Here we check to see if you have given your answer the correct object name
# and if your answer is plausible. However, all other tests have been hidden
# so you can practice deciding when you have the correct answer.
test_that('Did not assign answer to an object called "parking_clt_ci"', {
  expect_true(exists("parking_clt_ci"))
})
test_that("Solution should be a data frame", {
  expect_true("data.frame" %in% class(parking_clt_ci))
})


**Question 2.4**
<br> {points: 1}


Are these intervals accurate? In other words, if we were to do this process multiple times, will we capture the true parking rate difference approximately 94% of the time?  This is what we are going to investigate in this exercise. We have already taken 100 samples saved in the `parking_multiple_ci` object. 

Your job is:
1. add two columns to the `parking_multiple_ci` data frame, namely, `lower_ci` and `upper_ci` with the lower and upper bound of the respective confidence interval. 
2. add a third column, `captured`, with `TRUE` or `FALSE`, if the interval captured the true difference. 
3. finally, select only the `replicate`, `lower_ci`, `upper_ci`, and `captured` columns.


_Assign your plot to an object called `parking_multiple_ci`._

In [None]:
# Run this cell before continuing

set.seed(20210301) # Do not change this.

# Obtain the means of parking rates in each neighbourhood.
true_means <-
    parking_pop %>% 
    group_by(geo_local_area) %>% 
    summarise(sample_mean = mean(r_mf_9a_6p)) %>% 
    pull(sample_mean)

# Obtain the true difference in mean
true_diff = true_means[1] - true_means[2]

parking_multiple_samples <- 
    tibble(replicate = 1:100) %>% 
    mutate(sample = map(replicate,
                        `.f` = ~
                            parking_pop %>% 
                            group_by(geo_local_area) %>% 
                            sample_n(size = case_when(geo_local_area == "Downtown" ~ 53,
                                                      geo_local_area == "Kitsilano" ~ 40), replace = FALSE) %>% 
                            ungroup() 
                    )
    ) %>% 
    unnest(sample) %>% 
    group_by(replicate, geo_local_area) %>% 
    summarise(sample_mean = mean(r_mf_9a_6p),
              sample_std_error = sd(r_mf_9a_6p)/sqrt(n()),
              n = n()) %>% 
    pivot_wider(names_from = geo_local_area, values_from = c(sample_mean, sample_std_error, n))

head(parking_multiple_samples)

In [None]:
# parking_multiple_ci <-
#     parking_multiple_samples %>% 
#     mutate(lower_ci = ...,
#            upper_ci = ...) %>% 
#     select(replicate, lower_ci, upper_ci) %>% 
#     mutate(captured = between(true_diff, ..., ...))


# your code here
fail() # No Answer - remove if you provide an answer

head(parking_multiple_ci)

In [None]:
test_2.4()

Nice job! Run the cell below to visualize the confidence intervals.

In [None]:
parking_multiple_ci %>% 
    ggplot() +
    scale_colour_manual(breaks = c("TRUE", "FALSE"), # Change colour scale for better visibility.
                        values = c("grey", "black")) +
    geom_segment(aes(x = lower_ci,
                     xend = upper_ci,
                     y = replicate,
                     yend = replicate,
                     colour = captured)) +
    geom_vline(xintercept = true_diff, colour = "red", size = 1) +
    labs(title = "100 90% Confidence Intervals",
         x = 'Difference in means',
         y = "Sample ID",
         colour = "Captured?") +
    theme_bw() + # Sets a theme for better visibility.
    theme(text = element_text(size = 18)) 
    

As you can see, the CLT approximation seems quite reasonable, even though our population is not Normal.

## 3. Estimating Proportions using CLT

There was a provincial election in British Columbia in 2020. Before the election, the pollsters were hard at work trying to estimate the proportion of votes each party would get if the election happened when the data was collected. In this section, we will work with a data set from [a poll performed by the Angus Reid Institute](https://angusreid.org/bc-election-post-debate/) that asked people which party they intended to vote for. Let's start by reading the data set.


In [None]:
polls <- 
    read_csv("angus_reid_poll.csv") %>% 
    mutate(party = as.factor(party))

head(polls)

**Question 3.1** 
<br> {points: 1}

Since we intend to use the CLT, let us start by calculating the quantities that we need:

1. the total number of votes each party received, this should be stored in a column named `n`;
2. the proportion of votes each party received, this should be stored in a column named `prop`;
3. the standard error of the sample proportion of each party, which should be stored in a column named `se`;
3. lower and upper boundaries of the 95% confidence interval, stored in columns `lower_ci` and `upper_ci` respectively;

_Assign your data frame to an object called `polls_summary`._

In [None]:
# polls_summary <- 
#     polls %>% 
#     ... %>% 
#     summarise(n =  n(),
#               prop = ...,              
#               se = ...,
#               lower_ci = ...,
#               upper_ci = ...) %>% 
#     mutate(party = fct_reorder(party, prop, .desc = TRUE))

# your code here
fail() # No Answer - remove if you provide an answer

head(polls_summary)

In [None]:
# Here we check to see if you have given your answer the correct object name
# and if your answer is plausible. However, all other tests have been hidden
# so you can practice deciding when you have the correct answer.
test_that('Did not assign answer to an object called "polls_summary"', {
  expect_true(exists("polls_summary"))
})
test_that("Solution should be a data frame", {
  expect_true("data.frame" %in% class(polls_summary))
})
expected_colnames <- c("party", "prop", "n", "se", "lower_ci", "upper_ci")

given_colnames <- colnames(polls_summary)
test_that("Data frame does not have the correct columns", {
  expect_equal(length(setdiff(
    union(expected_colnames, given_colnames),
    intersect(expected_colnames, given_colnames)
  )), 0)
})

test_that("Data frame does not contain the correct number of rows", {
  expect_equal(digest(as.integer(nrow(polls_summary))), "234a2a5581872457b9fe1187d1616b13")
})



How about we visualize the information you obtained with a plot? 

In [None]:
options(repr.plot.width = 10, repr.plot.height = 8)

poll_ci_plot <- 
    polls_summary %>% 
    ggplot(aes(x = party, y = prop, fill=party)) +
      geom_bar(stat = "identity", 
               colour="black",
               alpha = .6) +
      geom_errorbar(aes(ymin = lower_ci, ymax = upper_ci),
                    size = 0.5, color = "black", width=.2) +
      theme_bw() +
      xlab("Political party") +
      ylab("Proportion intending to vote") +
      theme(text = element_text(size = 20)) + 
      ggtitle("Poll BC 2020 election")

poll_ci_plot

**Question 3.2** 
<br> {points: 4}

Use written English to interpret and report the estimates (and their confidence intervals) for each party. Is there any concern over using the CLT to obtain the confidence interval for the proportion for any of the parties?


DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

## 4. Estimating difference in proportions using CLT

In Question 3 of Tutorial 6, we used the [Breast Cancer](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer) data set and studied the effects of radiation therapy on the recurrence of breast cancer. Now, we will obtain the confidence interval of the difference in proportions by bootstrapping and CLT.

In [None]:
breast_cancer <- 
    read_csv(url("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer/breast-cancer.data"), 
             col_names = c("class", "age", "menopause", "tumor-size", "inv-nodes", "node-caps", "deg-malig", "breast", "breast-quad", "irradiat")) %>% 
    select(class, irradiat)

# Taking a peek of the data set
breast_cancer %>% 
    group_by(class,irradiat) %>% 
    sample_n(size = 2)

Let $p_{1}$ be the proportion of patients with past radiation treatment (irradiate=yes) that had recurrent cancer, and let $p_{2}$ be the proportion of patients with no radiation treatment (irradiate=no) that had recurrent cancer. We would like to study $p_1-p_2$ and find its 95% confidence interval.

<b>Question 4.1: Sample Proportions</b>
<br>{points: 1}

Find the sample proportions of $p_1$ (proportion of patients with radiation treatment with recurrent cancer) and $p_2$ (proportion of patients without radiation treatment with recurrent cancer).

<i>Assign your answers to a data frame named `p_summary`. The data frame should have four columns: `p_yes` and `p_no`, `n_yes`, `n_no`.</i>

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

p_summary

In [None]:
test_4.1()

<b> Question 4.2 </b>
<br>{points: 1}

Add two more columns to the `p_summary` data frame: 

1. `p_diff` to store the observed difference in proportion, i.e., $\hat{p_1}-\hat{p_2}$;
2. `p_diff_std_error` to store the sample standard error of the difference in proportions.

In [None]:
# p_summary <-
#     p_summary %>% 
#     mutate(p_diff = ...,
#            p_diff_std_error = ...)

# your code here
fail() # No Answer - remove if you provide an answer

p_summary

In [None]:
test_4.2()

<b>Question 4.3</b>
<br>{points: 1}

Finally, obtain the 95% confidence interval for the difference in proportion. Add two more columns to `p_summary` data frame: (1) `lower_ci`; and (2) `upper_ci`.

In [None]:
# p_summary <- 
#     p_summary %>% 
#     mutate(lower_ci = qnorm(..., ..., ...),
#            upper_ci = ...)

# your code here
fail() # No Answer - remove if you provide an answer

p_summary

In [None]:
# Here we check to see if you have given your answer the correct object name
# and if your answer is plausible. However, all other tests have been hidden
# so you can practice deciding when you have the correct answer.

test_that('Did not assign answer to an object called "p_summary"', {
expect_true(exists("p_summary"))
})

test_that("Solution should be a data frame", {
expect_true("data.frame" %in% class(p_summary))
})

expected_colnames <- c("n_no", "n_yes", "p_no", "p_yes", "p_diff", "p_diff_std_error", "lower_ci", "upper_ci")
given_colnames <- colnames(p_summary)
test_that("Data frame does not have the correct columns", {
expect_equal(length(setdiff(
  union(expected_colnames, given_colnames),
  intersect(expected_colnames, given_colnames)
)), 0)
})

test_that("Data frame does not contain the correct number of rows", {
expect_equal(digest(as.integer(nrow(p_summary))), "4b5630ee914e848e8d07221556b0a2fb")
})

<b>Question 4.4: Confidence Interval by Bootstrap</b>
<br>{points: 1}

Obtain the 95% confidence interval of the difference of proportions ($p_1-p_2$) via bootstrapping by generating 1000 samples from `breast_cancer`.
```r
diff_in_props_btsp_ci <- 
    breast_cancer %>%
    specify(formula = class ~ irradiat, success=...) %>%
    generate(...) %>%
    calculate(stat = ..., order = c(...)) %>%
    get_ci()
```
<i>Assign your answer to a variable called ` diff_in_props_btsp_ci`</i>.

In [None]:
set.seed(20210301) # Do not change this

# your code here
fail() # No Answer - remove if you provide an answer

diff_in_props_btsp_ci

In [None]:
test_4.4()